Learning by Abstraction: The Neural State Machine

07/09/2019 ∙ by Drew A. Hudson, et al. ∙ 49

We introduce the Neural State Machine, seeking to bridge the gap between the neural and symbolic views of AI and integrate their complementary strengths for the task of visual reasoning. Given an image, we first predict a probabilistic graph that represents its underlying semantics and serves as a structured world model. Then, we perform sequential reasoning over the graph, iteratively traversing its nodes to answer a given question or draw a new inference. In contrast to most neural architectures that are designed to closely interact with the raw sensory data, our model operates instead in an abstract latent space, by transforming both the visual and linguistic modalities into semantic concept-based representations, thereby achieving enhanced transparency and modularity. We evaluate our model on VQA-CP and GQA, two recent VQA datasets that involve compositionality, multi-step inference and diverse reasoning skills, achieving state-of-the-art results in both cases. We provide further experiments that illustrate the model's strong generalization capacity across multiple dimensions, including novel compositions of concepts, changes in the answer distribution, and unseen linguistic structures, demonstrating the qualities and efficacy of our approach.

READ FULL TEXT VIEW PDF

Authors

page 4

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Language is one of the most marvelous feats of the human mind. The emergence of a compositional system of symbols that can distill and convey from rich sensory experiences to creative new ideas has been a major turning point in the evolution of intelligence, and made a profound impact on the nature of human cognition, fostering our remarkable capacities of abstraction and generalization (Chomsky, 2017; Vygotsky, 1964; Boroditsky, 2011). Indeed, humans are particularly adept at making abstractions of various kinds: We make analogies and form concepts to generalize from given instances to unseen examples (Rogers & McClelland, 2004); we see things in context, and build compositional world models to represent objects and understand their interactions and subtle relations, turning raw sensory signals into high-level semantic knowledge (Navon, 1977); and we deductively draw inferences via conceptual rules and statements to proceed from known facts to novel conclusions (Harman, 1986; Hudson & Manning, 2018). Not only are humans capable of learning, but we are also talented at reasoning.

Ideas about compositionality, abstraction and reasoning greatly inspired the classical views of artificial intelligence

(Smolensky, 1987; Newell, 1980)

, but have lately been overshadowed by the astounding success of deep learning over a wide spectrum of real-world tasks

(He et al., 2016; Mnih et al., 2015; Wu et al., 2016b)

. Yet, even though neural networks are undoubtedly powerful, flexible and robust, recent work has repeatedly demonstrated their flaws, showing how they struggle to generalize in a systematic manner

(Lake & Baroni, 2017), overly adhere to superficial and potentially misleading statistical associations instead of learning true causal relations Agrawal et al. (2016); Jia & Liang (2017), strongly depend on large amounts of data and supervision (Garnelo et al., 2016; Lake et al., 2017), and sometimes behave in surprising and worrisome ways Goodfellow et al. (2014); Das et al. (2017). The sheer size and statistical nature of these models that support robustness and versatility are also what hinder their interpretability, modularity, and soundness.

Figure 1: The Neural State Machine is a graph network that simulates the computation of an automaton. For the task of VQA, the model constructs a probabilistic scene graph to capture the semantics of a given image, which it then treats as a state machine, traversing its states as guided by the question to perform sequential reasoning.

Motivated to alleviate these deficiencies and bring the neural and symbolic approaches more closely together, we propose the Neural State Machine, a differentiable graph-based model that simulates the operation of an automaton, and explore it in the domain of visual reasoning and compositional question answering. Essentially, we proceed through two stages: modeling and inference. Starting from an image, we first generate a probabilistic scene graph (Johnson et al., 2015; Krishna et al., 2017) that captures its underlying semantic knowledge in a compact form. Nodes correspond to objects and consist of structured representations of their properties, and edges depict both their spatial and semantic relations. Once we have the graph, we then treat it as a state machine and simulate an iterative computation over it, aiming to answer questions or draw inferences. We translate a given natural language question into a series of soft instructions, and feed them one-at-a-time into the machine to perform sequential reasoning, using attention to traverse its states and compute the answer.

Drawing inspiration from Bengio’s consciousness prior (Bengio, 2017), we further define a set of semantic embedded concepts that describe different entities and aspects of the domain, such as various kinds of objects, attributes and relations. These concepts are used as the vocabulary that underlies both the scene graphs derived from the image as well as the reasoning instructions obtained from the question, effectively allowing both modalities to “speak the same language”. Whereas neural networks typically interact directly with raw observations and dense features, our approach encourages the model to reason instead in a semantic and factorized abstract space, which enables the disentanglement of structure from content and improves its modularity.

We demonstrate the value and performance of the Neural State Machine on two recent Visual Question Answering (VQA) datasets: GQA (Hudson & Manning, 2019) which focuses on real-world visual reasoning and multi-step question answering, as well as VQA-CP (Agrawal et al., 2018), a recent split of the popular VQA dataset (Agrawal et al., 2017; Goyal et al., 2017) that has been designed particularly to evaluate generalization. We achieve state-of-the-art results on both tasks under single-model settings, substantiating the robustness and efficiency of our approach in answering challenging compositional questions. We then construct new splits leveraging the associated structured representations provided by GQA and conduct further experiments that provide significant evidence for the model’s strong generalization skills across multiple dimensions, such as novel compositions of concepts and unseen linguistic structures, validating its versatility under changing conditions.

Our model ties together two important qualities: abstraction and compositionality, with the respective key innovations of representing meaning as a structured attention distribution over an internal vocabulary of disentangled concepts, and capturing sequential reasoning as the iterative computation of a differentiable state machine over a semantic graph. We hope that creating such neural form of a classical model of computation will encourage and support the integration of the connectionist and symbolic methodologies in AI, opening the door to enhanced modularity, versatility, and generalization.

2 Related work

Figure 2: Question examples along with answers predicted by the NSM. The questions involve diverse reasoning skills such as multi-step inference, relational and spatial reasoning, logic and comparisons.

Our model connects to multiple lines of research, including works about compositionality Bottou (2014); Hu et al. (2017), concept acquisition (Higgins et al., 2017; Wu et al., 2016a), and neural computation (Graves et al., 2014; Miller et al., 2016; Andreas et al., 2016)

. Several works have explored the discovery and use of visual concepts in the contexts of reinforcement or unsupervised learning

(Higgins et al., 2016; Choi et al., 2018)

as well as in classical computer vision

(Farhadi et al., 2009; Vogel & Schiele, 2007). Others have argued for the importance of incorporating strong inductive biases into neural architectures (Bottou, 2014; Battaglia et al., 2018; Andreas, 2019; Andreas et al., 2018), and indeed, there is a growing body of research that seeks to introduce different forms of structural priors inspired by computer architectures (Graves et al., 2016; Weston et al., 2014; Hudson & Manning, 2018)

or theory of computation

(Aleksander, 1994; Forcada & Carrasco, 2001), aiming to bridge the gap between the symbolic and neural paradigms.

We explore our model in the context of VQA, a challenging multimodal task that has gained substantial attention in recent years (Goyal et al., 2017; Wang et al., 2017; Hudson & Manning, 2018). Prior work commonly relied on dense visual features produced by either CNNs (Xiong et al., 2016; Yang et al., 2016) or object detectors (Anderson et al., 2017), with a few recent models that use the relationships among objects to augment those features with contextual information from each object’s surroundings (Li et al., 2019a; Teney et al., 2017; Norcliffe-Brown et al., 2018). We move further in this direction, performing iterative reasoning over inferred scene graphs, and in contrast to prior models, incorporate higher-level semantic concepts to represent both the visual and linguistic modalities in a shared and sparser manner that facilitates their interaction.

Closest to the NSM is a model called MAC (Hudson & Manning, 2018) we have developed in prior work, a recurrent network that applies attention-based operations to perform sequential reasoning. In fact, our new model follows the high-level structure proposed by MAC, where the new decoder (section 3.3) is analogous to the control unit, and the model simulation (section 3.4) is parallel to the read-write operation. At the same time, the Neural State Machine differs from MAC in two crucial respects: First, we reason over graph structures rather than directly over spatial maps of visual features, traversing the graph by successively shifting attention across its nodes and edges. Second, key to our model is the notion of semantic concepts that are used to express knowledge about both the visual and linguistic modalities, instead of working directly with the raw observational features. As our findings suggest, these ideas contribute significantly to the model’s performance compared to MAC and enhance its compositionality, transparency and generalization skills.

3 The Neural State Machine

The Neural State Machine is a graph-based network that simulates the computation of a finite automaton (Hopcroft, 2008), and is explored here in the context of VQA, where we are given an image and a question and asked to provide an answer. We go through two stages – modeling and inference, the first to construct the state machine, and the second to simulate its operation.

In the modeling stage, we transform both the visual and linguistic modalities into abstract representations. The image is decomposed into a probabilistic graph that represents its semantics – the objects, attributes and relations in the depicted visual scene (section 3.2), while the question is converted into a sequence of reasoning instructions (section 3.3) that have to be performed in order to answer it.

In the inference stage (section 3.4), we treat the graph as a state machine, where the nodes, the objects within the image, correspond to states, and the edges, the relations between the objects, correspond to transitions. We then simulate a serial computation by iteratively feeding the machine with the instructions derived from the question and traversing its states, which allows us to perform sequential reasoning over the semantic visual scene, as guided by the question, to arrive at the answer.

We begin with a formal definition of the machine. In simple terms, a state machine is a computational model that consists of a collection of states, which it iteratively traverses while reading a sequence of inputs, as determined by a transition function. In contrast to the classical deterministic versions, the neural state machine defines an initial distribution over the states, and then performs a fixed number of computation steps , recurrently updating the state distribution until completion. Formally, we define the neural state machine as a tuple :

  • the model’s alphabet, consisting of a set of concepts, embedded as learned vectors.

  • a collection of states.

  • a collection of directed edges that specify valid transitions between the states.

  • a sequence of instructions, each of dimension , that are passed in turn as an input to the transition function .

  • a probability distribution of the initial state.

  • a state transition function: a neural module that at each step considers the distribution over the states as well as an input instruction , and uses it to redistribute the probability along the edges, yielding an updated state distribution .

3.1 Concept vocabulary

In contrast to many common networks, the neural state machine operates over a discrete set of concepts. We create an embedded concept vocabulary for the machine (initialized with GloVe (Pennington et al., 2014)), that will be used to capture and represent the semantic content of input images. The vocabulary is grouped into properties such as object identity (e.g. cat, shirt), different types of attributes (e.g. colors, materials) and relations (e.g. holding, behind), all derived from the Visual Genome dataset (Krishna et al., 2017) (see section 6.3 for details). We similarly define a set of embeddings for each of the property types (such as “color” or “shape”).

In using the notion of concepts, we draw a lot of inspiration from humans, who are known for their ability to learn concepts and use them for tasks that involve abstract thinking and reasoning (Bealer, 1998; Barsalou et al., 2003; Greeno, 1989; Pecher & Zwaan, 2005). In the following sections, rather than using raw and dense sensory input features directly, we represent both the visual and linguistic inputs in terms of our vocabulary, finding the most relevant concepts that they relate to. By associating such semantic concepts with raw sensory information from both the image and the question, we are able to derive higher-level representations that abstract away from irrelevant raw fine-grained statistics tied to each modaility, and instead capture only the semantic knowledge necessary for the task. That way we can effectively cast both modalities onto the same space to facilitate their interaction, and, as discussed in section 4, improve the model’s compositionality, robustness and generalization skills.

3.2 States and edge transitions

In order to create the state machine, we construct a probabilistic scene graph that specifies the objects and relations in a given image, and serves us as the machine’s state graph, where objects correspond to states and relations to valid transitions. Multiple models have been proposed for the task of scene graph generation (Xu et al., 2017; Yang et al., 2018; Chen et al., 2019; Zellers et al., 2018). Here, we largely follow the approaches of Yang et al. (2018) and Chen et al. (2019) in conjunction with a variant of the Mask R-CNN object detector (He et al., 2017) proposed by Hu et al. (2018). Further details regarding the graph generation can be found in section 6.4.

By using such a graph generation model, we can infer a scene graph that consists of: (1) A set of object nodes from the image, each accompanied by a bounding box, a mask, dense visual features, and a collection of discrete probability distributions for each of the object’s semantic properties (such as its color, material, shape, etc.), defined over the concept vocabulary presented above; (2) A set of relation edges between the objects, each associated with a probability distribution of its semantic type (e.g. on top of, eating) among the concepts in , and corresponding to a valid transition between the machine’s states.

Once we obtain the sets of state nodes and transition edges, we proceed to computing structured embedded representations for each of them. For each state that corresponds to an object in the scene, we define a set of property variables and assign each of them with

Where denotes each embedded concept of the property type and refers to the corresponding property distribution over these concepts, resulting in a soft-binding of concepts to each variable. To give an example, if an object is recognized by the object detector as likely to be e.g. red, then its color variable will be assigned to an averaged vector close to the embedding of the “red” concept. Edge representations are computed in a similar manner, resulting in matching embeddings of their relation type: for each edge .

Consequently, we obtain a set of structured representations for both the nodes and the edges that underlie the state machine. Note that by associating each object and relation in the scene with not one, but a collection of vectors that capture each of their semantic properties, we are able to create disentangled representations that encapsulate the statistical particularities of the raw image and express it instead through a factorized discrete distribution over a vocabulary of embedded semantic concepts, aiming to encourage and promote higher compositionality.

3.3 Reasoning instructions

(a)
(b)
(c)
(d)
Figure 3: A visualization of object masks from the inferred scene graphs, which form the basis for our model.

In the next step, we translate the question into a sequence of reasoning instructions (each expressed in terms of the concept vocabulary ), which will later be read by the state machine to guide its computation. The translation process consists of two steps: tagging and decoding.

We begin by embedding all the question words using GloVe (dimension ). We process each word with a tagger function that either translates it into the most relevant concept in our vocabulary or alternatively keeps it intact, if it does not match any of them closely enough. Formally, for each embedded word we compute a similarity-based distribution

Where

is initialized to the identity matrix and

denotes the matrix of all embedded concepts along with an additional learned default embedding to account for structural or other non-content words.

Next, we translate each word into a concept-based representation:

Intuitively, a content word such as apples will be considered mostly similar to the concept apple (by comparing their GloVe embeddings), and thus will be replaced by the embedding of that term, whereas function words such as who, are, how will be deemed less similar to the semantic concepts and hence will stay close to their original embedding. Overall, this process allows us to “normalize" the question, by transforming content words to their matching concepts, while keeping function words mostly unaffected.

Finally, we process the normalized question words with an attention-based encoder-decoder, drawing inspiration from (Hudson & Manning, 2018): Given a question of normalized words , we first pass it through an LSTM encoder, obtaining the final state to represent the question. Then, we roll-out a recurrent decoder for a fixed number of steps , yielding hidden states , and transform each of them into a corresponding reasoning instruction:

Here, we compute attention over the normalized question words at each decoding step. By repeating this process for all steps, we decompose the question into a series of reasoning instructions that selectively focus on its various parts, accomplishing the goal of this stage.

3.4 Model simulation

Figure 4: A visualization of a graph traversal step, where attention is being shifted from one node to its neighbor along the most relevant edge.

Having all the building blocks of the state machine ready, the graph of states and edges , the instruction series , and the concept vocabulary , we can now simulate the machine’s sequential computation. Basically, we will begin with a uniform initial distribution over the states (the objects in the image’s scene), and at each reasoning step , read an instruction as derived from the question, and use it to redistribute our attention over the states (the objects) by shifting probability along the edges (their relations).

Formally, we perform this process by implementing a neural module for the state transition function . At each step , the module takes a distribution over the states as an input and computes an updated distribution , guided by the instruction . Our goal is to determine what next states to traverse to () based on the states we are currently attending to (). To achieve that, we perform a couple of steps.

Recall that in section 3.2 we define for each object a set of variables, representing its different properties (e.g. identity, color, shape). We further assigned each edge with a variable that similarly represents its relation type. Our first goal is thus to find the instruction type: the property type that is most relevant to the instruction – basically, to figure out what the instruction is about. We compute the distribution over the embedded properties , defined in section 3.1. We further denote that corresponds to the relation property as , measuring the degree to which that reasoning instruction is concerned with semantic relations (in contrast to other possibilities such as e.g. objects or attributes).

Once we know what the instruction is looking for, we can use it as a guiding signal while traversing the graph from the current states we are focusing on to their most relevant neighbors. We compare the instruction to all the states and edges , computing for each of them a relevance score:

(1)
(2)

Where is a non-linearity, are the state variables corresponding to each of its properties, and is the edge variable representing its type. We then get relevance scores between the instruction and each of the variables, which are finally averaged for each state and edge using .

Having a relevance score for both the nodes and the edges, we can use them to achieve the key goal of this section: shifting the model’s attention from the current nodes (states) to their most relevant neighbors – the next states:

(3)
(4)
(5)

Here, we compute the distribution over the next states by averaging two probabilities and : the former is based on each potential next state’s own internal properties, while the latter considers the next states contextual relevance, relative to the current states the model attends to. Overall, by repeating this process over steps, we can simulate the iterative computation of the neural state machine.

After completing the final computation step, and in order to predict an answer, we use a standard 2-layer fully-connected softmax classifier that receives the concatenation of the question vector

as well as an additional vector that aggregates information from the machine’s final states:

(6)

Where reflects the information extracted from the final states as guided by the final reasoning instruction : averaged first by the reasoning instruction type, and then by the attention over the final states, as specified by .

Overall, the above process allows us to perform a differentiable traversal over the scene graph, guided by the sequence of instructions that were derived from the question: Given an image and a question, we have first inferred a graph to represent the objects and relations in the image’s scene, and analogously decomposed the question into a sequence of reasoning instructions. Notably, we have expressed both the graph and the instructions in terms of the shared vocabulary of semantic concepts, translating them both into the same “internal language". Then, we simulate the state machine’s iterative operation, and over its course of computation, are successively shifting our attention across the nodes and edges as we ground each instruction in the graph to guide our traversal. Essentially, this allows us to locate each part of the question in the image, and perform sequential reasoning over the objects and relations in the image’s scene graph until we finally arrive at the answer.

4 Experiments

Figure 5: A visualization of the NSM’s reasoning process: given an image and a question (left side), the model first builds a probabilistic scene graph (the blue box and the image on the right), and translates the question into a series of instructions (the green and purple boxes, where for each instruction we present its closest concept (or word) in vector space (section 3.1)). The model then performs sequential reasoning over the graph, attending to relevant object nodes in the image’s scene as guided by the instructions, to iteratively compute the answer.

We evaluate our model (NSM) on two recent VQA datasets: (1) The GQA dataset (Hudson & Manning, 2019) which focuses on real-world visual reasoning and compositional question answering, and (2) VQA-CP (version 2) (Agrawal et al., 2018), a split of the VQA dataset (Goyal et al., 2017) that has been particularly designed to test generalization skills across changes in the answer distribution between the training and the test sets. We achieve state-of-the-art performance both for VQA-CP, and, under single-model settings, for GQA. To further explore the generalization capacity of the NSM model, we construct two new splits for GQA that test generalization over both the questions’ content and structure, and perform experiments based on them that provide substantial evidence for the strong generalization skills of our model across multiple dimensions. Finally, performance diagnosis, ablation studies and visualizations are presented in section 6.2 to shed more light on the inner workings of the model and its qualitative behavior.

Both our model and implemented baselines are trained to minimize the cross-entropy loss of the predicted candidate answer (out of the top 2000 possibilities), using a hidden state size of and, unless otherwise stated, length of computation steps for the MAC and NSM models. Please refer to section 6.5

for further information about the training procedure, implementation details, hyperparameter configuration and data preprocessing, along with complexity analysis of the NSM model. The model has been implemented in Tensorflow, and will be released along with the features and instructions for reproducing the described experiments.

4.1 Compositional question answering

We begin by testing the model on the GQA task (Hudson & Manning, 2019), a recent dataset that features challenging compositional questions that involve diverse reasoning skills in real-world settings, including spatial reasoning, relational reasoning, logic and comparisons. We compare our performance both with baselines, as appear in (Hudson & Manning, 2019), as well as with the top-5 single and top-10 ensemble submissions to the GQA challenge.111The official leaderboard mixes up single-model and ensemble results – we present here separated scores for each track. For single-model settings, to have a fair comparison, we consider all models that, similarly to ours, did not use the strong program supervision as an additional signal for training, but rather learn directly from the questions and answers.

As table 1 shows, we achieve state-of-the-art performance for a single-model across the dataset’s various metrics (defined in (Hudson & Manning, 2019)) such as accuracy and consistency. For the ensemble setting, we compute a majority vote of 10 instances of our model, achieving the 3rd highest score compared to the 52 submissions that have participated in the challenge1 (table 4), getting significantly stronger scores compared to the 4th or lower submissions.

Note that while several submissions (marked with *) use the associated functional programs that GQA provides with each question as a strong supervision during train time, we intentionally did not use them in training our model, but rather aimed to learn the task directly using the question-answer pairs only. These results serve as an indicator for the ability of the model to successfully address questions that involve different forms of reasoning (see section 6 for examples), and especially multi-step inference, which is particularly common in GQA.

Model Binary Open Consistency Validity Plausibility Distribution Accuracy
Human (Hudson & Manning, 2019) 91.20 87.40 98.40 98.90 97.20 - 89.30
Global Prior (Hudson & Manning, 2019) 42.94 16.62 51.69 88.86 74.81 93.08 28.90
Local Prior (Hudson & Manning, 2019) 47.90 16.66 54.04 84.33 84.31 13.98 31.24
Language (Hudson & Manning, 2019) 61.90 22.69 68.68 96.39 87.30 17.93 41.07
Vision (Hudson & Manning, 2019) 36.05 1.74 62.40 35.78 34.84 19.99 17.82
Lang+Vis (Hudson & Manning, 2019) 63.26 31.80 74.57 96.02 84.25 7.46 46.55
BottomUp (Anderson et al., 2017) 66.64 34.83 78.71 96.18 84.57 5.98 49.74
MAC (Hudson & Manning, 2018) 71.23 38.91 81.59 96.16 84.48 5.34 54.06
SK T-Brain* 77.42 43.10 90.78 96.26 85.27 7.54 59.19
PVR* 77.69 43.01 90.35 96.45 84.53 5.80 59.27
GRN 77.53 43.35 88.63 96.18 84.71 6.06 59.37
Dream 77.84 43.72 91.71 96.38 85.48 8.40 59.72
LXRT 77.76 44.97 92.84 96.30 85.19 8.31 60.34
NSM 78.94 49.25 93.25 96.41 84.28 3.71 63.17
Table 1: GQA scores for the single-model settings, including official baselines and top submissions

4.2 Generalization experiments

Motivated to measure the generalization capacity of our model, we perform experiments over three different dimensions: (1) changes in the answer distribution between the training and the test sets, (2) contextual generalization for concepts learned in isolation, and (3) unseen grammatical structures.

First, we measure the performance on VQA-CP (Agrawal et al., 2018), which provides a new split of the VQA2 dataset (Goyal et al., 2017), where the answer distribution is kept different between the training and the test sets (e.g. in the training set, the most common color answer is white, whereas in the test set, it is black

). Such settings reduce the extent to which models can circumvent the need for genuine scene understanding skills by exploiting dataset biases and superficial statistics

(Agrawal et al., 2016; Johnson et al., 2017; Goyal et al., 2017), and are known to be particularly difficult for neural networks (Lake et al., 2017). Here, we follow the standard VQA1/2 (Goyal et al., 2017) accuracy metric for this task (defined in (Agrawal et al., 2018)). Table 4 presents our performance compared to existing approaches. We can see that NSM surpasses alternative models by a large margin.

We perform further generalization studies on GQA, leveraging the fact that the dataset provides grounding annotations of the question words. For instance, a question such as “What color is the book on the table?" is accompanied by the annotation expressing the fact that e.g. the word refers to the book object node. These annotations allow us to split the training set in two interesting ways to test generalization over both content and structure (see figure 6 for an illustration of each split):

Content: Since the annotations specify which objects each question refers to, and by using the GQA ontology, we can identify all the questions that are concerned with particular object types, e.g. foods, or animals. We use this observation to split the training set by excluding all question-answer pairs that refer to these categories, and measure the model’s generalization over them. Note however, that the object detector module described in section 3.2 is still trained over all the scene graphs including those objects – rather, the goal of this split is to test whether the model can leverage the fact that it was trained to identify a particular object in isolation, in order to answer unseen questions about that type of object without any further question training.

Model Accuracy
Kakao* 73.33
270 70.23
NSM 67.25
LXRT 62.71
GRN 61.22
MSM 61.09
DREAM 60.93
SK T-Brain* 60.87
PKU 60.79
Musan 59.93
Table 3: VQA-CPv2
Model Accuracy
SAN (Yang et al., 2016) 24.96
HAN (Malinowski et al., 2018) 28.65
GVQA (Agrawal et al., 2018) 31.30
RAMEN (Shrestha et al., 2019) 39.21
BAN (Kim et al., 2018) 39.31
MuRel (Cadene et al., 2019) 39.54
ReGAT (Li et al., 2019a) 40.42
NSM 45.80
Table 4: GQA generalization
Model Content Structure
Global Prior 8.51 14.64
Local Prior 12.14 18.21
Vision 17.51 18.68
Language 21.14 32.88
Lang+Vis 24.95 36.51
BottomUp (Anderson et al., 2017) 29.72 41.83
MAC (Hudson & Manning, 2018) 31.12 47.27
NSM 40.24 55.72
Table 2: GQA ensemble
Figure 6: Our new generalization splits for GQA, evaluating generalization over (1) content: where test questions ask about novel concepts, and (2) structure: where test questions follow unseen linguistic patterns.

Structure: We can use the annotations described above as masks over the objects (see figure 6 for examples), allowing us to divide the questions in the training set into linguistic pattern groups. Then, by splitting these groups into two separated sets, we can test whether a model is able to generalize from some linguistic structures to unseen ones.

Table 4 summarizes the results for both settings, comparing our model to the baselines released for GQA (Hudson & Manning, 2019), all using the same training scheme and input features. We can see that here as well, NSM performs significantly better than the alternative approaches, testifying to its strong generalization capacity both over concepts it has not seen any questions about (but only learned in isolation), as well as over questions that involve novel linguistic structures. In our view, these results point to the strongest quality of our approach. several prior works have argued for the great potential of abstractions and compositionality in enhancing models of deep learning (Andreas et al., 2018; Battaglia et al., 2018). Our results suggest that incorporating these notions may indeed be highly beneficial to creating models that are more capable in coping with changing conditions and can better generalize to novel situations.

5 Conclusion

In this paper, we have introduced the Neural State Machine, a graph-based network that simulates the operation of an automaton, and demonstrated its versatility, robustness and high generalization skills on the tasks of real-world visual reasoning and compositional question answering. By incorporating the concept of a state machine into neural networks, we are able to introduce a strong structural prior that enhances compositinality both in terms of the representation, by having a structured graph to serve as our world model, as well as in terms of the computation, by performing sequential reasoning over such graphs. We hope that our model will help in the effort to integrate symbolic and connectionist approaches more closely together, in order to elevate neural models from sensory and perceptual tasks, where they currently shine, into the domains of higher-level abstraction, knowledge representation, compositionality and reasoning.

References

6 Supplementary material

Figure 7: Accuracy of NSM and baselines for different structural and semantic question types of the GQA dataset. NSM surpasses the baselines for all question types, with the largest gains for relational and comparative questions, and high improvements for attribute, logic and query (open) questions.

6.1 Related work (full version)

Our model connects to multiple lines of research, including works about concept acquisition (Higgins et al., 2017), visual abstractions (Gan et al., 2017; Wu et al., 2016a; Lu et al., 2018, 2017; Yi et al., 2018), compositionality (Bottou, 2014; Hu et al., 2017), and neural computation (Graves et al., 2014; Miller et al., 2016; Andreas et al., 2016). Several works have explored the discovery and use of visual concepts in the contexts of reinforcement or unsupervised learning (Higgins et al., 2016; Choi et al., 2018) as well as in classical computer vision (Farhadi et al., 2009; Vogel & Schiele, 2007). Others have argued for the importance of incorporating strong inductive biases into neural architectures (Bottou, 2014; Battaglia et al., 2018; Andreas, 2019; Andreas et al., 2018), and indeed, there is a growing body of research that seeks to introduce different forms of structural priors inspired by computer architectures (Graves et al., 2016; Weston et al., 2014; Hudson & Manning, 2018) or theory of computation (Aleksander, 1994; Forcada & Carrasco, 2001), aiming to bridge the gap between the symbolic and neural paradigms.

One such structural prior that underlies our model is that of the probabilistic scene graph (Johnson et al., 2015; Krishna et al., 2017) which we construct and reason over to answer questions about presented images. Scene graphs provide a succinct representation of the image’s semantics, and have been effectively used for variety of applications such as image retrieval, captioning or generation (Johnson et al., 2015; Li et al., 2017; Johnson et al., 2018). Recent years have witnessed an increasing interest both in scene graphs in particular (Xu et al., 2017; Yang et al., 2018; Chen et al., 2019; Zellers et al., 2018; Li et al., 2018) as well as in graph networks in general (Battaglia et al., 2018; Kipf & Welling, 2016; Veličković et al., 2017) – a family of graph-structured models in which information is iteratively propagated across the nodes to turn their representations increasingly more contextual with information from their neighborhoods. In our work, we also use the general framework of graph networks, but in contrast to common approaches, we avoid the computationally-heavy state updates per each node, keeping the graph states static once predicted, and instead perform a series of refinements of one global attention distribution over the nodes, resulting in a soft graph traversal which better suits our need for supporting sequential reasoning.

We explore our model in the context of visual question answering (Gupta, 2017), a challenging multi-modal task that has gained substantial attention over the last years. Plenty of models have been proposed (Anderson et al., 2017; Wang et al., 2017; Hudson & Manning, 2018), focusing on visual reasoning or question answering in both abstract and real-world settings (Goyal et al., 2017; Johnson et al., 2017; Santoro et al., 2018; Hudson & Manning, 2019). Typically, most approaches use either CNNs (Xiong et al., 2016; Yang et al., 2016) or object detectors (Anderson et al., 2017) to derive visual features which are then compared to a fixed-size question embedding obtained by an LSTM. A few newer models use the relationships among objects to augment features with contextual information from each object’s surroundings (Li et al., 2019b; Teney et al., 2017; Norcliffe-Brown et al., 2018). We move further in this direction, performing iterative reasoning over the inferred scene graphs, and in contrast to prior models (Li et al., 2019a; Cadene et al., 2019), incorporate higher-level semantic concepts to represent both the visual and linguistic modalities in a shared and sparser manner to facilitate their interaction.

Other methods for visual reasoning such as (Yi et al., 2018; Mao et al., 2019; Mascharka et al., 2018) have similar motivation to ours of integrating neural and symbolic perspectives by learning and reasoning over structured representations. However, in contrast to our approach, those models heavily rely on symbolic execution, either through strong supervision of program annotations (Mascharka et al., 2018), non-differentiable python-based functions (Yi et al., 2018), or a collection of hand-crafted modules specifically designed for each given task (Mao et al., 2019), and consequently, they have mostly been explored in artificial environments such as CLEVR (Johnson et al., 2017). Instead, our model offers a fully-neural and more general graph-based design that, as demonstrated by our experiments, successfully scales to real-world settings.

Closest to our work is a model called MAC (Hudson & Manning, 2018), a recurrent network that applies attention-based operations to perform sequential reasoning. However, the Neural State Machine differs from MAC in two crucial respects: First, we reason over graph structures rather than directly over spatial maps of visual features, traversing the graph by successively shifting attention across its nodes and edges. Second, key to our model is the notion of semantic concepts that are used to express knowledge about both the visual and linguistic modalities, instead of working directly with the raw observational features. As our findings suggest, these ideas contribute significantly to the model’s performance compared to MAC and enhance its compositionality, transparency and generalization skills.

6.2 Ablation studies

Figure 8: Accuracy as a function of the number of computation steps. Performance is reported on the GQA Test-Dev split.

To gain further insight into the relative contributions of different aspects of our model to its overall performance, we have conducted multiple ablation experiments, as summarized in table 5 and figure 4. First, we measure the impact of using our new visual features (section 3.2) compared to the default features used by the official baselines (Hudson & Manning, 2019). Such settings result in an improvement of 1.27%, confirming that most of the gain achieved by the NSM model (with 62.95% overall) stems from its inherent architecture rather than from the input features.

We then explore several ablated models: one where we do not perform any iterative simulation of the state machine, but rather directly predict the answer from its underlying graph structure (55.41%); and a second enhanced version where we perform a traversal across the states but without considering the typed relations among them, adopting instead a uniform fully-connected graph (58.83%). These experiments suggest that using the graph structure to represent the visual scene as well performing sequential reasoning over it by traversing its nodes, are both crucial to the model’s overall accuracy and have positive impact on its performance.

Next, we evaluate a variant of the model where instead of using the concept-based representations defined in section 3.1, we fallback to the more standard dense features to represent the various graph elements, which results in an accuracy of 58.48%. Comparing this score to that of the default model’s settings (62.95%) proves the significance of using the higher-level semantic representations to the overall accuracy of the NSM.

Finally, we measure the impact of varying the number of computation steps on the obtained results (figure 4), revealing a steady increase in accuracy as we perform a higher number of reasoning steps (until saturation for steps). These experiments further validate the effectiveness of sequential computation in addressing challenging questions, and especially compositional questions which are at the core of GQA.

6.3 Concept vocabulary

We define a vocabulary of 1335 embedded concepts about object types (e.g. cat, shirt), attributes, grouped into types (e.g. colors, materials), and relations (e.g. holding, on top of), all initialized with GloVe (Pennington et al., 2014). Our concepts consist of 785 objects, 170 relations, and 303 attributes that are divided into 77 types, with most types being binary (e.g. short/tall, light/dark). All concepts are derived from the Visual Genome dataset (Krishna et al., 2017), which provides human-annotated scene graphs for real-world images. We use the refined dataset supported by GQA (Hudson & Manning, 2019), where the graphs are further cleaned-up and consolidated, resulting in a closed-vocabulary version of the dataset.

6.4 Scene graph generation

Model Accuracy
NSM (default) 62.95 0.22
No concepts 58.48 0.14
No relations (set) 58.83 0.17
No traversal 55.41 0.35
Visual features 47.82 0.09
Baseline (Lang+Vis) 46.55 0.13
Table 5: Ablations (reported on GQA Test-Dev)

In order to generate scene graphs from given images, we re-implement a model that largely follows Yang et al. (2018); Chen et al. (2019) in conjunction with an object detector proposed by Hu et al. (2018). Specifically, we use a variant of Mask R-CNN (He et al., 2017; Hu et al., 2018) to obtain object detections from each image, using Hu et al. (2018)’s official implementation along with ResNet-101 (He et al., 2016) and FPN (Lin et al., 2017) for features and region proposals respectively, and keep up to detections, with a confidence threshold of . We train the object detector (and the following graph generation model) over a cleaner version of Visual Genome scene graphs (Krishna et al., 2017), offered as part of the GQA dataset (Hudson & Manning, 2019). In particular, the detector heads are trained to classify both the object class and category (akin to YOLO’s hierarchical softmax (Redmon & Farhadi, 2017)), as well as the object’s attributes per each property type, resulting in a set of probability distributions over the object’s various properties (e.g. its identity, color or shape).

Once we obtain the graph nodes – the set of detected objects, we proceed to detecting their relations, mainly following the approach of Yang et al. (2018): First, we create a directed edge for each pair of objects that are in close enough proximity – as long as the relative distance between the objects in both axes is smaller than 15% of the image dimensions (which covers over 94% of the ground truth edges, and allows us to sparsify the graph, reducing the computation load). Once we set the graph structure, we process it through a graph attention network (Veličković et al., 2017) to predict the identities of each relation, resulting in a probability distribution over the relation type of each edge.

6.5 Implementation and training details

We train the model using the Adam optimization method (Kingma & Ba, 2014), with a learning rate of

and a batch size of 64. We use gradient clipping, and employ early stopping based on the validation accuracy, leading to a training process of approximately 15 epochs, equivalent to roughly 30 hours on a single Maxwell Titan X GPU. Both hidden states and word vectors have a dimension size of 300, the latter being initialized using GloVe

(Pennington et al., 2014). During the training, we maintain exponential moving averages of the model weights, with a decay rate of 0.999, and use them at test time instead of the raw weights. Finally, we use ELU as non-linearity and dropout of 0.15 across the network: in the initial processing of images and questions, for the state and edge representations, and in the answer classifier. All hyperparameters were tuned manually (from the following ranges: learning rate , batch size , dropout and hidden and word dimensions to comply with GloVe provided sizes).

We have preprocessed all the questions by removing punctuation and keeping the top 5000 most common words, and use the standard training/test splits provided by the original datasets we have explored (Hudson & Manning, 2019; Agrawal et al., 2018): For GQA, we use the more common “balanced" version that has been designed to reduce biases within the answer distribution (similar in motivation to the VQA2 dataset (Goyal et al., 2017)

), and includes 1.7M questions split into 70%/10%/10% for training, validation and test sets respectively. For VQA-CP, we use version v2 which consists of 438k/220k questions for training/test respectively. Finally, for the new generalization split, we have downsampled the data to have a ratio of 80%/20% for training/test over approximately 800k questions overall. All results reported are for a single-model settings, except the ensemble scores for GQA that compute majority vote over 10 models, and the ablation studies that are performed over 5 runs for each ablated version. To measure the confidence of the results, we have performed additional 5 runs of our best-performing model over both the GQA Test-Dev and VQA-CP, getting standard deviations of 0.22 and 0.31 respectively.

In terms of time complexity, the NSM is linear both in the number of question words , and the size of the graph ( states and due to the proximity-based pruning we perform while constructing the graph), multiplied by a constant of reasoning steps . Similarly, the space complexity of our model is since we do not have to store separated values for each computation step, but rather keep only the most recent ones.