Think Visually: Question Answering through Virtual Imagery

by   Ankit Goyal, et al.
University of Michigan

In this paper, we study the problem of geometric reasoning in the context of question-answering. We introduce Dynamic Spatial Memory Network (DSMN), a new deep network architecture designed for answering questions that admit latent visual representations. DSMN learns to generate and reason over such representations. Further, we propose two synthetic benchmarks, FloorPlanQA and ShapeIntersection, to evaluate the geometric reasoning capability of QA systems. Experimental results validate the effectiveness of our proposed DSMN for visual thinking tasks.


page 1

page 2

page 3

page 4


Dynamic Memory Networks for Visual and Textual Question Answering

Neural network architectures with memory and attention mechanisms exhibi...

Query-Reduction Networks for Question Answering

In this paper, we study the problem of question answering when reasoning...

Compositional Attention Networks for Interpretability in Natural Language Question Answering

MAC Net is a compositional attention network designed for Visual Questio...

SpartQA: : A Textual Question Answering Benchmark for Spatial Reasoning

This paper proposes a question-answering (QA) benchmark for spatial reas...

Adaptive Memory Networks

We present Adaptive Memory Networks (AMN) that processes input-question ...

Can Small and Synthetic Benchmarks Drive Modeling Innovation? A Retrospective Study of Question Answering Modeling Approaches

Datasets are not only resources for training accurate, deployable system...

Evaluating Theory of Mind in Question Answering

We propose a new dataset for evaluating question answering models with r...

1 Introduction

The ability to reason is a hallmark of intelligence and a requirement for building question-answering (QA) systems. In AI research, reasoning has been strongly associated with logic and symbol manipulation, as epitomized by work in automated theorem proving (Fitting, 2012). But for humans, reasoning involves not only symbols and logic, but also images and shapes. Einstein famously wrote: “The psychical entities which seem to serve as elements in thought are certain signs and more or less clear images which can be ‘voluntarily’ reproduced and combined… Conventional words or other signs have to be sought for laboriously only in a secondary state…” And the history of science abounds with discoveries from visual thinking, from the Benzene ring to the structure of DNA (Pinker, 2003).

There are also plenty of ordinary examples of human visual thinking. Consider a square room with a door in the middle of its southern wall. Suppose you are standing in the room such that the eastern wall of the room is behind you. Where is the door with respect to you? The answer is ‘to your left.’ Note that in this case both the question and answer are just text. But in order to answer the question, it is natural to construct a mental picture of the room and use it in the process of reasoning. Similar to humans, the ability to ‘think visually’ is desirable for AI agents like household robots. An example could be to construct a rough map and navigation plan for an unknown environment from verbal descriptions and instructions.

In this paper, we investigate how to model geometric reasoning (a form of visual reasoning) using deep neural networks (DNN). Specifically, we address the task of answering questions through geometric reasoning—both the question and answer are expressed in symbols or words, but a geometric representation is created and used as part of the reasoning process.

In order to focus on geometric reasoning, we do away with natural language by designing two synthetic QA datasets, FloorPlanQA and ShapeIntersection. In FloorPlanQA, we provide the blueprint of a house in words and ask questions about location and orientation of objects in it. For ShapeIntersection, we give a symbolic representation of various shapes and ask how many places they intersect. In both datasets, a reference visual representation is provided for each sample.

Further, we propose Dynamic Spatial Memory Network (DSMN), a novel DNN that uses virtual imagery for QA. DSMN is similar to existing memory networks (Kumar et al., 2016; Sukhbaatar et al., 2015; Henaff et al., 2016)

in that it uses vector embeddings of questions and memory modules to perform reasoning. The main novelty of DSMN is that it creates virtual images for the input question and uses a spatial memory to aid the reasoning process.

We show through experiments that with the aid of an internal visual representation and a spatial memory, DSMN outperforms strong baselines on both FloorPlanQA and ShapeIntersection. We also demonstrate that explicitly learning to create visual representations further improves performance. Finally, we show that DSMN is substantially better than the baselines even when visual supervision is provided for only a small proportion of the samples.

It’s important to note that our proposed datasets consist of synthetic questions as opposed to natural texts. Such a setup allows us to sidestep difficulties in parsing natural language and instead focus on geometric reasoning. However, synthetic data lacks the complexity and diversity of natural text. For example, spatial terms used in natural language have various ambiguities that need to resolved by context (e.g. how far is ”far” and whether ”to the left” is relative to the speaker or the listener) (Shariff, 1998; Landau and Jackendoff, 1993), but our synthetic data lacks such complexities. Therefore, our method and results do not automatically generalize to real-life tasks involving natural language. Additional research is needed to extend and validate our approach on natural data.

Our contributions are three-fold: First, we present Dynamic Spatial Memory Network (DSMN), a novel DNN that performs geometric reasoning for QA. Second, we introduce two synthetic datasets that evaluate a system’s visual thinking ability. Third, we demonstrate that on synthetic data, DSMN achieves superior performance for answering questions that require visual thinking.

2 Related Work

Natural language datasets for QA: Several natural language QA datasets have been proposed to test AI systems on various reasoning abilities (Levesque et al., 2011; Richardson et al., 2013). Our work differs from them in two key aspects: first, we use synthetic data instead of natural data; and second, we specialize in geometrical reasoning instead of general language understanding. Using synthetic data helps us simplify language parsing and thereby focus on geometric reasoning. However, additional research is necessary to generalize our work to natural data.

Synthetic datasets for QA: Recently, synthetic datasets for QA are also becoming crucial in AI. In particular, bAbI (Weston et al., 2015) has driven the development of several recent DNN-based QA systems (Kumar et al., 2016; Sukhbaatar et al., 2015; Henaff et al., 2016). bAbI consists of 20 tasks to evaluate different reasoning abilities. Two tasks, Positional Reasoning (PR) and Path Finding (PF), are related to geometric reasoning. However, each Positional Reasoning question contains only two sentences, and can be solved through simple logical deduction such as ‘A is left of B implies B is right of A’. Similarly, Path Finding involves a search problem that requires simple spatial deductions such as ‘A is east of B implies B is west of A’. In contrast, the questions in our datasets involve longer descriptions, more entities, and more relations; they are thus harder to answer with simple deductions. We also provide reference visual representation for each sample, which is not available in bAbI.

Mental Imagery and Visual Reasoning: The importance of visual reasoning has been long recognized in AI (Forbus et al., 1991; Lathrop and Laird, 2007). Prior works in NLP (Seo et al., 2015; Lin and Parikh, 2015) have also studied visual reasoning. Our work is different from them as we use synthetic language instead of natural language. Our synthetic language is easier to parse, allowing our evaluation to mainly reflect the performance of geometric reasoning. On the other hand, while our method and conclusions can potentially apply to natural text, this remains to be validated and involves nontrivial future work. There are other differences to prior works as well. Specifically, (Seo et al., 2015) combined information from textual questions and diagrams to build a model for solving SAT geometry questions. However, our task is different as diagrams are not provided as part of the input, but are generated from the words/symbols themselves. Also, (Lin and Parikh, 2015) take advantage of synthetic images to gather semantic common sense knowledge (visual common sense) and use it to perform fill-in-the-blank (FITB) and visual paraphrasing tasks. Similar to us, they also form ‘mental images’. However, there are two differences (apart from natural vs synthetic language): first, their benchmark tests higher level semantic knowledge (like “Mike is having lunch when he sees a bear.” “Mike tries to hide.”), while ours is more focused on geometric reasoning. Second, their model is based on hand-crafted features while we use a DNN.

Spatial language for Human-Robot Interaction: Our work is also related to prior work on making robots understand spatial commands (e.g. “put that box here”, “move closer to the box”) and complete tasks such as navigation and assembly. Earlier work (Müller et al., 2000; Gribble et al., 1998; Zelek, 1997) in this domain used template-based commands, whereas more recent work (Skubic et al., 2004) tried to make the commands more natural. This line of work differs from ours in that the robot has visual perception of its environment that allows grounding of the textual commands, whereas in our case the agent has no visual perception, and an environment needs to be imagined.

Image Generation: Our work is related to image generation using DNNs which has a large body of literature, with diverse approaches (Reed et al., 2016; Gregor et al., 2015). We also generate an image from the input. But in our task, image generation is in the service of reasoning rather than an end goal in itself—as a result, photorealism or artistic style of generated images is irrelevant and not considered.

Visual Question Answering: Our work is also related to visual QA (VQA) (Johnson et al., 2016; Antol et al., 2015; Lu et al., 2016). Our task is different from VQA because our questions are in terms of words/symbols whereas in VQA the questions are visual, consisting of both text descriptions and images. The images involved in our task are internal and virtual, and are not part of the input or output.

Memory and Attention: Memory and attention have been increasingly incorporated into DNNs, especially for tasks involving algorithmic inference and/or natural language (Graves et al., 2014; Vaswani et al., 2017). For QA tasks, memory and attention play an important role in state-of-the-art (SOTA) approaches. (Sukhbaatar et al., 2015) introduced End-To-End Memory Network (MemN2N), a DNN with memory and recurrent attention mechanism, which can be trained end-to-end for diverse tasks like textual QA and language modeling. Concurrently, Kumar et al. (2016) introduced Dynamic Memory Network (DMN), which also uses attention and memory. Xiong et al. (2016) proposed DMN+, with several improvements over the previous version of DMN and achieved SOTA results on VQA (Antol et al., 2015) and bAbI (Weston et al., 2015). Our proposed DSMN is a strict generalization of DMN+ (see Sec. 4.1). On removing the images and spatial memory from DSMN, it reduces to DMN+. Recently Gupta et al. (2017)

also used spatial memory in their deep learning system, but for visual navigation. We are using spatial memory for QA.

Figure 1: An example in the ShapeIntersection dataset.
Component Template
House door The house door is in the middle of the {nr, sr, er, wr} wall of the house.
The house door is located in the {n-er, s-er, n-wr, s-wr, n-er, s-er, n-wr, s-wr} side of the house, such that it opens towards {n, s, e, w}.
Room door The door for this room is in the middle of its {nr, sr, er, wr} wall.
This room’s door is in the middle of its {nr, sr, er, wr} wall.
The door for this room is located in its {n-er, s-er, n-wr, s-wr, n-er, s-er, n-wr, s-wr} side, such that it opens towards {n, s, e, w}.
This room’s door is located in its {n-er, s-er, n-wr, s-wr, n-er, s-er, n-wr, s-wr} side, such that it opens towards {n, s, e, w}.
Small room Room {1, 2, 3} is small in size and it is located in the {n, s, e, w, c, n-e, s-e, n-w, s-w} of the house.
Room {1, 2, 3} is located in the {n, s, e, w, c, n-e, s-e, n-w, s-w} of the house and is small in size.
Medium room Room {1, 2, 3} is medium in size and it extends from the {n, s, e, w, c, n-e, s-e, n-w, s-w} to the {n, s, e, w, c, n-e, s-e, n-w, s-w} of the house.
Room {1, 2, 3} extends from the {n, s, e, w, c, n-e, s-e, n-w, s-w} to the {n, s, e, w, c, n-e, s-e, n-w, s-w} of the house and is medium in size.
Large room Room {1, 2, 3} is large in size and it stretches along the {n-s, e-w}direction in the {n, s, e, w, c} of the house.
Room {1, 2, 3} stretches along the {n-s, e-w} direction in the {n, s, e, w, c} of the house and is large in size.
Object A {cu, cd, sp, co} is located in the middle of the {nr, sr, er, wr} part of the house.
A {cu, cd, sp, co} is located in the {n-er, s-er, n-wr, s-wr, n-er, s-er, n-wr, s-wr, cr} part of the house.
A {cu, cd, sp, co} is located in the middle of the {nr, sr, er, wr} part of this room.
A {cu, cd, sp, co} is located in the {n-er, s-er, n-wr, s-wr, n-er, s-er, n-wr, s-wr, cr} part of this room.
Table 1: Templates used by the description generator for FloorPlanQA. For compactness we used the following notations, n - north, s - south, e - east, w - west, c - center, nr - northern, sr - southern, er - eastern, wr - western, cr - central, cu - cube, cd - cuboid, sp - sphere and co - cone.

3 Datasets

We introduce two synthetically-generated QA datasets to evaluate a system’s goemetrical reasoning ability: FloorPlanQA and ShapeIntersection. These datasets are not meant to test natural language understanding, but instead focus on geometrical reasoning. Owing to their synthetic nature, they are easy to parse, but nevertheless they are still challenging for DNNs like DMN+ (Xiong et al., 2016) and MemN2N (Sukhbaatar et al., 2015) that achieved SOTA results on existing QA datasets (see Table 1(a)).

The proposed datasets are similar in spirit to bAbI (Weston et al., 2015), which is also synthetic. In spite of its synthetic nature, bAbI has proved to be a crucial benchmark for the development of new models like MemN2N, DMN+, variants of which have proved successful in various natural domains (Kumar et al., 2016; Perez and Liu, 2016). Our proposed dataset is first to explicitly test ‘visual thinking’, and its synthetic nature helps us avoid the expensive and tedious task of collecting human annotations. Meanwhile, it is important to note that conclusions drawn from synthetic data do not automatically translate to natural data, and methods developed on synthetic benchmarks need additional validation on natural domains.

The proposed datasets also contain visual representations of the questions. Each of them has 38,400 questions, evenly split into a training set, a validation set and a test set (12,800 each).

FloorPlanQA: Each sample in FloorPlanQA involves the layout of a house that has multiple rooms (max 3). The rooms are either small, medium or large. All the rooms and the house have a door. Additionally, each room and empty-space in the house (i.e. the space in the house that is not part of any room) might also contain an object (either a cube, cuboid, sphere, or cone).

Each sample has four components, a description, a question, an answer, and a visual representation. Each sentence in the description describes either a room, a door or an object. A question is of the following template: Suppose you are entering the {house, room 1, room 2, room 3}, where is the {house door, room 1 door, room 2 door, room 3 door, cube, cuboid, sphere, cone} with respect to you?. The answer is either of left, right, front, or back. Other characteristics of FloorPlanQA are summarized in Fig. 2.

The visual representation of a sample consists of an ordered set of image channels, one per sentence in the description. An image channel pictorially represents the location and/or orientation of the described item (room, door, object) w.r.t. the house. An example is shown in Fig. 2.

To generate samples for FloorPlanQA, we define a probabilistic generative process which produces tree structures representing layouts of houses, similar to scene graphs used in computer graphics. The root node of a tree represents an entire house, and the leaf nodes represent rooms. We use a description and visual generator to produce respectively the description and visual representation from the tree structure. The templates used by the description generator are described in Table 1. Furthermore, the order of sentences in a description is randomized while making sure that the description still makes sense. For example, in some sample, the description of room 1 can appear before that of the house-door, while in another sample, it could be reversed. Similarly, for a room, the sentence describing the room’s door could appear before or after the sentence describing the object in the room (if the room contains one). We perform rejection sampling to ensure that all the answers are equally likely, and thus removing bias.

vocabulary size 66
# unique sentences 264
# unique descriptions 38093
# unique questions 32
# unique question-description pairs 38228
Avg. # words per sentence 15
Avg. # sentences per description 6.61
Figure 2: An example and characteristics of FloorPlanQA (when considering all the 38,400 samples i.e. training, validation and test sets combined).

ShapeIntersection: As the name suggests, ShapeIntersection is concerned with counting the number of intersection points between shapes. In this dataset, the description consists of symbols representing various shapes, and the question is always “how many points of intersection are there among these shapes?”

There are three types of shapes in ShapeIntersection: rectangles, circles, and lines. The description of shapes is provided in the form of a sequence of 1D vectors, each vector representing one shape. A vector in ShapeIntersection is analogous to a sentence in FloorPlanQA. Hence, for ShapeIntersection, the term ‘sentence’ actually refers to a vector. Each sentence describing a shape consists of 5 real numbers. The first number stands for the type of shape: 1 - line, 2 - circle, and 3 - rectangle. The subsequent four numbers specify the size and location of the shape. For example, in case of a rectangle, they represent its height, its width, and coordinates of its bottom-left corner. Note that one can also describe the shapes using a sentence, e.g. “there is a rectangle at (5, 5), with a height of 2 cm and width of 8 cm.” However, as our focus is to evaluate ‘visual thinking’, we work directly with the symbolic encoding.

In a given description, there are 6.5 shapes on average, and at most 6 lines, 3 rectangles and 3 circles. All the shapes in the dataset are unique and lie on a

canvas. While generating the dataset, we do rejection sampling to ensure that the number of intersections is uniformly distributed from 0 to the maximum possible number of intersections, regardless of the number of lines, rectangles, and circles. This ensures that the number of intersections cannot be estimated from the number of lines, circles or rectangles.

Similar to FloorPlanQA, the visual representation for a sample in this dataset is an ordered set of image channels. Each channel is associated with a sentence, and it plots the described shape. An example is shown in Figure 1.

4 Dynamic Spatial Memory Network

We propose Dynamic Spatial Memory Network (DSMN), a novel DNN designed for QA with geometric reasoning. What differentiates DSMN from other QA DNNs is that it forms an internal visual representation of the input. It then uses a spatial memory to reason over this visual representation.

A DSMN can be divided into five modules: the input module, visual representation module, question module, spatial memory module, and answer module. The input module generates an embedding for each sentence in the description. The visual representation module uses these embeddings to produce an intermediate visual representation for each sentence. In parallel, the question module produces an embedding for the question. The spatial memory module then goes over the question embedding, the sentence embeddings, and the visual representation multiple times to update the spatial memory. Finally, the answer module uses the spatial memory to output the answer. Fig. 3 illustrates the overall architecture of DSMN.

Input Module: This module produces an embedding for each sentence in the description. It is therefore customized based on how the descriptions are provided in a dataset. Since the descriptions are in words for FloorPlanQA, a position encoding (PE) layer is used to produce the initial sentence embeddings. This is done to ensure a fair comparison with DMN+ (Xiong et al., 2016) and MemN2N (Sukhbaatar et al., 2015), which also use a PE layer. A PE layer combines the word-embeddings to encode the position of words in a sentence (Please see  Sukhbaatar et al. (2015)

for more information). For ShapeIntersection, the description is given as a sequence of vectors. Therefore, two FC layers (with ReLU in between) are used to obtain the initial sentence embeddings.

These initial sentence embeddings are then fed into a bidirectional Gated Recurrent Unit (GRU) 

(Cho et al., 2014) to propagate the information across sentences. Let and be the respective output of the forward and backward GRU at step. Then, the final sentence embedding for the sentence is given by .

Question Module: This module produces an embedding for the question. It is also customized to the dataset. For FloorPlanQA, the embeddings of the words in the question are fed to a GRU, and the final hidden state of the GRU is used as the question embedding. For ShapeIntersection, the question is always fixed, so we use an all-zero vector as the question embedding.

Visual Representation Module: This module generates a visual representation for each sentence in the description. It consists of two sub-components: an attention network and an encoder-decoder network. The attention network gathers information from previous sentences that is important to produce the visual representation for the current sentence. For example, suppose the current sentence describes the location of an object with respect to a room. Then in order to infer the location of the object with respect to the house, one needs the location of the room with respect to the house, which is described in some previous sentence.

The encoder-decoder network encodes the visual information gathered by the attention network, combines it with the current sentence embedding, and decodes the visual representation of the current sentence. An encoder () takes an image as input and produces an embedding, while a decoder () takes an embedding as input and produces an image. An encoder is composed of series of convolution layers and a decoder is composed of series of deconvolution layers.

Suppose we are currently processing the sentence . This means we have already processed the sentences and produced the corresponding visual representations . We also add and , which are all-zero vectors to represent the null sentence. The attention network produces a scalar attention weight for the sentence which is given by where . Here, is a vector, is a scalar, represents element-wise multiplication, represents element-wise absolute value, and represents the concatenation of vectors and .

The gathered visual information is . It is fed into the encoder-decoder network. The visual representation for is given by . The parameters of , , , and are shared across multiple iterations.

In the proposed model, we make the simplifying assumption that the visual representation of the current sentence does not depend on future sentences. In other words, it can be completely determined from the previous sentences in the description. Both FloorPlanQA and ShapeIntersection satisfy this assumption.

Spatial Memory Module: This module gathers relevant information from the description and updates memory accordingly. Similar to DMN+ and MemN2N, it collects information and updates memory multiple times to perform transitive reasoning. One iteration of information collection and memory update is referred as a ‘hop’.

The memory consists of two components: a 2D spatial memory and a tag vector. The 2D spatial memory can be thought of as a visual scratch pad on which the network ‘sketches’ out the visual information. The tag vector is meant to represent what is ‘sketched’ on the 2D spatial memory. For example, the network can sketch the location of room 1 on its 2D spatial memory, and store the fact that it has sketched room 1 in the tag vector.

As mentioned earlier, each step of the spatial memory module involves gathering of relevant information and updating of memory. Suppose we are in step . Let represent the 2D spatial memory and represent the tag vector after step . The network gathers the relevant information by calculating the attention value for each sentence based on the question and the current memory. For sentence , the scalar attention value equal to , where is given as


and represent initial blank memory, and their elements are all zero. Then, gathered information is represented as a context tag vector, and 2D context, . Please refer to  Xiong et al. (2016) for information about AttGRU(.). Finally, we use the 2D context and context tag vector to update the memory as follows:

Figure 3: The architecture of the proposed Dynamic Spatial Memory Network (DSMN).

Answer Module: This module uses the final memory and question embedding to generate the output. The feature vector used for predicting the answer is given by , where and represent the final memory.


To obtain the output, an FC layer is applied to in case of regression, while the FC layer is followed by softmax in case of classification. To keep DSMN similar to DMN+, we apply a dropout layer on sentence encodings () and .

4.1 DSMN as a strict generalization of DMN

DSMN is a strict generalization of a DMN+. If we remove the visual representation of the input along with the 2D spatial memory, and just use vector representations with memory tags, then a DSMN reduces to DMN+. This ensures that comparison with DMN+ is fair.

4.2 DSMN with or without intermediate visual supervision

As described in previous sections, a DSMN forms an intermediate visual representation of the input. Therefore, if we have a ‘ground-truth’ visual representation for the training data, we could use it to train our network better. This leads to two different ways for training a DSMN, one with intermediate visual supervision and one without it. Without intermediate visual supervision, we train the network in an end-to-end fashion by using a loss () that compares the predicted answer with the ground truth. With intermediate visual supervision, we train our network using an additional visual representation loss () that measures how close the generated visual representation is to the ground-truth representation. Thus, the loss used for training with intermediate supervision is given by , where

is a hyperparameter which can be tuned for each dataset. Note that in neither case do we need any visual input once the network is trained. During testing, the only input to the network is the description and question.

Also note that we can provide intermediate visual supervision to DSMN even when the visual representations for only a portion of samples in the training data are available. This can be useful when obtaining visual representation is expensive and time-consuming.

5 Experiments

Baselines: LSTM (Hochreiter and Schmidhuber, 1997) is a popular neural network for sequence processing tasks. We use two versions of LSTM-based baselines. LSTM-1 is a common version that is used as a baseline for textual QA (Sukhbaatar et al., 2015; Graves et al., 2016). In LSTM-1, we concatenate all the sentences and the question to a single string. For FloorPlanQA, we do word embedding look-up, while for ShapeIntersection, we project each real number into higher dimension via a series of FC layers. The sequence of vectors is fed into an LSTM. The final output vector of the LSTM is then used for prediction.

We develop another version of LSTM that we call LSTM-2, in which the question is concatenated to the description. We use a two-level hierarchy to embed the description. We first extract an embedding for each sentence. For FloorPlanQA, we use an LSTM to get the sentence embeddings, and for ShapeIntersection, we use a series of FC layers. We then feed the sentence embeddings into an LSTM, whose output is used for prediction.

Further, we compare our model to DMN+ (Xiong et al., 2016) and MemN2N (Sukhbaatar et al., 2015), which achieved state-of-the-art results on bAbI (Weston et al., 2015). In particular, we compare the 3-hop versions of DSMN, DMN+, and MemN2N.

FloorPlanQA ShapeIntersection
MODEL (accuracy in %) (rmse)
LSTM-1 41.36 3.28
LSTM-2 50.69 2.99
MemN2N 45.92 3.51
DMN+ 60.29 2.98
DSMN 68.01 2.84
DSMN* 97.73 2.14
(a) The test set performance of different models on FloorPlanQA and ShapeIntersection. DSMN* refers to the model with intermediate supervision.
MODEL in Eqn. 4 (accuracy in %)
DSMN 67.65
DSMN 43.90
DSMN 68.12
DSMN* 97.24
DSMN* 95.17
DSMN* 98.08
(b) The validation set performances for the ablation study on the usefulness of tag () and 2D spatial memory () in the answer feature vector for .
MODEL (accuracy in %)
1-Hop DSMN 63.32
2-Hop DSMN 65.59
3-Hop DSMN 68.12
1-Hop DSMN* 90.09
2-Hop DSMN* 97.45
3-Hop DSMN* 98.08
(c) The validation set performance for the ablation study on variation in performance with hops.
Table 2: Experimental results showing comparison with baselines, and ablation study of DSMN
(a) Test set rmse on ShapeIntersection.
(b) Test set accuracy on FloorPlanQA.
Figure 4: Performance of DSMN* with varying percentage of intermediate visual supervision.
Figure 5: Attention values on each sentence during different memory ‘hops’ for a sample from FloorPlanQA. Darker color indicates more attention. To answer, one needs the location of room 1’s door and the house door. To infer the location of room 1’s door, DSMN* directly jumps to sent. 3. Since DMN+ does not form a visual representation, it tries to infer the location of room 1’s door w.r.t the house by finding the location of the room’s door w.r.t the room (sent. 3) and the location of the room w.r.t the house (sent. 2). Both DSMN* and DMN+ use one hop to infer the location of the house door (sent. 1).

Training Details: We used ADAM (Kingma and Ba, 2014) to train all models, and the learning rate for each model is tuned for each dataset. We tune the embedding size and regularization weight for each model and dataset pair separately. For reproducibility, the value of the best-tuned hyperparameters is mentioned in the appendix. As reported by (Sukhbaatar et al., 2015; Kumar et al., 2016; Henaff et al., 2016)

, we also observe that the results of memory networks are unstable across multiple runs. Therefore for each hyperparameter choice, we run all the models 10 times and select the run with the best performance on the validation set. For FloorPlanQA, all models are trained up to a maximum of 1600 epochs, with early stopping after 80 epochs if the validation accuracy did not increase. The maximum number of epochs for ShapeIntersection is 800 epochs, with early stopping after 80 epochs. Additionally, we modify the input module and question module of DMN+ and MemN2N to be same as ours for the ShapeIntersection dataset.

For MemN2N, we use the publicly available implementation222 and train it exactly as all other models (same optimizer, total epochs, and early stopping criteria) for fairness. While the reported best result for MemN2N is on the version with position encoding, linear start training, and random-injection of time index noise (Sukhbaatar et al., 2015), the version we use has only position encoding. Note that the comparison is still meaningful because linear start training and time index noise are not used in DMN+ (and as a result, neither in our proposed DSMN).

Results: The results for FloorPlanQA and ShapeIntersection are summarized in Table 1(a)

. For brevity, we will refer to the DSMN model trained without intermediate visual supervision as DSMN, and the one with intermediate visual supervision as DSMN*. We see that DSMN (i.e the one without intermediate supervision) outperforms DMN+, MemN2N and the LSTM baselines on both datasets. However, we consider DSMN to be only slightly better than DMN+ because both are observed to be unstable across multiple runs and so the gap between the two has a large variance. Finally, DSMN* outperforms all other approaches by a large margin on both datasets, which demonstrates the utility of visual supervision in proposed tasks. While the variation can be significant across runs, if we run each model 10 times and choose the best run, we observe consistent results. We visualized the intermediate visual representations, but when no visual supervision is provided, they were not interpretable (sometimes they looked like random noise, sometimes blank). In the case when visual supervision is provided, the intermediate visual representation is well-formed and similar to the ground-truth.

We further investigate how DSMN* performs when intermediate visual supervision is available for only a portion of training samples. As shown in Fig. 4, DSMN* outperforms DMN+ by a large margin, even when intermediate visual supervision is provided for only of the training samples. This can be useful when obtaining visual representations is expensive and time-consuming. One possible justification for why visual supervision (even in a small amount) helps a lot is that it constrains the high­-dimensional space of possible intermediate visual representations. With limited data and no explicit supervision, automatically learning these high-dimensional representations can be difficult.

Additonally, we performed ablation study (see Table 1(b)) on the usefulness of final memory tag vector () and 2D spatial memory () in the answer feature vector (see Eqn. 4). We removed each of them one at a time, and retrained (with hyperparameter tuning) the DSMN and DSMN* models. Note that they are removed only from the final feature vector , and both of them are still coupled. The model with both tag and 2D spatial memory () performs slightly better than the only tag vector model (). Also, as expected the only 2D spatial memory model () performs much better for DSMN* than DSMN becuase of the intermdiate supervision.

Further, Table 1(c) shows the effect of varying the number of memory ‘hops’ for DSMN and DSMN* on FloorPlanQA. The performance of both DSMN and DSMN* increases with the number of ‘hops’. Note that even the 1-hop DSMN* performs well (better than baselines). Also, note that the difference in performance between 2-hop DSMN* and 3-hop DSMN* is not much. A possible justification for why DSMN* performs well even with fewer memory ‘hops’ is that DSMN* completes some ‘hops of reasoning’ in the visual representation module itself. Suppose one needs to find the location of an object placed in a room, w.r.t. the house. To do so, one first needs to find the location of the room w.r.t. the house, and then the location of the object w.r.t. the room. However, if one has already ‘sketched’ out the location of the object in the house, one can directly fetch it. It is during sketching the object’s location that one has completed a ‘hop of reasoning’. For a sample from FloorPlanQA, we visualize the attention maps in the memory module of 3-hop DMN+ and 3-hop DSMN* in Fig. 5. To infer the location of room 1’s door, DSMN* directly fetches sentence 3, while DMN+ tries to do so by fetching two sentences (one for the room’s door location w.r.t the room and one for the room’s location w.r.t the house).

Conclusion: We have investigated how to use DNNs for modeling visual thinking. We have introduced two synthetic QA datasets, FloorPlanQA and ShapeIntersection, that test a system’s ability to think visually. We have developed DSMN, a novel DNN that reasons in the visual space for answering questions. Experimental results have demonstrated the effectiveness of DSMN for geometric reasoning on synthetic data.

Acknowledgements: This work is partially supported by the National Science Foundation under Grant No. 1633157.

Appendix A Appendix

MODEL Embedding size regularization
LSTM-1 128 1e-4
LSTM-2 512 1e-5
MemN2N 64 NA
DMN+ 64 1e-3
DSMN 32 1e-3
DSMN* 32 1e-4
Table 3: The value of the tuned hyper-parameters for all the models on FloorPlanQA. MemN2N Sukhbaatar et al. (2015) model does not use regularization, so we tuned only the embedding size
MODEL Embedding size regularization
LSTM-1 2048 0.01
LSTM-2 512 0.1
MemN2N 512 NA
DMN+ 128 0.1
DSMN 32 0.1
DSMN* 64 0.01
Table 4: The value of the tuned hyper-parameters for all the models on ShapeIntersection. MemN2N Sukhbaatar et al. (2015) model does not use regularization, so we tuned only the embedding size.