SARN: Relational Reasoning through Sequential Attention

by   Jinwon An, et al.
Seoul National University

This paper proposes an attention module augmented relational network called SARN(Sequential Attention Relational Network) that can carry out relational reasoning by extracting reference objects and making efficient pairing between objects. SARN greatly reduces the computational and memory requirements of the relational network, which computes all object pairs. It also shows high accuracy on the Sort-of-CLEVR dataset compared to other models, especially on relational questions.



There are no comments yet.


page 4

page 8

page 9


Dilated DenseNets for Relational Reasoning

Despite their impressive performance in many tasks, deep neural networks...

R-SQAIR: Relational Sequential Attend, Infer, Repeat

Traditional sequential multi-object attention models rely on a recurrent...

Relational Embedding for Few-Shot Classification

We propose to address the problem of few-shot classification by meta-lea...

Relational Graph Attention Networks

We investigate Relational Graph Attention Networks, a class of models th...

Searching for Ambiguous Objects in Videos using Relational Referring Expressions

Humans frequently use referring (identifying) expressions to refer to ob...

Learning sparse relational transition models

We present a representation for describing transition models in complex ...

Taming Reasoning in Temporal Probabilistic Relational Models

Evidence often grounds temporal probabilistic relational models over tim...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Relational reasoning is one of the fundamental building blocks for all kinds of human cognitive activity [2]. While representing relations and performing reasoning based on them is a challenging problem[4, 1], many graph-based approaches tried to solve these problems [6, 3]. These studies were primarily focused on solving problems using a sparse matrix representing relationships that form a network. Moreover, relational reasoning based on network structure requires clearly defined entities and relations, which is mostly not the case in the real world.

We viewed relational reasoning as a series of decision processes. Consider a general setting where there are many objects in a scene and information on certain relationships has to be inferred based on a given question. The relational reasoning procedure begins with identifying the reference object based on the question. For example, given a certain image to answer the question "What is the closest object near to object A?", we first identify object A, the reference object, in the image. After the reference object is identified, it is compared with other objects to determine the relationship between each pair. Afterward, distance from object A to other objects will be calculated and sorted to find the nearest object. We do not need the relationship information between object B and object C to answer the question above. In other words, we focus our ’attention’ only on the relevant relationships by filtering out other relationships that do not include the reference object.

Previous research on relational neural networks that gave intuition to our study was

[5] and [7]. [5] proposed the relational module (RN) that tries to compute relational information by pairing each object representation with each other using a relational module. However, because all object pairs were put through the relational module, computational complexity is in where is the number of objects.

[7] used sequentially stacked attention algorithms to focus on areas of images that are relevant to solving a given question. Attention maps enhance performance but also explains how the model views the image when reasoning. Although this incorporates the sequential reasoning process, it does not use an explicit relational reasoning module.

In this work, we propose an efficient relational reasoning algorithm that sequentially processes information using attention. The reference object is found using soft-attention, which is paired with other objects. Next, the relevant relationship between each object and the reference object is extracted. By only making object pairs that include the reference object, computational complexity is now in . Attention maps and relational module activation maps show that the results are much more interpretable, selectively showing high activation values in areas of the image that is relevant.

2 Framework and formulation

We use the same notation as in [5]

: a convolutional neural network for pixel-wise object representation using feature maps (with coordinate vectors attached),

for the relational module and for processing the aggregated relational information. An additional attention module that we propose for representing the reference object is . When we use the word object, we refer to the pixel in the feature map, except for the reference object which is a weighted sum of the pixels.

First, we extract the reference object using soft-attention. The attention module takes the feature map and question embedding as input:


where is an object and is the question embedding. It outputs a softmax attention across the objects that locates the reference object. The reference object is represented as a weighted sum of the objects:


The reason why we use soft-attention instead of hard-attention is that each pixel of the feature map does not exactly correspond to one object. In other words, it is possible that the receptive field of a pixel in feature map does not contain an object entirely. It could be distributed among nearby pixels. Selecting only one pixel could force the object representation to be inconclusive. A weighted sum of soft-attention can adequately represent the reference object even in these situations. We check this idea in Section 3.4 by considering various image resolutions.

Next, We pair this reference object representation with other the objects by concatenating it channel-wise. As in , the question embedding is concatenated to each pair and is fed to the relational module :


Figure 1 shows the overall model of our proposed model.

Figure 1: Model architecture overview

3 Experiments

3.1 Dataset

In our experiments, we used a dataset which is a modified version of Sort-of-CLEVR from [5]. Each image has 6 objects, whose shape is randomly assigned to be a square or a circle. 6 different colors were used to identify each object. Given a reference object identified by one of the 6 colors, 3 non-relational and 5 relational questions are generated. The non-relational questions are the same as in [5]: (1) horizontal position, (2) vertical position, (3) shape. Relational questions of [5] are (1) shape of the nearest object (2) shape of the furthest object (3) number of objects of the same shape. Additionally, (4) color of the nearest object, (5) color of the furthest object is also added to the relational question list. Sample questions are shown in Figure . A total of 9800 images were generated for training and 200 for left for testing. Each image has 48 questions ( non-relational questions,

relational questions). Question vectors are represented by concatenating two one-hot encoding vectors, one for the color and the other for the question type.

3.2 Models and parameters

We ran three different models. The relational network of [5], a baseline model without object pairing, and our proposed model SARN. Our baseline model is different from that of [5] which flattens out the CNN feature map and concatenate it with the question embedding. We used a different baseline model that takes individual objects as inputs for instead of paired inputs and is run through .

The model parameters are the same for each model. CNN: 4 convolutional layers with 32 kernels, ReLU non-linearities, and layer normalization.

, and : three-layer MLP with 128 hidden units per layer.

Test accuracy is shown in table Table  1. SARN shows higher accuracy in both non-relational and relational task. For detailed accuracy results for each type of question, see the Appendix.

model overall non-rel rel
SARN 96.73 99.84 94.88
RN 93.56 99.81 89.83
base line 89.07 97.58 83.97
Table 1: Test accuracy

3.3 Reasoning inspection and interpretability

Since our model runs in a sequential manner, we can examine the attention module and the relational module to verify whether reference objects are correctly retrieved, and important relationships are highlighted.

The attention map produced by is shown in Figure  2. It shows the weights of . It correctly identifies the region of the reference object according to the question. We did not give only the color embedding vector for but used the whole concatenated question embedding vector. learned what information to use from the concatenated question embedding vector.

Inspecting the output of can show whether the most relevant pairs were identified regarding the question. To represent the average activation value for each object-reference pair, is summed up across channels. This shows the aggregated amount of activation values that each object-reference pair has produced.

Figure 2 fig:attention_blue_closest_shape shows that correctly picks the reference object as the blue object. Regarding channel sum value of , it also correctly exhibits high activation values for objects that are near the blue object. When the question is finding the furthest objects as in Figure 2 fig:attention_blue_furthest_shape, high activation values are found near the red object, which is the furthest from the blue object. For other examples, see the Appendix.

We also checked RN if the proper object pairs are used to solve the question. However, it was possible to see that object pairs that do not have much significance in addressing the given question have high activation values of . This indicates that there is lack of interpretability that can verify the reasoning is done soundly. See the Appendix for detailed examples.

3.4 Robustness on image size and object sparsity

We tested how robust SARN is to object size and image size with the same model parameters as in Section 3.2, which are shown in Table 4. By varying image size while fixing object size, we can evaluate how the model deals with sparsity. As image size gets bigger, more and more objects (pixels in the feature maps) will correspond to blank spots. When comparing the configurations where object size and image size are roughly in the same proportion, we can evaluate how the model deals with the granularity of object representation

We first tested robustness to sparsity by varying the image size to 64, 75, 128 while fixing the object size to 5. The baseline model and RN have similar performance on non-relational questions across all image sizes. However, they show worse results on relational questions as image size gets bigger. SARN is relatively robust and even shows higher accuracy with bigger image size.

Next, we tested robustness to granularity by changing the size of image and objects with the same proportion of image size-object size (75-5, 64-4, 128-8). Objects in the configuration (128-8) will be represented by more pixels than in (64-4). In case of RN, this will make each object (pixel) represent only a fraction of the original object in the image. However, SARN takes a soft-attention weighted representation for the reference object and is thus robust to how many pixels represent an original object in the image. The results reflect this: RN shows lower performance as the image size gets bigger. SARN shows stronger performance as the image size gets bigger, especially in relational questions.

(a) What is shape of the object closest to the blue object?
(b) What is shape of the object furthest to the blue object?
Figure 2: Sample attention maps and relational module activation: The first column shows the image. The second column shows the (upscaled) attention map of overlaid on the image. The third column shows summed up across channels overlaid on the image. The fourth column is used to emphasize and compare the amount of activation value of for each object

4 Conclusion

We propose an attention module augmented relational network called SARN(Sequential Attention Relational Network) that implements an efficient sequential reasoning process of (1) finding the reference object and (2) extracting relevant relationships between the reference object and other objects. This greatly reduces the computational and memory requirements of [5], which computes all object pairs. It shows higher accuracy on the modified Sort-of-CLEVR dataset than other models, especially on relational questions. Also by inspecting the attention map and relational module, we can verify that the reasoning process is interpretable.


  • Harnad [1990] Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346, 1990.
  • Kemp and Tenenbaum [2008] Charles Kemp and Joshua B Tenenbaum. The discovery of structural form. Proceedings of the National Academy of Sciences, 2008.
  • Li et al. [2015] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
  • Newell [1980] Allen Newell. Physical symbol systems. Cognitive science, 4(2):135–183, 1980.
  • Santoro et al. [2017] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Tim Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017.
  • Scarselli et al. [2009] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
  • Yang et al. [2016] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 21–29, 2016.

5 Appendix

5.1 test accuracy by question type

model horizontal vertical shape non-rel
SARN 99.92 99.67 99.92 99.84
RN 99.92 99.67 99.83 99.81
base line 96.33 96.58 99.83 97.58
Table 2: Test accuracy: non-relational questions
model cl_col cl_sh fur_col fur_sh count rel
SARN 90.75 93.92 93.75 96.33 99.67 94.88
RN 86.33 88.42 84.17 90.25 100 89.83
base line 84.92 88.50 67.83 79.25 99.33 83.97
Table 3: Test accuracy: relational questions

5.2 Image resolution robustness

64_4 128_8 64_5 128_5
SARN non-rel 0.9970 0.9999 0.9948 0.9988
rel 0.8949 0.9440 0.8370 0.8669
total 0.9345 0.9650 0.8970 0.9163
RN non-rel 0.9944 0.9981 0.9964 0.9931
rel 0.8415 0.8207 0.8430 0.7719
total 0.8989 0.8872 0.9005 0.8555
baseline non-rel 0.9941 0.9972 0.9933 0.9978
rel 0.8120 0.8625 0.8163 0.8532
total 0.8803 0.9130 0.8827 0.9074
Table 4: image size-object size

5.3 Proposed model and channel sum plot

(a) What is shape of the object closest to the green object?
(b) What is shape of the object furthest to the green object?
(c) What is the number of objects that has the same shapoe as the violet object?
(d) What is color of the object closest to the orange object?
Figure 3: Sample attention maps and relational module activation: additional examples
(a) question: What is shape of the object that is furthest from the red object?
(b) plot for each object pair
Figure 4: Relational module activation of RN: The three figures above show the output summed up across channels. The 8 figures below show the pairs with the highest activation values. The number above each figure shows the object that is paired with others. The first figure is the output of object pairs that is paired with the object 0. This tells that objects paired with 0 had the biggest summed up activation value of 9.75. However since the blue object is the furthest from the red object, object 23 should have been the relation that is most critical to solving the problem.