Over the last few years, deep learning approaches have been a huge success in diverse domains. Sophisticated systems even outperform humans in several tasks such as image recognition and question answering. However, most tasks are focused on recognizing surficial patterns in data. Also, only few works aim to solve tasks that require reasoning skills beyond pattern recognition or information retrieval. Several recent works used human IQ tests to evaluate the reasoning abilities of machines. In the vision domain, several works utilized Raven’s Progressive Matrices (RPM) as a test bed for measuring the abstract reasoning abilities of neural models [16, 3, 34]. These tasks are challenging even for sophisticated systems because the systems are required to understand the logic of humans.
While following the spirit of existing works that dealt with a variety of reasoning, we focus on spatial reasoning. We utilize human IQ tests, which are referred to as spatial reasoning tests, to explore the spatial understanding of neural models. Spatial reasoning tests require to mentally visualize and transform objects in 2D or 3D spaces. For instance, to solve problems of rotation task in Figure 1(b), one should first mentally rotate objects in 3D space, and then determine whether the rotated objects are the same or not by visually comparing them. Hence, both image recognition and spatial comprehension are required to solve the task.
Of the various tasks in spatial reasoning IQ tests, we use the rotation task and the shape composition task. The rotation task involves finding an object that is different from the 3D-polyomino in the given image, as shown in Figure 1(b). The rotation task involves recognizing visual features in 2D and visualizing rotated features in three dimensions. The shape composition task, similar to solving a Tangram puzzle, involves choosing a set of pieces that would produce the given shape if combined, as shown in Figure 1(c). Hence, the shape composition task evaluates the ability to aggregate spatial information and understand the relative size, edges, and angles from given images.
A well-designed dataset is needed to evaluate the spatial reasoning ability of models. However, spatial reasoning IQ tests are either unavailable due to the copyright issues or insufficient in the number of test samples. To address this issue, we systematically generate a well-defined dataset. The dataset is intentionally created simple to recognize its figures but challenging for reasoning. We define three and four complexity levels for the rotation task and shape composition task, respectively. In the rotation task, the complexity level is determined based on the shapes of objects in images whereas the complexity level in the shape composition task is determined based on the number of piece images in each candidate image. Creating different complexity levels enables our experiments to be expandable. We generated 10,000 problems for complexity level. There is a total of 70,000 problems consisting of 750,000 images, which is sufficient for evaluating the reasoning ability of neural models.111We will make our data public after publication.
Once a model learns a certain reasoning ability, the model should generalize the ability to unseen situations. Thus, we evaluated models in the extrapolation setting, as well as in the neutral setting. Unlike the neutral setting where the same complexity levels of problems appear in both the training and test sets, the extrapolation setting uses test sets that contains more complex problems. Although it is easy for humans to apply their knowledge acquired from solving simple problems to more complex problems, it is challenging for machines to do so.
Finally, we describe a variety of baseline models that are commonly used in the visual reasoning domain. For the shape composition task, we propose a novel model called CNN+GloRe, which is based on the GloRe unit introduced in . CNN+GloRe recognizes each piece image and then combines them in various ways like humans. We provide the experimental results of all the baselines. From the results, we analyze factors that affect the generalization abilities of the models in spatial reasoning tests. Also, we visualize which part of the input image has a strong signal for predicting the answer by a gradient-based approach. This explains the most important question: how do neural models solve spatial reasoning tests? We believe our findings would be valuable insights into understanding a machine and the difference between a machine and human.
2 Spatial Reasoning Tests
Like verbal questions in  and RPMs in [16, 3, 34], spatial reasoning tests have been used and studied for decades in the fields of psychology, education, and career development [11, 8, 10]. To the best of our knowledge, this work is the first to utilize spatial reasoning tests to study recent deep learning based models. Various types of IQ tests differ in their goal. The verbal questions measure an understanding of the meaning of words. RPMs measure the ability to find abstract rules or patterns such as progression, AND, OR and XOR, from the given context images. On the other hand, spatial reasoning tests measure the ability to mentally visualize and transform objects in 2D or 3D spaces.
Spatial reasoning is not new in the vision domain because various tasks such as movement prediction and the room-to-room navigation task implicitly require spatial reasoning [31, 1]. However, they focused on their main tasks and did not explicitly study on spatial reasoning. On the other hand, our main tasks are to solve spatial reasoning tests, which implies we focus on spatial reasoning itself. Similar to our work, the CLEVR dataset was proposed to study visual reasoning . However, spatial relationships that CLEVR requires are directly captured in given 2D images, without obtaining hidden information in the given images. Our aim is to determine whether models can obtain hidden spatial information from superficial visual inputs.
2.1.1 Task Description
In the rotation task, a 3D-polyomino is an object in an image, which is formed by joining one block edge to edge. A model has to choose the correct answer out of four candidate answers which has a different 3D-polyomino object from the given question. The remaining three candidate answers are made by rotating the polyomino in the question image both horizontally and vertically in 3D space. All edges are 90 degrees to each other. Figure 1(b) shows an example of the rotation task.
2.1.2 Related Work
In the field of vision systems, learning rotation was used as tools for other downstream tasks such as image classification . There exists three main approaches for learning rotation-invariant representations in downstream tasks. The first approach involves predicting the degree of rotation [18, 5], but it requires target labels for the degrees of rotation. The second approach is a structural method which involves transforming kernels in CNN layers to obtain rotation invariant features from images [29, 33]
. However, this approach considers only the 2D rotation of 2D images. In our task, there are no labels for the degrees of rotation, and the 3D rotation of objects from 2D images is considered. Hence, the last approach, using the vector distance, is the most suitable for our task. Like[21, 7], we train our model on the vector distance between question and candidate answers.
2.2 Shape Composition
2.2.1 Task Description
The shape composition task involves choosing the correct set of piece images that would produce the original image if combined, as shown in Figure 1(c). Unlike the rotation task, each candidate answer in the shape composition task contains number of pieces, which ranges from 2 to 5. The correct answer is the candidate with these pieces that can produce the original image when combined. The remaining three incorrect candidate answers have pieces that are not part of the original image.
2.2.2 Related Work
Compared to shape composition tasks in the previous works , in our shape composition task, only images are given without any information such as the lengths of sides of objects. The shape composition task is similar to solving a jigsaw puzzle in that all pieces are combined to produce the original image. Methods and skills used for solving jigsaw puzzles can be applied to biology , archaeology , image editing , and learning visual representations [9, 24, 27, 4]. Most existing works divide an original image into a grid of equal-sized squares, and then assemble the square pieces to produce the original image. In , the shape of pieces was converted to a rectangle. On the other hand, we assemble polygons with various shapes.
3 Data Construction and Experimental Settings
In this section, we describe the data generation process for each task. We focus on measuring reasoning ability, and not recognition ability. Hence, we use basic blocks and polygons and create incorrect candidate answers not too different from the question and the correct answer. Each image in our datasets has 224x224 pixels so that the images can be easily viewed by humans. The image size can be reduced for memory efficiency.
A problem in both tasks consists of one question and four candidate answers including the correct answer. We classify problems into three complexity levels for the rotation task and four complexity levels for the shape composition task. Description of complexity level is in the subsections below. Utilizing the complexity levels, we conduct various experiments in terms of generalization. All the experiments are categorized into the neutral and extrapolation settings. In the neutral settings, the training and test sets contain the same complexity levels of problems. On the other hand, in the extrapolation setting, we use test sets with more complex problems. Table1 describes complexity levels and data statistics for each task.
We denote our experiments as where denotes complexity levels that the training set contains and denotes complexity levels that the test set contains. For example, denotes that level 1 and level 2 problems appear in the training set and level 3 problems appear in the test set. In the neutral setting,
denotes all complexity levels are used in both training and test sets. Problems of each complexity level have the same probability of being selected.
We evaluate the baseline models on the following two test sets: 1) An in-distribution (in-dist) test set with the same complexity levels as the training set, 2) An out-of-distribution (out-dist) test set with different complexity levels from the training set. The validation set is used only for tuning the hyper-parameters of each model.222Previous studies reported the performance of models on only validation and test sets [3, 34]. However, it is not fair to compare performance on the validation set with that on the test set since models can fit to the validation set.
We use accuracy as the evaluation metric.
In this section, we describe how to systematically create a dataset for the rotation task. First, we denote sets of edge lengths, angles, and directions as and , respectively. The edge length is equal to the number of blocks that form each edge, and each edge can have 3 to 9 blocks. is the set of rotation angles, including vertical and horizontal angle pairs. The angles can be any degrees in the intervals of where is a multiple of . We exclude angles that are apart from the angles that are a multiple of since images are indistinguishable using the angles in this interval. We create an object by joining each edge, depending on the four directions in which the last edge can be 90 degrees to the previous edge. is a set of directions where each element represents one of four possible perpendicular directions of the last edge. Finally, a polyomino is expressed as a triplet where and . Elements of and are sampled from and , respectively. and sampled from are vertical and horizontal angles, respectively. denotes the number of edges of an object in images.
A question and three candidate answers have and in common. The question and the correct answer have the same and have different s, which results in different polyominoes. s of the question image and four candidate answer images are taken from four different intervals, which implies that two out of the five images share the same interval either for or .
There are three complexity levels in the rotation task. The complexity levels are determined depending on the number of edges, . A larger produces more complex problems. For each level, we randomly generate 10,000 problems from combinations of , which results in a total of 30,000 problems. 7K, 1K, and 2K problems are used for the training, validation, and test sets, respectively.
In the case of humans, it does not take much effort to acquire reasoning skills. Hence, our datasets are smaller than the benchmark datasets used in the computer vision field, but the size of our datasets is sufficient for studying reasoning. An excessive amount of data can cause unintended bias or contain irrelevant features, which makes it difficult to train models.
3.1.1 Experimental Settings
We conduct one experiment in the neutral setting (), and three experiments in the extrapolation setting for the rotation task, which are denoted as follows: , and .
We use 7K, 1K, 1K problems for the training, validation and in-dist test sets, respectively, in the neutral setting and use additional 1K problems for out-dist test sets in the extrapolation setting. The problems are sampled from the corresponding problem sets of each complexity level. In the extrapolation setting, the same out-dist test set is used.
3.2 Shape Composition
For dataset for the shape composition task, we first created original images. We generated initial images coloured in black. We linearly cut the initial images twice. We use an initial image as an original image only if its size is larger than 25,000, where the size denotes the number of remaining pixels coloured in black after cutting linearly. Here, the linear operation has the following two requirements: 1) The slope of the cutting line is either or , 2) The point corresponding to the y-intercept is between 1/4 and 3/4 of the image height.
We then linearly cut the original image into number of pieces. When is 2, 3, or 4, the size of each piece is in the range of 3,000 to 30,000 and when , the size of each piece is in the range of 2,000 to 30,000. The cut pieces are then rotated by or , but never flipped. This set of number of pieces cut from the original image forms the correct answer. For two of the incorrect candidate answers, pieces are also from the same original image but only one piece is randomly replaced with another piece of similar size. For the remaining incorrect candidate answer, one of the pieces of the correct answer is scaled, which makes it impossible to form the original image.
In the shape composition task, there are four complexity levels. The complexity levels are determined depending on the number of piece images . A larger produces more complex problems. For each level, we randomly generate 11,000 problems, which results in a total of 44,000 problems. 8K, 2K and 1K problems are used for the training, validation and test sets, respectively.
3.2.1 Experimental Settings
In the shape composition task, we conduct one experiment in the neutral setting as in the rotation task. Also, we conduct two groups of seven experiments in the extrapolation setting, where the same out-dist test sets are used in the same group. In the first group, the complexity levels of the test set is 4: , , and . In the second group, the complexity levels of the test set are 3 and 4: , and . We use 7K, 1K, 1K and 1k problems for the training, validation, in-dist test and out-dist test sets, respectively.
4 Model Architectures and Training
In this section, we describe the baseline models used in spatial reasoning tasks. All of the baselines are neural networks that are commonly used in the field of vision reasoning or designed to be task-specific.
In both the rotation and shape composition tasks, an image of one question and images of four candidates are given as inputs. Models have to calculate a similarity score for each question and candidate answer pair . A Softmax function is applied to the scores of four pairs representing the probabilities of each candidate answer being the correct answer. For optimization, cross-entropy loss is used.
Given the image pair and , we first encode this image pair into vector representations using an image recognizer as follows: and . Next, we combine and , and compute the score of the pair as follows: where is a reasoning module that captures relationships between questions and candidate answer images, and then reason the correct answers. The structure of the baseline models for the rotation task are determined depending on and .
We compare the 4-layer CNN with the deeper CNN ResNet-50  which is one of the most widely used image recognizers. Like CNN+MLP, ResNet+MLP has two MLP layers.
Using the vector distance between two images is one way to solve the rotation task. This approach can be used when the degree of rotation is not provided. Using Siamese networks , the vector distance between and
is calculated by cosine similarity.333We trained Siamese using the L1 loss, but the performance decreased. Unlike CNN+MLP and ResNet+MLP, our Siamese model is optimized by the binary cross entropy losses and the four similarity scores of each question and candidate pair. When evaluating Siamese, we choose the candidate answer with the lowest similarity score as the correct answer. Figure 3(a) illustrates the Siamese model.
4.2 Shape Composition
Unlike the rotation task, in the shape composition task, a candidate image consists of multiple piece images where . Thus, we have to consider a model to aggregate piece vectors and represent a candidate vector with a fixed length. We compare models with different aggregation functions.
We concatenate piece vectors encoded using CNN, and feed the concatenated vectors to the MLP layers to obtain a fixed sized candidate vector . The number of hidden units in the MLP layer is set to the maximum number of pieces, and we randomly feed each image to each unit.
Max-pooling is a simple function that calculates dimension-wise maximum values of piece vectors. The MLP layers are used to compute a score.
Following , we adopt Global Reasoning unit (GloRe unit) to create a novel model, CNN+GloRe, that is specified to the shape composition task. The CNN+GloRe is illustrated in Figure 3(b). The GloRe unit was proposed to capture not only local but also global relationships between image regions. The GloRe unit maps image regions in a coordinate space into nodes in an interaction space, then operates weighted graph pooling . We combine the GloRe unit with CNN for our shape composition task, where each piece image is regarded as an image region. The GloRe aggregates piece vectors into a candidate vector as follows:
where is the GloRe unit. and are then concatenated and fed into the MLP layers.555We implemented GloRe using the code released by the authors444https://github.com/facebookresearch/GloRe.
In detail, each node in an interaction space is obtained as follows:
where is a learnable parameter and is the index of a piece image of the candidate answer. Unlike in the rotation task, the recognizer for encoding questions and the recognizer for piece images are different. The question and piece vectors are encoded as and , respectively. All the nodes in the graph convolutional layers are as follows:
is an identity matrix andis a trainable adjacency matrix that is randomly initialized. and are trainable weights where is the dimension size of piece vectors. is the representation of the graph convolution output nodes.
Similar to Equation 2, the obtained node representations in the interaction space are mapped to the coordinate space as follows:
where is the -th row of , and is a learnable parameter. Finally, a candidate vector is computed as the element-wise mean of vectors . Figure 3(b) illustrates the graph-based structure of the CNN+GloRe model.
4.3 Hyperparameter Settings
In this section, we summarize all the hyper-parameter settings of the baseline models. We implemented baseline models using PyTorch.666https://pytorch.org The source code for reproduction is publicly available at github.com/blind.
For CNN, we used 4 convolutional layers each of which has feature maps, respectively. The kernel size is set to 7 for all layers. The dimension
of the image vectors is set to 512. We used a 2-layer MLP with a ReLu activation function for CNN+MLP. The same image recognizer is used for question and candidate images in the rotation task. CNN+MLP and Siamese were optimized by an SGD optimizer with an initial learning rate of 0.1. The batch size is set to 64. ResNet+MLP was optimized by the Adam optimizer
with an initial learning rate of 0.0005, and the batch size was set to 16. A learning rate decay of 0.9 at each epoch was used.
4.3.2 Shape Composition
In the shape composition task, we used the same hyperparameters for CNN, which were used in the rotation task. However, we used two CNNs: one for question images and the other for candidate images. For CNN+GloRe, we used the same GloRe unit used in. The SGD optimizer is used for CNN+GloRe, and the Adam optimizer is used for the other models. A learning rate of 0.1 is used for the SGD optimizer and a learning rate of 0.0005 is used for the Adam optimizer. A batch size of 64 is used for all models.
5 Experimental Results
This section discusses the experimental results and provides analysis. Table 2 and Table 3 show that the results in the neutral and extrapolation settings for each task, respectively. The overall section provides analysis of the experimental results which are consistent in both tasks. In the rotation and shape composition section, we analyze the experimental results of the baseline models and focus mainly on the differences in their architecture.
Training on complex problems is more effective than training on simple problems. Table 2 shows that performance of CNN+MLP in is 39.9% higher than that in . Also, Table 3 shows that its performance in is 48.0% higher than that in , and performance in is 54.8% higher than that in .
Training on different complexity levels improves generalization. Before we conducted experiments in the extrapolation setting, we predicted that the performance on out-dist test sets would increase if models were trained on various complexity levels. In the shape composition task, our prediction is consistent with the following results: performance of CNN+MLP and CNN+GloRe in is higher than that in , and performance in is higher than that in . However, the results from Table 2 ( and ) and Table 3 ( and ) are inconsistent with our prediction.
We hypothesized that adding an equal number of simple and complex problems to a training set would result in models learning fewer complex problems. we conduct an additional experiment using different ratios of complexity levels in the training set. We provide the result of for the rotation task, and for the shape composition task. As Figure 4 shows, changing the ratio of complexity levels in the training set affects performance on the test set. Figure 4(a) shows that CNN+MLP and Siamese achieved the highest and second highest accuracy performance of 74.0% and 65.7%, respectively, when the ratio is 1:2 in the rotation task. The performance of CNN+MLP and Siamese was 3.06% and 4.45%, respectively, higher than when training with the ratio of 1:1. However, when simple problems are added (the ratios of 1:3 and 1:4), performance slightly decreases. When the proportion of simple problems in the training set is too small, the simple problems may act as noise. When the ratio of the simple problems increases, performance improves. After the performance peaks with the ratio of 1:2, it decreases as the proportion of simple problems increases. If the proportion of simple problems is too high, performance may not improve. Figure 4(b) shows the result of the shape composition test . The CNN+MLP model achieves the highest performance of 73.3% when the ratio is 1:1:2, which shows that performance in both tasks improves as the proportion of more complex problems increases. Thus, we confirmed that training on different complexity levels improves generalization.
Siamese does not generalize well in most experiments. In previous studies, the vector distance between two images was commonly used to learn rotation-invariant representations when the degree of rotation is not available. Siamese networks, trained on the vector distance, generalized well on new situations . However, in our experiments, Siamese performed relatively poor than CNN+MLP in most experiments as Table 2 shows. These results implies that using only a recognizer is not enough to solve the rotation task. A reasoning module such as MLPs that aggregates question and candidate answers is helpful.
A larger model size does not guarantee higher performance. As shown in Table 2
, validation and test performances of ResNet+MLP are relatively low when training the ResNet+MLP from the scratch on our rotation task. Since the low performance may be due to underfitting, we replaced ResNet with a pretrained ResNet and trained all weights in the model (ResNet-pre). Though the pretrained ResNet is trained on the large amount of data in ImageNet dataset, ResNet-pre achieved the performances of 54.4%, 52.0% in validation and in-dist test set, respectively, in the neutral setting.777ResNet+MLP also performed relatively poor in the shape composition task although it requires more memory and training time. From these results, we conclude that a large model size does not always improve performance on spatial reasoning tasks. Using heavy image recognizers can cause overfitting. Instead, proper structures or training methods should be discussed to solve the spatial reasoning tasks.
5.3 Shape Composition
CNN+GloRe always outperforms other baselines in the neutral setting, but not in the extrapolation setting. In the neutral setting, CNN+GloRe obtained 2.5% and 4.6% higher performance than CNN+MLP and CNN+Max, respectively, as shown in Table 3. However, in the extrapolation setting, CNN+MLP generally outperforms CNN+GloRe. We assumed that CNN+GloRe would be able to learn shape composition skills (e.g., combining piece images) like humans, and achieve high performance in all experimental settings, but CNN+GloRe did not. Since CNN+GloRe relies on features from images in training data, it obtained low performance in the extrapolation setting. We believe that more work should be studied for developing models that perform in a similar way that humans solve Tangram puzzles, such as CNN+GloRe. Also, this work should involve careful consideration of learning the principle of reasoning.
CNN+Max generalize well when trained on only one complexity level. Table 3 shows that CNN+Max outperforms the other baselines in the following four experiments in the extrapolation setting: , , and . CNN+Max learns to find the most noticeable features from images, regardless of the number of piece images, resulting in the high performance in the experiments above. However, CNN+Max does not capture relationships between piece images, which are important in tasks such as the shape composition task. Even if CNN+Max is trained on more piece images, its performance does not improve. On the other hand, CNN+MLP and CNN+GloRe learn how to combine piece images. We hypothesized that when CNN+MLP and CNN+GloRe are given different number of piece images in training, the models can combine more piece images. The results of the following experiments support our hypothesis: , and .
6 Qualitative Analysis
In experiments, we confirmed that neural models can solve spatial reasoning tests, and generalize their ability even in the extrapolation setting. However, it is still questionable whether they solve the tasks based on spatial understanding or pattern matching. In this section, we provide further analysis and conclusions with visual aids. For visualization, we utilize Grad-CAM which uses the gradient flows into the convolutional layer. We used the third convolutional layer of the CNNs to understand the important features for the answer.
We randomly sampled problems in the neutral setting in the rotation task that CNN+MLP predicted correctly, and analyzed them. In most cases, we found that the model solves problems by capturing common structures of objects between question and candidates. In Figure 5(a), the model captured the common ’’-shaped parts in the question and three candidates out of four and the L-shaped part in the remaining candidate. As a result, the remaining candidate was chosen as the correct answer. However, it is worth noting that the model did not always focus on ’’-shaped parts or L-shaped parts. Its focus varies depending on objects. The model occasionally captured joining blocks and the longest edges of objects. Similarly, the model predicted the answer correctly when the model focused on the same parts in the question and the incorrect candidates, and different part in the correct candidate, as Figure 5(b) shows.
Next, we randomly sampled error cases of CNN+MLP, and classify our findings into three categories. First, if the model captures the same structures from the question and all the candidates, i.e., there is no difference between candidates, it is confusing for the model. In Figure 6(a), the model focused on the same L-shaped parts from all the five objects, which results in incorrect prediction. To solve the problem, the model is required to understand the direction to which the L-shaped parts are bent based on the longest edges. The model failed to answer correctly due to the lack of spatial understanding, but humans can easily solve this problem. Also, as Figure 6(b) shows, the model is confused when the correct candidate answer image is similar to a mirror image of the other candidate. Secondly, the model is vulnerable to the situation where a part of an object is obscured by another part of the object. In Figure 6(c), the L-shaped part in the question is obscured. In this case, while human can restore the obscured part in their mind, the model recognizes only the frontmost part and misses the obscured part. Lastly, despite the absence of obscurity, the model did not capture the important parts in solving problems. In Figure 6(d), we marked the important parts, the joining blocks, with yellow circles. The model missed them, and mainly focused on the longest edges, which results in incorrect prediction.
6.2 Shape Composition
In this section, we analyze how neural models solve the shape composition task using Grad-CAM. In the shape composition task, as Figure 1
shows, original images are adjacent to the border with no margins in the background. However, the weights are drawn outside the image outlines, thus information from the adjacent edges is lost resulting in difficulty of visual analysis. Hence, we padded images with 50 margins at each side and re-scale the images tosize to prevent information loss.
Using newly preprocessed images, we trained CNN+MLP model and CNN+Max in to investigate the reason of high performance of CNN+Max and the difference in problem solving between CNN+Max and CNN+MLP.888The performance of models after padding and re-scaling is slightly higher than before. In , CNN+MLP and CNN+Max achieved 52.6% and 65.1%, respectively. CNN+MLP achieved 68.3%. We refer to the CNN+MLP model as CNN+MLP to distinguish it from CNN+MLP that is a CNN+MLP model trained on complexity level 1 and 2. Note that CNN+Max first captures features of each piece image, and then selects the distinguishable ones among the features. If the piece images have the same part of those features, the model predicts the candidate image as the answer. For example, in Figure 7(b) and 7(f), CNN+Max captured an oblique side and a vertex on the top from the question image and captured the same part in the third piece image. The CNN recognizer also captured other features, but max-pooling does not consider the combination of these features. This problem solving method of CNN+Max is more effective than that of CNN+MLP where the model rarely learns how to combine piece images, i.e., the model is trained on only one complexity level, or a single number of piece images. In fact, CNN+MLP did not solve the problem (Figure 7(e) and (g)) because the model was trained to combine only two pieces, the model was not able to know how to generalize their ability to combine more than two pieces.
However, CNN+Max is vulnerable to when piece images are rotated. Since standard CNN filters do not consider rotation, CNNs capture different features from two images that are the same each other except that one of them is just rotated. Thus, if the third piece image in Figure 7(f) is rotated, the model outputs another answer. CNN+Max is also confused when features from the question image are common across more than two candidate images. In Figure 8, the top part of the question image (b) is very similar to the third piece image in the correct answer (f), but the model failed to answer correctly since there is the piece image in another candidate image (e) that is similar to the question image. On the other hand, CNN+MLP does not relies only on the similarity between the question and piece images and consider the combination of piece images. As a result, CNN+MLP solved the problem correctly, though similar piece images existed.
In addition, we trained another CNN+MLP model (CNN+MLP) in to see the effect of training on different number of piece images. Compared to CNN+MLP, CNN+MLP captures the shapes of images more globally as Figure 7(d), 7(h), 8(d) and 8(h) shown. This implies that CNN+MLP consults more shape features prior to combining pieces.
There are several works studying the recognition ability of CNNs to capture image shapes [22, 2, 12], but they did not provide visual analysis. Unlike the finding that ImageNet-trained CNNs are insensitive to image shapes [2, 12], our models solve the task based on information about image shapes. However, as consistent with , our CNN-based models usually do not capture global shapes of images, while it is natural for humans to capture global shapes when solving a Tangram puzzle. Reducing this difference between humans and machines could be the key to solving puzzle-related tasks and even reasoning tasks.
7 Conclusions and Future Work
In this paper, we introduced two spatial reasoning test: rotation and shape composition, both of which are human IQ tests that require spatial reasoning. We generated a dataset for each task, both of which consist of various complexity levels. In experiments of the neutral and extrapolation settings, we examined whether neural net based models can apply their spatial reasoning ability to unseen situations, and confirmed that the models can do. Several factors improve models’ generalization: training on complex problems, training on different complexity levels and using reasoning modules such as MLPs. Surprisingly, the max-pooling is effective in the extrapolation setting. Another lesson is that higher performance in in-distribution does not guarantee better generalization. Large model size and additional components may improve performance in in-distribution, but they can cause overfitting. Also, we analyzed how baseline models solve spatial reasoning tests. Although spatial reasoning tests were designed to measure spatial understanding, the models solve the tasks based on pattern matching with a lack of understanding of space.
Future work should focus on training models to understand the space. If such training is possible, we can create a model that can solve complex problems just by learning simple problems. Moreover, we can train our model efficiently with a small amount of data.
We simplified our tasks to focus on the reasoning abilities of neural models. Based on the results from this study, we plan to extend these problem tasks to more general situations. In the extended rotation task, polyominoes can be converted to cylindrical shapes. In the extended shape composition task, the original shape and piece images can be converted to three dimensions. In these extended tasks, more interesting and meaningful discoveries can be obtained.
-  (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683. Cited by: §2.
-  (2018) Deep convolutional networks do not classify based on global object shape. PLoS computational biology 14 (12), pp. e1006613. Cited by: §6.2.
-  (2018) Measuring abstract reasoning in neural networks. arXiv preprint arXiv:1807.04225. Cited by: §1, §2, footnote 2.
-  (2019) Domain generalization by solving jigsaw puzzles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2229–2238. Cited by: §2.2.2.
-  (2019) Self-supervised gans via auxiliary rotation loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12154–12163. Cited by: §2.1.2.
-  (2019) Graph-based global reasoning networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 433–442. Cited by: §1, §4.2.3, §4.3.2.
Rifd-cnn: rotation-invariant and fisher discriminative convolutional neural networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2884–2893. Cited by: §2.1.2.
-  (2004) Geometric and spatial thinking in early childhood education. Engaging young children in mathematics: Standards for early childhood mathematics education, pp. 267–297. Cited by: §2.
-  (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430. Cited by: §2.2.2.
-  (2006) The importance of gesture in children’s spatial reasoning.. Developmental psychology 42 (6), pp. 1259. Cited by: §2.
-  (1992) Multiple intelligences. Vol. 5, Minnesota Center for Arts Education. Cited by: §2.
-  (2018) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231. Cited by: §6.2.
-  (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §2.1.2.
-  (2017) From square pieces to brick walls: the next challenge in solving jigsaw puzzles. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4029–4037. Cited by: §2.2.2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.2.
-  (2017) Iq of neural networks. arXiv preprint arXiv:1710.01692. Cited by: §1, §2, §4.1.1.
-  (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910. Cited by: §2.
Rotationnet: joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5010–5019. Cited by: §2.1.2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.1.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §4.2.3.
-  (2015) Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §2.1.2, §4.1.3, §5.2.
-  (2016) Deep neural networks as a computational model for human shape sensitivity. PLoS computational biology 12 (4), pp. e1004896. Cited by: §6.2.
-  (2007) Mitochondrial dna as a genomic jigsaw puzzle. Science 318 (5849), pp. 415–415. Cited by: §2.2.2.
-  (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §2.2.2.
-  (2018) Image reassembly combining deep learning and shortest path problem. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 153–167. Cited by: §2.2.2.
-  (2018) Logical composition of qualitative shapes applied to solve spatial reasoning tests. Cognitive Systems Research 52, pp. 82–102. Cited by: §2.2.2.
-  (2017) Deeppermnet: visual permutation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3949–3957. Cited by: §2.2.2.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §6.
Patch reordering: a novelway to achieve rotation and translation invariance in convolutional neural networks.
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.1.2.
A generalized genetic algorithm-based solver for very large jigsaw puzzles of complex types. In Twenty-Eighth AAAI Conference on Artificial Intelligence, Cited by: §2.2.2.
-  (2019) Out of sight but not out of mind: an answer set programming based online abduction framework for visual sensemaking in autonomous driving. arXiv preprint arXiv:1906.00107. Cited by: §2.
Solving verbal questions in iq test by knowledge-powered word embedding.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 541–550. Cited by: §1, §2.
-  (2018) Learning steerable filters for rotation equivariant cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 849–858. Cited by: §2.1.2.
-  (2019) Raven: a dataset for relational and analogical visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5317–5327. Cited by: §1, §2, footnote 2.