Log In Sign Up

Multi-Granularity Modularized Network for Abstract Visual Reasoning

by   Xiangru Tang, et al.

Abstract visual reasoning connects mental abilities to the physical world, which is a crucial factor in cognitive development. Most toddlers display sensitivity to this skill, but it is not easy for machines. Aimed at it, we focus on the Raven Progressive Matrices Test, designed to measure cognitive reasoning. Recent work designed some black-boxes to solve it in an end-to-end fashion, but they are incredibly complicated and difficult to explain. Inspired by cognitive studies, we propose a Multi-Granularity Modularized Network (MMoN) to bridge the gap between the processing of raw sensory information and symbolic reasoning. Specifically, it learns modularized reasoning functions to model the semantic rule from the visual grounding in a neuro-symbolic and semi-supervision way. To comprehensively evaluate MMoN, our experiments are conducted on the dataset of both seen and unseen reasoning rules. The result shows that MMoN is well suited for abstract visual reasoning and also explainable on the generalization test.


page 1

page 2

page 3

page 4


DAReN: A Collaborative Approach Towards Reasoning And Disentangling

Computational learning approaches to solving visual reasoning tests, suc...

Abstract Visual Reasoning with Tangram Shapes

We introduce KiloGram, a resource for studying abstract visual reasoning...

AVoE: A Synthetic 3D Dataset on Understanding Violation of Expectation for Artificial Cognition

Recent work in cognitive reasoning and computer vision has engendered an...

Hierarchical Rule Induction Network for Abstract Visual Reasoning

Abstract reasoning refers to the ability to analyze information, discove...

Measuring abstract reasoning in neural networks

Whether neural networks can learn abstract reasoning or whether they mer...

Abstract Spatial-Temporal Reasoning via Probabilistic Abduction and Execution

Spatial-temporal reasoning is a challenging task in Artificial Intellige...

Multi-Viewpoint and Multi-Evaluation with Felicitous Inductive Bias Boost Machine Abstract Reasoning Ability

Great endeavors have been made to study AI's ability in abstract reasoni...

I Introduction

Can an agent do relational and analogical visual reasoning as well as a toddler? Moreover, can an agent solve reasoning tasks it has never seen before?

Abstract visual reasoning is a remarkable cognitive mechanism for humans to achieve logical conclusions in the absence of physical objects, specific instances, or concrete phenomena. And here, the capacity of reasoning is a generalization about relations and attributes primarily, instead of concrete objects. More importantly, current machine learning techniques are data-hungry and brittle—they can only make sense of patterns they’ve seen before. Using current methods, an algorithm can gain new skills by exposure to large amounts of data, but cognitive abilities that could broadly generalize to many tasks remain elusive. Thus, there’s a question about what happens if the agent meets a new and unseen reasoning type. And also, we want to know what is the thinking in images and abstract reasoning for machines.

To deal with these issues, we focus on abstract visual reasoning, offer the potential for more human-like abstraction and reasoning. Correctly, we verify our agent on Raven’s Progressive Matrices (RPM) Test, designed to measure abstract visual reasoning. It’s also used to test the human’s capacity of non-verbal cognitive functioning in some public exams. In the measurement, the agent is showed with a matrixes with geometric designs. Given eight candidates of the missing layout, the agent is aimed at choosing the correct layout, and need to follow the analogical relations’ rule and figure out the specific pattern in this matrix  [14], based on the Spearman’s two-factor theory of intelligence [13].

Unlike existing work in measuring abstract visual reasoning using RPM [10], we simulate and validate our designs on RAVEN [14] test because RAVEN test establishes a semantic link between vision and thinking by providing tree-based structure representation. Previous work [14, 16, 15, 12] design extremely complex models to do representation and reasoning in an end-to-end fashion. But their models are tedious and, therefore, hard to explain, besides, the structure information is not well utilized. Most importantly, they cannot easily extrapolate their knowledge to new situations.

Hoping to understand better how machines understand this task, we aim to figure out whether the computer can learn the rule (semantic) from the visual sensory information. Ask for toddlers, and toddlers must rely on intrinsic cognitive functions for logical conclusions. Inspired by these cognitive studies, we equip our model with simple modularized reasoning functions that is jointly trained with the perception backbone in a neuro-symbolic way. Toddlers can be attuned to relationships between features of objects, actions, and the physical environment. We adopt module network[1], and each module is for each rule in our case. To train it, we want to take our rules as the target of a latent semantic parser. And the goal is to recover that. Meta target information is then utilized to restrict the space of potential semantic parser that we consider, which provides a certain level of intelligence. To determine our model’s efficiency, we verify our model on the RAVEN dataset compared to various baselines. Furthermore, we design four generalization test to demonstrate the improved ability to deal with unseen reasoning rules.

Ii Related Work

Raven’s Progress Matrices problem is widely used to test the capability of abstract reasoning. In recent years, different models and datasets are designed to lift the reasoning ability of modern vision systems. Inspired by RPM, [10] built the first large-scale RPM dataset named PGM, and proposed a relational model Wild Relation Network (WReN) leverage representation of pair-wise relations for each choice. Then, [11] made use of pre-trained Variational Auto Encoder to improve the generalization performance of WReN[10]. [14] generated a new RPM-style dataset RAVEN with structured representation and proposed Dynamic Residual Tree (DRT), which considers annotations of image structure. Both PGM and RAVEN are designed to be easy to recognize but hard to reason. [16] proposed a student-teacher architecture to deal with distracting features. More recently, [12] used a multi-layer multiplex graph to capture multiple relations between objects. Besides, [17] modified ResNet[4] to reduce overfitting, and proposed MCPT to solve RPM problems in an unsupervised manner.

Fig. 1: Multi-Granularity Modularized Network

Many previous studies have utilized modular neural architectures for various tasks. Such as [2] assembled networks flexibly from a collection of specialized substructures to answer questions. And [6] could learn a good representation of visual concepts and semantic parsing of sentences from images and question-answer pairs jointly, even without explicit supervision because it utilizes the different modules to extract different information.

Iii Approach

Iii-a Problem Formulation

The task is designed to measure non-verbal, cognitive, and abstract reasoning. In the task’s setting, the agent is showed with a matrixes with geometric designs. And most importantly, the last diagram is missing. Given eight candidates of the missing layout, the agent is aimed at choosing the correct layout, and need to follow the analogical relations’ rule. See the example in Figure 1, specifically in this problem, it is an inside-outside structure in which the external component is a layout with a single centered object, and the inside element is a grid layout. The rules are listed in Figure 1. The compositional nature of the rules makes this problem a difficult one, and the correct answer is seven.

The task could be formally defined as: Given training samples, denoted as , where is the input images contains 8 content panels and 8 candidate answers . is the label and

is the meta target of training sample. Meta-target is a tensor containing attributes and rules of

. The input sample has a rule sets , where is a tuple containing two elements: , which means that for a certain row or column in , the attribute , has a rule . Suppose was sampled from a attribute set . The input of the model input images and meta-target

given by the dataset. In detail, meta-target is a multi-hot vector consists of attribute part (to represent

) and rule part (to represent ): each position of vector represents an attribute or rule, 1 for existing, 0 for not existing. This information is used for learning rules of training samples.

Specifically, there are 4 types of rules in our setting: Constant, Progression, Arithmetic, and Distribute Three. As shown in Figure 1, they are denoted as [attribute: rule] pair.

Iii-B Multi-Granularity Modularized Network

There are two vector spaces in our architecture, the scope of visual representation, and the scope of reasoning rules. Given modules corresponding to different attribute types, we need to learn a projection from high-dimensional representation space to low-dimensional rule space and find the most similar embedding.

Iii-B1 Multi-Granularity Sensory Representation

Inspired by principles of psychological development, The capacity for human abstract visual reasoning develops from the initial reasoning about physical objects, especially some concrete objects. Also, this capacity then develops from the subsequent formation of categories and schemas [5]. Inspired by this hierarchical reasoning strategies, we incorporate three-granularity hierarchical features from three levels of granularity: panel-level , row-level , and overall-level . This multi-granularity sensory representation captures both coarse-grained and fine-grained features effectively. Also, the representation of each panel is coupled and interacts with each other.

models Avg Center 2*2Grid 3*3Grid L-R U-D O-IC O-IG
Random 12.50% 12.50% 12.50% 12.50% 12.50% 12.50% 12.50% 12.50%
LSTM[14] 13.07% 13.19% 14.13% 13.69% 12.84% 12.35% 12.15% 12.99%
LSTM+DRT[14] 13.96% 14.29% 15.08% 14.09% 13.79% 13.24% 13.99% 13.29%
CNN[14] 36.97% 33.58% 30.30% 33.53% 39.43% 41.26% 43.20% 37.54%
CNN+DRT[14] 39.42% 37.30 30.06 34.57 45.49 45.54 45.93 37.54
ResNet-18+MLP+DRT[14] 59.56% 58.08% 46.53% 50.40% 65.82% 67.11% 69.09% 60.11%
ResNet-50+MLP+DRT[17] 86.26% 89.45% 66.60% 67.95% 97.85% 98.15% 96.60% 87.20%
  WReN[10] 14.69% 13.09% 28.62% 28.27% 7.49% 6.34% 8.38% 10.56%
LEN[16] 72.9% 80.2% 57.5% 62.1% 73.5% 81.2% 84.4% 71.5%
LEN + Teacher Model[16] 78.3% 82.3% 58.5% 64.3% 87.0% 85.5% 88.9% 81.9%
MXGNet[12] 83.91% / / / / / / /
MMoN 83.01% 92.06% 82.84% 63.44% 82.33% 78.29% 80.45% 82.54%
MMoN(meta-target) 87.04% 95.12% 90.14% 78.82% 89.45% 84.72% 82.68% 88.27%
TABLE I: Testing accuracy of different models on RAVEN.
models Center L-R U-D O-IC 2*2Grid 3*3Grid
Random 12.50 12.50 12.50 12.50 12.50 12.50
ResNet-18+MLP+DRT[14] 51.87 40.03 35.46 38.84 38.69 39.14
ResNet-50+MLP+DRT[17] 60.80 43.65 41.40 43.65 42.24 43.87
MMoN 59.47 40.28 38.91 41.11 39.84 42.55
MMoN(meta-target) 62.49 45.21 43.68 44.56 45.16 47.25
TABLE II: Generalization test. First, the model is trained on Center and tested on three other figure configurations, and then 3*3Grid column means the model is trained on 2*2Grid and tested on 3*3Grid, 2*2Grid column implies the model is trained on 2*2Grid and tested on 3*3Grid.
models Center L-R U-D O-IC(single) O-IC(four) 2*2Grid 3*3Grid
Random 12.5000 12.5000 12.5000 12.5000 12.5000 12.5000 12.5000
LSTM[14] 12.0192 13.2212 11.7788 12.0192 10.0962 13.4615 11.7788
LSTM+DRT[14] 12.2596 12.2596 9.8558 12.5000 13.7019 12.5000 10.0962
CNN[14] 12.0192 12.0192 11.0577 13.9423 10.5769 10.3365 12.5000
CNN+DRT[14] 11.5385 12.2596 14.9038 12.5000 11.0577 13.4615 10.8173
ResNet-18+MLP+DRT[14] 19.9519 17.3077 20.1923 14.6635 17.7885 18.2692 15.8654
ResNet-50+MLP+DRT[14] 31.4904 40.8654 37.0192 37.9808 31.7308 30.2885 25.7212
MMoN 33.9500 38.5000 17.7500 39.4499 31.3000 35.8500 38.5499
MMoN(meta-target) 40.1534 43.7400 22.1868 41.8625 37.4919 43.9735 47.1300
TABLE III: Generalization test. The model is trained on dataset without rule of disturbted three and tested on rule of disturbted three. And then another model is trained on dataset without progression but tested on rule of progression.

Panel-wise granularity (): lt takes each panel as input and handles the attributes of inside graphical element. Moreover, we take the correlations among panels of the same row into consideration, and apply Relation Network [9] to obtain this inner relationship. For each panel in row , firstly we use Residual Network[4] to extract the features () of each: . Then WReN is used to extract the representation of pair-wise relationship of 3 panels in a row :


Row-wise granularity (): Furthermore, the network of individual hierarchy takes each row as input. In Raven, the same rules are applied to rows. Motivated by this, we stack the three panels in row together (instead of treating each panel separately as we did in the previous section). Then it was fed to a pre0trained 3-channel ResNet to encode the entire row with a compact embedding.


Overall-wise granularity (): Considering rules of the third row are the same as rules of , it’s essential to take the two rows together as input and jointly learns the rule patterns underlying the two rows. Thus, we perform pair-wise embedding to capture the interaction between two rows: and just like what WReN does. combines , is obtained and passed to as input. treat each combination as a whole and uses a 6-channel ResNet to take the pair-wise relationships among rows into consideration.


Iii-B2 Modularized Reasoning

Here, we need to inference the corresponding rules from the representation and then obtain the answer. There are learning objects, and a bunch of modularized functions corresponds to different rules. So each module presents a specific rule, and we regard each rule is an operation. Then, given the attribute and the rule, we use a simple MLP () to tell us how correct the candidate is. The goal of functions () is to learn the right parameterization of modules to gain the right rule, like the size changes or the color remains. In this neuro-symbolic way [7, 3]), the signals for learning modules come from sensory representation, and the final candidate selection could jointly train them. We can end up with a rule embedding that is closed to the projection. For the training part, we can start from randomly initialized modules. Moreover, we feed the meta-target as semi-supervised signals.

To bind an attribute to a specific , we used the meta-target information in the dataset as supervision. The meta-target records the attributes and rules included in the current training sample. During training, we only train the network corresponding to the present sample attributes. For example, the attributes of the current sample are , then we only train networks and freeze other MLP. In this way, the MLP was bind to a unique attribute of .

Moreover, we can define the proposed modularized functions as , corresponding to the attribute set ). For each attribute, the multi-granularity representation, denoted as , are fed to to obtain the transformation of one rule .


Then modularized functions () are used to inference the transformation

on a specific attribute with meta-targets. Cosine similarity is applied to inference the

, which most similar to correct , given by meta-target. Concretely we compute similarity between and to choose the best candidate.


As for the inference part, we apply the cosine similarity function as a score function to calculate the score of . The scores in the three granularities are denoted as , respectively. Finally, We choose the one with the highest total score in , and the corresponding candidate answer is the output of the model.

Our model finally choose one image from the candidate set to complete the matrix correctly, namely satisfying the underlying rules in the matrix.


where is the target label, and is the correct answer.

is an adjustable hyperparameter, controlling the weight of these two losses. The weak supervision comes from concept space and answer. The meta target gives a constraint on what kind of operations we are going to use, and the answer gives us the constraints that what output what each of the functions (

) it should be.

Iv Experiment

In our experiment, the dataset is split into training, validation, and testing with the ratio 6:2:2, respectively. The input images of the model are resized to . In the training step, because the input channel of 3 levels is different, we modify the first channel of each ResNet to 1, 3, 6, respectively, fitting the shape of the input. In detail, our model contains MLP with the same structure(with linear and dropout layers). In Raven, there are two components for each sample, and each component has five attributes. And we use Adam as an optimizer for training and set of Adam to . Our code is available at

Fig. 2: Different kinds of components in the dataset.

In total, we have seven configurations, as shown in Figure 2. To test the generalization performance of models on unseen data, we design two experiments: (a) the models are trained on a subset of data but tested on another subset with unseen layout, (b) the models are trained on a subset of data without one specific rule, and then tested on another subset with this rule. Results can be found in Table II and III. The result shows that MMoN is well suited for abstract visual reasoning (see Table I) and also explainable on the generalization.

V Conclusion

We propose a novel Multi-Granularity Modularized Network, which performs high accuracy and maintains the stability of the model on different layouts of Raven.

Vi Acknowledgments

We would sincerely thank Rodolfo Corona for useful discussions and assistance with data analysis. This work is substantially supported by by the National Natural Science Foundation of China under the grant number 61572223 and the University-Industry Collaborative Education Program between the Ministry of Education of China and Google Information Technology (China) Co., Ltd. (PJ190496).


  • Andreas et al. [2016a] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 39–48, 2016a.
  • Andreas et al. [2016b] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016b.
  • Han et al. [2019] Chi Han, Jiayuan Mao, Chuang Gan, Josh Tenenbaum, and Jiajun Wu. Visual concept-metaconcept learning. In Advances in Neural Information Processing Systems, pages 5002–5013, 2019.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • Luszcs [1989] Mary A Luszcs. Psychological development. North Holland, 1989.
  • Mao et al. [2019a] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations, 2019a.
  • Mao et al. [2019b] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584, 2019b.
  • Raven [2000] John Raven. The raven’s progressive matrices: change and stability over culture and time. Cognitive psychology, 41(1):1–48, 2000.
  • Santoro et al. [2017] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap.

    A simple neural network module for relational reasoning.

    In Advances in neural information processing systems, pages 4967–4976, 2017.
  • Santoro et al. [2018] Adam Santoro, Felix Hill, David Barrett, Ari Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. In International Conference on Machine Learning, pages 4477–4486, 2018.
  • Steenbrugge et al. [2018] Xander Steenbrugge, Sam Leroux, Tim Verbelen, and Bart Dhoedt. Improving generalization for abstract reasoning tasks using disentangled feature representations. arXiv preprint arXiv:1811.04784, 2018.
  • Wang et al. [2020] Duo Wang, Mateja Jamnik, and Pietro Lio. Abstract diagrammatic reasoning with multiplex graph networks. 2020.
  • Weiten [2007] Wayne Weiten. Psychology: Themes and variations: Themes and variations. Cengage Learning, 2007.
  • Zhang et al. [2019a] Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. Raven: A dataset for relational and analogical visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5317–5327, 2019a.
  • Zhang et al. [2019b] Chi Zhang, Baoxiong Jia, Feng Gao, Yixin Zhu, Hongjing Lu, and Song-Chun Zhu. Learning perceptual inference by contrasting. In Advances in Neural Information Processing Systems, pages 1073–1085, 2019b.
  • Zheng et al. [2019] Kecheng Zheng, Zheng-Jun Zha, and Wei Wei. Abstract reasoning with distracting features. In Advances in Neural Information Processing Systems, pages 5834–5845, 2019.
  • Zhuo and Kankanhalli [2020] Tao Zhuo and Mohan Kankanhalli. Solving raven’s progressive matrices with neural networks. arXiv preprint arXiv:2002.01646, 2020.