Can an agent do relational and analogical visual reasoning as well as a toddler? Moreover, can an agent solve reasoning tasks it has never seen before?
Abstract visual reasoning is a remarkable cognitive mechanism for humans to achieve logical conclusions in the absence of physical objects, specific instances, or concrete phenomena. And here, the capacity of reasoning is a generalization about relations and attributes primarily, instead of concrete objects. More importantly, current machine learning techniques are data-hungry and brittle—they can only make sense of patterns they’ve seen before. Using current methods, an algorithm can gain new skills by exposure to large amounts of data, but cognitive abilities that could broadly generalize to many tasks remain elusive. Thus, there’s a question about what happens if the agent meets a new and unseen reasoning type. And also, we want to know what is the thinking in images and abstract reasoning for machines.
To deal with these issues, we focus on abstract visual reasoning, offer the potential for more human-like abstraction and reasoning. Correctly, we verify our agent on Raven’s Progressive Matrices (RPM) Test, designed to measure abstract visual reasoning. It’s also used to test the human’s capacity of non-verbal cognitive functioning in some public exams. In the measurement, the agent is showed with a matrixes with geometric designs. Given eight candidates of the missing layout, the agent is aimed at choosing the correct layout, and need to follow the analogical relations’ rule and figure out the specific pattern in this matrix , based on the Spearman’s two-factor theory of intelligence .
Unlike existing work in measuring abstract visual reasoning using RPM , we simulate and validate our designs on RAVEN  test because RAVEN test establishes a semantic link between vision and thinking by providing tree-based structure representation. Previous work [14, 16, 15, 12] design extremely complex models to do representation and reasoning in an end-to-end fashion. But their models are tedious and, therefore, hard to explain, besides, the structure information is not well utilized. Most importantly, they cannot easily extrapolate their knowledge to new situations.
Hoping to understand better how machines understand this task, we aim to figure out whether the computer can learn the rule (semantic) from the visual sensory information. Ask for toddlers, and toddlers must rely on intrinsic cognitive functions for logical conclusions. Inspired by these cognitive studies, we equip our model with simple modularized reasoning functions that is jointly trained with the perception backbone in a neuro-symbolic way. Toddlers can be attuned to relationships between features of objects, actions, and the physical environment. We adopt module network, and each module is for each rule in our case. To train it, we want to take our rules as the target of a latent semantic parser. And the goal is to recover that. Meta target information is then utilized to restrict the space of potential semantic parser that we consider, which provides a certain level of intelligence. To determine our model’s efficiency, we verify our model on the RAVEN dataset compared to various baselines. Furthermore, we design four generalization test to demonstrate the improved ability to deal with unseen reasoning rules.
Ii Related Work
Raven’s Progress Matrices problem is widely used to test the capability of abstract reasoning. In recent years, different models and datasets are designed to lift the reasoning ability of modern vision systems. Inspired by RPM,  built the first large-scale RPM dataset named PGM, and proposed a relational model Wild Relation Network (WReN) leverage representation of pair-wise relations for each choice. Then,  made use of pre-trained Variational Auto Encoder to improve the generalization performance of WReN.  generated a new RPM-style dataset RAVEN with structured representation and proposed Dynamic Residual Tree (DRT), which considers annotations of image structure. Both PGM and RAVEN are designed to be easy to recognize but hard to reason.  proposed a student-teacher architecture to deal with distracting features. More recently,  used a multi-layer multiplex graph to capture multiple relations between objects. Besides,  modified ResNet to reduce overfitting, and proposed MCPT to solve RPM problems in an unsupervised manner.
Many previous studies have utilized modular neural architectures for various tasks. Such as  assembled networks flexibly from a collection of specialized substructures to answer questions. And  could learn a good representation of visual concepts and semantic parsing of sentences from images and question-answer pairs jointly, even without explicit supervision because it utilizes the different modules to extract different information.
Iii-a Problem Formulation
The task is designed to measure non-verbal, cognitive, and abstract reasoning. In the task’s setting, the agent is showed with a matrixes with geometric designs. And most importantly, the last diagram is missing. Given eight candidates of the missing layout, the agent is aimed at choosing the correct layout, and need to follow the analogical relations’ rule. See the example in Figure 1, specifically in this problem, it is an inside-outside structure in which the external component is a layout with a single centered object, and the inside element is a grid layout. The rules are listed in Figure 1. The compositional nature of the rules makes this problem a difficult one, and the correct answer is seven.
The task could be formally defined as: Given training samples, denoted as , where is the input images contains 8 content panels and 8 candidate answers . is the label and
is the meta target of training sample. Meta-target is a tensor containing attributes and rules of. The input sample has a rule sets , where is a tuple containing two elements: , which means that for a certain row or column in , the attribute , has a rule . Suppose was sampled from a attribute set . The input of the model input images and meta-target
given by the dataset. In detail, meta-target is a multi-hot vector consists of attribute part (to represent) and rule part (to represent ): each position of vector represents an attribute or rule, 1 for existing, 0 for not existing. This information is used for learning rules of training samples.
Specifically, there are 4 types of rules in our setting: Constant, Progression, Arithmetic, and Distribute Three. As shown in Figure 1, they are denoted as [attribute: rule] pair.
Iii-B Multi-Granularity Modularized Network
There are two vector spaces in our architecture, the scope of visual representation, and the scope of reasoning rules. Given modules corresponding to different attribute types, we need to learn a projection from high-dimensional representation space to low-dimensional rule space and find the most similar embedding.
Iii-B1 Multi-Granularity Sensory Representation
Inspired by principles of psychological development, The capacity for human abstract visual reasoning develops from the initial reasoning about physical objects, especially some concrete objects. Also, this capacity then develops from the subsequent formation of categories and schemas . Inspired by this hierarchical reasoning strategies, we incorporate three-granularity hierarchical features from three levels of granularity: panel-level , row-level , and overall-level . This multi-granularity sensory representation captures both coarse-grained and fine-grained features effectively. Also, the representation of each panel is coupled and interacts with each other.
|LEN + Teacher Model||78.3%||82.3%||58.5%||64.3%||87.0%||85.5%||88.9%||81.9%|
Panel-wise granularity (): lt takes each panel as input and handles the attributes of inside graphical element. Moreover, we take the correlations among panels of the same row into consideration, and apply Relation Network  to obtain this inner relationship. For each panel in row , firstly we use Residual Network to extract the features () of each: . Then WReN is used to extract the representation of pair-wise relationship of 3 panels in a row :
Row-wise granularity (): Furthermore, the network of individual hierarchy takes each row as input. In Raven, the same rules are applied to rows. Motivated by this, we stack the three panels in row together (instead of treating each panel separately as we did in the previous section). Then it was fed to a pre0trained 3-channel ResNet to encode the entire row with a compact embedding.
Overall-wise granularity (): Considering rules of the third row are the same as rules of , it’s essential to take the two rows together as input and jointly learns the rule patterns underlying the two rows. Thus, we perform pair-wise embedding to capture the interaction between two rows: and just like what WReN does. combines , is obtained and passed to as input. treat each combination as a whole and uses a 6-channel ResNet to take the pair-wise relationships among rows into consideration.
Iii-B2 Modularized Reasoning
Here, we need to inference the corresponding rules from the representation and then obtain the answer. There are learning objects, and a bunch of modularized functions corresponds to different rules. So each module presents a specific rule, and we regard each rule is an operation. Then, given the attribute and the rule, we use a simple MLP () to tell us how correct the candidate is. The goal of functions () is to learn the right parameterization of modules to gain the right rule, like the size changes or the color remains. In this neuro-symbolic way [7, 3]), the signals for learning modules come from sensory representation, and the final candidate selection could jointly train them. We can end up with a rule embedding that is closed to the projection. For the training part, we can start from randomly initialized modules. Moreover, we feed the meta-target as semi-supervised signals.
To bind an attribute to a specific , we used the meta-target information in the dataset as supervision. The meta-target records the attributes and rules included in the current training sample. During training, we only train the network corresponding to the present sample attributes. For example, the attributes of the current sample are , then we only train networks and freeze other MLP. In this way, the MLP was bind to a unique attribute of .
Moreover, we can define the proposed modularized functions as , corresponding to the attribute set ). For each attribute, the multi-granularity representation, denoted as , are fed to to obtain the transformation of one rule .
Then modularized functions () are used to inference the transformation
on a specific attribute with meta-targets. Cosine similarity is applied to inference the, which most similar to correct , given by meta-target. Concretely we compute similarity between and to choose the best candidate.
As for the inference part, we apply the cosine similarity function as a score function to calculate the score of . The scores in the three granularities are denoted as , respectively. Finally, We choose the one with the highest total score in , and the corresponding candidate answer is the output of the model.
Our model finally choose one image from the candidate set to complete the matrix correctly, namely satisfying the underlying rules in the matrix.
where is the target label, and is the correct answer.
is an adjustable hyperparameter, controlling the weight of these two losses. The weak supervision comes from concept space and answer. The meta target gives a constraint on what kind of operations we are going to use, and the answer gives us the constraints that what output what each of the functions () it should be.
In our experiment, the dataset is split into training, validation, and testing with the ratio 6:2:2, respectively. The input images of the model are resized to . In the training step, because the input channel of 3 levels is different, we modify the first channel of each ResNet to 1, 3, 6, respectively, fitting the shape of the input. In detail, our model contains MLP with the same structure(with linear and dropout layers). In Raven, there are two components for each sample, and each component has five attributes. And we use Adam as an optimizer for training and set of Adam to . Our code is available at https://github.com/creeper121386/RAVEN-test.
In total, we have seven configurations, as shown in Figure 2. To test the generalization performance of models on unseen data, we design two experiments: (a) the models are trained on a subset of data but tested on another subset with unseen layout, (b) the models are trained on a subset of data without one specific rule, and then tested on another subset with this rule. Results can be found in Table II and III. The result shows that MMoN is well suited for abstract visual reasoning (see Table I) and also explainable on the generalization.
We propose a novel Multi-Granularity Modularized Network, which performs high accuracy and maintains the stability of the model on different layouts of Raven.
We would sincerely thank Rodolfo Corona for useful discussions and assistance with data analysis. This work is substantially supported by by the National Natural Science Foundation of China under the grant number 61572223 and the University-Industry Collaborative Education Program between the Ministry of Education of China and Google Information Technology (China) Co., Ltd. (PJ190496).
- Andreas et al. [2016a] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In
- Andreas et al. [2016b] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016b.
- Han et al.  Chi Han, Jiayuan Mao, Chuang Gan, Josh Tenenbaum, and Jiajun Wu. Visual concept-metaconcept learning. In Advances in Neural Information Processing Systems, pages 5002–5013, 2019.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- Luszcs  Mary A Luszcs. Psychological development. North Holland, 1989.
- Mao et al. [2019a] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations, 2019a.
- Mao et al. [2019b] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584, 2019b.
- Raven  John Raven. The raven’s progressive matrices: change and stability over culture and time. Cognitive psychology, 41(1):1–48, 2000.
Santoro et al. 
Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan
Pascanu, Peter Battaglia, and Timothy Lillicrap.
A simple neural network module for relational reasoning.In Advances in neural information processing systems, pages 4967–4976, 2017.
- Santoro et al.  Adam Santoro, Felix Hill, David Barrett, Ari Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. In International Conference on Machine Learning, pages 4477–4486, 2018.
- Steenbrugge et al.  Xander Steenbrugge, Sam Leroux, Tim Verbelen, and Bart Dhoedt. Improving generalization for abstract reasoning tasks using disentangled feature representations. arXiv preprint arXiv:1811.04784, 2018.
- Wang et al.  Duo Wang, Mateja Jamnik, and Pietro Lio. Abstract diagrammatic reasoning with multiplex graph networks. 2020.
- Weiten  Wayne Weiten. Psychology: Themes and variations: Themes and variations. Cengage Learning, 2007.
- Zhang et al. [2019a] Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. Raven: A dataset for relational and analogical visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5317–5327, 2019a.
- Zhang et al. [2019b] Chi Zhang, Baoxiong Jia, Feng Gao, Yixin Zhu, Hongjing Lu, and Song-Chun Zhu. Learning perceptual inference by contrasting. In Advances in Neural Information Processing Systems, pages 1073–1085, 2019b.
- Zheng et al.  Kecheng Zheng, Zheng-Jun Zha, and Wei Wei. Abstract reasoning with distracting features. In Advances in Neural Information Processing Systems, pages 5834–5845, 2019.
- Zhuo and Kankanhalli  Tao Zhuo and Mohan Kankanhalli. Solving raven’s progressive matrices with neural networks. arXiv preprint arXiv:2002.01646, 2020.