1 Introduction
Raven’s Progressive Matrices (RPM) raven1941standardization; raven1936mental; raven1998raven is a widely acknowledged metric in the research community to test the cognitive skills of humans. The purpose of RPM is primarily to assess a person’s capacity to perform comparisons, reason by analogy, logical thinking processes. Thus RPM task is considered to have the highest “gloading” and plays a central role among most cognitive tests klein2018scrambled.
Vision community has often employed Raven’s test to evaluate the reasoning skills of an AI model carpenter1990one; hoshen2017iq; little2012bayesian; lovett2017modeling; lovett2010structure; lovett2009solving. These tests are presented as a matrix, where each cell contains visual geometric design except the last, which is kept empty (Figure 1). The AI model is designed to pick the bestfit images from a provided list of 6 to 8 choices. To correctly answer, the model has to deduce the abstract relationship between the visual features such as shape, position, color, size, and the underlying relationship that is applied on them in the matrix. These relationship were first proposed by carpenter1990one that are constant in a row or column, quantitative pairwise progression, figure addition or subtraction, distribution of three values and distribution of two values. The visual IQ score of the model obtained from solving abstract reasoning tasks can provide ground to compare AI against human intelligence.
Please refer to Figure 1
for a demo example of RPM prepared by using the relation constant in a row. Initial stages of research employed computation models that depended on handcrafted heuristics rules on propositions formed from visual inputs to solve RPM
carpenter1990one; bringsjord2003artificial; lovett2007analogy. Recent approaches use visual representation from neural networks to solve RPM
hoshen2017iq; barrett2018measuring; zhang2019raven; van2019disentangled. The latest proposed model by barrett2018measuring known as Wild Relation Network (WReN) is the current state of the art neural network based on Relation Network santoro2017simple for Raven’s test. While WReN outperforms other reasoning based models on datasets such as PGM barrett2018measuring, Raven zhang2019raven, the performance is still sub optimal compared to humans.These setbacks to the model performance is caused due to the lack of effective and taskappropriate visual representation. The model’s during the training should separate apart the key attributes needed for reasoning. These key attributes aka disentangled representations bengio2013representation; ridgeway2016survey breaks down the visual features to its independent generative factors,i.e., to factors that can be used to generate the full spectrum of variations in the ambient data. We argue that a better disentangled model is essential for better reasoning ability of machines. In a recent study locatello2019challenging via impossibility theorem has shown the limitation of learning disentanglement independently. The impossibility theorem states without the presence of any form of inductive bias, learning disentangled factors is impossible. Since collecting label information of the generative factors is challenging and almost impossible in real world datasets, previous works have focused on some form of semisupervised or weaklysupervised methods such as observing a subset of ground truth factors kingma2014semi, using paired images that share values for a subset of factors bouchacourt2018multi or knowing the rank of subset of factors for a pair of images wang2014learning to improve disentanglement. In our work, we improve upon the model’s reasoning ability by using the inductive reasoning present in the spatial features. This underlying reasoning induces weak supervision that helps improve disentanglement hence leading to better reasoning.
Our work considers the upside of jointly learning disentangled representation and learning to reason (critical thinking).First, unlike above proposed models, i.e (working in a staged process either to improve disentangling or improve downstream accuracy), we work on the weakness of both components and propose a novel way to optimize both in single endtoend trained model. We demonstrate the benefits of the interaction between the representation learning and reasoning ability.
Our motivation behind using the same evaluation procedude by van2019disentangled is as follows: 1) the strong visual presence 2) information of the generative factors help in demonstrating the model efficacy on both reasoning accuracy and disentanglement (strong correlation) 3) possibility of comparing the disentangled results with stateoftheart SOTA disentangling results.
Our contributions are summarized as follows:

We propose a general generative graphical model for RPM, GMRPM, which will form the essential basis for inductive bias in joint learning for representation + reasoning.

Building upon GMRPM, we propose a novel learning framework named Disentangling based Abstract textbfReasoning Network (DAReN) composed of two primary components– disentanglement network, and reasoning network. It learns to disentangle factors and uses the representation to detect both the underlying relationship and the object property used for the relation.

We show that DAReN outperforms all SOTA baseline models across reasoning and disentanglement metrics, demonstrating that reasoning and disentangled representations are tightly related; learning both in unison can effectively improve the downstream reasoning task.
2 Related Works
Visual Reasoning. Earlier works on abstract visual reasoning involved forming traditional approaches such as rulebased heuristics in the form of symbolic representations carpenter1990one; lovett2017modeling; lovett2010structure; lovett2009solving or relational structures in the images little2012bayesian; mcgreggor2014confident; mekik2018similarity. These methods were limited in their capability of ever completely understanding the full reasoning tasks due to the following underlying assumption made by them. The first limitation in the experiments was assuming the machines of having access to the symbolic representation from images, and infering rules based on them. Next, the need for a domain expertise that understands the operations and the comparisons required in desgin principles in these reasoning tasks. The possibility of ever fully understanding and solving these task was made possible when wang2015automatic proposed their systematic way of automatically generating RPM using firstorder logic. Recently, there has been a growing interest in using deep neural networks to solve abstract reasoning because of their ability to learn powerful representations directly from the images. Since then, there has been significant progress in solving reasioning tasks using neural networks. Hoshen and Werman hoshen2017iq employed a CNN to find the matching choice for the context images, while barrett2018measuring proposed a Wild Relational Network (WReN) to study the nature of relationships between context and choice images.
Disentanglement. Over these recent years research in learning disentangled representation is gaining momentum bengio2013representation; higgins2016beta; kim2018disentangling; kim2019bayes; kim2019relevance; locatello2019challenging; ridgeway2016survey; tschannen2018recent. However, with all the proposed works this research area has not reached a major consensus on two major notions: i) there is no widely accepted formalized definition bengio2013representation; ridgeway2016survey; locatello2019challenging; tschannen2018recent
, ii) no single robust evaluations metrics to compare the models
burgess2018understanding; kim2018disentangling; chenisolating; eastwood2018framework; kumar2017variational. However, the key fact common to all the models claim to learn disentangled representation is the extractiong of statistically independent ridgeway2016survey learned factors. A majority of the research in this area is based on the definition presented by bengio2013representation, which states that the underlying generative factors correspond to independent latent dimensions, such that changing a single factor of variation should change only a single latent dimension while remaining invariant to others. In a recent work, locatello2019challenging proposed the theoretical impossibility result which states t is impossible to learn disentangled representation without the presence of inductive bias. There is a shift towards semisupervised locatello2019disentangling; sorrenson2020disentanglement; khemakhem2020variational and weaksupervised locatello2020weakly; bouchacourt2018multi; hosoya2018group; shu2019weakly based disentangling models.In this work, we focus on both learning disentangled representation as well solving abstract visual reasoning. A large scale study by van2019disentangled showed using previous stateoftheart (SOTA) disentangled models that disentangled representation are helpful in solving reasoning tasks. We focus on using the inductive bias present in the reasoning questions to jointly optimize both the tasks. Following section In the section below, we start by proppsing a general framework to solve abstract reasoning and a novel model design(DAReN), followed by emperical results comparing DAReN with previous SOTA methods,.
3 Problem Formulation and Approach
We first describe the Raven’s Progressive Matrices that forms our visual reasoning task in Section 3.1. We then propose our general generative graphical model for RPM, GMRPM, which will form the essential basis for inductive bias in joint learning for representation and reasoning in Section 3.2. Finally, in Section 3.3
, we describe our novel learning framework based on a variational autoencoder (VAE) and a reasoning network for jointly representationreasoning learning.
3.1 Visual Reasoning Task
The Raven’s matrix denoted as , of size contains images at all location except at . The aim is to find the best fit image at from a list of choices denoted as . For our current work, we have fixed , where in rowmajor order and is empty that needs to be placed with the correct image from the choices. We also set the number of choices , where .
We formulate an abstract representation for all patterns possible in by defining a structure on the image attributes () and on relation types () applied to the image attributes
The set consists of relations proposed by carpenter1990one that are constant in a row, quantitative pairwise progression, figure addition or subtraction, distribution of three values, and distribution of two values. We describe our approach below by assuming these relationships on the rows of RPM. The procedure remains the same for modeling relationships on the columns of RPM. Similarly, the image attributes () set consists of object type, size, position, and color. These image attributes are usually the underlying generative factors in the images. The structure is a set of tuples, where each tuple is formed by randomly sampling a relation from and image attribute from . The set of can contain a max of tuple, where the difficulty rises with the increase in the size of , or or any combination of them. Finally, the matrix is created using the tuples in . For the matrix to be a valid RPM an additional required consistency constraint is described below.
RPM Constraint.
Using , multiple realizations of the matrix are possible depending on the randomly sampled values of . For example, if = {(constant in a row, object type), (constant in a row, object size)}, every image in each row or column of will have the same (constant) value for ^{1}^{1}1We interchangeably use
to denote the subset of image attributes that adhere to rules of RPM as well as the multihot vector
whose nonzero values index those attributes.. The values for image attributes in , that are not part of , are sampled randomly for every image. These sampled values must not comply with the relation set across all rows or columns in . In the example above where position, color, a valid can also have same values for position or color (or both) in the first row while the values in the other two rows have to be different (not to contradict ). The above is an example of a distractor, where the attributes in during the sampling process might satisfy some for any one row in but not for all rows. These randomly varying values in add a layer of difficulty towards solving RPM.The task of any model trained to solve has to find that is consistent across all rows or columns in and discard the distracting features . In the rest of the paper, for simplicity we work on row based relationship on RPM. Our solution could easily be extended to address columns or both rows and columns.
3.2 Inductive Prior for RPM (GMRPM)
While previous works have made strides in solving RPMs
little2012bayesian; lovett2017modeling; van2019disentangled, the gap in reasoning and representation learning between those approaches and the human performance remains. To narrow this gap we propose a minimal inductive bias in the form of a probabilistic graphical model described here that can be used to guide the joint representationreasoning learning.Figure 2 defines the structure of the general generative graphical model for RPM. This model describes an RPM , where , denote the images in the puzzle, with the correct answer at , defined by rule on the subset of possible attributes indexed by the multihot attribute selection vector .
Latent vectors are the representations of the attributes, to be learned by our approach, and some inherent noise process encompassed in the remaining dimensions of , , which we refer to as nuisances. Ideally, some factors in should be isomorphic to the attributes themselves in this simple RPM setting, after an optimal model is learned. This latent vector of gives rise to ambient images through some stochastic nonlinear mapping ^{2}^{2}2We drop RPM indices, where obvious., parameterized by which is to be learned,
(1) 
where
is the latent tensor for the RPM.
The RPM inductive bias comes from the way (prior) are formed, given the unknown rule . Specifically,
(2) 
where , is the latent representation of the factors that are used in rule and is the latent representation of the complementary, unused factors. The key in RPM is how the priors on those factors are defined. The factors used in the rule, grouped as the tensor , follow some joint density over elements of the same row
(3) 
where is the matrix of size or all latent representations in row of RPM. The factors not used in the rule, , have a different, iid prior
(4) 
This is depicted by the innermost plate,“columns: M”, in Figure 2. Finally, the factors representing the noise information have iid prior
(5) 
Additionally, we assume that all factors are independent in those used for the puzzle, those not used, and the nuisances,
This finally gives rise to the full Generative models for RPM (GMRPM),
(6) 
3.3 Representation and Reasoning: DAReN
Inspired by our GMRPM, we propose a novel framework for learning representation and reasoning named Disentangling based Abstract Reasoning Network (DAReN). Please refer to Figure 3 for an overview of DAReN.
DAReN is composed of two primary components, a variational auto encoder (VAE) module and a reasoning module. The encoder model encodes the matrix (without the last element) into latent representation , whose structure assumes that in (2), and the choice list as , respectively. The latent variables and are used as input to both disentangling and reasoning module. In the case of disentangling module, we use an averaging strategy on the subset of underlying factors with . The updated latent variable is given to the decoder model to reconstruct the original images. The input to the reasoning module is prepared by changing the tensor dimension of to , where a choice is placed at location in
. The reasoning network outputs a choice probability score for each choice
, where with the highest value is predicted as the best fit image by the reasoning network. In the following section, we describe the disentanglement learning using VAE followed by the reasoning module.Inference
. Our proposed generative model uses a VAE kingma2013auto to learn the disentangled representations. As described in Section 3.2, the goal is to infer the value of the latent variables that generated the observations, i.e., to calculate the posterior distribution over , which is intractable. Instead, an approximate solution for the intractable posterior was proposed by kingma2013auto that uses a variational approximation , where are the variational parameters. In this work, we further define this variational posterior as
(7) 
Currently, DAReN is designed to solve for = constant in a row, thus .
is an intermediate variable, which is used to arrive at the final estimate of
, using the Rule Enforcing Constraint function , as described further in this section. We use the inductive bias in Sec. 3.2 to define as described below.Disentangling Module
. Initial inference of latent factors, , is accomplished using the disentangling VAE encoder from tschannen2018recent, which uses the following general ELBO optimization
(8) 
Here, regularizer controls the information flow in , regularizer
induces the disentanglement on the latent representation and the hyperparameter
controls the weight on each of the regularizers. Most of the regularization based disentangling work can be subsumed into the above objective form.Weak Supervision via RPM Structure
. We use the inductive bias present in via to guide our disentangling module. We describe the approach to infer possible attributes from . We initialize the variational parameters using a partial pre trained network and then use to compute the variance across each factor index. The values for the variance above a certain a threshold () are considered factors and the rest the nuisances.
Enforcing Fixed Factor Constraint
We infer the set bits of for a given from prior in (3), where the image attributes share the same value for the set bits of for row . We compute the KL divergence over the latent vectors in and extract the indices of lowest divergence values as our .
(9)  
where represents the indices in row . The KL divergence over denoted as is used to determine whether the factor at index is consistent in . The set consists of indices with values lower than . In the empirical study, we fix , so .
Enforcing Rule Constraint
. We describe the process of preparing the estimated latent variable . The averaging strategy is applied to latent vector . We use the set values of extracted from above to enforce the relation on the images in row , where as for and noise indices the values remains unchanged in .
For = constant in a row, the averaging strategy is variant of the method in Multi Level VAE bouchacourt2018multi to obtain
(10) 
The final latent vector is prepared as per (2).
From above constraints, we obtain new representations for each in denoted as where the attributes from are constrained by . The decoder network maps to original image. Since the attributes vary across , the above constraints forces the model to keep the factors separate which improves disentangling.
3.4 Reasoning Module
The reasoning component of DAReN incorporates a relational structure to infer the abstract relationship on the attribute for images in . The input to our reasoning module is the disentangled latent variables of and . We prepare possible last rows by placing each choice image in at the missing location . Given , it is known that the correct image will satisfy in the last row, where is already true for the top rows. To determine the correct choice, we contrast the values at index in by computing the variance over all images at row . Using the above step, we compute the variance for all rows that includes all probable last rows. Additionally, we append a positional encoding at the end of each row wise variance computed above. In the next step, we concatenate the variance in latent variables for the top rows along with each possible last row to prepare choice vectors (
). Finally, we train a basic threelayered MLP to process each choice vectors from above to produce a logit vector of size
. The choice with the highest score is selected as the correct answer. We train the reasoning model using CrossEntropy loss.4 Experiments
4.1 Datasets
We study the performance of DAReN on six datasets  (i) dsprites matthey2017dsprites, (ii) modified dSprites van2019disentangled, (iii) shapes3d kim2018disentangling, (ivvi) MPI3D  (Real, Realistic, Toy) gondal2019transfer. We use similar experimental settings as proposed in van2019disentangled to create RPM for the above datasets. Please see Appendix for the description of the datasets and the details on the preparation of the RPMs. We compare our results in both the reasoning accuracy and the disentanglement scores against the SOTA methods. For both categories, we show that our model outperforms SOTA approaches.
Model/Dataset  DSprites  Modified DSprites  Shapes3D  MPI3D–Real  MPI3D–Realistic  MPI3D–Toy 

StagedWReN van2019disentangled  
E2EWREN  
DAReN 
4.2 Experimental Setup, Baselines and Evaluation Metric
For our experiments, we used the objectives proposed in FactorVAE kim2018disentangling as replacement of the regularizers in (8) for the disentangled module. In case of FactorVAE, the information flow regularizer is and the disentanglement regularizer is
. All our models are implemented in PyTorch
NEURIPS2019_9015 and optimized using ADAM optimizer kingma2014adam, with the following parameters: learning rate of 1e4 for the reasoning + representation network excluding the Discriminator of FactorVAE which is set to 1e5. To demonstrate that our approach is less sensitive to the choice of the hyperparameters (), and network initialization, we sweep over 35 different hyper parameter settings of and initialization seeds . We kept the architecture and other hyperparameters similar to the experiments in van2019disentangled. The image size used in all six datasets is 64 × 64 × 3 pixels. The pixel intensity was scaled to [0,1]. We use a batch size of 64 during the training process.Baselines.
We refer to the training process in van2019disentangled as StagedWReN for convenience. With the above settings, we evaluate DAReN against StagedWReN
for abstract reasoning task and against FactorVAE for disentangled representation learning. The
StagedWReN method was a two stage training process, where a disentangling network was trained first ( iterations), followed by training a Wild Relational Network (WReN) proposed by barrett2018measuring on RPM using the fixed disentangled representation from the encoder trained for . Similar, we use the weights of a partially pretrained model to initialize our VAE model. However, we start at a lower starting point () compared to StagedWReN and train DAReN for 100K iterations.We propose an adapted baseline referred to as E2EWReN over StagedWReN, where we train both the disentangling and the reasoning network jointly endtoend. We adopt the similar notation used in van2019disentangled where the context panels and answer panels . WReN is evaluated for each answer panel in relation to the context panels. The following presents the learning objective of E2EWReN
(11) 
which is similar to StagedWReN. However, E2EWReN trains the the VAE module as well. Please refer appendix for a detailed description on E2EWReN.
4.3 Analyzing Abstract Visual Reasoning Results
We report the performance as mean variance of reasoning accuracy (refer Table 1) over 6 benchmark datasets. We start by comparing StagedWReN against the adapted baseline (E2EWReN). We observe an increase of in performance from StagedWReN to E2EWReN which supports our hypothesis that better representation results in improved reasoning performance. The WReN learns a relational structure by aggregating all pairwise relations formed within the context panels and between context and choice panels in the latent space to find the bestfit choice image for the context. Since the working procedure over StagedWReN and E2EWReN remain similar, the difference in reasoning performance is credited to the presence of stronger signals in the latent representation for E2EWReN that backpropagate from the reasoning module. Despite this, WReN performance is suboptimal in learning the underlying reasoning patterns. One reason is that the pairwise relations formed between the context panels do not contribute towards predicting any choice because the values in these pairwise relations remain the same across all choice panels. Consequently, a trained WReN learns to ignore withincontext pairwise comparisons. Only the pairwise score between context panels and each choice panel is used for inferring the relationship. We have empirically verified this on both trained networks (StagedWReN and E2EWReN) over all the datasets, where the scores of noninterfering comparisons across all choices do not impact the final prediction by the model. A detailed analysis of the empirical results is provided in the appendix. This blind formulation of pairwise relation presents a major downside to the quality of learning in WReN. We can observe from GMRPM, that relation bounds the image attributes in a rowwise manner, hence any pairwise comparison between a choice panel (including correct image) and context panel in the top rows does not help in learning or . This can be seen with an example, where = constant in a row, the pairwise comparison performed between context panel in the top rows with a choice panel (including correct answer) can never have the same values for . Due to the above reason forming every possible pairwise relation hinders model learning. In our DAReN, we avoid forming general pairwise combinations instead use our GMRPM to model . The results of DAReN outperform E2EWReN for all datasets. Our results on these datasets provide evidence of stronger affinity between reasoning and disentanglement, which DAReN is able to exploit.
4.4 Analyzing Disentangling Results
As illustrated above, while learning the relations DAReN also separates the underlying latent factors. In Table 2, we report the disentanglement scores of trained models on five widely used evaluation metrics namely, VAE metric higgins2016beta, FactorVAE metric kim2018disentangling, DCI disentanglement eastwood2018framework, MIG chenisolating, and SAPScore kumar2017variational. We compare the latent representation learnt for all three models at iteration. As described in Section 4.3, a joint learning framework not only improves reasoning performance but also strongly disentangles the latent vector in contrast to FactorVAE, an unsupervised (reasoningfree) approach. E2EWReN improves upon it but is still suboptimal in disentangling the latent factors, as observed in Table 2. One major reason for large improvements by DAReN compared to FactorVAE and even E2EWReN is due the extraction of underlying generative factors and the averaging strategy over the least varying index in . Generally during the unsupervised training process, indices of nuisance factor might receive weak signals from the true generative factors. However, DAReN removes this infusion of true signals with the nuisance factors by separating and .
5 Conclusion
In this paper, we introduce an endtoend learning framework to jointly learn the representation and reasoning using inductive bias as a weak form of supervision in Raven’s Progressive Matrices. To this end, we propose a general generative graphical model for RPM (GRRPM) as a prior for the reasoning task. We realize a new joint learning framework DAReN based on the principles defined in GRRPM. Our DAReN is composed of two components – a disentanglement network and a reasoning network. Our disentangling network allows easy integration of different representation learning frameworks. We fix the relation and evaluate our model DAReN on six benchmark datasets to measure the performance of reasoning and disentangling. Abstract visual reasoning task is hard problem in the presence of latent generative factors , relation , and image attributes . We show evidence of strong correlation between the two learning disentangled representation and solving the reasoning tasks. The general nature of GRRPM and DAReN opens further possibilities for research in joint learning of representation and reasoning, including generalization to arbitrary relationship.
Comments
There are no comments yet.