DAReN: A Collaborative Approach Towards Reasoning And Disentangling

by   Pritish Sahu, et al.

Computational learning approaches to solving visual reasoning tests, such as Raven's Progressive Matrices (RPM),critically depend on the ability of the computational approach to identify the visual concepts used in the test (i.e., the representation) as well as the latent rules based on those concepts (i.e., the reasoning). However, learning of representation and reasoning is a challenging and ill-posed task,often approached in a stage-wise manner (first representation, then reasoning). In this work, we propose an end-to-end joint representation-reasoning learning framework, which leverages a weak form of inductive bias to improve both tasks together. Specifically, we propose a general generative graphical model for RPMs, GM-RPM, and apply it to solve the reasoning test. We accomplish this using a novel learning framework Disentangling based Abstract Reasoning Network (DAReN) based on the principles of GM-RPM. We perform an empirical evaluation of DAReN over several benchmark datasets. DAReN shows consistent improvement over state-of-the-art (SOTA) models on both the reasoning and the disentanglement tasks. This demonstrates the strong correlation between disentangled latent representation and the ability to solve abstract visual reasoning tasks.



There are no comments yet.


page 5


Are Disentangled Representations Helpful for Abstract Visual Reasoning?

A disentangled representation encodes information about the salient fact...

Abstract Diagrammatic Reasoning with Multiplex Graph Networks

Abstract reasoning, particularly in the visual domain, is a complex huma...

Multi-Granularity Modularized Network for Abstract Visual Reasoning

Abstract visual reasoning connects mental abilities to the physical worl...

Raven's Progressive Matrices Completion with Latent Gaussian Process Priors

Abstract reasoning ability is fundamental to human intelligence. It enab...

Pairwise Relations Discriminator for Unsupervised Raven's Progressive Matrices

Abstract reasoning is a key indicator of intelligence. The ability to hy...

V-PROM: A Benchmark for Visual Reasoning Using Visual Progressive Matrices

One of the primary challenges faced by deep learning is the degree to wh...

Modeling Gestalt Visual Reasoning on the Raven's Progressive Matrices Intelligence Test Using Generative Image Inpainting Techniques

Psychologists recognize Raven's Progressive Matrices as a very effective...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: a: A sample instance of the RPM with missing bottom right image. b. The independent underlying generative factors of the images in RPM. c.Reasoning using the latent factors: A value at is represented by a gaussian. The relation constant in a row is true at representing color while the other two factors are inconsistent in a row. The blue colored image from choice is the correct answer as it matches relation constant in a row for the last row.

Raven’s Progressive Matrices (RPM) raven1941standardization; raven1936mental; raven1998raven is a widely acknowledged metric in the research community to test the cognitive skills of humans. The purpose of RPM is primarily to assess a person’s capacity to perform comparisons, reason by analogy, logical thinking processes. Thus RPM task is considered to have the highest “g-loading” and plays a central role among most cognitive tests klein2018scrambled.

Vision community has often employed Raven’s test to evaluate the reasoning skills of an AI model carpenter1990one; hoshen2017iq; little2012bayesian; lovett2017modeling; lovett2010structure; lovett2009solving. These tests are presented as a matrix, where each cell contains visual geometric design except the last, which is kept empty (Figure 1). The AI model is designed to pick the best-fit images from a provided list of 6 to 8 choices. To correctly answer, the model has to deduce the abstract relationship between the visual features such as shape, position, color, size, and the underlying relationship that is applied on them in the matrix. These relationship were first proposed by carpenter1990one that are constant in a row or column, quantitative pairwise progression, figure addition or subtraction, distribution of three values and distribution of two values. The visual IQ score of the model obtained from solving abstract reasoning tasks can provide ground to compare AI against human intelligence.

Please refer to Figure 1

for a demo example of RPM prepared by using the relation constant in a row. Initial stages of research employed computation models that depended on handcrafted heuristics rules on propositions formed from visual inputs to solve RPM 

carpenter1990one; bringsjord2003artificial; lovett2007analogy

. Recent approaches use visual representation from neural networks to solve RPM 

hoshen2017iq; barrett2018measuring; zhang2019raven; van2019disentangled. The latest proposed model by barrett2018measuring known as Wild Relation Network (WReN) is the current state of the art neural network based on Relation Network santoro2017simple for Raven’s test. While WReN outperforms other reasoning based models on datasets such as PGM barrett2018measuring, Raven zhang2019raven, the performance is still sub optimal compared to humans.

These setbacks to the model performance is caused due to the lack of effective and task-appropriate visual representation. The model’s during the training should separate apart the key attributes needed for reasoning. These key attributes aka disentangled representations bengio2013representation; ridgeway2016survey breaks down the visual features to its independent generative factors,i.e., to factors that can be used to generate the full spectrum of variations in the ambient data. We argue that a better disentangled model is essential for better reasoning ability of machines. In a recent study locatello2019challenging via impossibility theorem has shown the limitation of learning disentanglement independently. The impossibility theorem states without the presence of any form of inductive bias, learning disentangled factors is impossible. Since collecting label information of the generative factors is challenging and almost impossible in real world datasets, previous works have focused on some form of semi-supervised or weakly-supervised methods such as observing a subset of ground truth factors kingma2014semi, using paired images that share values for a subset of factors bouchacourt2018multi or knowing the rank of subset of factors for a pair of images wang2014learning to improve disentanglement. In our work, we improve upon the model’s reasoning ability by using the inductive reasoning present in the spatial features. This underlying reasoning induces weak supervision that helps improve disentanglement hence leading to better reasoning.

Our work considers the upside of jointly learning disentangled representation and learning to reason (critical thinking).First, unlike above proposed models, i.e (working in a staged process either to improve disentangling or improve downstream accuracy), we work on the weakness of both components and propose a novel way to optimize both in single end-to-end trained model. We demonstrate the benefits of the interaction between the representation learning and reasoning ability.

Our motivation behind using the same evaluation procedude by van2019disentangled is as follows: 1) the strong visual presence 2) information of the generative factors help in demonstrating the model efficacy on both reasoning accuracy and disentanglement (strong correlation) 3) possibility of comparing the disentangled results with state-of-the-art SOTA disentangling results.

Our contributions are summarized as follows:

  • We propose a general generative graphical model for RPM, GM-RPM, which will form the essential basis for inductive bias in joint learning for representation + reasoning.

  • Building upon GM-RPM, we propose a novel learning framework named Disentangling based Abstract textbfReasoning Network (DAReN) composed of two primary components– disentanglement network, and reasoning network. It learns to disentangle factors and uses the representation to detect both the underlying relationship and the object property used for the relation.

  • We show that DAReN outperforms all SOTA baseline models across reasoning and disentanglement metrics, demonstrating that reasoning and disentangled representations are tightly related; learning both in unison can effectively improve the downstream reasoning task.

2 Related Works

Visual Reasoning.  Earlier works on abstract visual reasoning involved forming traditional approaches such as rule-based heuristics in the form of symbolic representations carpenter1990one; lovett2017modeling; lovett2010structure; lovett2009solving or relational structures in the images little2012bayesian; mcgreggor2014confident; mekik2018similarity. These methods were limited in their capability of ever completely understanding the full reasoning tasks due to the following underlying assumption made by them. The first limitation in the experiments was assuming the machines of having access to the symbolic representation from images, and infering rules based on them. Next, the need for a domain expertise that understands the operations and the comparisons required in desgin principles in these reasoning tasks. The possibility of ever fully understanding and solving these task was made possible when  wang2015automatic proposed their systematic way of automatically generating RPM using first-order logic. Recently, there has been a growing interest in using deep neural networks to solve abstract reasoning because of their ability to learn powerful representations directly from the images. Since then, there has been significant progress in solving reasioning tasks using neural networks. Hoshen and Werman hoshen2017iq employed a CNN to find the matching choice for the context images, while barrett2018measuring proposed a Wild Relational Network (WReN) to study the nature of relationships between context and choice images.

Disentanglement.  Over these recent years research in learning disentangled representation is gaining momentum bengio2013representation; higgins2016beta; kim2018disentangling; kim2019bayes; kim2019relevance; locatello2019challenging; ridgeway2016survey; tschannen2018recent. However, with all the proposed works this research area has not reached a major consensus on two major notions: i) there is no widely accepted formalized definition bengio2013representation; ridgeway2016survey; locatello2019challenging; tschannen2018recent

, ii) no single robust evaluations metrics to compare the models 

burgess2018understanding; kim2018disentangling; chenisolating; eastwood2018framework; kumar2017variational. However, the key fact common to all the models claim to learn disentangled representation is the extractiong of statistically independent ridgeway2016survey learned factors. A majority of the research in this area is based on the definition presented by bengio2013representation, which states that the underlying generative factors correspond to independent latent dimensions, such that changing a single factor of variation should change only a single latent dimension while remaining invariant to others. In a recent work, locatello2019challenging proposed the theoretical impossibility result which states t is impossible to learn disentangled representation without the presence of inductive bias. There is a shift towards semi-supervised locatello2019disentangling; sorrenson2020disentanglement; khemakhem2020variational and weak-supervised locatello2020weakly; bouchacourt2018multi; hosoya2018group; shu2019weakly based disentangling models.

In this work, we focus on both learning disentangled representation as well solving abstract visual reasoning. A large scale study by  van2019disentangled showed using previous state-of-the-art (SOTA) disentangled models that disentangled representation are helpful in solving reasoning tasks. We focus on using the inductive bias present in the reasoning questions to jointly optimize both the tasks. Following section In the section below, we start by proppsing a general framework to solve abstract reasoning and a novel model design(DAReN), followed by emperical results comparing DAReN with previous SOTA methods,.

3 Problem Formulation and Approach

We first describe the Raven’s Progressive Matrices that forms our visual reasoning task in Section 3.1. We then propose our general generative graphical model for RPM, GM-RPM, which will form the essential basis for inductive bias in joint learning for representation and reasoning in Section 3.2. Finally, in Section 3.3

, we describe our novel learning framework based on a variational autoencoder (VAE) and a reasoning network for jointly representation-reasoning learning.

3.1 Visual Reasoning Task

The Raven’s matrix denoted as , of size contains images at all location except at . The aim is to find the best fit image at from a list of choices denoted as . For our current work, we have fixed , where in row-major order and is empty that needs to be placed with the correct image from the choices. We also set the number of choices , where .

We formulate an abstract representation for all patterns possible in by defining a structure on the image attributes () and on relation types () applied to the image attributes

The set consists of relations proposed by carpenter1990one that are constant in a row, quantitative pairwise progression, figure addition or subtraction, distribution of three values, and distribution of two values. We describe our approach below by assuming these relationships on the rows of RPM. The procedure remains the same for modeling relationships on the columns of RPM. Similarly, the image attributes () set consists of object type, size, position, and color. These image attributes are usually the underlying generative factors in the images. The structure is a set of tuples, where each tuple is formed by randomly sampling a relation from and image attribute from . The set of can contain a max of tuple, where the difficulty rises with the increase in the size of , or or any combination of them. Finally, the matrix is created using the tuples in . For the matrix to be a valid RPM an additional required consistency constraint is described below.

RPM Constraint.

Using , multiple realizations of the matrix are possible depending on the randomly sampled values of . For example, if = {(constant in a row, object type), (constant in a row, object size)}, every image in each row or column of will have the same (constant) value for 111We interchangeably use

to denote the subset of image attributes that adhere to rules of RPM as well as the multi-hot vector

whose non-zero values index those attributes.. The values for image attributes in , that are not part of , are sampled randomly for every image. These sampled values must not comply with the relation set across all rows or columns in . In the example above where position, color, a valid can also have same values for position or color (or both) in the first row while the values in the other two rows have to be different (not to contradict ). The above is an example of a distractor, where the attributes in during the sampling process might satisfy some for any one row in but not for all rows. These randomly varying values in add a layer of difficulty towards solving RPM.

The task of any model trained to solve has to find that is consistent across all rows or columns in and discard the distracting features . In the rest of the paper, for simplicity we work on row based relationship on RPM. Our solution could easily be extended to address columns or both rows and columns.

3.2 Inductive Prior for RPM (GM-RPM)

While previous works have made strides in solving RPMs

little2012bayesian; lovett2017modeling; van2019disentangled, the gap in reasoning and representation learning between those approaches and the human performance remains. To narrow this gap we propose a minimal inductive bias in the form of a probabilistic graphical model described here that can be used to guide the joint representation-reasoning learning.




Figure 2: Generative models for RPM (GM-RPM). See Sec. 3.2 for details.

Figure 2 defines the structure of the general generative graphical model for RPM. This model describes an RPM , where , denote the images in the puzzle, with the correct answer at , defined by rule on the subset of possible attributes indexed by the multi-hot attribute selection vector .

Latent vectors are the representations of the attributes, to be learned by our approach, and some inherent noise process encompassed in the remaining dimensions of , , which we refer to as nuisances. Ideally, some factors in should be isomorphic to the attributes themselves in this simple RPM setting, after an optimal model is learned. This latent vector of gives rise to ambient images through some stochastic nonlinear mapping 222We drop RPM indices, where obvious., parameterized by which is to be learned,



is the latent tensor for the RPM.

The RPM inductive bias comes from the way (prior) are formed, given the unknown rule . Specifically,


where , is the latent representation of the factors that are used in rule and is the latent representation of the complementary, unused factors. The key in RPM is how the priors on those factors are defined. The factors used in the rule, grouped as the tensor , follow some joint density over elements of the same row


where is the matrix of size or all latent representations in row of RPM. The factors not used in the rule, , have a different, iid prior


This is depicted by the inner-most plate,“columns: M”, in Figure 2. Finally, the factors representing the noise information have iid prior


Additionally, we assume that all factors are independent in those used for the puzzle, those not used, and the nuisances,

This finally gives rise to the full Generative models for RPM (GM-RPM),

Figure 3: Illustration of DAReN. Our novel learning framework consists of a VAE-based generative model (left) and a reasoning network (right). Generative Model. The encoder encodes and to and . The possible attributes are learnt from

by picking the factor indices with high variance and the rest are kept as nuisances (

). The set bits for (shown as a multi hot vector) for row m is obtained from fixed factor constraint. No operation is performed on the factor indices at the set bit of and the nuisance factors. The factor indices in the set bit of are enforced by rule constraint to take the same value via an averaging strategy (G). The updated latent representation is given to the decoder to reconstruct the image back. Reasoning Model. We consider the latent representation and

to extract the standard deviation across factor index for the top two rows and all possible 6 rows. An MLP is trained on the concatenated standard deviation of the top two rows with choice

to predict the best fit choice image.

3.3 Representation and Reasoning: DAReN

Inspired by our GM-RPM, we propose a novel framework for learning representation and reasoning named Disentangling based Abstract Reasoning Network (DAReN). Please refer to Figure 3 for an overview of DAReN.

DAReN is composed of two primary components, a variational auto encoder (VAE) module and a reasoning module. The encoder model encodes the matrix (without the last element) into latent representation , whose structure assumes that in (2), and the choice list as , respectively. The latent variables and are used as input to both disentangling and reasoning module. In the case of disentangling module, we use an averaging strategy on the subset of underlying factors with . The updated latent variable is given to the decoder model to reconstruct the original images. The input to the reasoning module is prepared by changing the tensor dimension of to , where a choice is placed at location in

. The reasoning network outputs a choice probability score for each choice

, where with the highest value is predicted as the best fit image by the reasoning network. In the following section, we describe the disentanglement learning using VAE followed by the reasoning module.


. Our proposed generative model uses a VAE kingma2013auto to learn the disentangled representations. As described in Section 3.2, the goal is to infer the value of the latent variables that generated the observations, i.e., to calculate the posterior distribution over , which is intractable. Instead, an approximate solution for the intractable posterior was proposed by kingma2013auto that uses a variational approximation , where are the variational parameters. In this work, we further define this variational posterior as


Currently, DAReN is designed to solve for = constant in a row, thus .

is an intermediate variable, which is used to arrive at the final estimate of

, using the Rule Enforcing Constraint function , as described further in this section. We use the inductive bias in Sec. 3.2 to define as described below.

Disentangling Module

. Initial inference of latent factors, , is accomplished using the disentangling VAE encoder from tschannen2018recent, which uses the following general ELBO optimization


Here, regularizer controls the information flow in , regularizer

induces the disentanglement on the latent representation and the hyperparameter

controls the weight on each of the regularizers. Most of the regularization based disentangling work can be subsumed into the above objective form.

Weak Supervision via RPM Structure

. We use the inductive bias present in via to guide our disentangling module. We describe the approach to infer possible attributes from . We initialize the variational parameters using a partial pre trained network and then use to compute the variance across each factor index. The values for the variance above a certain a threshold () are considered factors and the rest the nuisances.

Enforcing Fixed Factor Constraint

We infer the set bits of for a given from prior in (3), where the image attributes share the same value for the set bits of for row . We compute the KL divergence over the latent vectors in and extract the indices of -lowest divergence values as our .


where represents the indices in row . The KL divergence over denoted as is used to determine whether the factor at index is consistent in . The set consists of indices with values lower than . In the empirical study, we fix , so .

Enforcing Rule Constraint

. We describe the process of preparing the estimated latent variable . The averaging strategy is applied to latent vector . We use the set values of extracted from above to enforce the relation on the images in row , where as for and noise indices the values remains unchanged in .

For = constant in a row, the averaging strategy is variant of the method in Multi Level VAE bouchacourt2018multi to obtain


The final latent vector is prepared as per (2).

From above constraints, we obtain new representations for each in denoted as where the attributes from are constrained by . The decoder network maps to original image. Since the attributes vary across , the above constraints forces the model to keep the factors separate which improves disentangling.

3.4 Reasoning Module

The reasoning component of DAReN incorporates a relational structure to infer the abstract relationship on the attribute for images in . The input to our reasoning module is the disentangled latent variables of and . We prepare possible last rows by placing each choice image in at the missing location . Given , it is known that the correct image will satisfy in the last row, where is already true for the top rows. To determine the correct choice, we contrast the values at index in by computing the variance over all images at row . Using the above step, we compute the variance for all rows that includes all probable last rows. Additionally, we append a positional encoding at the end of each row wise variance computed above. In the next step, we concatenate the variance in latent variables for the top rows along with each possible last row to prepare choice vectors (

). Finally, we train a basic three-layered MLP to process each choice vectors from above to produce a logit vector of size

. The choice with the highest score is selected as the correct answer. We train the reasoning model using CrossEntropy loss.

4 Experiments

4.1 Datasets

We study the performance of DAReN on six datasets - (i) dsprites  matthey2017dsprites, (ii) modified dSprites van2019disentangled, (iii) shapes3d kim2018disentangling, (iv-vi) MPI3D - (Real, Realistic, Toy) gondal2019transfer. We use similar experimental settings as proposed in van2019disentangled to create RPM for the above datasets. Please see Appendix for the description of the datasets and the details on the preparation of the RPMs. We compare our results in both the reasoning accuracy and the disentanglement scores against the SOTA methods. For both categories, we show that our model outperforms SOTA approaches.

Model/Dataset DSprites Modified DSprites Shapes3D MPI3D–Real MPI3D–Realistic MPI3D–Toy
Staged-WReN van2019disentangled
Table 1: Performance (mean variance) of reasoning accuracy on the 6 benchmark dataset.Note that the higher the better. The best score for each dataset among the competing models are shown in bold red and second-best in blue. (Note: values are take from  van2019disentangled.)

4.2 Experimental Setup, Baselines and Evaluation Metric

For our experiments, we used the objectives proposed in Factor-VAE kim2018disentangling as replacement of the regularizers in (8) for the disentangled module. In case of Factor-VAE, the information flow regularizer is and the disentanglement regularizer is

. All our models are implemented in PyTorch 

NEURIPS2019_9015 and optimized using ADAM optimizer kingma2014adam, with the following parameters: learning rate of 1e-4 for the reasoning + representation network excluding the Discriminator of Factor-VAE which is set to 1e-5. To demonstrate that our approach is less sensitive to the choice of the hyper-parameters (), and network initialization, we sweep over 35 different hyper parameter settings of and initialization seeds . We kept the architecture and other hyperparameters similar to the experiments in  van2019disentangled. The image size used in all six datasets is 64 × 64 × 3 pixels. The pixel intensity was scaled to [0,1]. We use a batch size of 64 during the training process.


We refer to the training process in van2019disentangled as Staged-WReN for convenience. With the above settings, we evaluate DAReN against Staged-WReN

for abstract reasoning task and against Factor-VAE for disentangled representation learning. The

Staged-WReN method was a two stage training process, where a disentangling network was trained first ( iterations), followed by training a Wild Relational Network (WReN) proposed by barrett2018measuring on RPM using the fixed disentangled representation from the encoder trained for . Similar, we use the weights of a partially pre-trained model to initialize our VAE model. However, we start at a lower starting point () compared to Staged-WReN and train DAReN for 100K iterations.

We propose an adapted baseline referred to as E2E-WReN over Staged-WReN, where we train both the disentangling and the reasoning network jointly end-to-end. We adopt the similar notation used in van2019disentangled where the context panels and answer panels . WReN is evaluated for each answer panel in relation to the context panels. The following presents the learning objective of E2E-WReN


which is similar to Staged-WReN. However, E2E-WReN trains the the VAE module as well. Please refer appendix for a detailed description on E2E-WReN.

4.3 Analyzing Abstract Visual Reasoning Results

We report the performance as mean variance of reasoning accuracy (refer Table 1) over 6 benchmark datasets. We start by comparing Staged-WReN against the adapted baseline (E2E-WReN). We observe an increase of in performance from Staged-WReN to E2E-WReN which supports our hypothesis that better representation results in improved reasoning performance. The WReN learns a relational structure by aggregating all pair-wise relations formed within the context panels and between context and choice panels in the latent space to find the best-fit choice image for the context. Since the working procedure over Staged-WReN and E2E-WReN remain similar, the difference in reasoning performance is credited to the presence of stronger signals in the latent representation for E2E-WReN that back-propagate from the reasoning module. Despite this, WReN performance is sub-optimal in learning the underlying reasoning patterns. One reason is that the pairwise relations formed between the context panels do not contribute towards predicting any choice because the values in these pairwise relations remain the same across all choice panels. Consequently, a trained WReN learns to ignore within-context pairwise comparisons. Only the pairwise score between context panels and each choice panel is used for inferring the relationship. We have empirically verified this on both trained networks (Staged-WReN and E2E-WReN) over all the datasets, where the scores of non-interfering comparisons across all choices do not impact the final prediction by the model. A detailed analysis of the empirical results is provided in the appendix. This blind formulation of pairwise relation presents a major downside to the quality of learning in WReN. We can observe from GM-RPM, that relation bounds the image attributes in a row-wise manner, hence any pairwise comparison between a choice panel (including correct image) and context panel in the top rows does not help in learning or . This can be seen with an example, where = constant in a row, the pairwise comparison performed between context panel in the top rows with a choice panel (including correct answer) can never have the same values for . Due to the above reason forming every possible pairwise relation hinders model learning. In our DAReN, we avoid forming general pairwise combinations instead use our GM-RPM to model . The results of DAReN outperform E2E-WReN for all datasets. Our results on these datasets provide evidence of stronger affinity between reasoning and disentanglement, which DAReN is able to exploit.

4.4 Analyzing Disentangling Results

As illustrated above, while learning the relations DAReN also separates the underlying latent factors. In Table 2, we report the disentanglement scores of trained models on five widely used evaluation metrics namely, -VAE metric higgins2016beta, Factor-VAE metric kim2018disentangling, DCI disentanglement eastwood2018framework, MIG chenisolating, and SAP-Score kumar2017variational. We compare the latent representation learnt for all three models at iteration. As described in Section 4.3, a joint learning framework not only improves reasoning performance but also strongly disentangles the latent vector in contrast to Factor-VAE, an unsupervised (reasoning-free) approach. E2E-WReN improves upon it but is still suboptimal in disentangling the latent factors, as observed in Table 2. One major reason for large improvements by DAReN compared to Factor-VAE and even E2E-WReN is due the extraction of underlying generative factors and the averaging strategy over the least varying index in . Generally during the unsupervised training process, indices of nuisance factor might receive weak signals from the true generative factors. However, DAReN removes this infusion of true signals with the nuisance factors by separating and .

width=0.5 Model -VAE F-VAE DCI MIG SAP DSprites F-VAE 85.9 6.2 74.4 7.3 52.9 10.5 28.7 11.5 4.0 1.4 E2E-WREN 86.9 2.8 77.6 5.0 58.1 8.0 38.2 7.7 6.0 2.2 DAReN 87.8 2.1 79.2 6.2 59.0 6.4 39.0 0.0 6.0 2.0 Mod Dsprites F-VAE 51.4 13.8 44.0 10.6 31.2 7.6 13.8 7.1 6.4 2.6 E2E-WREN 75.2 10.0 65.1 10.6 43.0 6.9 26.1 9.1 8.3 3.3 DAReN 87.6 11.6 77.0 13.2 50.1 11.3 34.9 12.4 12.6 4.8 Shapes3D F-VAE 91.9 5.9 84.5 8.7 73.9 9.0 44.6 18.8 6.3 2.9 E2E-WREN 94.2 4.7 91.3 6.5 79.1 7.7 54.9 15.7 8.4 3.8 DAReN 99.9 0.3 98.4 3.2 91.6 4.7 68.8 17.5 17.2 4.8 Realistic F-VAE 61.7 6.9 45.0 5.5 37.4 4.6 22.7 7.7 9.8 2.6 E2E-WREN 70.7 5.8 55.3 6.1 42.5 6.4 29.7 8.1 12.8 2.9 DAReN 87.8 7.9 75.8 8.6 50.6 5.9 36.1 8.2 19.3 5.2 Real F-VAE 71.6 8.7 57.8 7.5 46.1 2.5 31.2 6.1 14.4 4.0 E2E-WREN 78.4 7.6 65.0 7.3 48.0 3.5 34.3 7.0 18.2 4.4 DAReN 90.0 10.0 75.8 9.7 51.8 4.9 37.0 8.1 20.8 5.3 Toy F-VAE 67.4 5.0 49.2 4.1 43.0 2.5 29.7 7.0 10.6 2.4 E2E-WREN 75.3 4.3 58.6 4.2 48.0 2.9 37.8 7.9 13.7 2.9 DAReN 87.6 14.2 75.3 15.4 52.8 6.8 35.1 9.6 18.7 5.5

Table 2: Disentanglement Metrics on the 6 benchmark datasets. Performance (mean variance) of disentanglement score widely used metric.

5 Conclusion

In this paper, we introduce an end-to-end learning framework to jointly learn the representation and reasoning using inductive bias as a weak form of supervision in Raven’s Progressive Matrices. To this end, we propose a general generative graphical model for RPM (GR-RPM) as a prior for the reasoning task. We realize a new joint learning framework DAReN based on the principles defined in GR-RPM. Our DAReN is composed of two components – a disentanglement network and a reasoning network. Our disentangling network allows easy integration of different representation learning frameworks. We fix the relation and evaluate our model DAReN on six benchmark datasets to measure the performance of reasoning and disentangling. Abstract visual reasoning task is hard problem in the presence of latent generative factors , relation , and image attributes . We show evidence of strong correlation between the two learning disentangled representation and solving the reasoning tasks. The general nature of GR-RPM and DAReN opens further possibilities for research in joint learning of representation and reasoning, including generalization to arbitrary relationship.