1 Introduction
Imitating the human ability to infer complicated patterns from observations has been a longstanding goal of artificial intelligence. In order to build such models capable of this reasoning ability, recent works
(zheng2019abstract; zhang2019learning; hahne2019attention; wang2020abstract; hu2020hierarchical; wu2020scattering) have focused on training a deep neural network (DNN) which solves abstract reasoning problems that resemble an IQ test (Figure 1). In these problems, one should infer common rules of contexts without any additional information other than context images and select a correct answer from a candidate set. Accordingly, those DNNbased approaches attempt to derive the framework in situations where both the supervision on rules of the problem and the explicit symbol labels of each image are not provided. A couple of studies (santoro2018measuring; zhang2019raven) demonstrated how the widely used neural network architectures such as ResNet (he2016deep) or LSTM (hochreiter1997long) are unfit for learning reasoning capability, as any priors to resemble the human reasoning procedure are not employed in these architectures. Remarkably, recent works have shown that the performance can be significantly improved with the careful neural architecture design motivated by the human reasoning process and even outperforms humans (zhang2019learning; zheng2019abstract; wu2020scattering).Most DNNbased methods on abstract reasoning mostly have resembled how humans perform reasoning via response elimination strategy (Figure 0(a)), i.e., to exclude candidate answers based on matching with the given context images. Intriguingly, cognitive science (bethell1984adaptive; carpenter1990one) reveals that humans use two types of abstract reasoning strategies, not only limited to response elimination strategy. To be specific, humans also have the ability to perform constructive matching (Figure 0(b)); they can first imagine the answer from context images without any candidates and then match the candidate answers to select the most similar one. Especially, several works (mitchum2010solve; becker2016preventing) have emphasized that the latter strategy better reflects the general intelligence of humans. However, investigation on the ability of neural networks to achieve constructive matching is yet underexplored; even such a direction is promising.
Contribution. We introduce a new endtoend generative framework, coined logicguided generation (LoGe), to learn a constructive matching strategy on abstract reasoning like humans. Our main idea is to reduce these reasoning problems into optimization problems in propositional logic. Leveraging such prior knowledge, LoGe learns to embed each image to discrete variables and generate the answer image via incorporating a differentiable propositional logical reasoning layer. We note that both objectives are achieved without any supervision on the exact propositional variables of each image and underlying rules in the problem. Specifically, we propose a threestep framework to achieve these objectives: abstraction, reasoning, and reconstruction.
We verify how our LoGe effectively solves the proposed task on the RAVEN (zhang2019raven) dataset. To be specific, we show that our framework generates highquality images, which is correct, based on capturing the underlying abstract rules and attributes. This result is remarkable, as the widely used neural network architectures perform poorly for this conditional generation task. We also verify how LoGe performs comparably to neural networks that rely on response elimination strategy to perform abstract reasoning, even though our task is arguably harder and has not accessed to the wrong candidates while training.
2 Logicguided Generation
In this section, we demonstrate logicguided generation (LoGe), a framework to imitate a human’s constructive matching strategy on abstract reasoning.
2.1 Overview of logicguided generation
Our problem setup is largely inspired by bethell1984adaptive, who evaluated the constructive matching ability of humans to measure their generative reasoning ability. In this perspective, we express reasoning as a task of inferring the rule from a given problem , where the problem is a pair of a context and answer satisfying the rule . We especially focus on the generative strategy for solving this task; given a query context , we evaluate the ability of machines to infer a rule as from contexts to generate an answer image that matches the groundtruth image .
For teaching models the ability of generative strategy in abstract reasoning, we train them on a dataset consisting of problems, i.e., . To be specific, we consider a dataset where each context in the dataset is a tuple of images and is an answer image. Images are specified by a collection of abstract features such as shapes, colors and size. For instance, we visualize the case of the generative strategy in Raven’s Progressive Matrices (RPM) structure in Figure 0(b): contexts are given as eight images i.e., , and the goal is to generate an answer image for the remaining location, denoted by a question mark.
Our main idea is to connect abstract reasoning to the optimization problem in propositional logic to achieve the generative reasoning strategy into the framework. For instance, in RPM problems, one can define propositional variables as “an image placed at th row and th column contains an attribute ” to represent contexts in the problem, as shown in Figure 0(c). Here, attributes are sets of features in each context image, e.g., set of shapes , color , and size . With those variables, underlying rules can be written as propositional logical formulas as in Figure 0(c). In this respect, one may interpret the answer generation procedure as the MAXSAT optimization problem in propositional logic: finding propositional variables representing the answer which satisfy the underlying logical formula in the given RPM problem as much as possible. We provide a more description of the MAXSAT problem in Appendix A.
As those propositional variables are not provided in dataset , LoGe learns a propositional embedding and rules in problems in selfsupervised manner. In particular, we derive a threestep framework with an encoder network , a decoder network , a logical reasoning layer parametrized by , , , respectively, and a latent codebook , where each element is a trainable
dimensional realvalued vector:

(Abstraction.) The encoder network and the codebook embeds contexts into propositional variables.

(Reasoning.) The reasoning layer predicts propositional variables of the answer image.

(Reconstruction.) The decoder network and the codebook generates the answer image from inferred propositional variables.
We provide an illustration of our framework in Figure 2.
2.2 Detailed description of LoGe
In the rest of this section, we describe each step of our framework in detail.
Abstraction. We first compute propositional logical variables of each context image from the encoder network and the latent codebook . To achieve this, we first pass through each image into the encoder network to have a corresponding output , in which we denote . We then quantize the output with the codebook , denoted by , where with is defined as follows:
where
. We finally consider an onehot encoding of indices of
, namely . Specifically, we map each index into categorical onehot vector in which the th value is 1 for all . Consequently, we have an onehot embedding of each image . This onehot embedding is regarded and utilized as propositional variables of in further steps.Reasoning. With propositional variables of the context , we compute propositional variables which corresponds to the predicted answer image. To be specific, we evaluate from the reasoning layer and propositional variables of contexts , i.e., . For the reasoning layer , we choose the SATNet layer (wang2019satnet), which is a differentiable version of the MAXSAT problem solver and learns propositional logical formulas from data as layer weights. We provide details of this reasoning layer in Appendix B.
Reconstruction. Finally, we infer the answer image from predicted propositional variables . To achieve this, we first compute a latent vector of from predicted and the codebook :
We then return the output from the decoder as the final answer image .
Training objective.
To train LoGe, we propose three loss functions:
, , and for each , which is for abstraction, reasoning, and reconstruction step, respectively. To formulate those objectives, we additionally consider and , which indicates a quantized vector and propositional variables of the answer from the abstraction step.We first formulate and
, which resemble the objective in vectorquantized variational autoencoder
(van2017neural; razavi2019generating):where the term with the bar indicates the term with a stopgradient operator. Moreover, we define as follows:
Here, denotes the binary crossentropy loss.
To sum up, we optimize the loss , which is a sum of above three loss functions:
Here, the total loss contains several discrete outputs, e.g., outputs of the reasoning layer in . We provide a detailed description of how to deal with such nondifferentiability on the optimization in Appendix C.
3 Experiments
We verify the effectiveness of our framework on the RAVEN/iRAVEN dataset (zhang2019raven; hu2020hierarchical). Our result demonstrates that the proposed logicguided generation (LoGe) framework how well generates the answer image at a given abstract reasoning problem, while other neural architectures fail to achieve this. Moreover, we also show our framework can be employed for discriminative tasks, i.e., choose the answer among candidates, and it shows improved results compared to existing discriminative methods (zhang2019raven; zheng2019abstract; zhang2019learning; hu2020hierarchical; wu2020scattering).
Method  Center  UD  LR  OIC 

LSTM (zhang2019raven)  12.3  10.3  12.7  12.9 
WReN (santoro2018measuring)  23.3  15.2  16.5  16.8 
LEN (zheng2019abstract)  42.5  28.1  27.6  32.9 
CoPINet (zhang2019learning)  50.4  40.8  40.0  42.7 
SRAN (hu2020hierarchical)  53.4  43.1  41.4  44.0 
LoGe (Ours)  87.5  64.0  51.7  48.5 
3.1 Experimental setup
Datasets. To verify the effectiveness of LoGe, we choose RAVEN (zhang2019raven) and iRAVEN (hu2020hierarchical). For more details of these datasets, see Appendix D.
Baselines. We note that LoGe utilizes VQVAE (van2017neural) and SATNet (wang2019satnet) to leverage a propositional logic. To verify the effectiveness of this logical prior, we qualitatively compare the generated answers with ones from other blackbox neural network frameworks, i.e., different encoders and reasoning networks other than VQVAE and SATNet, respectively. For quantitative results, we compare the performance with prior methods on discriminative abstract reasoning. We provide detailed descriptions of baselines for comparison in Appendix E.
3.2 Main results
Qualitative result. Figure 3 summarizes results of the generated answers from different configurations of RAVEN dataset. LoGe successfully generates the answer image at the various configuration of abstract reasoning problems, while other blackbox architecture design choices fail. It indicates how propositional logic prior is beneficial to achieve the objective. We provide more illustrations in Appendix F.
Quantitative result. Table 1 shows the comparison of our framework and prior approaches on discriminative tasks.^{1}^{1}1For hu2020hierarchical, we resized the image to 8080 for a fair comparison with other baselines. We select the answer by choosing the candidate which has the smallest meansquared error from the generated answer image to employ our framework into discriminative tasks. Intriguingly, LoGe shows better performance to existing works; even our method has not been accessed to other candidates other than the answer while training.
4 Conclusion
We introduce a new deep neural network generative framework that resembles a human’s constructive matching strategy on abstract reasoning. Specifically, we derive a threestep procedure based on connecting the optimization problem in propositional logic and abstract reasoning. Experimental results demonstrate the effectiveness of our framework among various problem types in abstract reasoning to generate the correct answer based on capturing common patterns with propositional logic prior.
5 Acknowledgements
This work was partially supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019000075, Artificial Intelligence Graduate School Program (KAIST)). This work was mainly supported by Samsung Research Funding & Incubation Center of Samsung Electronics under Project Number SRFCIT190206.
References
Appendix A Detailed description of the MAXSAT problem
CNF formula. Conjunctive normal form (CNF) formula is a conjunction of clauses, where each clause is composed of OR operation of propositional variables. To be specific, the following is an example of the CNF formula with 2 propositional variables and :
(1) 
MAXSAT problem.
MAXSAT problem aims to find values of propositional variables to maximize the number of clauses of given propositional logical formula in CNF. One may interpret the MAXSAT problem as a combinatorial optimization problem. Specifically, by letting
as propositional variables and as the CNF formula with clauses where is an th propositional variable in th clauses:(2) 
We notice that there may not exist any values of propositional variables that satisfy all clauses in the CNF formula to be satisfied. For instance, consider the CNF formula (1) with propositional variables and : there are no assignments of two propositional variables makes CNF formula satisfiable, while there exist assignments simultaneously satisfying three clauses out of four clauses exist.
Appendix B Detailed description of the reasoning layer
As MAXSAT problem in Appendix A is a combinatorial optimization problem which is not differentiable, several approaches (wang2018low; goemans1995improved) attempt to relax such MAXSAT problems into the continuous optimization problem. Specifically, they propose the conversion of MAXSAT problem into semidefinite program (SDP). In depth, the MAXSAT problem in (2) can be formulated as the following optimization problem:
(3) 
where , with are relaxations of propositional variables and clauses, respectively. Here, the solution of original MAXSAT problem can be recovered from the solution of SDP in probabilistic manner via randomized rounding, i.e., (goemans1995improved; wang2019satnet). Moreover, barvinok1995problems; pataki1998rank show the optimal solution of the original problem can be recovered from relaxed SDP if , and wang2017mixing proves coordinate descent update in this SDP on converges to the optimal fixed point.
With this continuous relaxation of the MAXSAT problem, wang2019satnet proposes SATNet to bridge such continuously relaxed problems and the deep neural network. Specifically, they regard in (3) as the layer weight of the neural network, i.e., they define differentiable forward operation which solves (3) from current weights and backward operation to learn a relaxed logical formula from a given dataset via optimizing .
Appendix C Detailed description of the optimization scheme
We first notice that the input of the reasoning layer is nondifferentiable, as it includes the operator. Consequently, optimization of the reasoning loss affects the layer parameter but not other parameters, e.g., codebook . To solve this problem, we propose to use the relaxed version of propositional embeddings , denoted by , where each for is defined as follows:
Moreover, we also note that there is no gradient in due to the operator to have from each image . To compensate this issue, we simply use straightthrough operator (bengio2013estimating), i.e., we copy gradients of the decoder input to the encoder output .
Hyperparameters.
We note that LoGe contains following hyperparameters: the size of codebook
, the size of spatial features , the size of reasoning layer (see Appendix B), and the coefficient in the abstraction loss . In all experiments, we use universal hyperparameter setups: , , , and .Appendix D Details of the RAVEN/iRAVEN dataset
RAVEN. RAVEN dataset (zhang2019raven) is a synthetic dataset to evaluate the abstract reasoning ability of machines, where each problem is a Raven’s Progressive Matrices (RPM) format. Specifically, the dataset consists of total 7 problem types: CenterSingle (Center), LeftRight (LR), UpDown (UD), OutInCenter (OIC), OutInGrid (OID), 2x2Grid, and 3x3Grid, where each configuration contains 10000 problems. Here, we consider 4 out of total 7 configurations: Center, LR, UD, and OIC. Each image contains five attributes: number of objects, position, shape, size, and color. For rules, the dataset contains 4 rules in total: the attribute is either constant, progressive, arithmetic, and distributed across each row in the problem. Figure 1 illustrates the Center configuration in RAVEN dataset, and more illustrations are provided in Figure 3.
iRAVEN. iRAVEN dataset (hu2020hierarchical) is a modified version of the RAVEN dataset with a different rule to generate a list of candidates in the problem. To be specific, hu2020hierarchical finds there exists a shortcut bias in candidates in RAVEN dataset, i.e., one can find the answer only from candidates without accessing the context images of the problem. Accordingly, they propose a new RAVEN dataset in which such a bias is removed from the candidate set to better measure the reasoning ability of the discriminative abstract reasoning framework. Moreover, they find that the accuracy of existing methods significantly drops if this shortcut bias is removed.
Appendix E Detailed description of baselines
Baselines for qualitative results. We notice that the neural architecture in LoGe is composed of vectorquantized variational autoencoder (VQVAE) (van2017neural) and SATNet (wang2019satnet)
, based on employing propositional logical prior to solve abstract reasoning problems. To verify the effectiveness of this prior on the abstract reasoning, we qualitatively compare generated answer images from other widely used neural architectures without this assumption. As generative neural architectures for solving reasoning problems are underexplored, we compare the result with other architecture variants where VQVAE and SATNet is substituted to different neural architectures. Specifically, we compare results from combinations of different encoders (autoencoder and VQVAE) and reasoning networks (attention layer and 2layer convolutional neural networks (CNNs) with different kernel sizes, where kernel sizes of CNNs set to 3 and 1).
Baselines for qualitative results. In the rest of this section, we briefly explain previous approaches to solve the abstract reasoning problem via deep neural networks.

[leftmargin=0.2in]

LSTM (zhang2019raven) attempts to utilize LSTM to validate the inefficiency of conventional deep neural network architectures to solve reasoning problems.

WReN (santoro2018measuring) proposes to solve the abstract reasoning problem via relation network. (NIPS2017_e6acf4b0).

LEN (zheng2019abstract)
proposes a variant of relation network, where the input of the network is a triplet of images rather than a pair of images. Furthermore, they empirically verify the performance can be further boosted with the curriculum learning based on a reinforcement learning framework.

CoPINet (zhang2019learning) suggests a contrastive learning algorithm to learn underlying rules from given images.

SRAN (hu2020hierarchical) designs a hierarchical neural network framework that simultaneously considers images in the problem individually and also at the row and column level.
Appendix F More illustration of generated results
In this section, we provide additional generated comparisons of different configurations in RAVEN deadset.
Comments
There are no comments yet.