Can We Automate Diagrammatic Reasoning?

02/13/2019 ∙ by Sk. Arif Ahmed, et al. ∙ Indian Institute of Technology Bhubaneswar IEEE NIT Durgapur IIT Roorkee 14

Learning to solve diagrammatic reasoning (DR) can be a challenging but interesting problem to the computer vision research community. It is believed that next generation pattern recognition applications should be able to simulate human brain to understand and analyze reasoning of images. However, due to the lack of benchmarks of diagrammatic reasoning, the present research primarily focuses on visual reasoning that can be applied to real-world objects. In this paper, we present a diagrammatic reasoning dataset that provides a large variety of DR problems. In addition, we also propose a Knowledge-based Long Short Term Memory (KLSTM) to solve diagrammatic reasoning problems. Our proposed analysis is arguably the first work in this research area. Several state-of-the-art learning frameworks have been used to compare with the proposed KLSTM framework in the present context. Preliminary results indicate that the domain is highly related to computer vision and pattern recognition research with several challenging avenues.



There are no comments yet.


page 3

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Diagrammatic reasoning involves visual representations of objects or diagrams. It involves understanding concepts and ideas from images consisting of patterns. Solving such diagrammatic reasoning problems using computer vision and artificial intelligence can help us to understand complex patterns of objects in images. Typically, a test in diagrammatic reasoning consists of questions that requires analyzing a sequence of shapes or patterns. This is also known as abstract or inductive reasoning test[]. The task is to identify the rules that can be applied to a sequence and then use them to pick an appropriate answer. The questions are usually of multiple choices. These questions generally consist of a series of pictures, each of which is different or oriented. The task is to choose another picture from a number of options to complete the series. For example, Figure 

1 shows a typical diagrammatic reasoning problem, where the first row represents the question and the second row contains the four options out of which only one is correct.

Figure 1: A typical example of a diagrammatic reasoning problem. The first row presents the first three objects of a sequence of four objects in a particular order. The second row presents the multiple choices typically shown to an examinee. Option D is the right answer for the above problem.

1.1 Related Work

Solving reasoning problems using artificial intelligence (AI) is a challenging task. For example, solving mathematical word problems kushman2014learning

using natural language processing (NLP) is well-known in artificial intelligence. Solutions to them have enhanced the strategies of supervised learning by introducing newer rules. However, similar tasks in visual reasoning have not received focused attention of the computer vision research community. Two similar domains that have attracted computer vision and pattern recognition researchers are visual question answering 

antol2015vqa ; yang2016stacked ; noh2016image and visual reasoning johnson2017inferring ; hu2017learning ; hu2017modeling . Figures 2(a) depicts Compositional Language and Elementary Visual Reasoning (CLVR) johnson2017clevr dataset that has been used to build artificial intelligence systems to reason and answer questions about visual data. Figures 2(b) depicts Cornell Natural Language Visual Reasoning (NLVR) synthetic dataset to solve the task of determining whether a comment is true or false about an image, Figures 2(c) represents Visual Question Answering (VQA) dataset. VQA is a new dataset containing open-ended questions about the images antol2015vqa . Figures 2(d) represents reasoning of image pairs balanced_vqa_v2 .

Figure 2: Visual reasoning dataset. (a) Question: How many objects are either small cylinders or red things? (b) Question: There is exactly one big yellow square not touching any edge (True/False) (c) Q: How many slices are there in the pizza? (d) Q: Is the umbrella upside down?

1.2 Motivation and Contributions

Diagrammatic reasoning can also be presented as a visual sequence prediction problem gavornik2014learned . In computer vision, similar approaches are used to predict future video frames lu2017flexible ; liu2017video ; vukotic2017one

. Following a similar line of thinking like visual reasoning in computer vision, we have introduced this new domain of research, namely solving diagrammatic reasoning using machine learning guided computer vision process. In this context, we have made the following contributions:

  • Solving diagrammatic reasoning problems with the help of computer vision and pattern recognition techniques.

  • Introducing a rich diagrammatic reasoning dataset that can be used by the computer vision research community for solving similar problems through pattern recognition and machine learning.

  • We also introduce a new learning framework referred to as Knowledge-based Long Short Term Memory (KLSTM) to solve diagrammatic reasoning problems.

Rest of the paper is organized as follows. In Section 2, we present the Datasets and Benckmarks. Section 3 presents the proposed DR solving method. Experiment results are presented in Section 4. Conclusion and future work are presented in Section 5.

2 Datasets and Benchmarks

The ultimate goal of visual reasoning is to learn image understanding and interpretations. Due to the unavailability of datasest and benchmarks, research in this domain is still in its infancy. There are a large variety of DR problems. For examples, Figure 3(a) represents a and Figure 3(b) represents a DR problem. The examples seem complex and we have found these examples are hard to solve through common machine learning frameworks. Such problems are left for future research. In this paper, we have considered a diagrammatic problem, such as shown in Figure 1.

Figure 3: Examples of DR problems that are difficult in nature and may not be possible to solve using typical machine learning frameworks. (a) Example of a matrix reasoning problem. (b) Example of a graphical reasoning problems.

We have collected images of diagrammatic reasoning from the web and prepared a dataset of diagrammatic reasoning problems. The dataset contains 619 number of problems. We have categorized these problems into four groups, namely (i) Rotation (RT), (ii) Counting (CT), (iii) Shape Scaling (SS), and (iv) Other Type (OT). Figure 4 depicts one sample question with possible answers from each category and Figure 5 depicts the distribution of the problems across various categories in our dataset.

Figure 4: Examples of four types of typical DR problems that are present in our dataset. (a) Example of a rotation problem (RT), where a pattern is rotated as compared to the first image with relative rotations that may be mentioned in a DR question as . The prediction should be and the correct answer is option B. (b) It is a typical problem of number series prediction (CT). The question consists of a set of filled circles. Here, the number of circles varies as 2, 4, 6, ?. Our task is to predict the picture with 8 filled-circles. The correct answer is option B. (c) Third one is an example of typical shape and scaling problem (SS). The pattern can be interpreted as {Cicle, Large Triangle,Circle, Big Triangle, Circle, Small Triangle,?}. Our task is to predict Circle, Tiny Triangle which is option B. (d) The fourth one is a typical pattern understanding problem. We have categorized such problems into Other Type (OT). Our task is to predict the pattern. The correct answer is option A.
Figure 5: Distribution of different DR problems in our dataset.

3 KLSTM for Solving DR Problems

The proposed DR solving method is based on a set of features and rules. We have introduced a supervised and ruled-based method to extract relational features (RF) of image sequences. The proposed method consists of two major steps. During the first-level of processing, the question and options are passed through a knowledge acquisition tool to construct the knowledge base. The knowledge consists of a set of image features extracted from individual image and a set of relational features extracted from the sequence of images in the question. Next, the problem type is identified using a rule-based method. Finally, the features are passed through a Knowledge-specific Long Short Term Memory (KLSTM) to predict the possible output pattern or image. Figure 


depicts the proposed framework in details. The KLSTM network consists of (i) a Knowledge acquisition module, (ii) a set of LSTM, and (iii) a problem classifier and LSTM chooser module.

Figure 6: Architecture of the proposed DR solving framework with various components. We take question sequence and the options as input and construct a knowledge base. Finally, it predicts the best possible option out of the four input options and produces a complete sequence of four patterns/images. The framework consist of a rule-base problem classifier and a set of LSTM similar to vinyals2015show . The input of the LSTM are relative features (RF).

The problem space (P) is defined in (1), where the question contains a set of images and the given options are grouped in another set . Diagrammatic reasoning is to predict the answer image such that . First, we represent the problem using a high-level knowledge structure. This is carried out as follows. Domain knowledge of human experts (rules) are used to understand the relation among a sequence of images. Next, a knowledge base is constructed. That expert opinions (rules) are integrated with the system to solve visual reasoning problems. The method is presented in Algorithm 1.

1:Problem Space as defined in (1)
2:, where
3:Extract knowledge base from the training data
4: = Classify (Q, K), where RT, CT, SS, OT
5:, where
Algorithm 1 Diagrammatic Reasoning

Knowledge Acquisition: Knowledge acquisition is carried out during training. The knowledge base is extracted from a set of training samples, i.e. problems used in training. First, the shapes in each image in and number of similar shapes are extracted using YOLO redmon2018yolov3 . YOLO is fast and the accuracy of the method is good enough in our context. We then introduce a new feature for solving digrammatic reasoning problems. The feature is referred to as the relational feature. Unlike image-based features such as color, texture, shape or edge that are typically used in various computer vision applications, we have extracted three relational features, namely rotation , counts , and scaling from the set of the given images. The feature-set is given in  (2). Various components of the feature-set (k) are described hereafter.


Shape Detection: Each image in

is passed through a deep learning module to extract the shapes. We have considered common geometrical shapes such as circle, triangle, rectangle, square, diamond, star, hexagon that are usually present in various DR problems. All the shapes are classified as either empty (only edges) or filled. YOLO 

redmon2018yolov3 has been found to be a good binary classifier as compared to Resnet50/101 he2016deep , VGG 16 simonyan2014very , or GoogleNet szegedy2015going .

Rotation: In a typical rotation diagrammatic reasoning problem (Figure 7), the solution lies in rotating the figure correctly to complete the sequence. We assume the first image () as the reference with a rotation of . All other images are expressed using rotation angle with respect to the reference image.

Figure 7: We represent the rotation problem as a set of 7 images or patterns. In rotation problems, we consider the first image (red) as the reference image with rotation and extract the rotation relation of other images.

To achieve this, 360 number of images are generated by incrementally rotating the base image by . A few samples of the rotated images corresponding to the DR problem described in Figure 7 are shown in Figure 8. This set is denoted by . The similarity score is defined in (3

). This score has been estimated between a query image and all images of

using ResNet50, where is query image and is the image in .


The relative rotation of each image of is then extracted with respect to each image belonging to . If the images in are different from each other, we categorize the question as non-rotation problem and a flag NA is set. A threshold has been used to decide about the success of matching. is set to the value of rotation if the matching score returned by ResNet50 is above the threshold. However, in the event of multiple images being categorized above the threshold, the image that gives the highest value, is selected and its rotation angle is taken as the final input. In the event that none is found suitable, the problem is categorized as non-rotation digametic reasoning problem.

Figure 8: Sample images after applying rotations on the first image of P. Depiction of how a possible match is found at for a given query image.

The relative rotations of the diagrammatic reasoning problem depicted in Figure 4(a) are for the options in the question and for the options in the answer.

Counting: Counting is a reasoning problem where the solution is to extract the correct number of shapes present in the problem sequence. First, the shapes are detected and the number of same shapes is estimated. For example, Figure 9 depicts a typical filled circle detection and counting using YOLO. Each image of the problem space is expressed using the count of shapes in a sequence as . The predicted missing number is from the set .

Figure 9:

A typical counting DR problem with 7 pictures. The first three patterns represent the sequence given in the question 2, 4, 6, ? and the next four patterns represent the options for the probable answer with 8 as the correct option.

Scaling: Relative scaling is extracted from the bounding box of the detected shapes. First, the bounding boxes are extracted from the shapes in

. Next, similar sized shapes are grouped using unsupervised density-based spectral clustering with application to noise (DBSCAN) 

tran2013revised . The groups are then rearranged in order of labels such that . These groups can be labelled as large, medium, small and tiny for a typical DR problem. The process of grouping and labelling of shapes is described in Algorithm 2.

1:Problem Space (P) as defined in equation (1)
2:Relational scaling of each image
5:Rearrange group and assign label , where
6:= Shape Label
Algorithm 2 Diagrammatic Reasoning for Scaling Problems

Figure 10(a) depicts a DR problem where size of the pattern is used as a clue for the solution. The DBSCAN algorithm has identified four classes or groups, where the problem has been expressed as , and the solution options are .

Figure 10: (a) Detection of shapes archived by YOLO. (b) The bounding boxes are grouped using DBSCAN. Each color represents a group of same size shapes.

Representation of Knowledge Base: For a given problem space , the shapes are detected and the relational features (RF) are extracted as mentioned earlier. The knowledge base consists of four sets, namely shapes, rotation , counting , and scaling . Shapes store information about the structures and other sets represent various components of the relational features. Table 1 shows the knowledge extracted from four different problems.

DR Problem Constructed Knowledge Base
Table 1: Typical examples of knowledge base extracted using the features described earlier

Classification and Solving: Final stage is to learn the pattern from the question images and predict the correct answer from the given options. At the beginning, the relational features (RF) are extracted from all training samples. Next, four independent LSTMs corresponding to and unknown problems are trained to build the prediction model. In the testing phase, a similar knowledge base is extracted from the test sample. Next, a rule-based method as described in Algorithm 3 is applied to classify the problem as Category 1 (RT or CT or SS) or Category 2 (OT). In the case of Category 1, a variation of LSTM is used as proposed in vinyals2015show . The method has been used to generate a caption from the images. Rather than using the conventional image-based features vukotic2017one ; liu2017video , we have used relational features (RF) extracted by the knowledge extractor. The method is depicted in Figure 11, where the knowledge extractor (KE) is the process of extracting RFs and representor (R) is the image with a set of relational feature.

1:relational features of the problem
2:Problem Class
3:if  or are different then
5:else if  or are different then
7:else if  or are different then
11:end if
Algorithm 3 Classification of Diagrammatic Reasoning Problem
Figure 11: Prediction model for solving Category 1 problems. KE: Knowledge Extractor, R: Representor, M: Matching Module.

Unknown problems (Category 2) are solved by the variation of LSTM called Flexible Spatio-Temporal Network (FSTN) proposed in lu2017flexible . Originally the method predicts the future video frames from a set of observed sequences. In this method, image-based features are sequentially passed through a LSTM in encoding/decoding manner. Figure 12 depicts the method in details. The method consists of encoders (E), decoders (D), and a matching network (M). The network is trained using several image sequences. Table 2 shows reference feature prediction of the problems shown in Table 1.

Figure 12: Interpolation model for solving Category 2 problems. E: Encoder, D: Decoder, M: Matching Network.
Predicted Answer Predicted Knowledge Detected Category Correct?
Category 1 (RT) Yes
Category 1 (CT) Yes
Category 1 (SS) Yes
NA Category 2 (OT) No
Table 2: Details of the predicted knowledge and answers

4 Experiments using Baselines

We present the experiment results in this section. The first step of the method is to detect shapes from a given image. We have experimented with state-of-the-art convolutional networks including ResNet50, ResNet101 he2016deep , VGG16 simonyan2014very , GoogleNet szegedy2015going and YOLO redmon2018yolov3 . YOLO has been found to be the best architecture for the present case. 70% of the data have been used for training and 30% for testing across all experiments. We have performed 10 folds cross validation and reported the average results. Table 3 summarizes the shape detection results.

Algorithm Accuracy
ResNet50 (baseline) 57.19
ResNet101 he2016deep 62.19
VGG16 simonyan2014very 71.11
GoogleNet szegedy2015going 77.22
YOLO redmon2018yolov3 86.76
Table 3: Results of shape detection

In the next stage, classification of the problem type has been carried out. The confusion matrix for four types of problems is depicted in Figure 

13. It is observed that the proposed method can successfully classify the problems with reasonably high accuracy.

Figure 13: The confusion matrix for classifying DR problem. R: RT, C: CT, S: SS, O: OT.

We have carried out several experiments to understand the behavior of the DR problem solver. We have taken image-based features as baseline and applied state-of-the-art recurrent neural network (RNN) to solve the reasoning problems. The results are summarized in Table 

4. Figure 14 depicts some success and failure cases.

Algorithm Rotation (RT) Counting (CT) Scaling (SS) Other (OT) Average
Image+LSTM (baseline) 57.12 42.13 62.11 36.5 49.46
Image+Encoder/Decoder vukotic2017one 62.11 41.12 61.11 37.89 50.55

Image+Deep feature 

64.39 47.19 41.91 42.86 49.08
Image+RNN bengio2015scheduled 56.80 41.19 54.91 32.20 46.27
Image+FSTN lu2017flexible 66.11 37.19 66.91 34.90 51.27
RF+FSTN lu2017flexible 51.16 52.19 46.34 42.86 48.13
Proposed KLSTM 75.87 76.22 73.41 66.91 73.10
Table 4: Comparative results of DR problem solving
Figure 14: A few samples of the proposed DR dataset. Green boxes represent ground truths and correctly solved problems, red boxes represent wrongly predicted answers.

5 Conclusion

In this paper, we have introduced a new dataset for solving diagrammatic reasoning (DR) problems using machine learning and computer vision. The dataset can open up new challenges to the vision community. We have experimented with several state-of-the-art learning frameworks to solve typical DR problem. It has been observed that the image-based analysis usually fails to answer correctly in many cases. We have introduced a new feature set called relational feature. A rule-based learning with the help of LSTM has been used to classify the DR questions. Results reflect that the proposed rule-based method outperforms existing image-based analysis.

It has been observed that simple rules defined in this work may not be sufficient to solve all types of DR problems. Complicated rules need to be defined and we may need to redefine the feature-set for solving complex DR problems. Mainly, other types (OT) DR problems need further attention of the research community.


Funding: We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P5000 GPU used for this research.
Conflict of interest: The authors declare that there is no conflict of interest regarding the publication of this paper.
Ethical approval: This article does not contain any studies with human participants or animals performed by any of the authors. Informed consent: Informed consent was obtained from all individual participants included in the study.



  • (1)


  • (1) N. Kushman, Y. Artzi, L. Zettlemoyer, R. Barzilay, Learning to automatically solve algebra word problems, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, 2014, pp. 271–281.
  • (2) S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, D. Parikh, Vqa: Visual question answering, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433.
  • (3) Z. Yang, X. He, J. Gao, L. Deng, A. Smola, Stacked attention networks for image question answering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 21–29.
  • (4)

    H. Noh, P. Hongsuck Seo, B. Han, Image question answering using convolutional neural network with dynamic parameter prediction, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 30–38.

  • (5) J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, R. B. Girshick, Inferring and executing programs for visual reasoning., in: ICCV, 2017, pp. 3008–3017.
  • (6) R. Hu, J. Andreas, M. Rohrbach, T. Darrell, K. Saenko, Learning to reason: End-to-end module networks for visual question answering, CoRR, abs/1704.05526 3.
  • (7) R. Hu, M. Rohrbach, J. Andreas, T. Darrell, K. Saenko, Modeling relationships in referential expressions with compositional modular networks, in: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE, 2017, pp. 4418–4427.
  • (8) J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, R. Girshick, Clevr: A diagnostic dataset for compositional language and elementary visual reasoning, in: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE, 2017, pp. 1988–1997.
  • (9) Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering, in: Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • (10) J. P. Gavornik, M. F. Bear, Learned spatiotemporal sequence recognition and prediction in primary visual cortex, Nature neuroscience 17 (5) (2014) 732.
  • (11) C. Lu, M. Hirsch, B. Schölkopf, Flexible spatio-temporal networks for video prediction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6523–6531.
  • (12) Z. Liu, R. A. Yeh, X. Tang, Y. Liu, A. Agarwala, Video frame synthesis using deep voxel flow., in: ICCV, 2017, pp. 4473–4481.
  • (13) V. Vukotić, S.-L. Pintea, C. Raymond, G. Gravier, J. C. van Gemert, One-step time-dependent future video frame prediction with a convolutional encoder-decoder neural network, in: International Conference on Image Analysis and Processing, Springer, 2017, pp. 140–151.
  • (14) O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: A neural image caption generator, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156–3164.
  • (15) J. Redmon, A. Farhadi, Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767.
  • (16) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • (17) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition (2015) 19–36.
  • (18) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
  • (19) T. N. Tran, K. Drab, M. Daszykowski, Revised dbscan algorithm to cluster data with dense adjacent clusters, Chemometrics and Intelligent Laboratory Systems 120 (2013) 92–96.
  • (20) T. Lan, T.-C. Chen, S. Savarese, A hierarchical representation for future action prediction, in: European Conference on Computer Vision, Springer, 2014, pp. 689–704.
  • (21) S. Bengio, O. Vinyals, N. Jaitly, N. Shazeer, Scheduled sampling for sequence prediction with recurrent neural networks, in: Advances in Neural Information Processing Systems, 2015, pp. 1171–1179.