1 Introduction
Concept learning actively extracts representations from a visual scene and connects them to linguistic tokens for identification, which resembles human cognition [inhelder2013early]. For example, we may focus on its color (e.g. reddish-yellow) and shape (e.g. round) appearance when identifying an orange. Concept learning is fundamental to vision-and-language tasks that perform multi-step inference over a scene’s entities and their relationships, such as Visual Question Answering (VQA) [antol2015vqa], Visual Commonsense Reasoning (VCR) [zellers2019recognition] and Vision-Language Navigation (VLN) [anderson2018vision].
In recent years, great progress has been made on concept learning to increase the accuracy of identification. In the beginning, most algorithms require explicit annotations, such as ground truth scene graphs, for training concept learners [mascharka2018transparency, yi2018neural].

Until recently, Mao et al [mao2019neuro] proposed quasi-symbolic execution together with learning visual subspaces using only question-answer pairs for images. Though the state-of-the-art methods can achieve around accuracy on diagnostic datasets such as the CLEVR dataset [johnson2017clevr], they suffer from heavy performance drop on out-of-distribution compositions [mascharka2018transparency, yi2018neural, mao2019neuro, marois2018transfer]. Also, we observe that they are vulnerable to attribute perturbations during inference.
A demonstration of attribute perturbation is shown in Figure 1. The left scene shares exactly the same setting as training data, i.e. object attributes such as color, shape, and material follow the same distributions with training data. In the right scene, small perturbations are added to the color attribute, e.g. redlight red. To examine a learner’s robustness to color perturbation, we output the scores of judging metal and sphere from the well-trained NSCL model [mao2019neuro] (shown in blue bars). We find it performs perfectly in the left scene, while much worse in the right one. It reveals that the learner’s recognition of the material attribute and the shape attribute are affected by colors. We ascribe this phenomenon to a failure of exploring the intrinsic semantic hierarchy of visual concepts, e.g. {red, blue,…} ‘color’ subspace yet cube ‘shape’.
An abstraction of such a hierarchy is vital to humans’ reasoning. It is one of the fundamental capabilities that make us being robust to understand the real world and generalize well to unseen scenarios [inhelder2013early, murphy2004big, landauer1997solution, lund1996producing, lake2020word, tenenbaum2011grow, rosch1976basic, tanaka1991object]. Human learners capture the essential hypothesis spaces in parsimonious form, and the formed hierarchy structure enables describing not only the specific situation at hand, but also a broader class of situations over which learning should generalize [tenenbaum2011grow]
. As an attempt to introduce such hierarchy into machine learning, Han et al
[han2020visual] propose a visual concept-metaconcept learner (VCML) that utilizes extra metaconcept supervision. Different from them, we learn the semantic hierarchy with only natural VQA data. To the best of our knowledge, we are the first to explore the intrinsic semantic hierarchy under natural weak supervision.In this paper, we propose a visual superordinate abstraction framework that explicitly models semantic-aware visual subspaces, which are denoted as visual superordinates. With the weak supervision from visual question answering data, the concept learner first acquires the semantic hierarchy from a linguistic view in a simple curriculum. Then in the following curriculum, it explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. Upon the abstraction framework, we propose a quasi-center visual concept clustering and a superordinate shortcut learning schemes, which further enhances the discrimination and independence of concepts within each visual superordinate. Quasi-center visual concept clustering aims to model the relationships between the clusters of visual representations and the linguistic features. By introducing a quasi-center of the visual cluster, we simultaneously (i) reduce the distance between visual representations and the quasi-center, and (ii) increase the similarity between visual samples and the corresponding concept. As for superordinate shortcut learning, the learner rectifies its judgment by reducing spurious causal-effect relations among superordinates.
Finally, comprehensive experiments are conducted to verify the effectiveness and generalization ability of the proposed concept learner. For the demonstration in Figure 1, our method identifies metal and sphere correctly even with color perturbations. Statistically, the proposed model achieves comparable accuracy as the state-of-the-art methods. On the more challenging CLEVR-Perturb setting, our method outperforms NSCL [mao2019neuro] by a relative improvement. On the CLEVR-CoGenT dataset, our model overcomes the implicit bias in the training data, and performs the best without finetuning on split val-B, which surpasses the state of the art by .
2 Related Work
Concept learning
Recent exploration on elementary visual reasoning starts from [johnson2017clevr], which provides a diagnostic dataset to test the reasoning and generalization ability of a model by answering visual-related questions. As for joint learning of vision and natural language, existing methods mainly diverse in visual representations and question parsing process. Initial methods conduct reasoning primarily on convolutional feature maps. Johnson et al [johnson2017inferring] build a reasoning system composed of a program generator and an execution engine, where the engine executes the decoded program sequence on features maps obtained from CNNs [he2016deep]. Hu et al [hu2017learning] propose end-to-end module networks that conduct reasoning by directly predicting instance-specific network layouts without off-the-shelf parser. To further get rid of annotated layout data, Hu et al [hu2018explainable] replace the layout graph in [hu2017learning] with a stack-based data structure that allows fully differentiable optimization. Similarly, Mascharka et al [mascharka2018transparency] propose a set of visual reasoning primitives that composes a model according to a given question.
Different from the methods reasoning on convolutional feature maps, Yi et al [yi2018neural] parse a scene into a structural graph using trainable object and attribute detectors. Since the scene parser in [yi2018neural] requires ground-truth structural graphs for training, Mao et al [mao2019neuro] propose quasi-symbolic execution and simultaneously learn scene parser and semantic parser using only question-answer pairs for images. Li et al [li2020competence] propose a multi-dimensional Item Response Theory (mIRT) model for guiding the learning process with an adaptive curriculum to increase training efficiency. Prerez et al [perez2018film] design a feature-wise linear modulation layer to improve the reasoning ability of the vanilla baseline in [johnson2017inferring]. Hudson and Manning [hudson2018compositional] introduce a recurrent memory, attention, and composition (MAC) cell that maintains control and memory separately to balance transparency and versatility, not explicitly parsing questions into programs. Following MAC [hudson2018compositional], Wang et al [wang2021interpretable]
devise an object-centric compositional attention model to induce symbolic concept space. To boost the quality of detected objects, Kamath et al
[kamath2021mdetr] pre-train an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, and then finetune on visual reasoning datasets. We follow the neuro-symbolic reasoning process in [mao2019neuro], but different from their semantic-agnostic visual representations, we learn visual superordinates to increase the robustness.Causal inference
Causal analysis infers probabilities under both static conditions and changing conditions, which aims to figure out correlations in biased data
[pearl2009causal]. Structural causal model (SCM), a directed graph that reveals causal relationships between random variables, has been developed as a formal tool to model causation from statistical data and counterfactual reasoning. It has been widely used in medical, psychological, and social research
[dunn2015evaluation, king2008political, mackinnon2007mediation, richiardi2013mediation]to determine the effect of a treatment or policy, and recently been introduced into computer vision
[nair2019causal, niu2020counterfactual, qi2020two, wang2020visual, yang2020deconfounded, tang2020unbiased] to enable counterfactual reasoning.Though efficient, it can hardly be used in concept learning models without semantic-aware visual subspaces. In a flat concept set, the distance between any two concepts is not comparable with that of another two, neither the distance between visual features and different concepts. For example, an object representation can be close to both “red” and “cube”, which gives no hint to its distance to other concepts. However, within a visual superordinate, the closeness to “red” indicates a relatively large distance to “green”. Our visual superordinate abstraction framework models semantic-aware visual subspaces as independent variables, which enables the advanced strategies to be introduced in concept learning.

3 Method
To conduct neuro-symbolic reasoning, the pipeline generally includes a concept learner and two parsers for vision and language respectively. Following NSCL [mao2019neuro], in our method, visual parsing is done via pre-trained object detectors. The language parser converts a given question into a sequence of programs that consist of concepts and operations. For instance, the question “What is the size of the sphere left of the blue metal object?” is parsed into the program sequence Filter (ObjConcept 2 (blue metal), Scene) Filter (RelConcept 1 (left),) Filter(ObjConcept 1 (sphere),) Query(Attribute 1 (size),). The key reasoning step is judging if an object shares a concept.
3.1 Visual Superordinate Abstraction
Linguistic hierarchy
Let denotes the visual features of objects in a scene, is the concept set to be learned. These concepts describe attributes in total, yet it is unknown which concept belongs to which attribute. The concept learner learns mapping functions , and forms visual subspaces , each .
In previous methods, are semantic-agnostic visual subspaces, thus the probability of the -th object sharing
concept is estimated as,
(1) |
where is the embedding of concept corresponding to and
is the sigmoid function.
andare the shifting and the scaling parameters respectively. The prior probability
is given by the normalized vector
along with the concept . The conditional probabilityis given by the cosine similarity in the
-th subspace, where the normalization denominator is omitted.The goal of our method is to first learn the linguistic hierarchy that naturally exists in visual reasoning questions. Different from the curriculum in NSCL [mao2019neuro] that arranges lessons by the depth of programs, we devise the curricula from easy question types to hard ones. For curriculum-I, we use programs of depth less than six and scenes including objects less than six, where the query-type question-answer pairs contain soft alignment of linguistic hierarchy. Our experiments show that, after the curriculum-I, our model is capable of accurately defining linguistic hierarchy, i.e. the affiliation illustrated in Figure 2 (b). With explored linguistic hierarchy, we are possible to further explore semantic-aware visual superordinates and their inside clustering.
Training objectives. Given parsed questions from a pre-trained language parser, the concept learner is fully differentiable and trained by maximizing the likelihood of right answers
(2) |
where represents the quasi-symbolic program execution and is the set of parameters in the concept learner. The execution result of each step is given by Eq. (3.1).
Semantic-aware visual superordinate abstraction.
Though the linguistic hierarchy is learned, the visual mappings are still semantic-agnostic. It leads to entanglement between visual representations. The cause lies in the full-probability reasoning given by Eq. (3.1). It indicates that the learner must refer to each subspace to determine whether an object shares the concept . However, as discussed in our Introduction, we argue that should exclusively belong to only one visual subspace.
To this end, we utilize the acquired linguistic hierarchy to learn semantic-aware visual subspaces, i.e. the visual superordinates in Figure 2 (c). Upon the learned linguistic hierarchy, the concept learner can uniquely relate to visual subspace (or or ) given by , and estimate the probability of the -th object sharing concept. For example, we assume that a concept belongs to subspace , then
(3) |
Comparing Eq. (3.1) and Eq. (3), we can see that uniquely matches and only refers to in judgment. This way, the subspaces become aware of the superordinate such as color and shape.
A similar objective is utilized to optimize the learner during the visual superordinate abstraction,
(4) |
where executes each program step and produces results by Eq. (3).
Learning the visual superordinates increases the independence of different visual subspaces and improves the learner’s robustness to perturbations. In addition, we devise a quasi-center visual concept clustering and a superordinate shortcut learning schemes to enhance the discrimination and independence of concepts within each visual superordinate.
3.2 Quasi-center Visual Concept Clustering
Given the independent mappings learned by visual superordinate abstraction, the visual representations within each superordinate can still be mixed with each other. For instance, the learner can distinguish shape from color and size, but it may confuse cube, cylinder and sphere. The reason can be found in Eq. (3.1) and Eq. (3
). The learner estimates the similarity between the mapped feature and the corresponding concept representation, yet ignores the samples that share the same concept in the visual subspace. As a result, visual samples diffuse around a concept and presents high variance with the concept representation, which is vulnerable to noise.

As illustrated in Figure 3 (left), the visual representations around the color concepts may disperse within each cluster and the clustering center may also be far from the corresponding concept representation. To this end, we design a memory cache to enhance the clustering of visual features around different concepts and reduce the discrepancy between linguistic concepts and visual representations. Supposing that at the -th training step the learner has seen a set of samples that share a concept , a quasi-center of each concept cluster can be calculated as
(5) | ||||
(6) |
where are the samples that shares up to -th training step and is the superordinate that aligns with according the learned linguistic hierarchy. Eq. (6) dynamically updates the clustering center at each training step, where are the samples that appear at the -th training step.
For simplicity, we calculate the Euclidean distance between a mapped sample and the existing clustering quasi-center
(7) |
and balance it with the original cosine similarity by
(8) |
where is the decay coefficient of the distance.
Adding cached visual samples, the learner infers the probability for concepts using Eq. (8) instead of Eq. (3). On the one hand, quasi-center visual concept clustering not only concentrates the distribution of visual representations. It reduces the distance between visual samples and the clustering center for a more consistent visual representation. On the other hand, the proposed clustering also updates the representation of concept words, which increases the similarity between (the clustering center of) visual samples and corresponding concepts. This produces a better multi-modal embedding and increases the robustness of concept identification, as demonstrated in Figure 3 (right).
3.3 Superordinate Shortcut Learning
Quasi-center concept clustering concentrates different visual clusters within a superordinate, meanwhile, we can analyze the dependence among superordinates. As show in Figure 4, and are the visual representations extracted from object
for two different superordinates. Due to the expressiveness of modern neural networks, a learner is likely to build a causal effect between these two variables that are expected to be independent (dashed line in Figure
4). Supposing that there is a correlation between and in training data and the learner figures out a shortcut from to . Thus is simultaneously influenced by and . Given the learned visual superordinate abstraction, the learner can explicitly model the shortcut from to by estimating during training. Then during inference, the true causal effect from to can be recovered by blocking such a shortcut.
For an object in space, the learner infers the probability of sharing a concept corresponding to by Eq. (3
), which gives the probability distribution
for all . Then the attribute of in is given by , . Assuming that this is known to the concept learner (i.e. the shaded in Figure 4), it estimates the information from to by(9) |
where is the embedding of corresponding to subspace, and is the function that maps concepts for superordinate to superordinate .
When learning the shortcut from to , the learner answers questions using Eq. (9) instead of Eq. (3). The resulted loss only updates , with fixed. After training, the learner conducts regular inference with Eq. (3) and infers the true causal effect from to by subtracting from initial estimates given by Eq. (3).Algorithm 1 elaborates the training process of the proposed visual superordinate abstraction for robust concept learning.
Weakly-supervised learning for visual superordinate abstraction
4 Experiments
We first provide the details of implementation in Section 4.1. Then we conduct experiments on the CLEVR dataset111https://cs.stanford.edu/people/jcjohns/clevr [johnson2017clevr] to provide an overall comparison of the reasoning ability under regular setting with the state of the arts in Section 4.2. We evaluate the generalization ability to new compositions on CLEVR-CoGenT dataset [johnson2017clevr] in Section 4.3 and to perturbations on CLEVR-Perturb test data in Section 4.4, followed by clustering visualization and sensitivity analysis.
4.1 Implementation Details
Our baseline model strictly follows the design of NSCL [mao2019neuro] for a fair comparison. We only use images and question-answer pairs for training and adopt curriculum learning to train the learner. Since the semantic parser without program annotations in NSCL [mao2019neuro] has achieved nearly perfect parsing accuracy, we use the pre-trained parser and fix it during training our visual superordinate abstraction model.
We set the dimension of word embedding and positional embedding as and respectively. The dimension of visual attribute subspaces is . ResNet34 is used to extract the general feature of objects in -dim, and subspace mappings consist of one-layer linear projection. We adopt the AdamW optimizer [loshchilov2017decoupled] for training, with initial learning rate of . The batch size for training is set as . The scaling parameter and the shifting parameter are set to and , respectively. As for concept clustering, we only add the most likely positive sample into the cache at each training step, and the minimum samples to start the clustering is set to . The distance decay coefficient is set to , and sensitivity analysis is given in Section 4.4. The module for superordinate shortcut learning consists of two linear layers and a non-linear activation layer. We alternatively update and at the same frequency during training.
4.2 General Visual Reasoning
The CLEVR [johnson2017clevr] dataset contains k rendered images and about one million automatically generated question-answer pairs. The question types include querying attributes, comparing attributes (or numbers), counting, and logic reasoning like existence. The combinations during training are the same as those in testing. We compare our model with both implicit and explicit reasoning methods.
FiLM [perez2018film] | TbD [mascharka2018transparency] | NSCL [mao2019neuro] | MDETR [kamath2021mdetr] | MAC [hudson2018compositional] | Ours | |
---|---|---|---|---|---|---|
CLEVR | 97.6 | 99.1 | 98.9 | 99.7 | 98.9 | 99.1 |
CoGenT val-A | 98.3 | 98.8 | 97.9 | 99.8 | 96.9 | 98.0 |
CoGenT val-B | 78.8 | 75.4 | 74.1 | 76.7 | 79.5 | 91.9 |
The overall results are in Table 1 (first row), and a more detailed comparison can be found in our supplementary material. The end-to-end pretraining method MDETR [kamath2021mdetr] achieves the highest accuracy, which indicates the superiority of a jointly trained object detector. Among the rest methods, TbD [mascharka2018transparency] obtains the same highest score as our model. Though flexible, the training of TbD [mascharka2018transparency] requires annotated programs for question parsing. For the approaches FiLM [perez2018film], MAC [hudson2018compositional] and NSCL [mao2019neuro] that do not use extra annotations, our method performs slightly better than them, which verifies the efficacy of the visual superordinate abstraction framework in regular reasoning.
4.3 Robustness to New Compositions
The bias in training data can heavily affect the independence of learned concepts. CLEVR-CoGenT [johnson2017clevr] is proposed for a diagnose. There are two conditions for data splits: in Condition A all cubes are gray, blue, brown, or yellow and all cylinders are red, green, purple, or cyan; in Condition B these shapes swap color palettes. Both conditions contain spheres of all eight colors. Training split is under condition A and validation splits A and B are under two conditions respectively, as shown in Figure 5.

In Figure 4, by interfering , the learner makes the most of the information transmitted in the shortcut from to . Interestingly, the learned shortcut plotted in Figure 5 is consistent with the bias in the training data. Specifically, given the color in {gray, blue, brown, yellow}, the learner predicts a high possibility (about ) of the cube shape, approximately twice that of the sphere (about ). The cylinder has a very small showing chance (close to zero). The opposite effect w.r.t
cube and cylinder can be observed in the rest four colors. If we assume that the shape is uniformly distributed, i.e.
in the training data, then the biased distribution of is exactly (or vise versa) for the colors appearing in the two splits. The learner is ready to use the disclosed bias in the inference on split val-B.Overall statistic comparison with the state-of-the-art methods is listed in Table 1. Due to the distribution discrepancy, almost all the methods present an excellent reasoning ability on the val-A split, followed by a plunge on val-B. The val-A split data is consistent with CoGenT training data, where the color attribute correlates with the shape attribute. In this case, MDETR [kamath2021mdetr] and TbD [mascharka2018transparency] perform slightly better than the rest methods, including ours. On the val-B split, the correlation changes inversely, and all the comparing approaches are negatively affected. On the contrary, our model still obtains high accuracy without any fine-tuning on the val-B split. It attributes to learning a superordinate shortcut to reduce the spurious causal effect in training data.
abs | cc | sl | Overall | Count | Exist | Cnt () | Cnt () | Cnt () | Comp. Attr. | Query | |
---|---|---|---|---|---|---|---|---|---|---|---|
NSCL | ✗ | ✗ | ✗ | 74.1 | 71.8 | 85.7 | 72.5 | 82.2 | 80.8 | 80.7 | 66.7 |
Ours | ✓ | ✗ | ✗ | 77.7 | 75.0 | 87.6 | 74.5 | 84.2 | 82.5 | 87.6 | 70.1 |
Ours | ✓ | ✓ | ✗ | 78.5 | 76.6 | 88.0 | 77.5 | 84.2 | 82.5 | 87.7 | 70.8 |
Ours | ✓ | ✓ | ✓ | 91.9 | 95.7 | 99.1 | 94.0 | 98.8 | 98.7 | 99.1 | 81.5 |
Detailed comparison with NSCL [mao2019neuro] on the val-B split of CLEVR-CoGenT is provided in Table 2. Under biased training, NSCL perform worse on new compositions of color and shape. However, due to shortcut learning upon the visual superordinate abstraction, our model achieves higher accuracy on all types of questions, with a relative overall gain. Among the tasks, our model surpasses NSCL by a large margin (about ) on comparing attributes. Causal inference brings further relative overall improvement for our model. Significantly, it promotes the performance on counting by , and reasons nearly perfectly on existence and comparing attributes.
In Figure 5, we analyze the correlation between color and shape in CLEVR-CoGenT training data. Here we train a shortcut from color to material
in a similar way. The learned conditional probability distribution for different colors is provided in Figure
6. The result reveals that there is no bias related to this pair of variables, which is consistent with the ground-truth setting in CLEVR-CoGenT training set.
4.4 Robustness to Perturbations
To examine the learner’s robustness to perturbations in one superordinate, we synthesize a CLEVR-Perturb test set that contains k images, each with about question-answer pairs. The values for shape, size and material are set the same as those in the CLEVR dataset [johnson2017clevr]. As for the colors, each color is slightly perturbed to a nearby value. The specific color shift is listed in Figure 7. To precisely assess the robustness to color perturbations, we only generate the questions that require no reasoning ability about color. For example, in the original setting, a possible question is “How many other things are there of the same shape as the tiny cyan matte object?”. Though the final step is querying shape, this question requires identifying “cyan” and will not appear in the synthesized CLEVR-Perturb test set. We compare our model with NSCL [mao2019neuro] that performs best on the CLEVR dataset without extra annotations, and with the baseline without concept clustering.

abs | sl | cc | Overall | Count | Exist | Cnt () | Cnt () | Cnt () | Comp. Attr. | Query | |
---|---|---|---|---|---|---|---|---|---|---|---|
NSCL | ✗ | ✗ | ✗ | 87.1 | 76.2 | 91.8 | 81.1 | 91.1 | 92.2 | 90.9 | 91.7 |
Ours | ✓ | ✗ | ✗ | 92.9 | 85.5 | 96.6 | 87.8 | 96.0 | 95.8 | 95.7 | 95.7 |
Ours | ✓ | ✓ | ✗ | 92.7 | 85.7 | 96.1 | 88.3 | 96.1 | 96.0 | 95.2 | 95.5 |
Ours | ✓ | ✓ | ✓ | 93.6 | 87.3 | 96.9 | 89.0 | 96.5 | 96.0 | 95.7 | 96.0 |
Table 3 lists the statistical results. Comparing the variant model without concept clustering and NSCL [mao2019neuro], we can see obvious relative improvements on all types of questions (about for counting, for comparing numbers, for existence, for comparing attribute, and for querying). It indicates that applying linguistic abstraction leads to enhanced visual abstraction. Equipped with concept clustering, our model further boosts the performance on counting related questions, which shows the superiority of the proposed visual superordinate abstraction framework.
Referring to the qualitative example shown in Figure 1, the well-trained NSCL model performs nearly perfectly on the left scene, while much worse on the right scene. It reveals that the learner’s recognition of the material attribute and the shape attribute are affected by colors. Instead, our model identifies metal and sphere correctly. More importantly, we can observe that our model produces a nearly equivalent possibility for different objects for each concept. It indicates that the learner abstracts the feature that is only relevant to the current concept (e.g. metal), so the objects that have the same material get a similar score. More examples can be found in the supplementary material.
![]() |
![]() |
![]() |
(a) NSCL [mao2019neuro] | (b) Ours w.o. cc | (c) Ours |
4.5 Clustering within Superordinate
For the CLEVR-Perturb test set, we compare in detail the clusters formed by the visual samples around different concepts. Clustering results are shown in Figure 8. We observe that even the variant without concept clustering learns a clearly more discriminant subspace than NSCL [mao2019neuro]. Equipped with concept clustering, our model further enhances the discrimination in the superordinate. The better clustering results also account for higher reasoning accuracy. Note that, for fairness, in the comparison in Table 3, we avoid querying about color on the CLEVR-Perturb test set. Furthermore, Figure 8 (a)(c) demonstrate that our model improves the discrimination of different primitive concepts (e.g. ‘gray’, ‘cyan’, and ‘yellow’) within the ‘color’ superordinate, even when the input colors are perturbed.

4.6 Sensitivity Analysis
Figure 9 presents the sensitivity of the proposed model w.r.t the decay coefficient in Eq. (8). The cached samples around a concept provide more visual information for reference in identification. We tune from to
in a logarithmic scale, and record the accuracy change in each question type. The results show that all the variants perform better on binary reasoning questions, including comparing numbers or attributes, reasoning existence, and querying attributes, than answering specific counting questions. We analyze that for the learner, counting is a more complicated discretization process where probabilities higher than a threshold are binarized and then added up. Not surprisingly, this complicated task benefits the most from improving the learner’s robustness, both by the proposed superordinate abstraction and by the concept clustering. Figure
9 illustrates that as the increase of , the performance is improved first and then becomes worse than the baseline. It indicates that there is a trade-off between the effect of linguistic tokens and visual samples within a superordinate.4.7 Reasoning Details
To provide an intuitive comparison between our model and NSCL [mao2019neuro], we list more details of the reasoning details on the val-A and the val-B splits of CLEVR-CoGenT dataset [johnson2017clevr] in Figure 10 and Figure 11, respectively. It shows that NSCL [mao2019neuro] perform almost perfectly on val-A, but mixes ‘cube’ and ‘cylinder’ on val-B. Our model overcomes the bias on val-B split.


5 Conclusion
In this paper, we propose a visual superordinate abstraction framework for concepts learning. The learner acquires linguistic abstraction from soft-aligned questions and contributes to the discrimination of visual abstraction. On top of the framework, we devise a quasi-center concept clustering and a superordinate shortcut learning schemes to address such issues as perturbations and biased training. Experiments under different settings verify the superiority of the proposed model.
A potential limitation of this paper is the proposed visual superordinate abstraction has not been validated on large-scale real-world datasets. The synthesized CLEVR datasets provide ideal and controllable environments that allow the community to directly explore and evaluate different concept learners. In this paper, we mainly focused on analyzing the potential bottleneck of existing methods, and pinpointed that most of them ignored the valuable abstraction capability in human reasoning. In future works, we aim to further explore the capacity of the proposed visual superordinate abstraction in real-world scenarios.