Log In Sign Up

Visual Superordinate Abstraction for Robust Concept Learning

by   Qi Zheng, et al.

Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are still vulnerable to attribute perturbations and out-of-distribution compositions during inference. We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts, e.g. {red, blue,...} ∈ `color' subspace yet cube ∈ `shape'. In this paper, we propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces (i.e. visual superordinates). With only natural visual question answering data, our model first acquires the semantic hierarchy from a linguistic view, and then explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. In addition, a quasi-center visual concept clustering and a superordinate shortcut learning schemes are proposed to enhance the discrimination and independence of concepts within each visual superordinate. Experiments demonstrate the superiority of the proposed framework under diverse settings, which increases the overall answering accuracy relatively by 7.5% on reasoning with perturbations and 15.6% on compositional generalization tests.


page 2

page 8

page 9

page 10

page 11

page 12

page 13


Building a visual semantics aware object hierarchy

The semantic gap is defined as the difference between the linguistic rep...

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

Video question answering requires the models to understand and reason ab...

Separating Skills and Concepts for Novel Visual Question Answering

Generalization to out-of-distribution data has been a problem for Visual...

Visual Concept-Metaconcept Learning

Humans reason with concepts and metaconcepts: we recognize red and green...

Deep Collaborative Discrete Hashing with Semantic-Invariant Structure

Existing deep hashing approaches fail to fully explore semantic correlat...

Learning by Abstraction: The Neural State Machine

We introduce the Neural State Machine, seeking to bridge the gap between...

Modulating early visual processing by language

It is commonly assumed that language refers to high-level visual concept...

1 Introduction

Concept learning actively extracts representations from a visual scene and connects them to linguistic tokens for identification, which resembles human cognition [inhelder2013early]. For example, we may focus on its color (e.g. reddish-yellow) and shape (e.g. round) appearance when identifying an orange. Concept learning is fundamental to vision-and-language tasks that perform multi-step inference over a scene’s entities and their relationships, such as Visual Question Answering (VQA) [antol2015vqa], Visual Commonsense Reasoning (VCR) [zellers2019recognition] and Vision-Language Navigation (VLN) [anderson2018vision].

In recent years, great progress has been made on concept learning to increase the accuracy of identification. In the beginning, most algorithms require explicit annotations, such as ground truth scene graphs, for training concept learners [mascharka2018transparency, yi2018neural].

Figure 1: Illustration of color perturbations. The left image share consistent attributes with training samples, and the right one is synthesized by adding perturbations to the color attribute. We present the scores of judging metal and sphere by NSCL [mao2019neuro] (in blue bars) and our method (in red bars). The colorful rectangles refer to the object in corresponding bounding boxes.

Until recently, Mao et al [mao2019neuro] proposed quasi-symbolic execution together with learning visual subspaces using only question-answer pairs for images. Though the state-of-the-art methods can achieve around accuracy on diagnostic datasets such as the CLEVR dataset [johnson2017clevr], they suffer from heavy performance drop on out-of-distribution compositions [mascharka2018transparency, yi2018neural, mao2019neuro, marois2018transfer]. Also, we observe that they are vulnerable to attribute perturbations during inference.

A demonstration of attribute perturbation is shown in Figure 1. The left scene shares exactly the same setting as training data, i.e. object attributes such as color, shape, and material follow the same distributions with training data. In the right scene, small perturbations are added to the color attribute, e.g. redlight red. To examine a learner’s robustness to color perturbation, we output the scores of judging metal and sphere from the well-trained NSCL model [mao2019neuro] (shown in blue bars). We find it performs perfectly in the left scene, while much worse in the right one. It reveals that the learner’s recognition of the material attribute and the shape attribute are affected by colors. We ascribe this phenomenon to a failure of exploring the intrinsic semantic hierarchy of visual concepts, e.g. {red, blue,…} ‘color’ subspace yet cube ‘shape’.

An abstraction of such a hierarchy is vital to humans’ reasoning. It is one of the fundamental capabilities that make us being robust to understand the real world and generalize well to unseen scenarios [inhelder2013early, murphy2004big, landauer1997solution, lund1996producing, lake2020word, tenenbaum2011grow, rosch1976basic, tanaka1991object]. Human learners capture the essential hypothesis spaces in parsimonious form, and the formed hierarchy structure enables describing not only the specific situation at hand, but also a broader class of situations over which learning should generalize [tenenbaum2011grow]

. As an attempt to introduce such hierarchy into machine learning, Han et al 

[han2020visual] propose a visual concept-metaconcept learner (VCML) that utilizes extra metaconcept supervision. Different from them, we learn the semantic hierarchy with only natural VQA data. To the best of our knowledge, we are the first to explore the intrinsic semantic hierarchy under natural weak supervision.

In this paper, we propose a visual superordinate abstraction framework that explicitly models semantic-aware visual subspaces, which are denoted as visual superordinates. With the weak supervision from visual question answering data, the concept learner first acquires the semantic hierarchy from a linguistic view in a simple curriculum. Then in the following curriculum, it explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. Upon the abstraction framework, we propose a quasi-center visual concept clustering and a superordinate shortcut learning schemes, which further enhances the discrimination and independence of concepts within each visual superordinate. Quasi-center visual concept clustering aims to model the relationships between the clusters of visual representations and the linguistic features. By introducing a quasi-center of the visual cluster, we simultaneously (i) reduce the distance between visual representations and the quasi-center, and (ii) increase the similarity between visual samples and the corresponding concept. As for superordinate shortcut learning, the learner rectifies its judgment by reducing spurious causal-effect relations among superordinates.

Finally, comprehensive experiments are conducted to verify the effectiveness and generalization ability of the proposed concept learner. For the demonstration in Figure 1, our method identifies metal and sphere correctly even with color perturbations. Statistically, the proposed model achieves comparable accuracy as the state-of-the-art methods. On the more challenging CLEVR-Perturb setting, our method outperforms NSCL [mao2019neuro] by a relative improvement. On the CLEVR-CoGenT dataset, our model overcomes the implicit bias in the training data, and performs the best without finetuning on split val-B, which surpasses the state of the art by .

2 Related Work

Concept learning

Recent exploration on elementary visual reasoning starts from [johnson2017clevr], which provides a diagnostic dataset to test the reasoning and generalization ability of a model by answering visual-related questions. As for joint learning of vision and natural language, existing methods mainly diverse in visual representations and question parsing process. Initial methods conduct reasoning primarily on convolutional feature maps. Johnson et al [johnson2017inferring] build a reasoning system composed of a program generator and an execution engine, where the engine executes the decoded program sequence on features maps obtained from CNNs [he2016deep]. Hu et al [hu2017learning] propose end-to-end module networks that conduct reasoning by directly predicting instance-specific network layouts without off-the-shelf parser. To further get rid of annotated layout data, Hu et al [hu2018explainable] replace the layout graph in [hu2017learning] with a stack-based data structure that allows fully differentiable optimization. Similarly, Mascharka et al [mascharka2018transparency] propose a set of visual reasoning primitives that composes a model according to a given question.

Different from the methods reasoning on convolutional feature maps, Yi et al [yi2018neural] parse a scene into a structural graph using trainable object and attribute detectors. Since the scene parser in [yi2018neural] requires ground-truth structural graphs for training, Mao et al [mao2019neuro] propose quasi-symbolic execution and simultaneously learn scene parser and semantic parser using only question-answer pairs for images. Li et al [li2020competence] propose a multi-dimensional Item Response Theory (mIRT) model for guiding the learning process with an adaptive curriculum to increase training efficiency. Prerez et al [perez2018film] design a feature-wise linear modulation layer to improve the reasoning ability of the vanilla baseline in [johnson2017inferring]. Hudson and Manning [hudson2018compositional] introduce a recurrent memory, attention, and composition (MAC) cell that maintains control and memory separately to balance transparency and versatility, not explicitly parsing questions into programs. Following MAC [hudson2018compositional], Wang et al [wang2021interpretable]

devise an object-centric compositional attention model to induce symbolic concept space. To boost the quality of detected objects, Kamath et al 

[kamath2021mdetr] pre-train an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, and then finetune on visual reasoning datasets. We follow the neuro-symbolic reasoning process in [mao2019neuro], but different from their semantic-agnostic visual representations, we learn visual superordinates to increase the robustness.

Causal inference

Causal analysis infers probabilities under both static conditions and changing conditions, which aims to figure out correlations in biased data 


. Structural causal model (SCM), a directed graph that reveals causal relationships between random variables, has been developed as a formal tool to model causation from statistical data and counterfactual reasoning. It has been widely used in medical, psychological, and social research 

[dunn2015evaluation, king2008political, mackinnon2007mediation, richiardi2013mediation]

to determine the effect of a treatment or policy, and recently been introduced into computer vision 

[nair2019causal, niu2020counterfactual, qi2020two, wang2020visual, yang2020deconfounded, tang2020unbiased] to enable counterfactual reasoning.

Though efficient, it can hardly be used in concept learning models without semantic-aware visual subspaces. In a flat concept set, the distance between any two concepts is not comparable with that of another two, neither the distance between visual features and different concepts. For example, an object representation can be close to both “red” and “cube”, which gives no hint to its distance to other concepts. However, within a visual superordinate, the closeness to “red” indicates a relatively large distance to “green”. Our visual superordinate abstraction framework models semantic-aware visual subspaces as independent variables, which enables the advanced strategies to be introduced in concept learning.

Figure 2: An overview of the learning process of the proposed Visual Superordinate Abstraction framework. (a) Originally, there is a set of visual concepts and detected objects from visual question answering data. (b) After a simple curriculum (C-I), the learner acquires linguistic abstraction from the weak supervision. It categorizes concepts (e.g. ‘red’, ‘circle’, ‘small’) into different linguistic hierarchy (e.g. color, shape and size). (c) In the following difficult curriculum (C-II), the learner constructs semantic-aware subspaces (i.e. visual superordinates , and ) aligning with the linguistic hierarchy. In each subspace, it extracts only one desired visual feature from detected objects, which can be described with the closest concept. For forming ideal visual superordinates, quasi-center visual concept clustering (Figure 3) and superordinate shortcut learning (Figure 4) are proposed and introduced in the following.

3 Method

To conduct neuro-symbolic reasoning, the pipeline generally includes a concept learner and two parsers for vision and language respectively. Following NSCL [mao2019neuro], in our method, visual parsing is done via pre-trained object detectors. The language parser converts a given question into a sequence of programs that consist of concepts and operations. For instance, the question “What is the size of the sphere left of the blue metal object?” is parsed into the program sequence Filter (ObjConcept 2 (blue metal), Scene) Filter (RelConcept 1 (left),) Filter(ObjConcept 1 (sphere),) Query(Attribute 1 (size),). The key reasoning step is judging if an object shares a concept.

3.1 Visual Superordinate Abstraction

Linguistic hierarchy

Let denotes the visual features of objects in a scene, is the concept set to be learned. These concepts describe attributes in total, yet it is unknown which concept belongs to which attribute. The concept learner learns mapping functions , and forms visual subspaces , each .

In previous methods, are semantic-agnostic visual subspaces, thus the probability of the -th object sharing

concept is estimated as,


where is the embedding of concept corresponding to and

is the sigmoid function.


are the shifting and the scaling parameters respectively. The prior probability

is given by the normalized vector

along with the concept . The conditional probability

is given by the cosine similarity in the

-th subspace, where the normalization denominator is omitted.

The goal of our method is to first learn the linguistic hierarchy that naturally exists in visual reasoning questions. Different from the curriculum in NSCL [mao2019neuro] that arranges lessons by the depth of programs, we devise the curricula from easy question types to hard ones. For curriculum-I, we use programs of depth less than six and scenes including objects less than six, where the query-type question-answer pairs contain soft alignment of linguistic hierarchy. Our experiments show that, after the curriculum-I, our model is capable of accurately defining linguistic hierarchy, i.e. the affiliation illustrated in Figure 2 (b). With explored linguistic hierarchy, we are possible to further explore semantic-aware visual superordinates and their inside clustering.

Training objectives. Given parsed questions from a pre-trained language parser, the concept learner is fully differentiable and trained by maximizing the likelihood of right answers


where represents the quasi-symbolic program execution and is the set of parameters in the concept learner. The execution result of each step is given by Eq. (3.1).

Semantic-aware visual superordinate abstraction.
Though the linguistic hierarchy is learned, the visual mappings are still semantic-agnostic. It leads to entanglement between visual representations. The cause lies in the full-probability reasoning given by Eq. (3.1). It indicates that the learner must refer to each subspace to determine whether an object shares the concept . However, as discussed in our Introduction, we argue that should exclusively belong to only one visual subspace.

To this end, we utilize the acquired linguistic hierarchy to learn semantic-aware visual subspaces, i.e. the visual superordinates in Figure 2 (c). Upon the learned linguistic hierarchy, the concept learner can uniquely relate to visual subspace (or or ) given by , and estimate the probability of the -th object sharing concept. For example, we assume that a concept belongs to subspace , then


Comparing Eq. (3.1) and Eq. (3), we can see that uniquely matches and only refers to in judgment. This way, the subspaces become aware of the superordinate such as color and shape.

A similar objective is utilized to optimize the learner during the visual superordinate abstraction,


where executes each program step and produces results by Eq. (3).

Learning the visual superordinates increases the independence of different visual subspaces and improves the learner’s robustness to perturbations. In addition, we devise a quasi-center visual concept clustering and a superordinate shortcut learning schemes to enhance the discrimination and independence of concepts within each visual superordinate.

3.2 Quasi-center Visual Concept Clustering

Given the independent mappings learned by visual superordinate abstraction, the visual representations within each superordinate can still be mixed with each other. For instance, the learner can distinguish shape from color and size, but it may confuse cube, cylinder and sphere. The reason can be found in Eq. (3.1) and Eq. (3

). The learner estimates the similarity between the mapped feature and the corresponding concept representation, yet ignores the samples that share the same concept in the visual subspace. As a result, visual samples diffuse around a concept and presents high variance with the concept representation, which is vulnerable to noise.

Figure 3: Demonstration of quasi-center visual concept clustering in color superordinate. denotes the projection of concept words in the visual subspace, and denotes the quasi-center of visual concept clusters (i.e. ). Quasi-center concept clustering not only accumulates the visual representations around a concept, but also reduces the distance between concept words (e.g. for “red”) and corresponding cluster centers (e.g. ). View the color version.

As illustrated in Figure 3 (left), the visual representations around the color concepts may disperse within each cluster and the clustering center may also be far from the corresponding concept representation. To this end, we design a memory cache to enhance the clustering of visual features around different concepts and reduce the discrepancy between linguistic concepts and visual representations. Supposing that at the -th training step the learner has seen a set of samples that share a concept , a quasi-center of each concept cluster can be calculated as


where are the samples that shares up to -th training step and is the superordinate that aligns with according the learned linguistic hierarchy. Eq. (6) dynamically updates the clustering center at each training step, where are the samples that appear at the -th training step.

For simplicity, we calculate the Euclidean distance between a mapped sample and the existing clustering quasi-center


and balance it with the original cosine similarity by


where is the decay coefficient of the distance.

Adding cached visual samples, the learner infers the probability for concepts using Eq. (8) instead of Eq. (3). On the one hand, quasi-center visual concept clustering not only concentrates the distribution of visual representations. It reduces the distance between visual samples and the clustering center for a more consistent visual representation. On the other hand, the proposed clustering also updates the representation of concept words, which increases the similarity between (the clustering center of) visual samples and corresponding concepts. This produces a better multi-modal embedding and increases the robustness of concept identification, as demonstrated in Figure 3 (right).

3.3 Superordinate Shortcut Learning

Quasi-center concept clustering concentrates different visual clusters within a superordinate, meanwhile, we can analyze the dependence among superordinates. As show in Figure 4, and are the visual representations extracted from object

for two different superordinates. Due to the expressiveness of modern neural networks, a learner is likely to build a causal effect between these two variables that are expected to be independent (dashed line in Figure 

4). Supposing that there is a correlation between and in training data and the learner figures out a shortcut from to . Thus is simultaneously influenced by and . Given the learned visual superordinate abstraction, the learner can explicitly model the shortcut from to by estimating during training. Then during inference, the true causal effect from to can be recovered by blocking such a shortcut.

Figure 4: Demonstration of superordinate shortcut learning. To measure whether there is a spurious causal effect between and , the learner explicitly learns a shortcut by interfering from . is the stop-gradient operator, which indicates that no gradient propagates through when training .

For an object in space, the learner infers the probability of sharing a concept corresponding to by Eq. (3

), which gives the probability distribution

for all . Then the attribute of in is given by , . Assuming that this is known to the concept learner (i.e. the shaded in Figure 4), it estimates the information from to by


where is the embedding of corresponding to subspace, and is the function that maps concepts for superordinate to superordinate .

When learning the shortcut from to , the learner answers questions using Eq. (9) instead of Eq. (3). The resulted loss only updates , with fixed. After training, the learner conducts regular inference with Eq. (3) and infers the true causal effect from to by subtracting from initial estimates given by Eq. (3).Algorithm 1 elaborates the training process of the proposed visual superordinate abstraction for robust concept learning.

Initialization: linguistic prior , visual mappings , concept embeddings , curriculum , visual mappings if learning superordinate shortcut
1 for  in curriculum lesson 1 // learn linguistic abstraction
2       for program in // execute parsed question
3             calculate using Eq. (1); // run on scene objects
5      predict answer ;
6       update with loss ;
8for  in curriculum lesson 2 // learn visual abstraction
9       for program in
10             if with concept clustering then  calculate using Eq. (8); else if train shortcut branch  then  calculate using Eq. (9); else  calculate using Eq. (3);
12      predict answer ;
13       if type()==count then  update with loss ; else  update with loss ; // e.g., type()==query
Algorithm 1

Weakly-supervised learning for visual superordinate abstraction

4 Experiments

We first provide the details of implementation in Section 4.1. Then we conduct experiments on the CLEVR dataset111 [johnson2017clevr] to provide an overall comparison of the reasoning ability under regular setting with the state of the arts in Section 4.2. We evaluate the generalization ability to new compositions on CLEVR-CoGenT dataset [johnson2017clevr] in Section 4.3 and to perturbations on CLEVR-Perturb test data in Section 4.4, followed by clustering visualization and sensitivity analysis.

4.1 Implementation Details

Our baseline model strictly follows the design of NSCL [mao2019neuro] for a fair comparison. We only use images and question-answer pairs for training and adopt curriculum learning to train the learner. Since the semantic parser without program annotations in NSCL [mao2019neuro] has achieved nearly perfect parsing accuracy, we use the pre-trained parser and fix it during training our visual superordinate abstraction model.

We set the dimension of word embedding and positional embedding as and respectively. The dimension of visual attribute subspaces is . ResNet34 is used to extract the general feature of objects in -dim, and subspace mappings consist of one-layer linear projection. We adopt the AdamW optimizer [loshchilov2017decoupled] for training, with initial learning rate of . The batch size for training is set as . The scaling parameter and the shifting parameter are set to and , respectively. As for concept clustering, we only add the most likely positive sample into the cache at each training step, and the minimum samples to start the clustering is set to . The distance decay coefficient is set to , and sensitivity analysis is given in Section 4.4. The module for superordinate shortcut learning consists of two linear layers and a non-linear activation layer. We alternatively update and at the same frequency during training.

4.2 General Visual Reasoning

The CLEVR [johnson2017clevr] dataset contains k rendered images and about one million automatically generated question-answer pairs. The question types include querying attributes, comparing attributes (or numbers), counting, and logic reasoning like existence. The combinations during training are the same as those in testing. We compare our model with both implicit and explicit reasoning methods.

FiLM [perez2018film] TbD [mascharka2018transparency] NSCL [mao2019neuro] MDETR [kamath2021mdetr] MAC [hudson2018compositional] Ours
CLEVR 97.6 99.1 98.9 99.7 98.9 99.1
CoGenT val-A 98.3 98.8 97.9 99.8 96.9 98.0
CoGenT val-B 78.8 75.4 74.1 76.7 79.5 91.9
Table 1: Overall question answering accuracy (%) on CLEVR and CLEVR-CoGenT datasets.

The overall results are in Table 1 (first row), and a more detailed comparison can be found in our supplementary material. The end-to-end pretraining method MDETR [kamath2021mdetr] achieves the highest accuracy, which indicates the superiority of a jointly trained object detector. Among the rest methods, TbD [mascharka2018transparency] obtains the same highest score as our model. Though flexible, the training of TbD [mascharka2018transparency] requires annotated programs for question parsing. For the approaches FiLM [perez2018film], MAC [hudson2018compositional] and NSCL [mao2019neuro] that do not use extra annotations, our method performs slightly better than them, which verifies the efficacy of the visual superordinate abstraction framework in regular reasoning.

4.3 Robustness to New Compositions

The bias in training data can heavily affect the independence of learned concepts. CLEVR-CoGenT [johnson2017clevr] is proposed for a diagnose. There are two conditions for data splits: in Condition A all cubes are gray, blue, brown, or yellow and all cylinders are red, green, purple, or cyan; in Condition B these shapes swap color palettes. Both conditions contain spheres of all eight colors. Training split is under condition A and validation splits A and B are under two conditions respectively, as shown in Figure 5.

Figure 5: Illustration of biased data. Image A shows an example from CLEVR CoGenT val-A split, and image B is from the val-B split. The bar graph shows the learned correlation of from the biased training data.

In Figure 4, by interfering , the learner makes the most of the information transmitted in the shortcut from to . Interestingly, the learned shortcut plotted in Figure 5 is consistent with the bias in the training data. Specifically, given the color in {gray, blue, brown, yellow}, the learner predicts a high possibility (about ) of the cube shape, approximately twice that of the sphere (about ). The cylinder has a very small showing chance (close to zero). The opposite effect w.r.t

cube and cylinder can be observed in the rest four colors. If we assume that the shape is uniformly distributed, i.e.

in the training data, then the biased distribution of is exactly (or vise versa) for the colors appearing in the two splits. The learner is ready to use the disclosed bias in the inference on split val-B.

Overall statistic comparison with the state-of-the-art methods is listed in Table 1. Due to the distribution discrepancy, almost all the methods present an excellent reasoning ability on the val-A split, followed by a plunge on val-B. The val-A split data is consistent with CoGenT training data, where the color attribute correlates with the shape attribute. In this case, MDETR [kamath2021mdetr] and TbD [mascharka2018transparency] perform slightly better than the rest methods, including ours. On the val-B split, the correlation changes inversely, and all the comparing approaches are negatively affected. On the contrary, our model still obtains high accuracy without any fine-tuning on the val-B split. It attributes to learning a superordinate shortcut to reduce the spurious causal effect in training data.

abs cc sl Overall Count Exist Cnt () Cnt () Cnt () Comp. Attr. Query
NSCL 74.1 71.8 85.7 72.5 82.2 80.8 80.7 66.7
Ours 77.7 75.0 87.6 74.5 84.2 82.5 87.6 70.1
Ours 78.5 76.6 88.0 77.5 84.2 82.5 87.7 70.8
Ours 91.9 95.7 99.1 94.0 98.8 98.7 99.1 81.5
Table 2: Complete comparison on the val-B split of CLEVR-CoGenT dataset. Questions of comparing numbers are divided into count-equal (i.e. Cnt(=)), count-greater-than (i.e. Cnt()), and count-less-than (i.e. Cnt()). “abs”, “cc” and “sl” are abbreviations for visual superordinate abstraction, quasi-center concept clustering and superordinate shortcut learning respectively.

Detailed comparison with NSCL [mao2019neuro] on the val-B split of CLEVR-CoGenT is provided in Table 2. Under biased training, NSCL perform worse on new compositions of color and shape. However, due to shortcut learning upon the visual superordinate abstraction, our model achieves higher accuracy on all types of questions, with a relative overall gain. Among the tasks, our model surpasses NSCL by a large margin (about ) on comparing attributes. Causal inference brings further relative overall improvement for our model. Significantly, it promotes the performance on counting by , and reasons nearly perfectly on existence and comparing attributes.

In Figure 5, we analyze the correlation between color and shape in CLEVR-CoGenT training data. Here we train a shortcut from color to material

in a similar way. The learned conditional probability distribution for different colors is provided in Figure 

6. The result reveals that there is no bias related to this pair of variables, which is consistent with the ground-truth setting in CLEVR-CoGenT training set.

Figure 6: The learned correlation between color and material superordinates from the training data.

4.4 Robustness to Perturbations

To examine the learner’s robustness to perturbations in one superordinate, we synthesize a CLEVR-Perturb test set that contains k images, each with about question-answer pairs. The values for shape, size and material are set the same as those in the CLEVR dataset [johnson2017clevr]. As for the colors, each color is slightly perturbed to a nearby value. The specific color shift is listed in Figure 7. To precisely assess the robustness to color perturbations, we only generate the questions that require no reasoning ability about color. For example, in the original setting, a possible question is “How many other things are there of the same shape as the tiny cyan matte object?”. Though the final step is querying shape, this question requires identifying “cyan” and will not appear in the synthesized CLEVR-Perturb test set. We compare our model with NSCL [mao2019neuro] that performs best on the CLEVR dataset without extra annotations, and with the baseline without concept clustering.

Figure 7: Illustration of color perturbation. The left shows original image from CLEVR [johnson2017clevr], and the right one is synthesized by adding color perturbations. The specific colors used in two settings are below the two images respectively.
abs sl cc Overall Count Exist Cnt () Cnt () Cnt () Comp. Attr. Query
NSCL 87.1 76.2 91.8 81.1 91.1 92.2 90.9 91.7
Ours 92.9 85.5 96.6 87.8 96.0 95.8 95.7 95.7
Ours 92.7 85.7 96.1 88.3 96.1 96.0 95.2 95.5
Ours 93.6 87.3 96.9 89.0 96.5 96.0 95.7 96.0
Table 3: Comparison on the CLEVR-Perturb test set. Questions of comparing numbers are divided into count-equal (i.e. Cnt(=)), count-greater-than (i.e. Cnt()), and count-less-than (i.e. Cnt()). “abs”, “cc” and “sl” are abbreviations for visual superordinate abstraction, quasi-center concept clustering and superordinate shortcut learning respectively.

Table 3 lists the statistical results. Comparing the variant model without concept clustering and NSCL [mao2019neuro], we can see obvious relative improvements on all types of questions (about for counting, for comparing numbers, for existence, for comparing attribute, and for querying). It indicates that applying linguistic abstraction leads to enhanced visual abstraction. Equipped with concept clustering, our model further boosts the performance on counting related questions, which shows the superiority of the proposed visual superordinate abstraction framework.

Referring to the qualitative example shown in Figure 1, the well-trained NSCL model performs nearly perfectly on the left scene, while much worse on the right scene. It reveals that the learner’s recognition of the material attribute and the shape attribute are affected by colors. Instead, our model identifies metal and sphere correctly. More importantly, we can observe that our model produces a nearly equivalent possibility for different objects for each concept. It indicates that the learner abstracts the feature that is only relevant to the current concept (e.g. metal), so the objects that have the same material get a similar score. More examples can be found in the supplementary material.

(a) NSCL [mao2019neuro] (b) Ours w.o. cc (c) Ours
Figure 8: Clustering of visual samples within the color superordinate. We compare NSCL [mao2019neuro] with our model, and ours without concept clustering. The color values used in (a)(c) are exactly those used in the CLEVR-Perturb test set (please refer to Figure 7 for more details).

4.5 Clustering within Superordinate

For the CLEVR-Perturb test set, we compare in detail the clusters formed by the visual samples around different concepts. Clustering results are shown in Figure 8. We observe that even the variant without concept clustering learns a clearly more discriminant subspace than NSCL [mao2019neuro]. Equipped with concept clustering, our model further enhances the discrimination in the superordinate. The better clustering results also account for higher reasoning accuracy. Note that, for fairness, in the comparison in Table 3, we avoid querying about color on the CLEVR-Perturb test set. Furthermore, Figure 8 (a)(c) demonstrate that our model improves the discrimination of different primitive concepts (e.g. ‘gray’, ‘cyan’, and ‘yellow’) within the ‘color’ superordinate, even when the input colors are perturbed.

Figure 9: Question answering accuracy w.r.t the decay coefficient that ranges from to in logarithmic scale. Performance of the proposed model without concept clustering serves as a reference.

4.6 Sensitivity Analysis

Figure 9 presents the sensitivity of the proposed model w.r.t the decay coefficient in Eq. (8). The cached samples around a concept provide more visual information for reference in identification. We tune from to

in a logarithmic scale, and record the accuracy change in each question type. The results show that all the variants perform better on binary reasoning questions, including comparing numbers or attributes, reasoning existence, and querying attributes, than answering specific counting questions. We analyze that for the learner, counting is a more complicated discretization process where probabilities higher than a threshold are binarized and then added up. Not surprisingly, this complicated task benefits the most from improving the learner’s robustness, both by the proposed superordinate abstraction and by the concept clustering. Figure 

9 illustrates that as the increase of , the performance is improved first and then becomes worse than the baseline. It indicates that there is a trade-off between the effect of linguistic tokens and visual samples within a superordinate.

4.7 Reasoning Details

To provide an intuitive comparison between our model and NSCL [mao2019neuro], we list more details of the reasoning details on the val-A and the val-B splits of CLEVR-CoGenT dataset [johnson2017clevr] in Figure 10 and Figure 11, respectively. It shows that NSCL [mao2019neuro] perform almost perfectly on val-A, but mixes ‘cube’ and ‘cylinder’ on val-B. Our model overcomes the bias on val-B split.

Figure 10: Illustration of reasoning details on the CLEVR-CoGenT val-A set. The colorful rectangles refer to the object in corresponding bounding boxes.
Figure 11: Illustration of reasoning details on the CLEVR-CoGenT val-B set. The colorful rectangles refer to the object in corresponding bounding boxes.

5 Conclusion

In this paper, we propose a visual superordinate abstraction framework for concepts learning. The learner acquires linguistic abstraction from soft-aligned questions and contributes to the discrimination of visual abstraction. On top of the framework, we devise a quasi-center concept clustering and a superordinate shortcut learning schemes to address such issues as perturbations and biased training. Experiments under different settings verify the superiority of the proposed model.

A potential limitation of this paper is the proposed visual superordinate abstraction has not been validated on large-scale real-world datasets. The synthesized CLEVR datasets provide ideal and controllable environments that allow the community to directly explore and evaluate different concept learners. In this paper, we mainly focused on analyzing the potential bottleneck of existing methods, and pinpointed that most of them ignored the valuable abstraction capability in human reasoning. In future works, we aim to further explore the capacity of the proposed visual superordinate abstraction in real-world scenarios.