Visual reasoning (VR) is the ability of an autonomous system to construct a rich representation of a visual scene and perform multi-step inference over the scene’s constituents and their relationships. It stands among the key outstanding challenges in computer vision. Common tangible instantiations of VR include language-driven tasks such as Visual Question Answering (VQA)(antol2015vqa) and Visual Commonsense Reasoning (VCR) (zellers2019recognition)
. Recent advances in computer vision, representation learning, and natural language processing have enabled continued progress on VQA with a wide variety of modeling approaches(hudson2019learning; andreas2016neural; anderson2018bottom; hudson2018compositional; tan2019lxmert).
A defining characteristic of VR is the interaction between a perception system
(object detection and scene representation learning) and areasoning system (question interpretation and inference grounded in the scene). However, this interaction is difficult to capture and assess accurately. For example, the definition of VQA has evolved over time to eliminate language biases that impeded its robustness as a VR metric. The early VQA datasets were biased to real-world language priors to the extent that many questions were answerable without looking at the image (agrawal2018don). Subsequent versions improved the balance but still mostly involved simple inference questions with little requirement for multi-step reasoning.
To facilitate progress in VR, hudson2019gqa proposed GQA, a procedurally generated VQA dataset of multi-step inference questions. Although GQA targets compositional multi-step reasoning, the current GQA Challenge primarily evaluates visual perception rather than reasoning of a VQA model. As we show in Section 4, a neuro-symbolic VQA model that has access to ground-truth scene graphs achieves 96% accuracy on GQA. Moreover, language interpretation (semantic parsing) alone does not capture the complexity of VR due to the language in questions being procedurally generated. As a result, while GQA is well suited as an evaluation environment for VR (for multi-modal pretraining tasks (tan2019lxmert; zhou2019unified)), a higher GQA accuracy does not necessarily imply a higher reasoning capability. In this work, we supplement GQA with a differentiable first-order logic framework -FOL that allows us to isolate and assess the reasoning capability of a VQA model separately from its perception.
The -FOL Framework: -FOL is a neuro-symbolic VR model. Neuro-symbolic models such as MAC (hudson2018compositional), Neural Module Networks (andreas2016neural), and Neural State Machines (hudson2019learning) implement compositional multi-step inference by modeling each step as a differentiable operator from a functional specification of the question (a program) or its approximation. This facilitates systematicity, compositionality, and out-of-distribution generalization in VQA because accurate inference of a given question commonly requires accurate inference over its constituents and entailed questions (vedantam2019probabilistic). They, however, commonly operate over the latent feature representations of objects and their relations, produced by the underlying perception module. This entanglement not only limits interpretability of the learned neuro-symbolic inference blocks, but also limits the reasoning techniques applicable for VQA improvement.
In contrast to SOTA neuro-symbolic approaches, -FOL fully disentangles
visual representation learning of a VQA model from its inference mechanism, while still being end-to-end trainable with backpropagation (seeFigure 1). This enables identifying GQA questions solvable via perception vs. reasoning and evaluating their respective contributions.
VQA Reasoning Evaluation Score: To assess the reasoning capability of a VQA model, we define the VQA reasoning evaluation score as the extent to which the model can answer a question despite imperfect visual perception. If the input image is noisy or the perception system is imperfect, the learned object representations do not contain enough information to determine certain attributes of the objects. This potentially impedes question answering and may require non-trivial reasoning. For example, an object detection module that misclassifies wolves as huskies may impede answering the question “Is there a husky in the living room?” Similarly, the question “What is behind the broken wooden chair?” relies on the information capturing “broken”, “wooden”, and “chair” attributes in the representation of the corresponding object. Many VQA models answer such questions nonetheless (by disregarding weak attribute signals when a strong “chair” signal is present in a single object in the scene), which exemplifies the kind of visual reasoning we aim to assess in VQA. In contrast, the questions that can be answered using a pre-trained perception system and parameter-less logical inference do not require reasoning per se as their visual representations contain all the information necessary to answer the question.
Contributions: This work makes three contributions:
[leftmargin=*, noitemsep, topsep=0pt]
We introduce differentiable first-order logic as a common formalism for compositional visual reasoning and use it as a foundation for the inference in -FOL.
We use -FOL to define a disentangled evaluation methodology for VQA systems to assess the informativeness of perception as well as the power of reasoning separately. To this end, we introduce a VQA reasoning evaluation score, an augmentation of GQA evaluation that eliminates questions primarily resolved by perception. With it, we evaluate two representatives from two families of VQA models: MAC (hudson2018compositional) and LXMERT (tan2019lxmert).
As a simple way of going beyond logical reasoning, we propose top-down calibration via the question context on the top of FOL reasoning and show that it improves the accuracy of -FOL on the visually hard questions.
2 Related Work and Background
Visual Question Answering: VQA has been used as a front-line task to research and advance VR capabilities. The first release of the VQA dataset (antol2015vqa) initiated annual competitions and a wide range of modeling techniques aimed at addressing visual perception, language understanding, reasoning, and their combination (lu2019vilbert; tan2019lxmert; hudson2018compositional; hudson2019learning; anderson2018bottom; li2019unicoder; zhou2019unified). To reduce the annotation effort and control the problem complexity, CLEVR (johnson2017clevr) and GQA (hudson2019gqa) tasks propose synthetic construction of resp. scenes and questions.
Capturing and measuring the extent of human ability in VR accurately is a significant challenge in task design as well as modeling. Datasets have to account for language and real-world biases, such as non-visual and false-premise questions (ray2016question). VQA models, when uncontrolled, are known to “solve” the task by exploiting language priors (agrawal2016analyzing; zhou2015simple). Different techniques have been proposed to control this phenomenon. agrawal2018don adversarially separate the distributions of training and validation sets. goyal2017making balance the VQA dataset by asking human subjects to identify distractors – visually similar images that yield different answers for the same questions. Recently, selvaraju2020squinting augment the VQA dataset with human-annotated subquestions that measure a model’s reasoning consistency in answering complex questions. In this work, we propose another step to improve the accuracy of VQA reasoning assessment by capturing a “hard” subset of GQA questions where perception produces imperfect object representations.
Neuro-Symbolic Reasoning: -FOL is a neuro-symbolic reasoning model (garcez2019neural). In neuro-symbolic reasoning, answer inference is defined as a chain of differentiable modules wherein each module implements an “operator” from a latent functional program representation of the question. The approach is applicable to a wide range tasks, including visual QA (andreas2016neural; vedantam2019probabilistic; hudson2018compositional), reading comprehension of natural language (chen2020neural), and querying knowledge bases, databases, or other structured sources of information (saha-etal-2019-complex; neelakantan2015neural; neelakantan2016learning; liang2016neural). The operators can be learned, like in MAC (hudson2018compositional) or pre-defined, like in NMN (andreas2016neural). In contrast to semantic parsing (cacm-liang2016learning) or program synthesis (gulwani2017program; parisotto2016neuro), the model does not necessarily emit a symbolic program, although it can involve them as an intermediate step to construct the differentiable pipeline (like in -FOL). Neuro-symbolic reasoning is also similar to neural program induction (NPI) (reed2015neural; cai2017making; pierrot2019learning) but the latter requires strong supervision in the form of traces, and the learned “operators” are not always composable or interpretable.
The main benefit of neuro-symbolic models is their compositionality. Because the learnable parameters of individual operators are shared for all questions and subsegments of the differentiable pipeline correspond to constituents of each question instance, the intermediate representations produced by each module are likely composable with each other. This, in turn, facilitates interpretability, systematicity, and out-of-distribution generalization – commonly challenging desiderata of reasoning systems (vedantam2019probabilistic). In Section 6, we demonstrate them in -FOL over VQA.
Neuro-symbolic models can be partially or fully disentangled from the representation learning of their underlying ground-world modality (vision in the case of VQA). Partial entanglement is the most common, wherein the differentiable reasoning operates on featurizations of the scene objects rather than raw pixels but the featurizations are in the uninterpretable latent space. Neural State Machine (NSM) (hudson2019learning) and the eXplainable and eXplicit Neural Modules (XNM) (shi2019explainable) are prominent examples of such frameworks. As for full disentanglement, there are Neural-Symbolic Concept Learner (NS-CL) (mao2019neuro) and Neural-Symbolic VQA (NS-VQA) (yi2018neural) which separate scene understanding, semantic parsing, and program execution with symbolic representations in between similar to -FOL. However, both NS-CL and NS-VQA as well as XNM are based on operators that are heuristic realization of the task-dependent domain specific language (DSL) of their target datasets. In contrast, we propose a task-independent, mathematical formalism that is probabilistically derived from the first-order logic independent of any specific DSL. This highlights two important differences between -FOL and NS-CL, NS-VQA, or XNM. First, compared to these frameworks, -FOL is more general-purpose: it can implement any DSL that is representable by FOL. Second, our proposed disentangled evaluation methodology in Section 4 requires the reasoning framework to be mathematically sound so that we can reliably draw conclusions based off it; this is the case for our FOL inference formalism. Furthermore, while NS-CL and NS-VQA have only been evaluated on CLEVR (with synthetic scenes and a limited vocabulary), -FOL is applied to real-life scenes in GQA.
Finally, we note that outside of VR, logic-based, differentiable neuro-symbolic formalisms have been widely used to represent knowledge in neural networks(serafini2016logic; socher2013reasoning; xu2018semantic). A unifying framework for many of such formalisms is Differentiable Fuzzy Logics (DFL) (van2020analyzing) which models quantified FOL within the neural framework. Despite the similarity in formulation, the inference in DFL is generally of exponential complexity, whereas -FOL proposes a dynamic programming strategy to perform inference in polynomial time, effectively turning it into program-based
reasoning of recent VQA frameworks. Furthermore, while these frameworks have been used to encode symbolic knowledge into the loss function,-FOL is used to specify a unique feed-forward architecture for each individual instance in the dataset; in that sense, -FOL is similar to recent neuro-symbolic frameworks proposed to tackle the SAT problem (selsam2018learning; amizadeh2018learning; amizadeh2019pdp).
3 Differentiable First-Order Logic for VR
We begin with the formalism of differentiable first-order logic (DFOL) for VR systems, which forms the foundation for the -FOL framework. DFOL is a formalism for inference over statements about an image and its constituents. It has two important properties: (a) it disentangles inference from perception, so that the operation "filter all the red objects in the scene" can be split into determining the “redness” of every object and attending to the ones deemed sufficiently red, and (b) it is end-to-end differentiable, which allows training the perception system from inference results. In Section 4, we show how DFOL enables us to measure reasoning capabilities of VQA models.
3.1 Visual Perception
Given an image , let
be a set of feature vectorsrepresenting a set of objects detected in . This detection can be done via different pre-trained models such as Faster-RCNN (ren2015faster) for object detection or Neural Motifs (neuralmotifs) or Graph-RCNN (yang2018graph) for scene graph generation.111 can also include features of relations
between the objects. Relation features have been shown to be helpful in tasks such as image captioning and information retrieval(rscan) As is common in VQA, we assume as given, and refer to it as the scene visual featurization.
Furthermore, we introduce the notion of neural visual oracle where and are neural models parametrized by vectors and , respectively. Conceptually, computes the likelihood of the natural language predicate holding for object (e.g. ). Similarly, calculates the likelihood of holding for a pair of objects and (). combined with the visual featurization forms the perception system of -FOL.
3.2 First-Order Logic over Scenes
Given objects in the scene, we denote by the upper-case letters categorical variables over the objects’ index set . The values are denoted by subscripted lower-case letters – states that is set to refer to the -th object in the scene. The -arity predicate defines a Boolean function on variables defined over . In the context of visual scenes, we use unary predicates to describe object names and attributes ( and ), and binary predicates to describe relations between pairs of objects (). Given the definitions above, we naturally define quantified first-order logical (FOL) formulae ,
states that "There is a chair in the scene that is to the left of all other objects."
FOL is a more compact way to describe the visual scene than the popular scene graph (yang2018graph) notation, which can be seen as a Propositional Logic description of the scene, also known as grounding the formula. More importantly, while scene graph is only used to describe the scene, FOL allows us to perform inference over it. For instance, the formula in Eq. (1) also encodes the binary question "Is there a chair in the scene to the left of all other objects?" In other words, a FOL formula encodes both a descriptive statement and a hypothetical question about the scene. This is the key intuition behind -FOL and the common formalism behind its methodology.
Given a NL (binary) question and a corresponding FOL formula , the answer is the result of evaluating . We reformulate this probabilistically as
The naïve approach to calculate the probability inEq. 2 requires evaluating every instantiation of , which are of exponential number. Instead, we propose a dynamic programming strategy based on the intermediate notion of attention which casts inference as a multi-hop execution of a functional program in polynomial time.
Assume is minimal and contains only the operators and (which are functionally complete). We begin by defining the concept of attention which in -FOL naturally arises by instantiating a variable in the formula to an object:
Given a FOL formula over the variables , the attention on the object w.r.t. is:
Similarly, one can compute the joint attention by fixing more than one variable to certain objects. For example, given the formula in Eq. 1, represents the probability that "The -th object in the scene is a chair that is to the left of all other objects." and represents the probability that "The -th object in the scene is to the right of a chair.".
Next, we define the attention vector on variable w.r.t. formula as . In similar way, we define the attention matrix on two variables and w.r.t. formula as . Given these definitions, the following lemma gives us the first step toward calculating the likelihood in Eq. (2) from attention values in polynomial time:
Let be a FOL formula with left most variable that appears with logical quantifier . Then we have:
where is the attention vector on and is the quantifier-specific aggregation function defined as:
Furthermore, given two matrix and , we define the matrix -product w.r.t. the quantifier as:
where and are respectively the -th row of and the -th column of , and denotes the Hadamard product. In general, the
-product can be used to aggregate attention tensors (multi-variate logical formulas) along a certain axis (a specific variable) according to the variable’s quantifier.
Lemma 3.1 reduces the computation of the answer likelihood to computing the attention vector of the left most variable w.r.t. . The latter can be further calculated recursively in polynomial time as described below.
Lemma 3.2 (Base Case).
If only constitutes the literal , the attention vector is the vector.
Lemma 3.3 (Recursion Case).
We have three cases:
(A) Negation Operator:
if , then we have:
(B) Filter Operator: if where is a unary predicate, then:
(C) Relate Operator:
if where is the set of all binary predicates defined on variables and in , then we have:
where is the quantifier of variable in .
The attention vector and the attention matrix in Eqs. (11) and (3.3), respectively, form the leaves of the recursion tree and contain the probabilities of the atomic predicate holding for specific object instantiations. These probabilities are directly calculated by the visual oracle . In particular, we propose:
where and denote the sets of all unary and binary predicates in the model’s concept dictionary.
The recursion steps in Lemma 3.3 can be seen as operators that given an input attention vector produce an output attention vector. In fact, Eq. 11 and Lemma 3.3 are respectively the DFOL embodiments of the abstract Filter and Relate operations widely used in multi-hop VQA models. In other words, by abstracting the recursion steps in Lemma 3.3
into operators, we turn a descriptive FOL formula into an executable program which can be evaluated to produce the probability distribution of the answer. For example, by applying the steps in Lemmas3.1-3.3 to Eq. 1, we get the following program to calculate its likelihood:
Algorithm 1 presents the final operationalization of question answering as inference over formulae in DFOL. For open questions such as “What is the color of the chair to the left of all objects?”, it translates them into a set of binary questions over the plausible set of answer options (all color names), which can be predefined or learned.
4 VQA Reasoning Evaluation Score
In this section, we describe our methodology of VQA reasoning evaluation. Given a VQA model over the visual featurization , our goal is to study and measure:
how informative a visual featurization is on its own to accomplish a certain visual reasoning task, and
how much the reasoning capabilities of a model can compensate for the imperfections in perception to accomplish a reasoning task.
To this end, we use the GQA dataset (hudson2019gqa) of multi-step functional visual questions. The GQA dataset consists of 22M questions defined over 130K real-life images. Each image in the Train/Validation splits is accompanied by the scene graph annotation, and each question in the Train/Validation/Test-Dev splits comes with its equivalent program. We translate the programs in GQA into a domain-specific language (DSL) built on top of the four basic operators Filter, Relate, Neg and introduced in the previous section. The DSL covers of the questions in GQA. See Appendix for its definition.
The DFOL formalism allows us to establish an upper bound on reasoning – the accuracy of a neuro-symbolic VQA model when the information in its visual featurization is perfect. To measure it, let be a golden visual oracle based on the information in the ground-truth GQA scene graphs. The parameter-less -FOL inference from Section 3 achieves accuracy on the GQA validation split using the golden oracle and the golden programs. We manually inspected the remaining 4% and found that almost all involved errors in the scene graph or the golden program.
This result not only verifies the soundness of -FOL as a probabilistic relaxation of the GQA DSL, but also establishes that question understanding alone does not constitute the source of complexity in the compositional question answering on GQA. In other words, the main contributing factor to the performance of GQA models is the representation learning in their underlying perception systems. However, even with imperfect perception, many models successfully recover the right answer using language priors, real-world biases, and other non-trivial learned visual reasoning. Using -FOL, we present a metric to quantify this phenomenon.
Reasoning with Imperfect Perception: Let be a fixed scene featurization, often produced by a pre-trained Faster-RCNN model. Let be a GQA question and be its corresponding DFOL formula. The VQA Reasoning Evaluation is based on two key observations:
If the probabilistic inference over produces a wrong answer, the featurization
does not contain enough information to correctly classify all attributes, classes, and relations involved in the evaluation of.
If is informative enough to enable correct probabilistic inference over , then is an “easy” question – the right answer is accredited to perception alone.
Let a base model be an evaluation of Algorithm 1 given some visual oracle trained and run over the features . Note that the inference process of described in Section 3 involves no trainable parameters. Thus, its accuracy stems entirely from the accuracy of on the attributes/relations involved in any given question.222This is not the same as classification accuracy of in general because only a small fraction of objects and attributes in the scene are typically involved in any given question. Assuming a commonly reasonable architecture for the oracle (a deep feed-forward network over followed by sigmoid activation) trained end-to-end with backpropagation from the final answer through , the accuracy of thus indirectly captures the amount of information in directly involved in the inference of a given question – Q1 above.
With this in mind, we arrive at the following procedure for quantifying the extent of reasoning of a VQA model :
Fix an architecture for as described above. We propose a standard in our experiments in Section 6.
Train the oracle on the Train split of GQA using backpropagation through from the final answer.
Let be a test set for GQA. Evaluate on using the trained oracle and ground-truth GQA programs.
Let and be respectively the set of successful and failed questions by ().
Measure the accuracy of on .
Measure the error of of .
The easy set and hard set define, respectively, GQA instances where visual featurization alone is sufficient or insufficient to arrive at the answer. By measuring a model’s accuracy on the hard set (or error on the easy set), we determine the extent to which it uses the information in the featurization to answer a hard question (or, resp., fails to do so on an easily solvable question) – Q2 above.
Importantly, need not be a DFOL-based model, or even a neuro-symbolic model, or even based on any notion of a visual oracle – we only require it to take as input the same visual features . Thus, its accuracy on or error on is entirely attributable to its internal interaction between vision and language modalities. Furthermore, we can meaningfully compare ’s reasoning score to that of any VQA model that is based on the same featurization. (Although the comparison is not always “fair” as the models may differ in their pre-training data, it is still meaningful.)
5 Top-Down Contextual Calibration
We now present top-down contextual calibration as one way of augmenting logical reasoning to compensate for imperfect perception. Note that the FOL reasoning is a bottom-up process in the sense that every time the oracle is queried, it does not take into consideration the broad context of the question. Nevertheless, considering any additional information such as the context of question can be useful especially when the visual perception is imperfect.
Every formula defines two conditional likelihoods on the attention values over the population of all images in the dataset: and . In general, the bottom-up process assumes these two distributions are well separated on the extremes for every . However, due to the imperfection of
, that is not the case in practice. The Bayesian way to address this issue is to estimate these likelihoods and use the posteriorinstead of . This is the classical notion of calibration in binary classification (platt2000probabilities). In our framework, we have developed the neural version of the Beta Calibration (kull2017beta) to calculate the above posterior. In particular, we assume the likelihoods and
can be modeled as two Beta distributions with parametersand , respectively. Then, the posterior becomes where:
is called the calibration function. Here , and is the prior. Furthermore, where is the Beta function. By , we denote the parameters of the calibration function that are applied after the -th operator of during the attention calculation. Instead of estimating these parameters for each possible and , we amortize the computation by modeling them as a function of question context using a Bi-LSTM (huang2015bidirectional):
where is a MLP with parameters and denotes the -th state of a Bi-LSTM parametrized by . Here denotes the context of the formula , which is defined as the sequence of the predicates present in the program. For example, for the formula in Eq. (1), we have . The word embedding of this context is then fed to the bi-LSTM network as its input. Figure 2 (Left) shows our proposed top-down calibration mechanism and how it affects the DFOL reasoning process. To train this calibrator, we first train the Base model without the calibrator as before. We then freeze the weights of the visual oracle in the Base model, add the calibrator on the top and run the backprop again through the resulted architecture on the training data to tune the weights of the calibrator.
Note that for parameter values and , the calibration function in Eq. (16) is just the Identity function; that is, the calibration function does nothing and the reasoning stays purely logical. However, as the parameters deviate from these values, so does the behavior of reasoning from the logical reasoning. Interestingly, depending on the values of its parameters, the behavior of the calibration function is quite often interpretable. In Figure 2 (Right), we have shown how the calibrator, for example, can sharpen, blur, suppress or excite visual attention values via the parameters of the calibration function. This behavior is indeed context-dependent and learned by the calibrator from data. For example, if the model sees the "broken wooden chair" phrase enough times but the visual featurization is not informative enough to always detect "broken" in the image, the calibrator may decide to excite visual attention values upon seeing that phrase so it can make up for the imperfection of the visual system and still answer the question correctly.
It is important to note that even though the calibrator tries to pick up informative signals from the language priors, it does not simply replace the visual attention values by them. Instead, it modulates the visual attention via the language priors. So for example, if the attention values upon seeing "broken wooden chair" is close to zero for an image (indicating that the phrase cannot be really grounded in that image), then the calibration function will not raise the attention values significantly as shown in Figure 2 (Right), even though the calibrator has learned to "excite" visual attentions for that phrase. This soft thresholding behavior of is entirely learned from data. Finally, we note that modulating the visual attentions via the question context is only one way of filling in the holes of perceptions. Other informative signals such as the visual context and the feature-level, cross-modal interaction of language and vision can be exploited to improve the accuracy of -FOL even further.
In this section, we experimentally demonstrate how we can incorporate our framework for evaluating the visual and the reasoning aspects of the VQA in a decoupled manner. To this end, we have performed experiments using our framework and candidate VQA models on the GQA dataset.
The visual oracle: For the experiments in this section, we have chosen a feed-forward architecture with hidden layers and an output embedding layer that covers all the concepts in the GQA vocabulary. The weights of the embedding layer are initialized using GloVe (pennington2014glove).
The visual featurization: We use the standard Faster-RCNN object featurization that is released with the GQA dataset. The features vectors are further augmented by the bounding box positional features for each detected object. For binary relations, we simply concatenate the feature vectors of the two objects involved after a linear projection. For the sake of meaningful comparison in this section, we have made sure all the participating models use the same Faster-RCNN object featurization.
Training setup: For training all of -FOL models, we have used Adam optimizer with learning rate and weight decay . The dropout ratio is set to
. We have also applied gradient clipping with norm. For better convergence, we have implemented a curriculum training scheme where we start the training with short programs and over time we add longer programs to the training data.
Evaluation metrics: In addition to accuracy, we have also computed the consistency metric as defined by the GQA Challenge (hudson2019gqa).
6.1 How Informative is the GQA Visual Featurization?
Using the settings above, we have trained the Base model . Table 1 shows the accuracy and the consistency of the this model evaluated on the (balanced) Test-Dev split. Since we wish to use the Base model to isolate only the visual informativeness of the data, we have used the golden programs (provided in GQA) for calculating the metrics for this experiment. Based on these results, the Faster-RCNN featurization is informative enough on its own to produce correct answers for of the instances in the set without requiring any extra reasoning capabilities beyond FOL. Whereas, for of the questions, the visual signal in the featurization is not informative enough to accomplish the GQA task. Another interesting data point here is for about of the binary questions, the visual features are informative enough for question answering purposes without needing any fancy reasoning model in place, which in turn can explain why many early classifier-based models for VQA work reasonably well on binary questions.
6.2 Evaluating the Reasoning Capabilities of Models
The Base model , from the previous section, can be further used to divide the test data into the hard and easy sets as illustrated in Section 4 (i.e. and ). In this section, we use these datasets to measure the reasoning power of candidate VQA models by calculating the metrics and as well as the consistency for each model. See Appendix for examples of challenging instances from and deceptively simple instances from .
For the comparison, we have picked two well-known representatives in the literature for which the code and checkpoints were open-sourced. The first is the MAC network(hudson2018compositional) which belongs to the family of multi-hop, compositional neuro-symbolic models (hudson2019learning; andreas2016neural; vedantam2019probabilistic). The second model is the LXMERT (tan2019lxmert) network which belongs to the family of Transformer-based, vision-language models (lu2019vilbert; li2019unicoder). Both models consume Faster-RCNN object featurization as their visual inputs and have been trained on GQA.
|Open||42.73 %||88.74 %|
|Binary||65.08 %||86.65 %|
|All||51.86 %||88.35 %|
|Test-Dev||Hard Test-Dev||Easy Test-Dev|
|MAC||Open||41.66 %||82.28 %||18.12 %||74.87 %||26.70 %||84.54 %|
|Binary||71.70 %||70.69 %||58.77 %||66.51 %||21.36 %||75.37 %|
|All||55.37 %||79.13 %||30.54 %||71.04 %||23.70 %||82.83 %|
|LXMERT||Open||47.02 %||86.93 %||25.27 %||85.21 %||22.92 %||87.75 %|
|Binary||77.63 %||77.48 %||63.02 %||73.58 %||13.93 %||81.63 %|
|All||61.07 %||84.48 %||38.43 %||81.05 %||17.87 %||86.52 %|
|Test-Dev||Hard Test-Dev||Easy Test-Dev|
|-FOL||Open||41.22 %||87.63 %||0.53 %||11.46 %||2.53 %||90.70 %|
|Binary||64.65 %||85.54 %||4.42 %||61.11 %||2.21 %||86.33 %|
|All||51.45 %||87.22 %||1.81 %||19.44 %||2.39 %||89.90 %|
|Calibrated -FOL||Open||41.22 %||86.37 %||0.53 %||11.46 %||2.53 %||89.45 %|
|Binary||71.99 %||79.28 %||37.82 %||70.90 %||9.20 %||84.45 %|
|All||54.76 %||84.48 %||12.91 %||57.72 %||6.32 %||88.51 %|
Table 2 demonstrates the various statistics obtained by evaluating the two candidate models on balanced Test-Dev and its hard and easy subsets according to the Base model. From these results, it is clear that LXMERT is significantly superior to MAC on the original balanced Test-Dev set. Moreover, comparing the values for two models shows that the reasoning capability of LXMERT is significantly more effective compared to that of MAC when it comes to visually vague examples. This can be attributed to the fact that LXMERT like many other models of its family is massively pre-trained on large volumes of vision-language bi-modal data before it is fine-tuned for the GQA task. This pre-trained knowledge comes to the aide of the reasoning process when there are holes in the visual perception.
Another interesting observation is the comparison between the accuracy gap (i.e. ) and the consistency gap between the hard and easy subsets for each model/split row in the table. While the accuracy gap is quite large between the two subsets (as expected), the consistency gap is much smaller (yet significant) in comparison. This shows that the notion of visual hardness (or easiness) captured by the Base model partitioning is in fact consistent; in other words, even when VQA models struggle in the face of visually-hard examples in the hard set, their struggle is consistent across all logically-related questions (high hard consistency value in the table), which indicates that the captured notion of visual hardness is indeed meaningful. Furthermore, one may notice the smaller consistency gap of LXMERT compared to that of the MAC network, suggesting the consistent behavior of MAC is more sensitive to the hardness level of perception compared to that of LXMERT.
6.3 The Effect of Top-Down Contextual Calibration
Table 3 shows the result of applying the calibration technique from Section 5. Since we are using -FOL as an actual VQA model in this experiment, we have trained a simple sequence-to-sequence semantic parser to convert the natural language questions in the test set to programs. As shown in Table 3, the top-down calibration significantly improves the accuracy over the -FOL. This improvement is even more significant when we look at the results on the hard set, confirming the fact that exploiting even the simplest form of bi-modal interaction (in this case, the program context interacting with the visual attentions) can significantly improve the performance of reasoning in the face imperfect perception. Nevertheless, this gain comes at a cost. Firstly, the consistency of the model over the entire set degrades. This is, however, to be expected; after all, we are moving from pure logical reasoning to something that is not always “logical”. Secondly, by looking at the values, we observe that the calibrated model starts making significant mistakes on cases that are actually visually informative. This reveals one of the important dangers the VQA models might fall for once they start deviating from objective logical reasoning to attain better accuracy overall.
The neuro-symbolic -FOL framework, based on the differentiable first-order logic defined over the VQA task, allows us to isolate and assess reasoning capabilities of VQA models. Specifically, it identifies questions from the GQA dataset where the contemporary Faster-RCNN perception pipeline by itself produces imperfect representations that do not contain enough information to answer the question via straightforward sequential processing. Studying these questions on the one hand motivates endeavors for improvement on the visual perception front and on the other hand provides insights into the reasoning capabilities of state-of-the-art VQA models in the face of imperfect perception as well as the sensitivity of their consistent behavior to it. Furthermore, the accuracy and consistency on “visually imperfect” instances is a more accurate assessment of a model’s VR ability than dataset performance alone. In conclusion, we believe that the methodology of vision-reasoning disentanglement, realized in -FOL, provides an excellent tool to measure progress toward VR and some form of it should be ideally adopted by VR leaderboards.
We would like to thank Pengchuan Zhang for insightful discussions and Drew Hudson for helpful input during her visit at Microsoft Research. We also thank anonymous reviewers for their invaluable feedback.
Appendix A: Proofs
Lemma 3.1: Let be the left most variable appearing in formula , then depending on the quantifier of , we will have:
Note that the key underlying assumption in deriving the above proofs is that the binary logical statements for all objects
are independent random variablesgiven the visual featurization of the scene, which is a viable assumption. ∎
Lemma 3.2: ∎
If where is a unary predicate:
If where is the set of all binary predicates defined on variables and in and is the left most variable in with quantifier :
Note that the key underlying assumption in deriving the above proofs is that all the unary and binary predicates and for all objects and are independent binary random variables given the visual featurization of the scene, which is a viable assumption. ∎
|GQA OP||T||Equivalent FOL Description||Equivalent DFOL Program|
Appendix B: The Language System
Our language system defines the pipeline to translate the questions in the natural language (NL) all the way to the DFOL language which we can then run to find the answer to the question. However, as opposed to many similar frameworks in the literature, our translation process takes place in two steps. First, we parse the NL question into the task-dependent, high-level, domain-specific language (DSL) of the target task. We then compile the resulted DSL program into the task-independent, low-level DFOL language. This separation is important because the -FOL core reasoning engine executes the task-independent, four basic operators of the DFOL language (i.e. Filter, Relate, Neg and ) and not the task specific DSL operators. This distinguishes -FOL from similar frameworks in the literature as a general-purpose formalism; that is, -FOL can cover any reasoning task that is representable via first-order logic, and not just a specific DSL. This is mainly due to the fact that DFOL programs are equivalent to FOL formulas (up to reordering) as shown in Section 3.3. Figure 3 shows the proposed language system along with its different levels of abstraction.
For the GQA task, we train a neural semantic parser using the annotated programs in the dataset to accomplish the first step of translation. For the second step, we simply use a compiler, which converts each high-level GQA operator into a composition of DFOL basic operators. Table 4 shows this (fixed) conversion along with the equivalent FOL formula for each GQA operator.
Most operators in the GQA DSL are parameterized by a set of NL tokens that specify the arguments of the operation (e.g. "attr" in GFilter specifies the attribute that the operator is expected to filter the objects based upon). In addition to the NL arguments, both terminal and non-terminal operators take as input the attention vector(s) on the objects present in the scene (except for GSelect which does not take any input attention vector). However, in terms of their outputs, terminal and non-terminal operators are fundamentally different. A terminal operator produces a scalar likelihood or a list of scalar likelihoods (for "query" type operators). Because they are "terminal", terminal operators have logical quantifiers in their FOL description; this, in turn, prompts the aggregation operator in their equivalent DFOL translation. Non-terminal operators, on the other hand, produce attention vectors on the objects in the scene without calculating the aggregated likelihood.
Appendix C: Some Examples from the Hard and the Easy Sets
In this appendix, we visually demonstrate a few examples from the hard and the easy subsets of the GQA Test-Dev split. Figures 4,5,6 show a few examples from the hard set with their corresponding questions, while Figures 7,8 show a few examples from the easy set. In these examples, the green rectangles represent where in the image the model is attending according to the attention vector . Here the formula represents either the entire question for the easy set examples or the partial question up until to the point where the visual system failed to produce correct likelihoods for the hard set examples. We have included the exact nature of the visual system’s failure for the hard set examples in the captions. As illustrated in the paper, the visually hard-easy division here is with respect to the original Faster-RCNN featurization. This means that the "hard" examples presented here are not necessarily impossible in general, but are hard with respect to this specific featurization.
Furthermore, in Figure 9, we have demonstrated two examples from the hard set for which taking into the consideration the context of the question via the calibration process helped to overcome the imperfectness of the visual system and find the correct answer. Please refer to the caption for the details.