probnmn-clevr
Code for ICML 2019 paper "Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering" [long-oral]
view repo
We propose a new class of probabilistic neural-symbolic models, that have symbolic functional programs as a latent, stochastic variable. Instantiated in the context of visual question answering, our probabilistic formulation offers two key conceptual advantages over prior neural-symbolic models for VQA. Firstly, the programs generated by our model are more understandable while requiring lesser number of teaching examples. Secondly, we show that one can pose counterfactual scenarios to the model, to probe its beliefs on the programs that could lead to a specified answer given an image. Our results on the CLEVR and SHAPES datasets verify our hypotheses, showing that the model gets better program (and answer) prediction accuracy even in the low data regime, and allows one to probe the coherence and consistency of reasoning performed.
READ FULL TEXT VIEW PDFCode for ICML 2019 paper "Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering" [long-oral]
A curated list of papers on neural module networks.
Building flexible learning and reasoning machines is a central challenge in Artificial Intelligence (AI). Deep representation learning
(LeCun et al., 2015) provides us powerful, flexible function approximations that have resulted in state-of-the-art performance across multiple AI tasks such as recognition (Krizhevsky et al., 2012; He et al., 2015), machine translation (Sutskever et al., 2014), visual question answering (Agrawal et al., 2015), speech modeling (van den Oord et al., 2016), and reinforcement learning
(Mnih et al., 2015). However, many aspects of human cognition such as systematic compositional generalization (e.g., understanding that “John loves Mary” could imply that “Mary loves John”) (Lake et al., 2017; Lake & Baroni, 2017) have proved harder to model.Symbol manipulation (Newell & Simon, 1976), on the other hand lacks flexible learning capabilities but supports strong generalization and systematicity (Lake & Baroni, 2017). Consequently, many works have focused on building neural-symbolic models with the aim of combining the best of representation learning and symbolic reasoning (Valiant, 2003; Yi et al., 2018; Yin et al., 2018; Evans et al., 2018; Bader & Hitzler, 2005).
As we tackle more complex, higher level tasks with machine learning models that involve reasoning (Andreas et al., 2016a, b; Weston et al., 2015)
, a natural desire is to provide instructions or guidance as a scaffolding. In such a context symbols, by their very nature are easier to specify than the parameters of a neural network. Thus, a promising direction for such interpretable reasoning systems is to specify the plan for the computations to be executed symbolically and learn to perform them using representation learning.
This neural-symbolic methodology has been extensively used to model and test reasoning capabilities in visual question answering (VQA) (Andreas et al., 2016b; Johnson et al., 2017; Hu et al., 2017; Yi et al., 2018; Mao et al., 2018; Mascharka et al., 2018) and to some extent in reinforcement learning (Andreas et al., 2016a; Das et al., 2018). Concretely, in the VQA task one is given an image , a question (“Is a square to the left of a green shape?”), for which we would like to provide an answer (yes). In addition, one may also optionally be provided a program for the question that specifies a reasoning plan. For example one might ask a model to apply the find[green] operator, then transform[left], then And the result together with a find[square] operator in order to predict the answer, with the idea that the ‘neural’ part then kicks in to actually learn to execute these operations (Figure 1).
The scope of this current work is to provide a probabilistic formulation for such neural-symbolic models. We show that with such a formulation, one should expect the model to satisfy some natural desiderata for interpretable reasoning models. Firstly, given a limited number of teaching examples of plans / programs for a given question / context, we show one can better capture the association between the question and programs to provide more understandable and legible program explanations for novel questions. Inspired by Dragan et al. (2013), we call this notion data-efficient legibility, since the model’s program (actions) in this case need to be legible, i.e. clearly convey the question (goal specification) the model has been given.
Secondly, the formulation makes it possible to probe deeper into the reasoning capabilities of such VQA models. Specifically, one can test if the reasoning done by the system is 1) coherent: programs which lead to similar answers are consistent with each other and the syntax; and 2) sensitive: change in the answer should lead to a meaningful change in the underlying reasoning process.
Our probabilistic formulation addresses these desiderata in a principled manner, by modeling functional programs as a stochastic latent variable. This allows us to share statistics meaningfully, between questions with associated programs and those without corresponding programs. This sharing aids data-efficient legibility. Secondly, probabilistic modeling brings to bear a rich set of inferential techniques to answer conditional queries on the beliefs of the model. After fitting a model for VQA, one can sample , to probe if the reasoning done by the model is coherent (i.e. multiple programs leading to answer yes are coherent) and sensitive to a different answer (say, no).
Given an image, we propose to model the conditional distribution (see Figure 1); where the model factorizes as: . The implied generative process is as follows. First we sample a program , which generates questions . Further, given a program and an image we generate answers . Note that based on the symbolic program , we dynamically instantiate parameters of a neural network (these are deterministic, given ), by composing smaller neural-modules for each symbol in the program. This is similar to prior work on neural-symbolic VQA (Hu et al., 2017; Johnson et al., 2017)
using neural module networks (NMNs). In comparison to prior works, our probabilistic formulation (shorthand Prob-NMN) leads to better semi-supervised learning, and reasoning capabilities
^{1}^{1}1Note that this model assumes independence of programs from images, which corresponds to the weak sampling assumptions in concept learning (Tenenbaum, 1999), one can handle question premise, i.e. that people might ask a specific set of questions for an image in such a model by reparameterizing the answer variable to include a relevance label..Our technical contribution is in showing how to formulate semi-supervised learning with this deep generative neural symbolic model using variational inference (Jordan et al., 1999). First, we derive variational lower bounds on the evidence for the model for the semi-supervised and supervised cases, and show how this motivates the semi-supervised objectives used in previous work with discrete structured latent spaces (Yin et al., 2018; Miao & Blunsom, 2016). Next, we show how to learn program execution, i.e. jointly with the remaining terms in the model by devising a stage-wise optimization algorithm.
Contributions. First, we provide tractable algorithms for training models with probabilistic latent programs that also learn to execute them in an end-to-end manner. Second, we take the first steps towards deriving useful semi-supervised learning objectives for the class of structured sequential latent variable models (Miao & Blunsom, 2016; Yin et al., 2018). Third, our approach enables interpretable reasoning systems that are more legible with lesser supervision, and expose their reasoning process to test coherence and sensitivity. Fourth, our system offers improvements on the CLEVR (Johnson et al., 2017) as well as SHAPES (Andreas et al., 2016b) datasets in the low question program supervision regime. That is, our model answers questions more accurately and with legible (understandable) programs than adaptations of prior work to our setting as well as baseline non-probabilistic variants of our model.
We first explain the model and its parameterization, then detail the training objectives along with a discussion of a stage-wise training procedure.
Let be an input image, be a question, which is comprised of a sequence of words , where each , where is the vocabulary comprising all words, similarly, be the answer (where is the answer vocabulary), and be the prefix serialization of a program. That is, .
The model we describe below assumes is a latent variable that is observed only for a subset of datapoints in . We choose as a latent variable since this should help share statistics across different observations meaningfully, leading to better legibility. Secondly, testing coherence and sensitivity of reasoning skills can be cast as the problem of inferring the latent variable , by either looking at multiple samples (coherence) or comparing and (sensitivity).
Concretely, the programs express (prefix-serializations of) tree structured graphs. Computations are performed given this symbolic representation by instantiating, for each symbol a corresponding neural network (Figure 1, right) with parameters (Figure 1). That is, given a symbol in the program, say find[green], the model instantiates parameters . In this manner, the mapping is operationalized as , where .
As explained in Section 1, our graphical model factorizes as (see Figure 1 for the model in plate notation). The parameters are a deterministic node in the graph instantiated given the program and (which is independent of a particular program , and represents the concatenation of the parameters across all tokens in ). In addition to the generative model, we also parameterize an inference network to map questions to latent structured programs. Overall, the generative parameters in the model are , while inference parameters are .
We parameterize the terms in the above graphical model using neural networks. Firstly, the prior over programs is an LSTM (Hochreiter & Schmidhuber, 1997) sequence model pretrained using maximum likelihood, on programs simulated using the syntax of the program (unless specified otherwise). The prior is learned first and kept fixed for the rest of the training process. Next, is an LSTM sequence to sequence model (Sutskever et al., 2014) parametrized by that maps programs to questions. Likewise, is an LSTM sequence to sequence model that maps questions to programs. All the sequence models factorize as follows: . Finally, the parameters for symbols
parameterize small, deep convolutional neural networks which optionally take as input an attention map over the image (see Appendix for more details).
We assume access to a dataset where indexes the visual question answering dataset, with questions, corresponding images and answers, while indexes a teaching dataset, which provides the corresponding programs for a question, explaining the steps required to solve the question. For data-efficient legibility, we are interested in the setting where where i.e. we have few annotated examples of programs which might be more expensive to specify.
Given this, learning in our model consists of estimating parameters
. We do this in a stage wise fashion (shown below). We first describe the stage wise optimization procedure to initialize the parameters of the different parts of the model and then perform joint training. In Section 2.3 we discuss why stage-wise training is beneficial.Stage wise optimization.
1) Question Coding: Optimizing parameters to learn a good code for
questions in the latent space .
2) Module Training: Optimizing parameters for learning
to execute symbols using neural networks .
3) Joint Training: Learning all the parameters of the model .
We describe each of the stages in detail below, and defer the reader to the appendix for a
more details.
Question coding. In question coding, we operationalize our intuition for data-efficient legibility of reasoning process given a question. We do this by fitting the model on the marginal evidence (after marginalizing answers ): , where and , where is the set of questions without any annotated programs.
The term is straightforward to handle by constructing a variational lower bound on the evidence for by fitting an amortized inference network (c.f. (Kingma & Welling, 2013; Rezende et al., 2014)):
(1) |
where the lower bound holds for .
The term does not capture the semantics of programs, in terms of how they relate to particular questions. For modeling legible programs, , we would also like to make use of the labelled data to learn associations between questions and programs, and provide legible explanations for novel questions . To do this, one can factorize the model to maximize: . While in theory given the joint, it is possible to estimate , this is expensive and requires an (in general) intractable sum over all possible programs. Ideally, one would like to reuse the same variational approximation that we are training for so that it learns from both labelled as well as unlabelled data (c.f. Kingma et al. (2014)). We prove the following lemma and use it to construct an objective that makes use of , and relates it to the evidence.
Given observations and , let , the token at the timestep in a sequence be distributed as a categorical with parameters . Let us denote
, the joint random variable over all
. Then, the following is a lower bound on the joint evidence :(2) |
where is a distribution over the sampling distributions implied by the prior and , where each
is a delta distribution on the probability simplex.
See Appendix for a proof of the result. The result is an extension of the result for a related graphical model (with discrete categorical labels as the supervision) from (Kingma et al., 2014; Keng, 2017) to the case where the latent variables (i.e. the supervision observed) are sequences.
In practice, the bound above is quite loose, as the proof assumes a delta posterior which makes the last term . This means we have to resort to learning only the first two terms in Lemma 2 as an approximation:
(3) |
where is a scaling on (which still makes this a valid lower bound, modulo the term).
Interestingly, Lemma 2 sheds light on previous objectives proposed for semi-supervised learning with discrete sequence-valued latent variable models.
Connections to other objectives. To our knowledge, two previous works (Yin et al., 2018; Miao & Blunsom, 2016) have formulated semi-supervised learning with discrete (sequential) latent variable models. While Miao & Blunsom (2016) writes the supervised term as , Yin et al. (2018) writes it as . The lemma above provides a clarifying perspective on both the objectives, firstly showing that should actually be written as (and suggests an additional term), and that the objective from Yin et al. (2018) is actually a part of a loose lower bound on the evidence for providing some justification for the intuition presented in Yin et al. (2018)^{2}^{2}2A promising direction for obtaining tighter bounds could be to change the parameterization of the variational distribution. Overall, learning of is challenging in the structured, discrete space of sequences and a proper treatment of how to train this term in a semi-supervised setting is important for this class of models..
Empirically, we follow Kingma & Welling (2013); Miao & Blunsom (2016) in up-weighting the scaling for term in the lower bound on , with a scaling factor . With this, for discrete programs the bound remains intact and helps with training (see Appendix for a more detailed explanation). In addition, we follow prior work in using to scale the contribution from in the lower bound on , since this is important for learning meaningful representations when decoding sequences (c.f. Miao & Blunsom (2016); Yin et al. (2018); Bowman et al. (2016); Alemi et al. (2018)). While this violates the lower bound, Alemi et al. (2018) provides theoretical justifications for why this is desirable when decoding sequences which we discuss in more detail in the Appendix.
We next explain the evidence formulation for the full graphical model; and then introduce the module training and joint training steps.
Module and Joint Training. For the full model (including the answers ), the evidence is , where and . Similar to the previous section, one can derive a variational lower bound (Jordan et al., 1999) on (c.f. (Vedantam et al., 2018; Suzuki et al., 2017)):
(4) |
Module Training. During module training, first we optimize the model only w.r.t the parameters responsible for neural execution of the symbolic programs, namely . Concretely, we maximize:
(5) |
The goal is to find a good initialization of the module parameters, say that binds the execution to the computations expected for the symbol find[green] (namely the neural module network). Outside of a probabilistic context, a similar training scheme for neural modules has been presented in Johnson et al. (2017).
Joint Training. Having trained the question code and the neural module network parameters, we train all terms jointly, optimizing the complete evidence with the lower bound . We make changes to the above objective (across all the stages), by adding in scaling factors , and for corresponding terms in the objective, and write out the term (Equation 4), subsuming it into the expectation:
(6) |
where is a scaling factor on the answer likelihood, which has fewer bits of information than the question. For answers , which have probability mass functions, still gives us a valid lower bound^{3}^{3}3Similar scaling factors have been found to be useful in prior work (Vedantam et al., 2018) (Appendix A.3) in terms of shaping the latent space.. The same values of and are used as in question coding (and for the same reasons explained above).
The first term, with expectation over is not differentiable with respect to . Thus we use the REINFORCE (Williams, 1992) estimator with a moving average baseline to get a gradient estimate for (see Appendix for more details.). We take the gradients (where available) for updating the rest of the parameters.
In this section, we outline the difficulties that arise when we try to optimize Equation 6 directly, without following the three stage procedure. Let us consider question coding – if we do not do question coding independently of the answer, learning the parameters of the neural module network becomes difficult, especially when as the mapping is implemented using neural modules . This optimization is discrete in the program choices, which hurts when is uncertain (or has not converged). Next, training the joint model without first running module training is possible, but trickier, because the gradient from an untrained neural module network would pass into the inference network, adding noise to the updates. Indeed, we find that inference often deteriorates when trained with REINFORCE on a reward computed from an untrained network(Table 1).
We first explain differences with other discrete structured latent variable models proposed in the literature. We then connect our work to the broader context of research in reasoning and visual question answering (VQA). Next, we discuss interpretability in the context of VQA, followed by an examination on how this present work relates to broader themes in the literature around program induction and neural architecture search.
Discrete Structured Latent Variables.
There has been an emerging trend in building probabilistic, discrete, structured latent variable models. A lot of the approaches are loosely inspired by the variational autoencoder model
(Kingma & Welling, 2013; Rezende et al., 2014) in terms of the graphical structure, except the observations as well as the latents are either sequences (Miao & Blunsom, 2016) or tree structured programs (Yin et al., 2018). Our graphical model is richer than those considered by the previous works, in the sense that our joint model must learn to also decode a latent program into answers by executing neural modules , while previous works only consider the problem of parsing text into programs (Yin et al., 2018) or generating (textual) summaries using a latent variable (Miao & Blunsom, 2016). Finally, as discussed in Section 2, our derivation for the lower bound on the question-program evidence provides some understanding for the objectives used in these prior works for semi-supervised learning (Yin et al., 2018; Miao & Blunsom, 2016).Visual Question Answering and Reasoning. A number of approaches have studied visual question answering, motivated to study multi-hop reasoning (Yi et al., 2018; Johnson et al., 2017; Hu et al., 2017, 2018; Hudson & Manning, 2018; Santoro et al., 2017; Perez et al., 2017). Some of these works build in implicit, non-symbolic inductive biases to support compositional reasoning into the network (Perez et al., 2017; Hudson & Manning, 2018; Hu et al., 2018), while others take a more explicit symbolic approach (Johnson et al., 2017; Yi et al., 2018; Hu et al., 2017; Mascharka et al., 2018). Our high level goal is centered around providing legible explanations and reasoning traces for programs, and thus, we adopt a symbolic approach. Even in the realm of symbols, different approaches utilize different kind of inductive biases in the mapping from symbols (programs) to answer. While Yi et al. (2018)
favor an approach that represents objects in a scene with a vectorized representation, and compute various operations as manipulations of the vectors, other works take a more modular approach
(Andreas et al., 2016b; Johnson et al., 2017; Hu et al., 2018; Mascharka et al., 2018) where programs, on instantiates neural networks with parameters . We study the latter approach since it is arguably more general and could conceivably transfer better to other tasks such as planning and control (Das et al., 2018), lifelong learning (Valkov et al., 2018; Gaunt et al., 2017) etc.Different from all these prior works, we provide a probabilistic scaffolding that embeds previous neural-symbolic models, which we conceptualize should lead to better data-efficient legibility and the ability to debug coherence and sensitivity in reasoning. We are not aware of any prior work on VQA where it is possible to reason about coherence or sensitivity of the reasoning performed by the model.
Interpretable VQA.
Given its importance as a scene understanding task, and as a general benchmark for reasoning, there has been a lot of work in trying to interpret VQA systems and explain their decisions
(Das et al., 2016; Lu et al., 2016; Park et al., 2018; Selvaraju et al., 2017). Interpretability approaches typically either perform some kind of explicit attention (Bahdanau et al., 2014) over the question or the image (Lu et al., 2016) to explain with a heat map the regions or parts of the question the model used to arrive at an answer. Some other works develop post-hoc attribution techniques (Mudrakarta et al., 2018; Selvaraju et al., 2017) for providing explanations. In this work, we are interested in an orthogonal notion of interpretability, in terms of the legibility of the reasoning process used by the network given symbolic instructions for a subset of examples. More similar to our high-level motivation are approach which take a neural-symbolic approach, providing explanations in terms of programs used for reasoning about a question (Andreas et al., 2016b; Hu et al., 2018), optionally including a spatial attention over the image to localize the function of the modules (Andreas et al., 2016b; Hu et al., 2017; Mascharka et al., 2018). In this work we augment the legibility of the programs/reasoning from these approaches using a probabilistic framework.Program Induction and Neural Architecture Search. In program induction one is typically interested in learning to write programs given specifications of input-output pairs (Reed & de Freitas, 2015; Kalyan et al., 2018), and optionally language (Neelakantan et al., 2015; Guu et al., 2017) or vision (Gaunt et al., 2017). The key difference between program induction and our problem is that program induction assumes the tokens/instructions in the language are grounded to the executor, i.e., it is assumed that once a valid program is generated there is a black-box execution engine that can execute it, while we learn the execution engine from scratch, and learn the grounding from program symbols to parameters of a neural network for execution. Neural architecture search (Zoph & Le, 2016) is another closely related problem where one is given input output examples for a machine learning task, and the goal is to find a “neural program” that performs well on the task. In contrast, ours can be seen as the problem of inducing programs conditioned on particular inputs (questions), where the number of samples seen per question is significantly lower than the number of samples one would observe for a machine learning task of interest.
Dataset. We report our results on the CLEVR (Johnson et al., 2017) dataset and the SHAPES datasets (Andreas et al., 2016b). The CLEVR dataset has been extensively used as a benchmark for testing reasoning in VQA models in various prior works (Johnson et al., 2017; Hu et al., 2017, 2018; Hudson & Manning, 2018; Santoro et al., 2017; Perez et al., 2017) and is composed of 70,000 images and around 700K questions, answers and functional programs in the training set, and 15,000 images and 150K questions in the validation set. We divide the CLEVR validation set into a val set with 20K questions and associated images, and a test set with 130K questions. The longest questions in the dataset are of length 44 while the longest programs are of length 25. The question vocabulary has 89 tokens while the program vocabulary has 40 tokens, with 28 possible answers.
We investigate our design choices on the smaller SHAPES dataset proposed in previous works (Andreas et al., 2016b; Hu et al., 2017) for visual question answering. The dataset is explicitly designed to test for compositional reasoning, and contains compositionally novel questions that the model must demonstrate generalization to at test time. Overall there are 244 unique questions with yes/no answers and 15,616 images (Andreas et al., 2016b). The dataset also has annotated programs for each of the questions. We use train, val, and test splits of 13,568, 1,024, and 1,024 triplets respectively. The longest questions in the dataset are of length 11 and shortest are of length 4, while the longest programs are of length 6 and shortest programs are of length 4. The size of the question vocabulary is 14 and the program vocabulary is 12.
Training. On SHAPES, to simulate a data-sparse regime, we restrict the set of question-aligned programs to 5, 10, 15, or 20% of unique questions – such that even at the highest level of supervision, programs for 80% of unique question have never been seen during training. We train our program prior using a set of 1848 unique programs simulated from the syntax (more details of this can be seen in the Appendix).
In general, we find that performance at a given amount of supervision can be quite variable. In addition, we also find question coding, and module training stages tend to show a fair amount of variance across multiple runs. To make fair comparisons, for every experiment, we run question coding (
, ) across 5 different runs, pick the best performing model, and then run module training (updating ) across 10 different runs. Finally, we run the best model from this stage for joint training (sweeping across values of ). Finally, at the end of this process we have the best model for an entire training run. We repeat this process across five random datasets and report mean and variance at a given level of supervision.With CLEVR, we report results when we train on 1000 question-program supervision pairs (this is % of all question-program pairs), along with the rest of the question, image answer pairs (dropping the corresponding programs). Similar to SHAPES, we report our results across 20 different choices of the subset of 1000 questions to estimate statistical significance. In general, the CLEVR dataset has question lengths with really large variance, and we select the subset of 1000 questions with length at most 40. We do this to stabilize training (for both our method as well as the baseline) and also to simulate a realistic scenario where an end user would not want to annotate really long programs. For each choice of a subset of 1000 questions, we run our entire training pipeline (across the question coding, module training and joint training stages) and report results on the top 15 runs out of the 20 (for all comparisons reported in the paper), based on question coding accuracy (see metrics below).
Metrics. For question coding, we report the accuracy of the programs predicted by the model (determined with exact string match), since we are interested in the legibility of the generated programs. In module training, we report the VQA accuracy obtained by the model, and finally, in joint training, we report both, VQA and program prediction accuracy since we are interested in both, getting the right answers and the legibility of the model’s reasoning trace.
Baselines. We compare against adaptation of a state-of-the-art semi-supervised learning approach proposed for neural module networks by Johnson et al. (2017). Johnson et al. fit the terms corresponding to and in our model. Specifically, in question coding, they optimize (where indexes data points with associated questions, i.e. the approach does not make use of “unlabelled” data). Next, in module training they optimize . In joint training, they optimize the same objective with respect to the parameters as well as , using the REINFORCE gradient estimator:
. In contrast, we follow Algorithm 1 and maximize the corresponding evidence at every stage of training. We also report other baselines which are either ablations of our model or deterministic variants of our full model. See Appendix for the exact setting of the hyperparameters.
We focus on evaluating approaches by varying the fraction of question-aligned programs (which we denote ) in Table 1. To put the numbers in context, a baseline LSTM + image model, which does not use module networks, gets an accuracy of 63.0% on Test (see Andreas et al. (2016b); Table 2). This indicates that SHAPES has highly compositional questions which is challenging to model via conventional methods for Visual Question Answering (Agrawal et al., 2015). Overall, we make the following observations:
Our Prob-NMN approach consistently improves performance in data-sparse regimes. While both methods tend to improve with greater program supervision, Prob-NMN quickly outpaces NMN (Johnson et al., 2017), achieving test VQA accuracies over 30-35% points higher for program supervision. Notably, both methods perform similarly poorly on the test set given only 5% program supervision, suggesting this may be too few examples to learn compositional reasoning. Similarly, we can see that the program prediction accuracy is also significantly higher for our approach at the end of joint training, meaning that Prob-NMN is right for the right reasons.
Our question coding stage greatly improves initial program prediction. Our Prob-NMN approach to question coding gets to approximately double the question coding (program prediction) accuracy as the NMN approach (col 1, question coding). This means that it effectively propagates groundings from question-aligned programs during the coding phase. Thus the initial programs produced by our approach are more legible at a lower amount of supervision than NMN. Consequently, we also see improved VQA performance after the module and joint training stages which are based on predicted programs.
Successful joint training improves program prediction. In general, we find that the accuracies obtained on program prediction deteriorate when the module training stage is weak (row 1). On the other hand, higher program prediction accuracies generally lead to better module training, which further improves the program prediction performance.
Figure 2 shows sample programs for each model. We see the limited supervision negatively affects NMN program prediction, with the 5% model resorting to simple Find[X]Answer structures. Interestingly, we find that the mistakes made by the Prob-NMN model, e.g., green in 5% supervision (top-right) are also made when reconstructing the question (also substituting green for blue). Further, when the token does get corrected to blue, the question also eventually gets reconstructed (partially correctly (10%) and then fully (15%)), and the program produces the correct answer. This indicates that there is high fidelity between the learnt question space, the answer space and the latent space.
Effect of the Program Prior . Next, we explore the impact of regularizing the program posterior to be close to the prior , for different choices of the prior. Firstly, we disable the KL-divergence term by setting in Equation 6 recovering a deterministic version of our model, that still learns to reconstruct the question given a sampled program . Compared to our full model at 10% supervision, the performance on question coding drops to 23.14 6.1% program prediction accuracy (from 60.18 9.56%). Interestingly, the model has a better performance on question reconstruction, which improves from 83.60 5.57 % to 94.3 1.85%. This seems to indicate that, without the term, the model focuses solely on reconstruction and fails to learn compositionality in the latent space, such that supervised grounding are poorly propagated to unsupervised questions as evidenced by the drop in program prediction accuracy. Thus, the probabilistic model helps better achieve our high-level goal of data-efficient legibility in the reasoning process. In terms of the VQA accuracy on validation (at the end of joint training) we see a drop in performance from to .
Validation During Training Stages | Test Accuracy | ||||||
[I] Question Coding | [II] Module Training | [III] Joint Training | |||||
(Reconstruction) | (Program Prediction) | (VQA Accuracy) | (Program Prediction) | (VQA Accuracy) | (VQA Accuracy) | ||
NMN (Johnson et al., 2017) | 5 | - | 9.281.91 | 61.563.59 | 0.0 0.0 | 63.080.78 | 60.063.88 |
Prob-NMN(Ours) | 62.865.31 | 17.125.28 | 54.5511.31 | 28.12 28.12 | 72.508.35 | 71.9511.15 | |
NMN (Johnson et al., 2017) | 10 | - | 24.302.39 | 60.313.37 | 6.25 10.83 | 66.514.10 | 61.990.96 |
Prob-NMN(Ours) | 83.605.57 | 60.189.56 | 75.803.62 | 90.626.98 | 96.862.48 | 94.532.06 | |
NMN (Johnson et al., 2017) | 15 | - | 47.675.02 | 69.479.87 | 0.0 0.0 | 62.430.49 | 61.322.36 |
Prob-NMN(Ours) | 95.860.20 | 84.856.25 | 90.573.44 | 95.31 5.18 | 98.401.63 | 97.020.84 | |
NMN (Johnson et al., 2017) | 20 | - | 58.373.30 | 66.177.02 | 43.75 43.75 | 80.6818.00 | 78.5919.27 |
Prob-NMN(Ours) | 96.100.27 | 90.221.63 | 91.811.58 | 96.87 5.41 | 99.430.61 | 96.971.30 |
Validation During Training Stages | Test Accuracy | ||||||
---|---|---|---|---|---|---|---|
[I] Question Coding | [II] Module Training | [III] Joint Training | |||||
(Reconstruction) | (Program Prediction) | (VQA Accuracy) | (Program Prediction) | (VQA Accuracy) | (VQA Accuracy) | ||
NMN (Johnson et al., 2017) | 0.143 | - | 62.479.82 | 79.264.03 | 63.08 9.91 | 79.384.21 | 79.314.26 |
Prob-NMN(Ours) | - | 93.158.61 | 94.423.77 | 93.87 8.73 | 95.524.15 | 95.394.23 |
Effect of optimizing the true ELBO, . Next, we empirically validate the intuition that for the semi-supervised setting, approximate inference learned using Equation (1) for sequence models often leads to solutions that do not make meaningful use of the latent variable. For example, we find that at (and 10% supervision), training the true evidence lower bound on deteriorates the question reconstruction accuracy from 83.60 % to 8.74% causing a large distortion in the original question, as predicted by the results from Alemi et al. (2018). Overall, in this setting the final accuracy on VQA at the end of joint training is , a drop of 20 % accuracy.
Finally, the N2NMN approach (Hu et al., 2017) evaluates their question-attention based module networks in the fully unsupervised setting, getting to 96.19% on TEST. However, the programs in this case become non-compositional (see Section 3), as the model leaks information from questions to answers via attention, meaning that programs no longer carry the burden of explaining the observed answers. This makes the modules illegible. In general, our approach makes the right independence assumptions () which helps legibility to emerge, along with our careful design of the three stage optimization procedure.
In this section we show that the probabilistic formulation can be used to check a model’s coherence in reasoning (see Figure 3, top). Given the image and the answer yes, we observe that one is able to generate multiple, diverse reasoning patterns which lead to the answer, by sampling , showing a kind of systematicity in reasoning (Lake et al., 2017). On the other hand, when we change the answer to no (see Figure 3, bottom), keeping the image the same, we observe that the reasoning pattern changes in a meaningful way, yielding a program that evaluates to the desired answer. See appendix for a description of how we do the sampling for these experiments.
In this section, we report results on the CLEVR dataset for compositional visual question answering (Johnson et al., 2017).
Training details. In order to get stable training with sequence-to-sequence models on CLEVR, we found it useful to average the log-probabilities across multiple timesteps. Without this, in presence of a large variance in sentence lengths, we find that the model focuses on generating the longer sequences, which is usually hard without first understanding the structure in shorter sequences. Averaging (across) the dimension helps with this. This is also common practice when working with sequence models and has been done in numerous prior works (Vinyals et al., 2015; Johnson et al., 2017; Hu et al., 2017; Lu et al., 2017) etc. We use a batch size of 256, learning rate of 1e-3 during question coding, 1e-4 during module training, and 2e-5 during joint training. We validate every 500 iterations, and drop the learning rate to half when our primary metric (either program prediction accuracy or VQA accuracy, in respective stages) does not improve within next 3 validations. We set as 100, as 0.1 and as 10. We use the code and module design choices of the state of the art TbD nets (Mascharka et al., 2018) to implement the mapping in our model.
Results. Our results on the CLEVR dataset (Johnson et al., 2017) reflect similar trends as the results on SHAPES (Table 2). As explained in Section 4, we work with 1000 supervised question program examples from the CLEVR dataset (0.143% of all question program pairs). With this, at the end of question coding, Prob-NMN gets to an accuracy of 93.15 8.61, while the baseline NMN approach gets to an accuracy of 62.47 9.82. These gains for the Prob-NMN model are reflected in module training, where the Prob-NMN approach gets to an accuracy of 94.03 3.95, while the baseline is at 79.09 4.13. Finally, at the end of joint training these improve marginally to 95.52 4.15 and 79.38 4.21 respectively. Crucially, this is achieved with a program generation accuracy of 93.87 8.73 by our approach compared to a baseline accuracy of 63.08 9.91. Thus, the programs generated by the Prob-NMN model are more legible than those by the semi-supervised NMN approach (Johnson et al., 2017).
In this work, we discussed a probabilistic, sequential latent variable model for visual question answering, that jointly learns to parse questions into programs, reasons about abstract programs, and learns to execute them on images using modular neural networks. We demonstrate that the probabilistic model endows the model with desirable properties for interpretable reasoning systems, such as the reasoning being clearly legible given minimal number of teaching examples, and the ability to probe into the reasoning patterns of the model, by testing their coherence (how consistent are the reasoning patterns which lead to the same decision?) and sensitivity (how sensitive is the decision to the reasoning pattern?). We test our model on the CLEVR dataset as well as a dataset of compositional questions about SHAPES and find that handling stochasticity enables better generalization to compositionally novel inputs.
R.V thanks Google for supporting this work via. a PhD Fellowship. This work was supported in part by NSF, AFRL, DARPA, Siemens, Samsung, Google, Amazon, ONR YIPs and ONR Grants N00014-16-1-{2713,2793}. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 39–48, June 2016b.Proc. Empirical Methods in Natural Language Processing
, 2016.Stochastic backpropagation and approximate inference in deep generative models.
In ICML, 2014.Given observations and , let , the token at the timestep in a sequence be distributed as a categorical with parameters . Let us denote , the joint random variable over all . Then, the following is a lower bound on the joint evidence :
(7) |
where is a distribution over the sampling distributions implied by the prior and , where each is a delta distribution on the probability simplex.
We provide a proof for the lemma stated in the main paper, which explains how to obtain a lower bound on the evidence for the data samples .
Let us first focus on the prior on (serialized) programs, . This is a sequence model, which factorizes as, , where . We learn point estimates of the parameters for the prior (in a regular sequence model), however, the every sequence has an implicit distribution on . To see this, note that at every timestep (other than ), the sampling distribution depends upon , which is set of random variables. Let
denote the joint distribution of all the sampling distributions across multiple timesteps
and , the realized programs from the distribution. LetSince refers to the sampling distribution for , we immediately notice that . That is, only depends on the sampling distribution at time , meaning that .
With this (stochastic) prior parameterization, we can factorize the full graphical model now as follows: . We treat as a latent variable. Then, the marginal likelihood for this model is:
(8) |
Notice that this equals the evidence for the original, graphical model of our interest. Next, we write, for a single data instance:
where we multiply and divided by a variational approximation and apply Jensen’s inequality (due to concavity of ) to get the following variational lower bound:
(9) |
Next, we explain how to parametrize the prior, which is the key assumption for deriving the result. Let be the inferred distribution for . That is, . Then, let , that is, a Dirac delta function at .
Further, note that we can write the second term as follows:
(10) | ||||
(11) |
Each integral in the sum above can be simplified as follows:
Substituting into Equation 10, we get:
This gives us the final bound, written over all the observed data points:
(12) |
∎
We optimize the full evidence lower bound using supervised learning for and REINFORCE for . When using REINFORCE, notice that , where is a baseline. For the lower bound on , the gradient is . Notice that the scales of the gradients from the two terms are very different, since the order of is around the number of bits in a program, making it dominate in the early stages of training.
During question coding, we learn a latent representation of programs in presence of a question reconstructor. Our reconstructor is a sequence model, which can maximize the likelihood of reconstructed question without learning meaningful latent representation of programs. Recent theory from (Alemi et al. 2017) prescribes changes to the ELBO, setting as a recipe to learn representations which avoid learning degenerate latent representations in the presence of powerful decoders. Essentially, (Alemi et al.) identify that up to a constant factor, the negative log-marginal likelihood term =, and the KL divergence term bound the mutual information between the data and the latent variable as follows:
(13) |
where is the entropy of the data, which is a constant. One can immediately notice that the standard . This means that for the same value of the one can get different models which make drastically different usage of the latent variable (based on the achieved values of and ). Thus, one way to achieve a desired behavior (of say high mutual information ), is to set to a lower value than the standard (c.f. Eqn. 6 in (Alemi et al. 2017).) This aligns with our goal in question coding, and so we use .
We provide more details on the modeling choices and hyperparameters used to optimize the models in the main paper.
Sequence to Sequence Models: All our sequence to sequence models described in the main paper (Section. 2) are based on LSTM cells with a hidden state of 128 units, single layer depth, and have a word embedding of 32 dimensions for both the question as well as the program vocabulary.
The when sampling from a model, we sample till the maximum sequence length of 15 for questions and 7 for programs.
Image CNN: The SHAPES images are of 30x30 in size and are processed by a two layered convolutional neural network, which in its first layer has a 10x10
filters applied at a stride of 10, and in the second layer, it does
1x1 convolution applied at a stride of 1. Channel widths for both the layers are 64 dimensional.Moving average baseline For a reward , we use an action independent baseline to reduce variance, which tracks the moving average of the rewards seen so far during training. Concretely, the form for the network is:
(14) |
where, given as the decay rate for the baseline,
(15) |
is the update on the baseline performed at every step of training.
In addition, the variational parameters are also updated via. the path derivative .
Hyperparameters for training: We use the ADAM optimizer with a learning rate of 1e-3, a minibatch size of 576, and use a moving average baseline for REINFORCE, with a decay factor of 0.99. Typical values of are set to 0.1, is set to 100 and is chosen on a validation set among (1.0, 10.0, 100.0).
Simulating programs from known syntax:
We follow a simple two-stage heuristic procedure for generating a set of samples to train the program prior in our model. Firstly, we build a list of possible future tokens given a current token, and sample a random token from that list, and repeat the procedure for the new token to get a large set of possible valid sequences – we then pass the set of candidate sampled sequences through a second stage of filtering based on constraints from the work of (Hu et al. 2017).
Empirical vs. syntactic priors While our default choice of the prior is from programs simulated from the known syntax, it might not be possible to exhaustively enumerate all valid program strings in general. Here, we consider the special case where we train the prior on a set of unaligned, ground truth programs from the dataset. Interestingly, we find that the performance of the question coding stage, especially at reconstructing the questions improves significantly when we have the syntactic prior as opposed to the empirical prior (from 38.39 11.92% to 54.86 6.75% for 5% program supervision). In terms of program prediction accuracy, we also observe marginal improvements in the cases where we have 5% and 10% supervision respectively, from 56.23 2.81% to 65.45 11.88%. When more supervision is available, regularizing with respect to the broader, syntactic prior hurts performance marginally, which makes sense, as one can just treat program supervision as a supervised learning problem in this setting.
Our goal is to sample from the posterior , which we accomplish by sampling from the unnormalized joint distribution . To answer a query conditioned on an answer , one can simply filter out samples which have , and produce the sample programs as the program, and decode the program to produce the corresponding question. Note that it is also possible to fit a (more scalable) variational approximation having trained the generative model retrospectively, following results from (Vedantam et al. 2018). However, for the current purposes we found the above sampling procedure to be sufficient.