Visual Dialogue without Vision or Dialogue

12/16/2018 ∙ by Daniela Massiceti, et al. ∙ University of Oxford 0

We characterise some of the quirks and shortcomings in the exploration of Visual Dialogue (VD) - a sequential question-answering task where the questions and corresponding answers are related through given visual stimuli. To do so, we develop an embarrassingly simple method based on Canonical Correlation Analysis (CCA) that, on the standard dataset, achieves near state-of-the-art performance for some standard metric. In direct contrast to current complex and over-parametrised architectures that are both compute and time intensive, our method ignores the visual stimuli, ignores the sequencing of dialogue, does not need gradients, uses off-the-shelf feature extractors, has at least an order of magnitude fewer parameters, and learns in practically no time. We argue that these results are indicative of issues in current approaches to Visual Dialogue relating particularly to implicit dataset biases, under-constrained task objectives, and over-constrained evaluation metrics, and consequently, discuss some avenues to ameliorate these issues.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Caption: A man and a woman sit on the street in front of a large mural painting.
Question Answer How old is the baby? About 2 years old What color is the remote? White Where is the train? On the road How many cows are there? Three

Figure 1: Failures in visual dialogue. Visually-unrelated questions, and their visually-unrelated plausible answers.

Recent years have seen a great deal of interest in conversational AI, enabling natural language interaction between humans and machines, early pioneering efforts for which include ELIZA [Weizenbaum, 1966] and SHRDLU [Winograd, 1971]

. This resurgence of interest builds on the ubiquitous successes of neural-network-based approaches in the last decade, particularly in the perceptual domains of vision and language.

A particularly thriving sub-area of interest in conversational AI is that of visually grounded dialogue, termed VD, involving an AI agent conversing with a human about visual content [Das et al., 2017a, b, Massiceti et al., 2018]. Specifically, it involves answering questions about an image, given some dialogue history—a fragment of previous questions and answers. Typical approaches for learning to do VD, as is standard practice in ML, involves defining an objective to achieve, procuring data with which to learn, and establishing a measure of success at the stated objective.

The objective for VD is reasonably clear at first glance—answer in sequence, a set of questions about an image. The primary choice of dataset, VisDial [Das et al., 2017a], addresses precisely this criterion, involving a large set of images, each paired with a dialogue—a set of question-answer pairs—collected by pairs of human annotators playing a game to understand an image through dialogue. And finally, evaluation measures on the objective are typically defined through some perceived value of a human-derived “ground-truth” answer in the system.

However, as we will demonstrate, certain quirks in the choices of the above factors, can lead to unintentional behaviour (c.f. Figure 1111From online demos of SOTA models–VisDial [Das et al., 2017a] and FlipDial [Massiceti et al., 2018]. ), which leverages implicit biases in data and methods, to potentially misdirect progress from the desired objectives. Intriguingly, we find that in contrast to SOTA approaches that employ complex neural-network architectures using complicated training schemes over millions of parameters and taking many hours of time and expensive GPU compute resources, a simple CCA-based method only uses standard off-the-shelf feature extractors, avoids computing gradients, involves a few hundred thousand parameters and requires just a few seconds on a CPU to achieve comparable performance—all without requiring the image or prior dialogue!

2 (Multi-View) CCA for Vd

We begin with a brief preliminary for CCA [Hotelling, 1936] and its multi-view extension [Kettenring, 1971]. In (standard 2-view) CCA, given access to paired observations , the objective is to jointly learn projection matrices and where , that maximise the correlation between the projections, formally .

Multi-view CCA, a generalisation of CCA, extends this to associated data across  domains, learning projections . Kettenring [1971] shows that can be learnt by minimising the Forbenius norm between each pair of views, with additional constraints over the projection matrices [Hardoon et al., 2004]

. Optimising the multi-view CCA objective then reduces to solving a generalized eigenvalue decomposition problem,

, where and are derived from the inter- and intra-view correlation matrices (c.f. Appendix A[Bach and Jordan, 2002].

Projection matrices  are extracted from corresponding rows (for view ) and the top 

columns of the (eigenvalue sorted) eigenvector matrix corresponding to this eigen-decomposition. A sample 

from view  is then embedded as , where and are the eigenvalues. A scaling, , controls the extent of eigenvalue weighting, reducing to the standard objective at 222There are cases where values of have been shown to give better performance [Gong et al., 2014].

. With this simple objective, one can tackle a variety of tasks at test time—ranking and retrieval across all possible combinations of multiple views—where the cosine similarity between (centred) embedding vectors captures correlation.

For VD, given a dataset of images  and associated question-answer () pairs, joint embeddings between question and answer (and optionally, the image) are learnt, with projection matrices , as appropriate. At test time, correlations can be computed between any, and all, combinations of inputs, helping measure suitability against the desired response.

3 Experimental Analyses

In order to employ CCA for VD, we begin by transforming the input images , questions , and answers , into lower-dimensional feature spaces. For the images, we employ the standard pre-trained ResNet34 [He et al., 2016] architecture, extracting a 512-dimensional feature—the output of the avg pool layer after conv5. For the questions and answers, we employ the FastText [Bojanowski et al., 2017] network to extract 300-dimensional embeddings for each of the words. We then simply average the embeddings [Arora et al., 2017]

for the words, with suitable padding or truncation (up to a maximum of 16 words), to obtain a 300-dimensional embedding for the question or answer.

Model #Params Train time (s)
Factor () 90 10
Table 1: CCA vs. SOTA: number of learnable parameters and training time.

We then set the hyper-parameters for the CCA objective as , based off of a simple grid search over feasible values, such that we learn a 300-dimensional embedding space that captures the correlations between the relevant domains. It is important to note that the SOTA approaches [Das et al., 2017a, b, Massiceti et al., 2018] also employ pre-trained feature extractors—the crucial difference between approaches is the complexities in modelling and computation on top of

such feature extraction, as starkly indicated in

Table 1.

We then learn two joint embeddings—between just the answers and questions, denoted A-Q, and between the answers, questions, and images, denoted A-QI. Note that the answer is always present, since the stipulated task in VD is to answer a given question. The first allows us to explore the utility (or lack thereof) of the image in performing the VD task. The second serves as a useful indicator of how unique any question-image pairing is, in how it affects the ability to answer—performance closer to that of A-Q indicating fewer unique pairings. Also, when embedding all three of A, Q, and I, at test time, we only employ Q to compute a match against a potential answer.

Having now learnt an embedding, we evaluate our performance using the standard ranking measure employed for the VisDial

dataset. Here, for a given image and an associated question, the dataset provides a set of 100 candidate answers, which includes the human-derived “ground-truth” answer. The task then, is to rank each of the 100 candidates, and observe the rank awarded to the “ground-truth” answer. In our case, we rank on correlation, computed as the cosine distance between centered embeddings between the question and a candidate answer. Then, for all the answers we compute the mean rank (MR), mean reciprocal rank (MRR) (inverse harmonic mean of rank), and recall at top 1, 5, and 10 candidates—measuring how often the “ground-truth” answer ranked within that range.

Model MR R@1 R@5 R@10 MRR



HCIAE-G-DIS 14.23 44.35 65.28 71.55 0.5467
CoAtt-GAN 14.43 46.10 65.69 71.74 0.5578
HREA-QIH-G 16.79 42.28 62.33 68.17 0.5242



A-Q 16.21 16.77 44.86 58.06 0.3031
A-QI (Q) 18.29 12.17 35.38 50.57 0.2427


A-Q 17.08 15.95 40.10 55.10 0.2832
A-QI (Q) 19.24 12.73 33.05 48.68 0.2393
Table 2: Results for SOTA vs. CCA on the VisDial dataset. CCA achieves comparable performance while ignoring both image and dialogue sequence.

The results, in Table 2, show that the simple CCA approach achieves comparable performance on the mean rank (MR) metric using the A-Q model that doesn’t use the image or dialogue sequence! This solidifies the impression, from Figure 1, that there exist implicit correlations between just the questions and answers in the data, that can be leveraged to perform “well” on a task that simply requires matching “ground-truth” answers. Our experiments indicate that for the given dataset and task, one need not employ anything more complicated than an exceedingly simple method such as CCA on pre-trained feature extractors, to obtain plausible results.

Image Question [-2pt] (Rank) GT Answer CCA Top-3 [-2pt] (Rank) Answer
What colour is the bear? \⃝scalebox{0.6}{1} White and brown
\⃝scalebox{0.6}{51} Floral white \⃝scalebox{0.6}{2} Brown and white
\⃝scalebox{0.6}{3} Brown, black & white
Does she have long hair? \⃝scalebox{0.6}{1} No, it is short hair
\⃝scalebox{0.6}{41} No \⃝scalebox{0.6}{2} Short
\⃝scalebox{0.6}{3} No it’s short
Can you see any passengers? \⃝scalebox{0.6}{1} No
\⃝scalebox{0.6}{48} Not really \⃝scalebox{0.6}{2} Zero
\⃝scalebox{0.6}{3} No I can not
Are there people not on bus? \⃝scalebox{0.6}{1} No people
\⃝scalebox{0.6}{22} Few \⃝scalebox{0.6}{2} No, there are no people around
\⃝scalebox{0.6}{3} I don’t see any people
Figure 2: Qualitative results for the A-Q model showing the top-3 ranked answers for questions where the ground-truth answer is given a low rank—showing them to be perfectly feasible.

Moreover, another factor that needs to be considered, is that the evaluation metric itself, through the chosen task of candidate-answer ranking, can be insufficient to draw any actual conclusions about how well questions were answered. To see this, consider Figure 2, where we deliberately pick examples that rank the “ground-truth” answer poorly despite CCA’s top-ranked answers all being plausible alternatives. This clearly illustrates the limitations imposed by assuming a single “ground-truth” answer in capturing the breadth of correct answers.

To truly judge the validity of the top-ranked answers, regardless of “ground-truth” would require thorough human-subject evaluation. However, as a cheaper, but heuristic alternative, we quantify the validity of the top answers, in relation to the “ground truth”, using the correlations themselves. For any given question and candidate set of answers, we cluster the answers based on an automatic binary thresholding (ISODATA 

[Ridler and Calvard, 1978], 5 bins) of the correlation with the given question. We then compute the following two statistics based on the threshold i)

the average variance of the correlations in the lower-ranked split, and

ii) the fraction of questions that have correlation with “ground truth” answer higher than the threshold.
The intention being that (i) quantifies how closely clustered the top answers are, and (ii) quantifies how often the “ground-truth” answer is in this cluster. Low values for the former, and high values for the latter would indicate that there exists an equivalence class of answers, all relatively close to the ground-truth answer in terms of their ability to answer the question. Our analysis for the VisDial v0.9 dataset reveals values of (i) 0.023 and (ii) 96.9%, supporting our claims that CCA recovers plausible answers.

We note that the VisDial dataset was recently updated to version 1.0, where the curators try to ameliorate some of the issues with the single-“ground-truth” answer approach. They incorporate a human-agreement scores for candidate answers, and introduce a modified evaluation which weighs the predicted rankings by these scores. We include our performance on the (held-out) test set for VisDial v1.0 in the bottom row of Table 2. However, in making this change, the primary evaluation for this data has now become an explicit classification task on the candidate answers—requiring access, at train time, to all (100) candidates for every question-image pair [see Table 1, pg 8. Das et al., 2017a] and the evaluation results of the Visual Dialog Challenge 2018. For the stated goals of VD, this change can be construed as unsuitable as it falls into the category of redefining the problem to match a potentially unsuitable evaluation measure—how can one get better ranks in the candidate-answer-ranking task. For this reason, although there exist approaches that use the updated data, we do not report comparison to any of them.

Q: Are they adult giraffe? Q: Are there other animals?
\⃝scalebox{0.6}{\tinyGT} Yes \⃝scalebox{0.6}{\tinyGT} No

Ranked Ans

\⃝scalebox{0.6}{1} Yes the giraffe seems to be an adult

Ranked Ans

\⃝scalebox{0.6}{1} No, there are no other animals
\⃝scalebox{0.6}{2} It seems to be an adult, yes \⃝scalebox{0.6}{2} No other animals

The giraffe is probably an adult, it looks very big

\⃝scalebox{0.6}{3} There are no other animals around
\⃝scalebox{0.6}{4} Young adult \⃝scalebox{0.6}{4} Don’t see any animals
Q: Any candles on cake? Q: Is the cake cut?
\⃝scalebox{0.6}{\tinyGT} Just a large “number one” \⃝scalebox{0.6}{\tinyGT} No, but the boy has sure had his hands in it!

Ranked Ans

\⃝scalebox{0.6}{1} There are no candles on the cake

Ranked Ans

\⃝scalebox{0.6}{1} No it’s not cut
\⃝scalebox{0.6}{2} I actually do not see any candles on the cake \⃝scalebox{0.6}{2} No the cake has not been cut
\⃝scalebox{0.6}{3} No , no candles \⃝scalebox{0.6}{3} Nothing is cut
\⃝scalebox{0.6}{4} No candles \⃝scalebox{0.6}{4} No, the cake is whole
Figure 3: Example answers “generated” using the nearest-neighbours approach. For a given test question, a custom candidate set is constructed by choosing answers corresponding to the 100 closest (by correlation using A-Q) questions from the training data, and the best correlated answers to the given question returned.

Although standard evaluation for VD involves ranking the given candidate answers, there remains an issue of whether, given a question (relating to an image), the CCA approach really “answers” it. From one perspective, simply choosing from a given candidate set can seem a poor substitute for the ability to generate answers, in the vein of Das et al. [2017a], Massiceti et al. [2018]. To address this, we construct a simple “generative” model using our learned projections between questions and answers (A-Q model, c.f. Figure 3). For a given question, we select the corresponding answers to the 100 nearest-neighbour questions using solely the train set and construct a custom candidate-answer set. We then compute their correlations with the given question, and sample the top-correlated answers as “generated” answers333This only additionally requires disk-space to persist the training data—costing roughly $0.25/GB. We also compute our heuristic for the validity of the top-ranked answers in relation to the “ground-truth” as before, with average variance (i) 0.018 and fraction correct (ii) 87.2%, indicating reliable performance.

4 Discussion

We use the surprising equivalence from § 3 as evidence of several issues with current approaches to VD. The biggest concern our evaluation, and a similar by Anand et al. [2018], reveals is that, for standard datasets in the community, visually grounded questions can be answered “well”, without referring to the visual stimuli. This reveals an unwanted bias in the data, whereby correlations between question-answer pairs can be exploited to provide reasonable answers to visually-grounded questions. Moreover, the dataset also includes an implicit bias that any given question must necessarily relate to a given image—as evidenced by visually-unrelated questions getting visually-unrelated, but plausible answers (Figure 1). A particularly concerning implication of this is that current approaches to VD [Das et al., 2017a, b, Massiceti et al., 2018] may not actually be targeting the intended task.

Our simple CCA method also illustrates, that the standard evaluation used for VD has certain shortcomings. Principally, the use of “candidate” answers for each question, with a particular subset of them (1 in VisDial v0.9, and K-human-derived weighted choices in v1.0) are deemed to be the “ground-truth” answers. However, as we show in Figure 2, such an evaluation can still be insufficient to capture the range of all plausible answers. The task of designing evaluations on the “match” of expected answers in for natural language, though, is fraught with difficulty, as one needs to account for a high degree of syntactic variability, with perhaps little semantic difference.

Responses to addressing the issues observed here, can take a variety of forms. For the objective itself, one could alternately evaluate the effectiveness with which the dialogue enables a downstream task, as explored by some [Das et al., 2017b, De Vries et al., 2017, Khani et al., 2018, Lazaridou et al., 2016]. Also, to address implicit biases in the dataset, one could adopt synthetic, or simulated, approaches, such as Hermann et al. [2017], to help control for undesirable factors. Fundamentally, the important concern here is to evaluate visual dialogue on it’s actual utility—conveying information about the visual stimuli—as opposed to surface-level measures of suitability.

And finally, we believe an important takeaway from our analyses is that it is highly effective to begin exploration with the simplest possible tools one has at one’s disposal. This is particularly apposite in the era of deep neural networks, where the prevailing attitude appears to be that it is preferable to start exploration with complicated methods that aren’t well understood, as opposed to older, perhaps even less fashionable methods that have the benefit of being rigorously understood. Also, as shown in Table 1, choosing simpler methods can help minimise human effort and cost in terms of both compute and time, and crucially provide the means for cleaner insights into the problems being tackled.


Appendix A Multi-view Canonical Correlation Analysis

Among several possible ways to formulate the CCA objective for multiple variables [Kettenring, 1971], we choose the Forbenius-norm-based objective as it provides better insights. Let us assume that there are views and represents the observation from the -th view. Then, the objective is to jointly learn projection matrices for all the views such that the embeddings in the dimensional space are maximally correlated. This is achieved by optimizing the following problem:


where, is the -th column of projection matrix. It turns out that optimizing Eq. 1 reduces to solving a generalized eigenvalue decomposition problem [Bach and Jordan, 2002]:


where, is the correlation matrix obtained using the observations from -th and -th views.