Implementation for "Large-scale Pretraining for Visual Dialog" https://arxiv.org/abs/1912.02379
Prior work in visual dialog has focused on training deep neural models on the VisDial dataset in isolation, which has led to great progress, but is limiting and wasteful. In this work, following recent trends in representation learning for language, we introduce an approach to leverage pretraining on related large-scale vision-language datasets before transferring to visual dialog. Specifically, we adapt the recently proposed ViLBERT (Lu et al., 2019) model for multi-turn visually-grounded conversation sequences. Our model is pretrained on the Conceptual Captions and Visual Question Answering datasets, and finetuned on VisDial with a VisDial-specific input representation and the masked language modeling and next sentence prediction objectives (as in BERT). Our best single model achieves state-of-the-art on Visual Dialog, outperforming prior published work (including model ensembles) by more than 1 NDCG and MRR. Next, we carefully analyse our model and find that additional finetuning using 'dense' annotations i.e. relevance scores for all 100 answer options corresponding to each question on a subset of the training set, leads to even higher NDCG – more than 10 17 primary metrics for this task – NDCG and MRR. We find that this is because dense annotations in the dataset do not correlate well with the original ground-truth answers to questions, often rewarding the model for generic responses (e.g. "can't tell").READ FULL TEXT VIEW PDF
Implementation for "Large-scale Pretraining for Visual Dialog" https://arxiv.org/abs/1912.02379
Recent years have seen incredible progress in Visual Dialog [13, 1, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33], spurred in part by the initial efforts of Das et al. 
in developing a concrete task definition – given an image, dialog history consisting of a sequence of question-answer pairs, and a follow-up question about the image, to predict a free-form natural language answer to the question – along with a large-scale dataset and evaluation metrics. The state-of-the-art on the task has improved by more thanabsolute ( NDCG) and the original task has since been extended to challenging domains, e.g. video understanding  and navigation assistants [35, 36, 37].
While this is promising, much of this progress has happened in isolation, wherein sophisticated neural architectures are trained and benchmarked solely on the VisDial dataset. This is limiting – since there is a significant amount of shared abstraction and visual grounding in related tasks in vision and language (e.g. captioning, visual question answering) that can benefit Visual Dialog – and wasteful – since it is expensive and dissatisfying to have to collect a large-scale dataset for every new task. In this work, we explore an approach to pretrain our core model on other publicly available vision and language datasets and then transfer to Visual Dialog.
In this work, we adapt ViLBERT  to Visual Dialog. ViLBERT uses two Transformer-based encoders, one each for the two modalities – language and vision – and interaction between the two modalities is enabled by co-attention layers i.e. attention over inputs from one modality conditioned on inputs from the other. Note that adapting ViLBERT to Visual Dialog is not as trivial as it may seem (or initially seemed to us). The Visual Dialog dataset has image-grounded conversation sequences that are up to rounds long. These are significantly longer than captions (which are sentences) from the Conceptual Captions dataset  used to pretrain ViLBERT, and thus requires a different input representation and careful reconsideration of the masked language modeling and next sentence prediction objectives used to train BERT  and ViLBERT .
This adapted model outperforms prior published work by absolute
and achieves state-of-the-art on Visual Dialog. Next, we carefully analyse
our model and find that additional
finetuning on ‘dense’ annotations111publicly available on
visualdialog.org/data. i.e. relevance scores for all
answer options corresponding to each question on a subset of the training
set, highlights an interesting trade-off – the model gets to
NDCG (outperforming the
2019 VisDial Challenge winner), but an MRR of
( below our base model!).
We find this happens because dense annotations in
VisDial do not correlate well with the ground-truth answers to
questions, often rewarding the model for generic, uncertain responses.
Concretely, our contributions are as follows:
We introduce an adaptation of the ViLBERT  model for Visual Dialog, thus making use of the large-scale Conceptual Captions  and Visual Question Answering (VQA)  datasets for pretraining and learning powerful visually-grounded representations before finetuning on VisDial . Since captioning and VQA differ significantly from Visual Dialog in input size ( sentence descriptions vs. question-answer rounds), this requires rethinking the input representation to learn additional segment embeddings representing questions-answer pairs. Our adapted model improves over prior published work by and sets a new state-of-the-art.
We next finetune our model on dense annotations i.e. relevance scores for all answer options corresponding to each question on a subset of the training set, leading to even higher NDCG – more than over our base model – but hurting MRR – more than below our base model! This highlights a stark trade-off between the two primary metrics for this task – NDCG and MRR. We demonstrate through qualitative and quantitative results that this happens because dense annotations do not correlate well with the original ground-truth answers, often rewarding the model for generic, uncertain responses.
Our work is related to and builds on prior work in visual dialog [13, 1, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 27, 28, 29, 25, 26, 30, 31, 32, 33], and self-supervised pretraining and transfer learning in computer vision and language [38, 39, 40, 2, 3, 4, 5, 6, 7, 8, 9].
introduced the task of Visual Dialog – given an image, dialog history consisting of a sequence of question-answer pairs, and a follow-up question, predict a free-form natural language answer to the question – along with a dataset, evaluation metrics, and baseline models. Follow-up works on visual dialog have explored the use of deep reinforcement learning[14, 15, 28], transferring knowledge from discriminative to generative decoders 
, conditional variational autoencoders, generative adversarial networks , attention mechanisms for visual coreference resolution [20, 22], and modeling the questioner’s theory of mind . Crucially, all of these works train and evaluate on the VisDial dataset in isolation, without leveraging related visual grounding signals from other large-scale datasets in vision and language. We devise a unified model that can be pretrained on the Conceptual Captions  and VQA  datasets, and then transferred and finetuned on VisDial.
Self-Supervised Learning in Vision and Language
Self-Supervised Learning in Vision and Language. Building on the success of transfer learning in natural language understanding [2, 3, 4, 5, 6, 7, 8, 9] leading to state-of-the-art results on a broad set of benchmarks [41, 44], recent work has extended this to vision and language tasks [10, 45, 46, 47, 48, 49, 50]. These works pretrain single [45, 48, 49] or two [10, 46]-stream Transformer -based models with self-supervised objectives, such as next-sentence prediction and masked language/image modeling, on large-scale image-text datasets and have led to compelling results in Visual Question Answering, Commonsense Reasoning , Natural Language Visual Reasoning , Entailment , Image-Text Retrieval [54, 55], and Referring Expressions .
Lu et al.  introduced ViLBERT333along with code released at github.com/jiasenlu/ViLBERT_beta., which extended BERT  to a two-stream multi-modal architecture for jointly modeling visual and linguistic inputs. Interaction between the two modalities was enabled through co-attention layers, i.e. attending to one modality conditioned on the other – attention over language conditioned on visual input, and attention over image regions conditioned on linguistic input. This was operationalized as swapping the key and value matrices between the visual and linguistic Transformer  blocks. We next discuss our changes to adapt it for Visual Dialog followed by our training pipeline.
Input Representation. Recall that the model gets image , dialog history (including image caption ) , question , and a list of answer options as input, and is asked to return a sorting of . We concatenate the rounds of dialog history and follow-up question , with each question and answer separated by a <SEP> token. Similar to Wolf et al. , we use different segment embeddings for questions and answers to help the model distinguish between the two and understand question and answer boundaries in the input. Captions and answers share the same segment embeddings. To represent the image, we follow [10, 58] and extract object bounding boxes and their visual features for top- detected objects in the image from a Faster R-CNN  (with a ResNet-101  backbone) object detection network pretrained on the Visual Genome dataset 
. The feature vector for each detected object is computed as mean-pooled convolutional features from the regions of that object. A-d feature vector, consisting of normalized top-left and bottom-right object coordinates, and the fraction of image area covered, is projected to the same dimensions as the feature vector for the detected object, and added to it. The beginning of this image region sequence (consisting of object detection features) is demarcated by an IMG token with mean-pooled features from the entire image.
To pretrain the model, we follow  and train on the Conceptual Captions (CC) dataset, which is a large corpus (with M samples) of aligned image-caption pairs. During pretraining, the sum of the masked language modeling (MLM) loss  and the masked image region (MIR) loss is optimized. To compute the MLM loss, a set of tokens in the input sequence are masked and the model is trained to predict these tokens given context. We mask around of the tokens in the input sequence. For the MIR loss, similar to the MLM loss, we zero out of the image features and the model learns to predict the semantic category of the masked out object (out of classes from Visual Genome [42, 58]). Both losses are equally weighted.
The VQA dataset is quite related to Visual Dialog in that it can be interpreted as independent visually-grounded question-answer pairs with no dialog history, and thus is a natural choice for further pretraining prior to finetuning on VisDial. Similar to Lu et al. , we pretrain on VQA by learning a small decoder – a two-layer MLP – on top of the element-wise product between the image and text representations to predict a distribution over answers.
To finetune on Visual Dialog, we use the MLM loss along with the next sentence prediction (NSP) loss and the MIR loss. For MLM, we mask of the tokens in the dialog sequence. For the MIR loss, similar to pretraining, we mask of the image features. Note that the discriminative task in visual dialog is to identify the ground-truth answer from a list of answer options consisting of popular, nearest neighbors, and random answers from the dataset. We achieve this through the NSP loss. The NSP head is trained to predict when the ground-truth answer is appended to the input sequence, and when a negative answer sampled from the remaining answer options is appended to it. Each image in VisDial has rounds of dialog, leading to sets of positive and negative samples for the NSP loss per mini-batch. Since these are fairly correlated samples, we randomly sub-sample out of these
during training. At test time, we use log-probabilities from the NSP head to rank theanswer options at each round.
The authors of  recently released dense annotations444publicly available on visualdialog.org/data. i.e. relevance scores for all answer options from corresponding to the question on a subset of the training set. These relevance scores range from to and are calculated as the ratio of number of human annotators who marked a particular answer option as correct to the total number of human annotators (). So means that the answer option was considered correct by human annotators. In our final stage of training, we utilize these dense annotations to finetune our model. Concretely, we use the NSP head to predict likelihood scores for each answer option
, normalize these to form a probability distribution over theanswers , and then compute a cross-entropy (CE) loss against the normalized ground-truth relevance scores , given by .
To compare to previous research, we conduct experiments on VisDial v . The dataset contains human-human dialogs on COCO -like images. We follow the original splits and use for training, for validation, and for testing. We next describe the various settings we experiment with.
Evaluation Metrics. We use the same evaluation metrics as in . Specifically, given the predicted ranking of answer options from the model at each round, we compute retrieval metrics – mean rank (MR) of the ground-truth answer, mean reciprocal rank (MRR), and recall@ (where ). Additionally, along with the release of dense annotations, i.e. relevance scores for all answer options, a new metric – NDCG – was introduced. NDCG accounts for multiple correct answers in the option set and penalizes low-ranked but correct answer options.
We begin with a ‘blind’ setting, where given the dialog history and follow-up question, and without access to the image, the model is tasked with predicting the answer. As such, we do not use the ViLBERT formulation for these experiments, and finetune the BERT model released in  and pretrained on BooksCorpus  and English Wikipedia. For the MLM loss, we mask of tokens and sub-sample out of 20 sequences per mini-batch during training. We experiment with two variants – training only with NSP, and training with both NSP and NSP. See tab:val_results for language-only results (marked ‘L-only’). This experimental setting helps us put gains coming from switching to Transformer -based architectures (and before the added complexity of incorporating visual input) in perspective.
Varying number of dialog rounds. We train ablations of our language-only model (with both NSP and MLM losses) where we vary the number of rounds in dialog history, starting from , where the input sequence only contains the follow-up question and answer, to , , and rounds of dialog history. tab:round_results shows these results.
Zero-shot and ‘cheap’ finetuning. We report performance for ablations of our NSP+MLM model with no/minimal training. First, we conduct a zero-shot test where we initialize BERT with weights from Wikipedia and BooksCorpus pretraining and simply run inference on VisDial. Second, with the same initialization, we freeze all layers and finetune only the MLM and NSP loss heads. See tab:zero.
We finetune ViLBERT on VisDial with three different weight initializations – 1) from the best language-only weights (from sec:lonly) for the language stream (visual stream and co-attention layers initialized randomly), 2) from a model pretrained on CC  (as described in sec:cctraining), and 3) from a model pretrained on CC +VQA  (as described in sec:vqatraining). 1) helps us benchmark performance if the model learns visual grounding solely from VisDial, 2) quantifies effects of learning visual grounding additionally from CC, while 3) helps us quantify improvements with additional exposure to visually-grounded question-answering data. See tab:val_results for results.
Finally, we finetune our best model from sec:visdialft2 – marked ‘w/ CC+VQA’ in tab:val_results – on dense annotations, as described in sec:denseft. Note that computing the CE loss requires a separate forward pass for each of the answer options, since dialog history, question, answer are all concatenated together before passing as input. This is memory-expensive, and so in practice, we sub-sample and only use options, and use gradient accumulation to (artificially) construct a larger mini-batch. Finetuning with the CE loss only leads to significant improvements on NDCG but hurts other metrics (see tab:val_results). We discuss and analyse this in more detail later. But to control for this ‘metric-overfitting’, we also train a variant with both the CE and NSP losses.
|NSP + MLM|
|w/ CC |
|w/ CC +VQA |
|CE + NSP|
|Loss heads only|
|MReal - BDAI |
|MS D365 AI|
|w/ CC +VQA |
We list findings from all our experiments in sec:experiments below.
Language-only performs well. The language-only model gets to on NDCG and on MRR (tab:val_results). This is high and already competitive with several prior published works (see tab:teststd).
Increasing dialog history rounds helps. As reported in tab:round_results, performance of the language-only model continues to go up with increasing dialog history rounds. We believe these improvements are largely indicative of the Transformer’s ability to model long-term dependencies.
Zero-shot model performs poorly. Running inference with the language-only model pretrained on BooksCorpus  and Wikipedia without any finetuning on VisDial only gets to on NDCG and on MRR. Finetuning the loss heads with all other layers frozen leads to an improvement of NDCG points over this. This low performance can be attributed to significantly longer input sequences in VisDial than the model was pretrained with.
VQA initialization helps more than CC. Finetuning ViLBERT on VisDial with weights initialized from VQA pretraining gets to on NDCG and on MRR,
points more than CC pretraining. This is likely because images in VQA are from COCO (same as VisDial) as opposed to CC, and the task of visual question answering is more closely related to VisDial than image captioning.
Dense annotations boost NDCG, hurt MRR. Finetuning with the CE loss leads to on NDCG – a improvement over the ‘w/ CC + VQA’ base model – but on MRR, a decline below the base model. This is a somewhat surprising finding, and we dig deeper into this result in subsequent analysis in sec:analysis.
We report results from the Visual Dialog evaluation server555evalai.cloudcv.org/web/challenges/challenge-page/161/leaderboard/483 for our best models – ‘w/ CC + VQA’, ‘CE’ and ‘CE + NSP’ – on the unseen test-std split in tab:teststd. We compare against prior published results as well as top entries from the leaderboard. Our models outperform prior results and set a new state-of-the-art – ViLBERT with CC + VQA pretraining on MRR, R@, MR metrics, and further finetuning with a CE loss on dense annotations on NDCG. Finally, adding the NSP loss along with CE (as in sec:denseft2) offers a nice balance between optimizing metrics that reward both sparse (original ground-truth answers) and dense annotations.
As described in sec:results, finetuning on dense annotations leads to a significant increase in NDCG, but hurts the other metrics – MRR, R@, R@, R@ and MR – which depend on the original sparse annotations in VisDial i.e. follow-up answers provided in human-human dialog.
We begin by visualizing the distribution of dense relevance scores for these sparse ground-truth (GT) answers in fig:distribution_relevance_gt_answers and observe that GT answers have relevance , and have relevance . Thus, there is some degree of misalignment between dense and sparse annotations – answers originally provided during human-human dialog in VisDial were not always judged to be relevant by all humans during the post-hoc dense annotation phase.
Why are GT and dense annotations misaligned? We notice that many questions with discrepancy between GT and dense annotations are somewhat subjective. For e.g
., in row 1, round 7 (fig:qual1), Q: ‘what color is the chair?’, the GT answer is ‘black’ but the chair is in shadow and it is difficult to accurately identify its color. And thus, we expect to see variance when multiple humans are polled for the answer. Instead, the GT answer is just one sample from the human answer distribution, not necessarily from its peak. In general, the dense annotations seem less wrong than GT (as they are sourced by consensus) since they are safer – often resolving to answers like ‘I cannot tell’ when there is uncertainty / subjectivity – but also uninformative – not conveying additional informatione.g. ‘I think but they are occluded so it is hard to tell’ – since such nuanced answers are not part of the list of answer options in VisDial .
Model performance on GT vs. dense annotations. tab:relevance_distribution_gt_answer shows mean ranks of these GT answers as predicted by three model variants – ViLBERT w/ CC + VQA, CE, and CE + NSP – grouped by dense relevance scores. The ‘CE’ model gets worse mean ranks than ‘w/ CC + VQA’ for all GT answers, since it is no longer trained with these GT answers during dense annotation finetuning. The CE model assigns low mean ranks to GT answers with higher relevance scores (), which translates to a high NDCG score (tab:val_results). But it assigns poor mean ranks to GT answers with relatively lower relevance scores (), and since GT answers have relevance scores , this hurts MRR, R@, MR for the CE model (tab:val_results).
Next, we consider the top- most-relevant answer options (occurring times) as per dense annotations in VisDial v val (not restricting ourselves to only GT answers). fig:all_val_top_50_options_avg_relevance shows the mean relevance scores for this set, and fig:all_val_top_50_options_mean_rank shows the mean ranks assigned to these answers by our models. The CE model gets better mean ranks in this set compared to Base, leading to high NDCG.
Qualitative examples. Finally, we present uniformly sampled example answer predictions on VisDial v val from our models along with the ground-truth dialog sequences in fig:qual1, fig:qual2, and fig:qual3. In these examples, consistent with the Visual Dialog task definition , at every round of dialog, the model gets the image, ground-truth human dialog history (including caption), and follow-up question as input, and predicts the answer. Specifically, the model ranks answer options. Here we show the top- prediction.
We make a few observations. 1) The Base model is surprisingly accurate, e.g. in row 2, round 1 (fig:qual1), Q: ‘can you see any people?’, predicted answer: ‘part of a person’, in row 2, round 10, Q: ‘anything else interesting about the photo?’, predicted answer: ‘the dog is looking up at the person with his tongue out’. 2) The CE model often answers with generic responses (such as ‘I cannot tell’), especially for questions involving some amount of subjectivity / uncertainty, e.g. in row 1, round 7, Q: ‘what color is the chair?’, predicted answer: ‘I cannot tell’ (the chair seems to be in shadow in the image), in row 2, round 7, Q: ‘does the dog look happy?’, predicted answer: ‘I can’t tell’ (subjective question). 3) This also highlights a consequence of misalignment between ground-truth and dense annotations. While the ground-truth answer provides one reasonable response for the question asked, it is answerer-specific to quite an extent and there may be other correct answers (annotated in the dense annotations). A negative effect of this misalignment is that when finetuned on dense annotations (CE), the model gets rewarded for generic answers (e.g. ‘cannot tell’). While being able to capture and reason about uncertainty is a desirable property models should have, it would be more helpful if these agents can convey more information with appropriate qualifiers (e.g. ‘I think but they are occluded so it is hard to tell’) than a blanket ‘I cannot tell’. We aim to study this in future work.
For the linguistic stream, we use the BERTBASE model , which has layers of Transformer blocks, with each block having attention heads and a hidden state size of . For the visual stream, we use layers of Transformer blocks with each block having attention heads with a hidden state size of . We then connect the first Transformer layers in the visual stream with the last Transformer layers in the linguistic stream with co-attentional layers as in .
For the linguistic stream, we set the maximum sequence length to . We do not train on dialog sequences which exceed the maximum sequence length and during inference, we truncate rounds starting from the first round to fit within the maximum length (we do not remove the caption). However, nearly all dialog sequences have less than tokens and we rarely had to truncate the dialog sequence.
All loss coefficients are set to . We use Adam  with the learning rate linearly increasing from to over iterations and decay it to over iterations.
We introduced a model for Visual Dialog that enables pretraining on large-scale image-text datasets before transferring and finetuning on VisDial. Our model is an adaptation of ViLBERT , and our best single model is pretrained on BooksCorpus , English Wikipedia (at the BERT stage), and on Conceptual Captions , VQA  (at the ViLBERT stage), before finetuning on VisDial, optionally with dense annotations. Our model outperforms prior published results by absolute on NDCG and MRR, achieving state-of-the-art results, and providing a simple baseline for future ‘pretrain-then-transfer’ modeling approaches.
Through careful analysis of our results, we find that the recently released dense annotations for the task do not correlate well with the original ground-truth dialog answers, leading to a trade-off when models optimize for metrics that take into account these dense annotations (NDCG) vs. the original sparse annotations (MRR). This opens up avenues for future research into better evaluation for this task.
Finally, it is worth noting that our model is discriminative – it can pick a good answer from a list of answer options – but cannot generate an answer. In future work, we aim to remove this limitation and develop robust decoding techniques for a powerful generative visual dialog model.
Reproducibility. Code to replicate results from the paper is publicly available at github.com/vmurahari3/visdial-bert.
We thank Jiasen Lu and Stefan Lee for helpful discussions. The Georgia Tech effort is supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE. AD is supported in part by fellowships from Facebook, Adobe, and Snap Inc. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the US Government, or any sponsor.
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding with unsupervised learning,” 2018.
K. Nguyen and H. Daumé III, “Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning,” inEMNLP, 2019.