In recent years many methods have been proposed for vision-language tasks such as image and video captioning [13, 30, 55, 56, 61], multimodal retrieval [19, 27, 23, 57, 41, 54, 60], phrase grounding [47, 22, 45, 49], and visual question answering [15, 2, 65, 52, 63]. Language representations for these models tend to be obtained by averaging word embeddings ( [57, 45, 44, 27]), feeding features representing each word into a LSTM ( [49, 61, 60]
), and using word-level or phrase-level attention models ([1, 12, 37, 5, 33]
). The word embeddings used in these tasks include a simple one-hot encoding of each word in a vocabulary ([15, 56, 57]
), pretrained dense vector representations like Word2Vec or GloVe , and Fisher vectors built on top of these dense representations ( [27, 44, 57]). Although there are more modern embeddings such as FastText , ELMo  and BERT 
that have shown significant performance improvements on language tasks such as sentiment analysis and question answering, many vision-language approaches still use the more dated feature representations.
While there are isolated cases where these language model and feature choices are compared for the same task model ( [57, 20]), to our knowledge there exists no comprehensive comparison. To address this neglect of language feature exploration, we provide an all-inclusive experimental survey of embedding, language model, and training choice. We perform experiments using from-scratch, Word2Vec , WordNet retrofitted Word2Vec , FastText , Visual Word2Vec , HGLMM (300-D, 6K-D) , InferSent , and BERT  representations in addition to a new embedding, GrOVLE, on five vision-language tasks: image-sentence retrieval, visual question answering, phrase grounding, image captioning, and text-to-clip retrieval.
Our goal is to provide insight for vision-language applications based on extensive experiments varying choices illustrated in Figure 1. Our findings show how to make these choices to take advantage of language features in vision-language work. For example, we find that using an Average Embedding language model, which ignores word ordering, tends to perform better than a LSTM. This suggests that the LSTM overfits to the task it is trained on. However, when training a word embedding from scratch a LSTM performs best. This result is mostly likely a product of the LSTM learning to predict the next word given previous words, learning context. Pretrained word vectors likely already provide some semblance of this context information since that is how they are typically trained. The take-aways from all experimental results are summarized in Figure 2.
Relying on word embeddings trained solely on large text corpora can have important consequences. For example, in Word2Vec the words “boy” and “girl” have higher cosine similarity than either have to the word “child.” While this is a subtle difference, it can impact tasks such as image captioning where “girl” can be replaced by “child” when describing a visual scene, but not by “boy.” These nuances are not well captured when using text-only information. To address this, we introduce the Graph Oriented Vision-Language Embedding, GrOVLE, which has been learned for vision-language tasks specifically.
When building GrOVLE, we take into account the differences in the relationships between words when used to describe visual data. We introduce a new relational graph by extracting semantic relationships between words using the Visual Genome dataset , which is annotated with dense descriptions of entities, their attributes, and their relationships to other entities within an image. We use both WordNet and Visual Genome graphs to adapt Word2Vec, through the retrofitting process defined by Faruqui .
Finally, in addition to viewing embedding performance for each individual task, we asked: Can an embedding generalize across vision-language tasks? Inspired by multi-task training strategies like PackNet , we train the GrOVLE embedding on all the vision-language tasks in our experiments. The word representation becomes more powerful with task specific knowledge, as the multi-task GrOVLE ultimately outperforms its single-task trained version, becoming a leading embedding amongst the five tasks. Note that unlike PackNet, GrOVLE operates directly on the word embeddings rather than model weights.
Below we summarize our primary contributions:
Comprehensive experiments exhaustively comparing different word representations, language models, and pretraining and adaptation steps across five common vision-language tasks, providing best practices for future work. See Figure 2 for a summary of our findings.
GrOVLE, a publicly available word embedding which has been specially trained for vision-language tasks111http://ai.bu.edu/grovle.
Key insight into the transferability of word embeddings across the five vision-language tasks through the use of multi-task training.
2 Related Work
To the best of our knowledge, the effect of pretrained embeddings in VL tasks has never before been systematically compared. Visual information has been used in limited ways to improve word embeddings such as simply concatenating visual features  or focusing on abstract scenes . Lazaridou  focuses on leveraging first order semantic relationships by encouraging alignment between the visual and language embeddings for a predefined set of nouns describing objects. Word embeddings have also been improved by including additional constraints on the learning process  or as a post-processing step . These models focus on improving some general sense of word similarity. GrOVLE is different in that it is directly optimized to work well on a variety of vision-language tasks. We focus on how 10 representations compare amongst model and training choices, some of which are considered state-of-the-art for language tasks such as the recently introduced BERT .
Several vision-language approaches have also tried to improve their language model, rather than the word embeddings, as a way to improve performance. These have included building Fisher vectors on top of pretrained word embeddings [27, 34], constraining a coarse-to-fine word ordering [11, 54], or performing co-reference resolution to identify additional constraints between entities ([58, 45, 28, 6]). Attention mechanisms have also become a popular way to improve performance: word-level attention has been used in image captioning by learning the weights of words using a LSTM 
or a multi-layered perceptron[61, 12] before being passed to a language generation model. Dual attention 
has also been used to attend to the question in VQA using feed-forward neural networks. These approaches could be used in conjunction with this work to further improve performance.
3 Language Models
We present three language model options for which we provide experimental results for 8 of 10 different embeddings to determine which language model is best for each task and each embedding (sentence level embeddings cannot be incorporated into some of these architectures).
In Figure 3 an Average Embedding, Self-Attention, and LSTM language architecture are shown. The Average Embedding model consists of mean pooling the embeddings, forming a single representation of all words (with words in total) in a given sentence or phrase. A sample’s pooled vector is then passed through a pair of fully connected layers as shown in the upper left corner of Figure 3.
A more complex language architecture is a LSTM; word representations are individually passed through a LSTM cell, each producing their own hidden state. LSTMs are typically thought of as a “better” architecture choice, modeling the relationship between words in a sentence, as it maintains word ordering. We later show this assumption does not hold true across all vision-language tasks.
Lastly, we compare a Self-Attention model that is closely related to the Average Embedding architecture. The primary difference is the pooling layer, which now consists of two steps. First, a context vector C is concatenated with all word embeddings in W of a given sample. Our experiments use the average embedding as context. It is passed through a fully connected layer which applies Softmax to give context “scores” for each word in a sentence. Next, the inner product is taken of these weights and the original word embeddings from W to produce a context weighted sum which is then passed to a pair of fully connected layers.
4 Experimental Setup
In this section we provide details of each vision-language task. The datasets and vision-language task models are described in the appendix, but are referenced in Table 1. We split our experiments into three parts: Pretrained Embeddings (Section 5), Adapted Embeddings (Section 6), and Multi-task Trained Embeddings (Section 7).
4.1 Compared Tasks and Metrics
Image-Sentence Retrieval. The goal is to retrieve relevant sentences given an image, or to retrieve relevant images given a sentence. It is evaluated using Recall@ where , resulting in six numbers which measure the performance of the model (three for image-to-sentence and three for sentence-to-image). We report the average of these six numbers as a measure of overall performance. All six numbers can be found in the appendix.
Phrase Grounding. In phrase grounding the task is to find the location of a phrase given an image it is known to exist in. Performance is measured using accuracy, where a box is deemed to be successfully localized if it has at least 0.5 intersection over union (IOU) with the ground truth box.
Text-to-Clip. For text-to-clip, the goal is to locate the temporal region (the video clip) that is described by a query. Performance is measured using a mix of Recall@, where , and the average IOU the predicted temporal location of a query phrase has with its ground truth temporal segments. We use the evaluation code provided by Hendricks  in our experiments. We report the average of these three metrics as an overall score; all metrics are reported in the appendix.
The goal of image captioning is to produce natural language which describes an image scene with a well formed sentence. The produced captions are evaluated against a set of reference sentences for each image. We report the commonly used evaluation metric BLEU-4, with CIDEr and METEOR results available in the appendix.
Visual Question Answering. In VQA , the goal is to produce a free-form natural language answer given an image and question. This open-ended task consists of three types of questions: yes/no, number and other. The accuracy of the model is determined by the number of correctly answered questions. We evaluate on the test-dev set.
5 Pretrained Word Embeddings
We begin our exhaustive search across language feature choices with pretrained word embeddings. These offer an initial comparison across techniques that do not use forms of post-processing to adapt embeddings, but rather learn vectors with different model architectures and training objectives. Word2Vec, FastText, InferSent, and BERT are reviewed before results are discussed.
5.1 Word Level Representations
Word2Vec  is one of the most widespread word embeddings in use since its release. It builds off of the probabilistic feed forward Neural Network Language Model (NNLM) introduced in , which is composed of input, projection, hidden, and output layers. The input is defined by a 1-out-of-V vector where V is the vocabulary size. The projection matrix is shared amongst all words and the computational complexity between hidden and output layers is reduced using a hierarchical Softmax where the vocabulary is represented as a Huffman binary tree.
Word2Vec introduced two variations of the NNLM model, with the primary distinction being that the non-linear hidden layer is removed and the projection layer is shared amongst all words, the words are averaged. This leads to the first model, Continuous Bag of Words (CBOW), in which given four previous and four future words, the current word is predicted. The second model, Skip-Gram, instead predicts the context words given the current word. This results in maximizing the classification of a word given the words it is surrounded by. Skip-Gram tends to perform better with a larger range of context words, but this also results in greater computational complexity.
is an extension of the Word2Vec model in which the atomic entities of the embeddings are no longer words, but are instead character n-grams. N can be decided given the task and time or space constraints. A word is represented as the sum of its character n-gram vectors in addition to the word vector itself. This change of reference can improve performance due to better representation of rare, misspelled, and out of vocabulary words, as the n-grams create more neighbors for use during training.
5.2 Sentence Level Representations
uses a bi-directional LSTM with max-pooling to create a sentence-level embedding. It is trained using the Natural Language Inference (NLI) task, in which the goal is to categorize natural language English sentence (premise, hypothesis) pairs into three classes: entailment, contradiction, and neutral. The NLI model architecture separately encodes each sentence of the input pair using a BiLSTM. After, the pair’s sentences form a shared representation composed of the concatenation of the vectors, the element-wise product, and the absolute element-wise difference. This vector is then fed into a three-class classifier, defined by several FC layers and a Softmax.
BERT  is currently the state-of-the-art word embedding model. Its language encoder is a bi-directional multi-layered Transformer which directly follows the architecture described in . The embedding is trained on two tasks: Masked Language Modeling (MLM) and Next Sentence Prediction. The goal of MLM is to predict the original vocabulary ID of a masked word given its context words. Next Sentence Prediction is the binary classification task of determining if the second sentence is the true next sentence.
|Task||Image-Sentence Retrieval||Phrase Grounding||Text-to-Clip||Image Captioning||VQA|
|Dataset||Flickr30K ||MSCOCO ||Flickr30K||ReferIt ||DiDeMo ||MSCOCO ||VQA |
|Method||Embedding Network ||CITE ||ARNet ||EtEMN |
|(a)||Training from scratch|
|Average Embedding + ft||71.5||78.2||70.85||53.29||32.58||–||–||–|
|Self-Attention + ft||71.9||79.9||70.94||53.54||33.26||–||–||–|
|LSTM + ft||69.0||78.2||70.55||53.58||33.94||28.5||94.0||61.35|
|Average Embedding + ft||73.0||80.7||70.62||53.24||32.01||–||–||–|
|Self-Attention + ft||73.1||80.6||71.23||53.87||33.17||–||–||–|
|LSTM + ft||68.5||80.1||71.09||53.95||32.51||28.3||93.2||61.66|
We start with an embedding learned from scratch with random initialization as our first baseline. Results demonstrate that while many previous works use scratch embeddings, this greatly impacts performance in vision-language tasks. Unsurprisingly, when comparing the first lines of Table 1(a,b), we find that using Word2Vec rather than an embedding trained from scratch tends to improve performance. This is more important when considering a larger vocabulary as seen comparing phrase grounding experiments on DiDeMo and ReferIt, whose embeddings trained from scratch using their smaller vocabulary compare favorably to Word2Vec.
The original Word2Vec embedding pretrained on Google News can be considered a second baseline. While FastText is a more modern embedding, Word2Vec only falls behind within a point or two across all tasks, and even outperforms or performs equally as well as FastText for certain tasks (text-to-clip, image captioning). This validates works which extend Word2Vec such as Retrofitting, HGLMM Fisher Vectors, and GrOVLE, as Word2Vec may still provide advantages with additional adaptations; results for adapted embeddings follow in Section 6.
Table 1 also contains a comparison of language model variants across the five vision-language tasks we evaluate on. We see that fine-tuning a word embedding on a vision-language task can have dramatic effects on the performance of the language model (5-10% increase to mean recall on image-sentence retrieval).
When comparing the architecture choices from Figure 3 we see that for retrieval-based tasks (where the output is not free-form text) the Average Embedding and Self-Attention models perform better than a simple LSTM-based approach, with Self-Attention being best on average. This is especially notable since these two models have fewer parameters and are faster to compute than a LSTM. Choosing to use a Self-Attention language model in future vision-language work will not only boost metrics, but will also be a more time efficient option. The only apparent exception to this is the text-to-clip task. This may be because it is a video-based task which contains some temporal language in its queries , so the ordering of words may be especially important to identifying which video clip to select compared to other retrieval-based tasks. While all language models perform closely on ReferIt phrase grounding, this still suggests that there is no need to use the more complex LSTM language model without additional modification.
Lastly, sentence level embeddings InferSent and BERT are compared in Table 1(d); results are without fine-tuning. Fine-tuning would likely improve performance, but is difficult to incorporate due to size (the larger BERT model contains a total of 340M parameters while the well-known VGG-16 network uses 138M; fine-tuning the top layers of BERT still requires loading the full model). The two are comparable to each other with the exception of phrase grounding accuracy on Flickr30K Entities; BERT surprisingly outperforms InferSent by 11.55%. Both InferSent and BERT do not provide the best results across any task, and thus are not a leading option for vision-language tasks.
InferSent and BERT reach comparable values to the best Word2Vec models for image-sentence retrieval on Flickr30K, performing more poorly for the MSCOCO dataset. For the remaining retrieval tasks, metrics are below the best performing model and embedding combination within 1-3 points, again noting the unusual exception of InferSent on phrase grounding of Flickr30K Entities, which significantly drops below scratch performance.
6 Adapted Word Embeddings
Since the introduction of Word2Vec, several enhancement techniques have been proposed. In this section we explore adaptations of Word2Vec which use different methods to post-process embeddings. Extensions either use language enhancements, visual enhancements, or both (WordNet retrofitting, HGLMM vs. Visual Word2Vec vs. GrOVLE, respectively). We shall now briefly discuss these enhancements.
6.1 Visual Word2Vec
Visual Word2Vec  is a neural model designed to ground the original Word2Vec representation with visual semantics. Its goal is to maximize the likelihood of a visual context given the set of words used to describe it, thus pushing word representations used to describe the same visual scene closer together. Clusters are first learned offline using features from abstract clip-art scenes such as the locations of objects, pose, expressions, and gaze to provide surrogate class labels. Word vectors initialized with Word2Vec are then passed through a single hidden layer network. After, a learned output weight matrix and Softmax are applied to predict the visual semantic class the words belong to.
6.2 HGLMM Fisher Vectors
Another post-processed embedding we use for this set of experiments is the Hybrid Gaussian-Laplacian Mixture Model (HGLMM) representation built off of Fisher vectors for Word2Vec 
. While bag-of-words pooling is simple and commonly applied, Fisher vectors change this pooling technique and achieve state-of-the-art results on many applications. Fisher vectors instead concatenate the gradients of the log-likelihood of local descriptors (which in this case are the Word2Vec vectors) with respect to the HGLMM parameters. HGLMM is a weighted geometric mean of the Gaussian and Laplacian distributions and is fit using Expectation Maximization. Following[57, 44], we reduce the dimensions of the original encodings (18K-D) to 6K-D or 300-D using PCA, as it has been found to improve numerical stability on VL tasks (except for experiments on ReferIt which we reduce to 2K-D due to its small vocabulary size).
6.3 GrOVLE: Graph Oriented Vision-Language Embedding
We provide a new embedding, GrOVLE, which adapts Word2Vec using two knowledge bases: WordNet and Visual Genome. This builds off of the retrofitting work of 
in which WordNet was one of the lexicon options. The Visual Genome relational graph is novel, as it creates a language graph that captures how words are used in visual contexts, unlike any of the language databases used in. We briefly review retrofitting and then detail the construction of our original Visual Genome word relation graph. GrOVLE provides a vision-language enhanced embedding and outperforms Visual Word2Vec across many tasks. The released version of GrOVLE is multi-task trained, creating an additional level of VL knowledge, later described in Section 7.
6.3.1 Retrofitting Word Embeddings
In this section we review the approach of Faruqui , which proposed a graph based learning technique to “retrofit” additional semantic knowledge onto pretrained word embeddings.
Given a vocabulary with words and its corresponding word embedding , where is the embedding for , belief propagation is performed to obtain a new embedding which minimizes the distances between the embedding representing each word and its neighbors. These neighbors are defined as edges between words in a graph. regularization is performed between the original and new word embeddings to help prevent overfitting. We find that this regularization is necessary whenever we are updating the word embeddings (we also use it during multi-task training described in Section 7). We use the same regularization parameters as Faruqui and refer the reader to their work to view the final objective function.
|Task||Image-Sentence Retrieval||Phrase Grounding||Text-to-Clip||Image Captioning||VQA|
|(a)||Word2Vec + wn |
|Average Embedding + ft||72.0||79.2||70.51||53.93||33.24||–||–||–|
|Self-Attention + ft||72.4||80.0||70.70||53.81||33.65||–||–||–|
|LSTM + ft||69.3||78.9||70.80||53.67||34.16||28.6||93.3||61.06|
|Average Embedding + ft||72.3||80.2||70.77||53.99||33.71||–||–||–|
|Self-Attention + ft||72.1||80.5||70.95||53.75||33.14||–||–||–|
|LSTM + ft||69.7||78.8||70.18||53.99||34.47||28.3||92.5||61.22|
|(c)||Visual Word2Vec |
|Average Embedding + ft||66.8||78.7||70.61||53.14||31.73||–||–||–|
|Self-Attention + ft||68.8||79.2||71.07||53.26||31.15||–||–||–|
|LSTM + ft||66.7||74.5||70.70||53.19||32.29||28.8||94.0||61.15|
|(d)||HGLMM (300-D) |
|Average Embedding + ft||71.0||79.8||70.64||53.71||32.62||–||–||–|
|Self-Attention + ft||71.8||80.4||70.51||53.83||33.44||–||–||–|
|LSTM + ft||69.5||77.9||70.37||53.10||33.85||28.7||94.0||61.44|
|(e)||HGLMM (6K-D) |
|Average Embedding + ft||73.5||80.9||70.83||53.36||32.66||–||–||–|
|Self-Attention + ft||75.1||80.6||71.02||53.43||33.57||–||–||–|
|LSTM + ft||68.0||79.4||70.38||53.89||34.62||28.0||92.8||60.58|
6.3.2 Word Relation Graph Construction
Below we describe the methods we use to create the edges between words which share some semantic relation. We use these edges to retrofit the word embeddings with the process described in Section 6.3.1. Of the lexicons provided by Faruqui , we used only the WordNet graph, as it contains the largest vocabulary with the most edges. A joint lexicon is built with WordNet and Visual Genome as opposed to successively retrofitting the two; this minimized forgetting of the first and thus improved performance.
WordNet  is a hierarchical lexical database which organizes nouns, adjectives, verbs and adverbs into sets of synonyms (synsets) and uses semantic relations to associate them. As in Faruqui , we construct a graph by creating links between words if they have a synonym, hypernym, or hyponym relationship.
Visual Genome  contains a wealth of language annotations for 108K images: descriptions of entities in an image, their attributes, relationships between multiple entities, and whole image and region-based QA pairs. Each instance in these annotations is considered a sample which we tokenize and remove stopwords from. We compute co-occurrence statistics over pairs of words within the sample for pairs that occur more than 50 times, resulting in 322,928 pairs for 12,849 words. For each word we compute a pointwise mutual information (PMI) score for all pairs it occurs in, and create links between the top ten words. This creates a graph where words that occur frequently together when describing visual data are linked.
|Task||Image-Sentence Retrieval||Phrase Grounding||Text-to-Clip||Image Captioning||VQA|
|GrOVLE w/o multi-task pretraining||64.7||75.0||70.53||52.15||34.45||28.5||92.7||61.46|
|+ multi-task pretraining w/o target task||65.8||76.4||70.82||52.21||34.57||28.8||93.3||61.47|
|+ multi-task pretraining w/ target task||66.2||80.2||70.87||52.64||34.82||28.5||92.7||61.53|
|+ multi-task pretraining w/ target task + ft||72.6||81.3||71.57||54.51||35.09||28.7||93.2||61.46|
|Task||Image-Sentence Retrieval||Phrase Grounding||Text-to-Clip||Image Captioning||VQA|
|Additional Models||SCAN ||QA R-CNN ||TGN ||BUTD ||BAN|
|Training from scratch||72.8||83.2||68.56||50.23||43.91||35.2||109.8||68.98|
|FastText + ft||72.5||83.8||69.27||53.01||44.21||35.2||110.3||69.91|
|GrOVLE (w/o multi-task pretraining) + ft||72.7||84.1||70.03||53.88||45.26||35.1||110.4||69.36|
|+ multi-task pretraining w/ target task + ft||76.2||84.7||71.08||54.10||43.61||35.7||111.6||69.97|
We see a small, but consistent improvement across most of the vision-language tasks using GrOVLE as seen in Table 2(b). These changes result in an embedding with comparable performance to the HGLMM 6K-D features, which are reported in Table 2(e). However, our word embedding tends to perform better when embeddings are the same size (300-D). For the generation-based tasks (captioning and VQA), the benefits of using adapted embeddings are less clear. This may simply be an artifact of the challenges in evaluating these tasks (, the captions are improving in a way the metrics don’t capture). Also, models that more carefully consider the effect of each word in a caption may benefit more from our improved features ( [41, 60]).
While Visual Word2Vec is an established visually-enhanced embedding, its published results did not include these vision-language tasks. Visual Word2Vec performs comparably amongst results for generation tasks (image captioning and VQA), but these tasks have little variance in results, with less than a point of difference across the adapted embeddings. The small gain provided in generation tasks by Visual Word2Vec does not out-weight the drops in performance across other tasks such as the significant mean recall drop of 6.3 compared to HGLMM’s 6K-D Self-Attention result in line two of Table 2(c) and Table 2(e) for image-sentence retrieval of Flickr30K. For comparison, GrOVLE’s Self-Attention result in Table 2(b) is only 3 points lower.
Finally, we report results using HGLMM of different dimension. HGLMM 300-D features are used for a more fair comparison to other embeddings. While the HGLMM 6K-D representation primarily results in the highest performance, it performs more poorly on generation tasks and also results in high variance. For example, column one in Table 2(e) shows a range of 7.1 in mean recall, unlike GrOVLE which has a range of 2.6.
7 Multi-task Training
A drawback of using pretrained word embeddings like Word2Vec or the retrofitting process is that they are trained solely on text data. While our Visual Genome Graph provides some general information on how words in our vocabulary are used for visual data, it doesn’t provide any sense of visual similarity between semantically different words that may be necessary to perform a particular vision-language task. To address this, we fine-tune GrOVLE across the five VL tasks.
We provide results for a four and five multi-task trained embedding. The four task experiments are performed with the final task embedding fixed to demonstrate how well the embeddings would generalize to new tasks. We also provide results for pretraining on five tasks with and without fine-tuning during the last task. Similarly to PackNet , for each dataset/task in the four and five task experiments, we keep the most informative features frozen when training any subsequent task, diminishing the effect of catastrophic forgetting when fine-tuning on a new task. For an embedding of size and tasks, , in our experiments. We evenly split the features for tasks with multiple datasets. Features that were tuned on a task are ranked according to variance and frozen before training on the next dataset/task. The end result is a pretrained word embedding which can be “dropped in” to existing models to improve performance across many vision-language tasks.
To verify that the multi-task GrOVLE performance improvements generalize across task model architecture, we provide results using additional task models in Table 4. More results can be found in the appendix.
Table 3 reports results of the multi-task training procedure described above. We use the best performing language model in our comparisons for each task, Self-Attention for image-sentence retrieval and phrase grounding, and the LSTM language model for text-to-clip, image captioning, and VQA. The first lines of Table 3 report the results of the original fixed GrOVLE embedding, which should be considered the baseline. The second line of Table 3 reports performance when the four-task pretrained GrOVLE is fixed when used in the target task, the task currently being run. The third and fourth line of Table 3 report the results of our embedding when they were trained on all five tasks, and kept fixed or fine-tuned for the target task, respectively.
The results of line three and four demonstrate that our improved embedding tends to transfer better when applied with fine-tuning during the target task. We find similar trends in performance improvements across tasks: larger gains occur for image-sentence retrieval with +7.9 mean recall for the Flickr30K dataset and +6.3 for MSCOCO. All other tasks have performance improvements under one point, showing that while the vision-language tasks appear to transfer well without harming performance, they are leveraged most in image-sentence retrieval, with an exception of phrase grounding accuracy on ReferIt (+2.36%).
Table 4 provides more models per task and demonstrates consistent results: embeddings can significantly affect performance and GrOVLE variants are still the best embedding overall. As we move down the table we find even larger performance improvements made by using the five-task pretrained GrOVLE with fine-tuning than in Table 3. This multi-task variant is the best performing across all tasks, thus we release this embedding for public use.
We believe there are five major findings in our experiments that researchers should keep in mind when considering the language component for vision-language tasks:
On retrieval-style tasks, the Average Embedding and Self-Attention language model tend to outperform a simple LSTM.
Fine-tuning a word embedding for a task can significantly impact performance.
For standard vision-language metrics, language features matter most on retrieval and grounding tasks, and less on text-to-clip and generation tasks.
Word embeddings trained on outside vision-language datasets and tasks generalize to other applications.
Multi-task trained GrOVLE is the leading embedding option for four of the five vision-language tasks when used with the best corresponding language model.
We have provided evidence that language and vision features should be treated equally when used in vision-language tasks. When using the best embedding, language model, and training choices, performance for tasks with more variance can greatly improve, and tasks with more stubborn performance metrics can be nudged further. These insights are proposed to benefit future vision-language work. Along with these findings, we have introduced GrOVLE, which incorporates hierarchical language relations from WordNet as well as language with visual context from Visual Genome. In addition to these adaptations, we perform multi-task training with five common vision-language tasks to further incorporate nuanced visual information. This provides a 300-D embedding with vision-language enhancements that is comparable to current embeddings and provides low variance results.
We would like to thank the reviewers for their helpful suggestions. This work is supported in part by DARPA and NSF awards IIS-1724237, CNS-1629700, CCF-1723379.
-  (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: §1, §2, Table 4, §9.3, §9.3, §9.3, Table 19.
-  (2015) VQA: Visual Question Answering. In ICCV, Cited by: §1, §4.1.
A neural probabilistic language model.
Journal of Machine Learning Research, 3:1137-1155, Cited by: §5.1.
-  (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: Cited by: §1, §1, §5.1, Table 1.
-  (2018) Temporally grounding natural sentence in video. In EMNLP, Cited by: §1, Table 4, §9.2, §9.3, Table 14.
-  (2017) Query-guided regression network with context policy for phrase grounding. In ICCV, Cited by: §2.
-  (2018) Regularizing rnns for caption generation by reconstructing the past with the present. In arXiv:1803.11439v2, Cited by: Table 1, §9.1, §9.2, Table 15, Table 16, Table 17.
-  (2017) Supervised learning of universal sentence representations from natural language inference data. In , Cited by: §1, §5.2, Table 1.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, Cited by: §9.2.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. In arXiv:1810.04805v1, Cited by: §1, §1, §2, §5.2, Table 1.
-  (2018) VSE++: improving visual-semantic embeddings with hard negatives. In BMVC, Cited by: §2.
-  (2018) Image captioning with word level attention. 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 1278–1282. Cited by: §1, §2.
-  (2014) From captions to visual concepts and back. arXiv:1411.4952. Cited by: §1.
-  (2015) Retrofitting word vectors to semantic lexicons. In NAACL, Cited by: §1, §1, §2, §6.3.1, §6.3.2, §6.3.2, §6.3, Table 2.
-  (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP, Cited by: §1.
-  (2017) Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In , Cited by: Table 1, §9.1.
-  (2006) The IAPR TC-12 benchmark – a new evaluation resource for visual information systems. Cited by: §9.1.
-  (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: §9.2.
Localizing moments in video with natural language.. In ICCV, Cited by: §1, §4.1, §5.3, Table 1, §9.1, §9.2.
-  (2018) Discriminative learning of open-vocabulary object retrieval and localization by negative phrase augmentation. In EMNLP, Cited by: §1, Table 4, §9.3, Table 10.
-  (2017) Learning to reason: end-to-end module networks for visual question answering. CoRR, abs/1704.05526 3. Cited by: Table 1, §9.2.
-  (2016) Natural language object retrieval. In CVPR, Cited by: §1, §9.1.
-  (2018) Learning semantic concepts and order for image and sentence matching. In CVPR, Cited by: §1.
-  (2014) ReferItGame: referring to objects in photographs of natural scenes. In EMNLP, Cited by: Table 1, §9.1.
Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In EMNLP, Cited by: §2.
-  (2018) Bilinear attention networks. In NeurIPS, Cited by: Table 4, §9.3, Table 20.
-  (2015) Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. In CVPR, Cited by: §1, §1, §2, §6.2, Table 2.
-  (2014) What are you talking about? text-to-image coreference. In CVPR, Cited by: §2.
-  (2016) Visual word2vec (vis-w2v): learning visually grounded word embeddings using abstract scenes. In CVPR, Cited by: §1, §2, §6.1, Table 2.
-  (2017) Dense-captioning events in videos. In ICCV, Cited by: §1.
-  (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. Cited by: §1, §6.3.2, §9.3.
-  (2015) Combining language and vision with a multimodal skip-gram model. In NAACL, Cited by: §2.
-  (2018) Stacked cross attention for image-text matching. In ECCV, Cited by: §1, Table 4, §9.3, §9.4, Table 9.
-  (2016) RNN fisher vectors for action recognition and image annotation. In ECCV, Cited by: §2.
-  (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: Table 1, §9.1.
-  (2018) Temporal modular networks for retrieving complex compositional activities in videos. In ECCV, Cited by: §9.2.
-  (2016) Hierarchical question-image co-attention for visual question answering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, USA, pp. 289–297. External Links: Cited by: §1.
-  (2018) PackNet: adding multiple tasks to a single network by iterative pruning. In CVPR, Cited by: §1, §7.
-  (2013) Linguistic regularities in continuous space word representations. In NAACL, Cited by: §1, §1, §5.1, Table 1.
-  (1995) Wordnet: a lexical database for english. Communications of the ACM. Cited by: §6.3.2.
-  (2017) Dual attention networks for multimodal reasoning and matching. In CVPR, Cited by: §1, §2, §6.4, §9.2.
-  (2014) GloVe: global vectors for word representation. In EMNLP, Cited by: §1.
-  (2018) Deep contextualized word representations. In NAACL, Cited by: §1.
-  (2018) Conditional image-text embedding networks. In ECCV, Cited by: §1, Table 1, §6.2, §9.2, Table 11, Table 12, Table 13, Table 5.
-  (2017) Phrase localization and visual relationship detection with comprehensive image-language cues. In ICCV, Cited by: §1, §2.
-  (2018) Revisiting image-language embeddings for open-ended phrase detection. arXiv:1811.07212. Cited by: §9.2, §9.3.
-  (2017-05) Flickr30K Entities: collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV 123 (1), pp. 74–93. Cited by: §1, Table 1, §9.1, §9.1.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, Cited by: §9.3, §9.3.
-  (2016) Grounding of textual phrases in images by reconstruction. In ECCV, Cited by: §1, §9.1.
-  (2016) Training region-based object detectors with online hard example mining. In CVPR, Cited by: §9.3.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Cited by: §9.2.
-  (2016) Solving Visual Madlibs with Multiple Cues. In BMVC, Cited by: §1.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010. Cited by: §5.2.
-  (2016) Order embeddings of images and language. In ICLR, Cited by: §1, §2.
-  (2015) Sequence to sequence – video to text. In ICCV, Cited by: §1.
-  (2015) Show and tell: a neural image caption generator. In CVPR, Cited by: §1, §9.2, §9.3, §9.4, Table 18.
-  (2017) Learning two-branch neural networks for image-text matching tasks. arXiv:1704.03470. Cited by: §1, §1, Table 1, §6.2, §9.1, §9.1, §9.2, §9.2, Table 5, Table 6, Table 7, Table 8.
-  (2016) Structured matching for phrase localization. In ECCV, Cited by: §2.
-  (2017) R-C3D: region convolutional 3d network for temporal activity detection. In ICCV, Cited by: §9.2.
-  (2019) Multilevel language and vision integration for text-to-clip retrieval. In AAAI, Cited by: §1, §6.4.
-  (2015) Show, attend and tell: neural image caption generation with visual attention. arXiv:1502.03044. Cited by: §1, §2.
-  (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, pp. 67–78. Cited by: Table 1, §9.1.
-  (2015) Visual Madlibs: Fill in the blank Image Generation and Question Answering. ICCV. Cited by: §1.
-  (2014) Improving lexical embeddings with semantic knowledge. In ACL, Cited by: §2.
-  (2016) Yin and Yang: balancing and answering binary visual questions. In CVPR, Cited by: §1.
-  (2018) Deep cross-modal projection learning for image-text matching. In ECCV, Cited by: §9.2.
Flickr30K . This dataset consists of 32K images obtained from the Flickr website, each of which has been annotated with five descriptive captions. We use the splits of Plummer , which separate the dataset into 30K/1K/1K train/test/validation images which we use for the image-sentence retrieval and phrase grounding tasks.
MSCOCO . This dataset links 123K images for the training and validation sets (80K/40K images, respectively), each of which is annotated with five descriptive captions. For the image-sentence retrieval experiments, we use the test/validation splits from Wang , which consists of 1K images for each split, for a total of 2K images, randomly sampled from the validation set. For image captioning experiments, use the splits from Chen , which reserves 5K images each for validation and testing.
Flickr30K Entities . This dataset augments the Flickr30K dataset with 276K bounding boxes which are linked to noun phrases in the descriptive captions. We use the same splits as the Flickr30K dataset, resulting in 14.5K instances across the 1K images in the test set for the phrase grounding task. Following [47, 49, 57], we use the union of the bounding boxes for the ground truth box of a phrase which is linked to multiple boxes.
ReferIt . This dataset augments the 20K images from the IAPR RC-12 dataset  with 120K region descriptions. We split the splits of Hu , which split the images evenly into train/validation and test sets (10K each), resulting in about 60K instances in each split.
DiDeMo . This dataset consists of just over 10,000 videos, each of which has between 3-5 video segment descriptions. We use the splits provided by Hendricks , which splits the videos into sets of 8.4K/1K/1K for train/test/validation.
VQA v2 . This dataset augments images from MSCOCO with QA pairs. The training, validation and test image sets contain 83K, 41K, and 81K images, respectively. This constitutes 444K, 214K, and 448K questions for training/validation/testing splits. Each training and validation question has ten answers provided.
9.2 Task Methods
Image-Sentence Retrieval. We use a modified implementation of the Embedding Network  provided by the authors in our experiments222https://github.com/lwwang/Two_branch_network. This model uses two branches, one for text and one for images, to learn a projection to a shared embedding space where Euclidean distance is used to measure similarity between images and sentences. We use the default parameters and data processing in the author’s implementation, except that we compute the visual representation for each image using a 152-layer ResNet  which has been trained on ImageNet . Additionally, we use 448x448 crops rather than the 224x224 pixel crops used by Wang  as done in prior work, [66, 41]. Following [57, 66, 41], we keep the CNN parameters fixed for a fair comparison. By default this model uses an Average Embedding language model. When we use the LSTM language model, we use a hidden state of 512-D. We set regularization coefficient to be 1e-4 when fine-tuning the Average Embedding and Self-Attention model and 1e-6 for the LSTM model.
Phrase Grounding. To evaluate our word embeddings on this task, we use the implementation of CITE network 333https://github.com/BryanPlummer/cite. This model learns a set of embeddings which share some parameters, each of which captures a different concept important for phrase grounding. Following Plummer , we use the parameters and feature representation learned from fine-tuning a 101-layer ResNet and Region Proposal Network. This model also uses an Average Embedding language model by default, and we use 256-D hidden state for our LSTM experiments. We set regularization coefficient to be 1e-5 for both datasets.
Text-to-Clip. When we performed our experiments none of the methods on the DiDeMo dataset which outperform the baseline model of Hendricks  had publicly available code for the text-to-clip task ( [5, 36]). As a result, we used the CITE network for the text-to-clip task since it performed better than the baseline model as well as better than the phrase-region grounding Similarity Network  and straightforward adaptations of the R-C3D model  in our experiments. We learn concept embeddings for this dataset and use the VGG  features for the visual representation provided by Hendricks . We use a 512-D hidden state for our LSTM models, and set regularization coefficient to 5e-2. This dataset likely required additional regularization when fine-tuning its embeddings due to its relatively small size.
We use a PyTorch implementation444https://github.com/chenxinpeng/ARNet of the Auto-Reconstructor Network (ARNet) architecture  provided by the authors. This model builds off of the original Neural Image Captioning (NIC) architecture  by adding an additional LSTM to reconstruct previous hidden states. We set the regularization coefficient of the NIC loss, , to be 5e-2 when fine-tuning the word embeddings. ARNet’s additional stacked LSTM takes a current hidden state as input and attempts to generate the previous hidden state. This can be viewed as a “soft” zoneout strategy as the model adaptively learns how to reconstruct the last hidden state at each time step, as opposed to the typical zoneout regularizer which makes a binary choice between previous and current hidden states.
Visual Question Answering. We use the authors’ implementation555https://github.com/ronghanghu/n2nmn of the End-to-End Module Networks  as our VQA model. This network learns to decompose natural language questions into sub-tasks and assembles question-specific deep networks from neural modules to solve its corresponding sub-task. The training process of this model consists of two parts: the cloning expert and the policy search. Since the policy search improves the model by only 0.7% while adding significant training time, we report results only using the cloning expert. We use the default parameters in the implementation and follow the authors’ data pre-processing steps. When we include L2 regularization on the word embeddings, we set its weight to be 5e-4. Note that we report results using the VQA v2 dataset, whereas Hu  reported results on VQA v1.
9.3 Additional Task Methods
Image-Sentence Retrieval. We also report results with the Stacked Cross Attention Network (SCAN) model  using the authors’ provided implementation666https://github.com/kuanghuei/SCAN. Unlike the Embedding Network, this model uses the top 36 region-level features  which have been trained to capture image concepts on the Visual Genome dataset . A similarity score is computed between all combinations of words in a sentence and image regions, and then aggregated using a multi-step attention mechanism to obtain an overall matching score. For each dataset, we use the settings for the best performing single model reported in their paper, , i-t AVG (1 = 4) for Flickr30K and t-i AVG (1 = 9) for MSCOCO.
Phrase Grounding. To supplement our results, we experiment with using the implementation of the Query Adaptive R-CNN network  from Plummer . This model adapts Faster R-CNN  to the phrase grounding task. The implementation in Plummer updates the VGG network used in the original paper with a 101-layer ResNet, but does not pretrain their model on Visual Genome or use the online hard negative mining  as done in the original paper. In addition, Plummer also reported better performance by randomly sampling 5 phrases associated with an image for each minibatch rather than using all annotated phrases. We compared this implementation using a VGG network to the grounding performance reported in  and found it performed similarly on Flickr30K Entities despite these changes, but using a ResNet backbone as done in our experiments does boost performance by 3-8%.
Text-to-Clip. We provide additional results from the Temporally Grounding Natural Sentence in Video (TGN) 
model. The TGN model consists of 3 components: the encoder, the interactor and the grounder. Visual and language features are first projected into the same embedding space using the encoder. Next, the interactor computes the frame-by-word interactions using the encoded visual and language features. Finally, based on these interactions, the grounder scores and ranks the temporal segment candidates ending at each frame. We note that these results are obtained from our own implementation of the TGN model as the authors have not released code. In our implementation, we adopt the same hyperparameter values as detailed in.
Image Captioning. We provide results for two additional image captioning models: the vanilla show-and-tell Neural Image Captioning model (NIC) of Vinyals  and the popular Bottom-Up Town-Down (BUTD) model from Anderson . We set = 5e-2 as our L2 regularization coefficient when fine-tuning the word embeddings for both models. We use a PyTorch implementation 777https://github.com/yunjey/pytorch-tutorial
of the NIC model for this task. This model follows an encoder-decoder paradigm inspired by machine translation, in which the probability of a sentence given an image is maximized. A CNN encodes an image which is then fed into a decoder LSTM to form a natural language sentence. Unlike the results reported in Vinyals , we use a single model rather than an ensemble, and use a 152-layer ResNet pretrained on ImageNet as our image encoder.
We also use a PyTorch implementation 888https://github.com/ruotianluo/self-critical.pytorch of the Bottom-Up Top-Down Attention image captioning model. BUTD uses a combination of visual attention mechanisms: bottom-up attention is implemented using Faster R-CNN  to generate object region proposals and their respective features, which are then weighted by the top-down attention mechanism. The model also adds an attribute predictor to Faster R-CNN. The language model is implemented with two standard LSTMs, where the first layer serves as top-down attention and the second is the language generator. The attention LSTM takes the previous time step output, mean pooled image features, and previously generated word encoding as input. After a Softmax is applied to the output of the attention LSTM, the weighted visual features are passed to the generator LSTM.
Visual Question Answering. We provide additional VQA results using the Bilinear Attention Networks (BAN) model . The BAN model utilizes adaptive region-level features  as the visual input. It extracts joint representations from each pair of visual and word features via low-rank bilinear pooling while computing their bilinear interactions using attention maps. We use the provided implementation 999https://github.com/jnhwkim/ban-vqa in our experiments and adopt the same hyperparameter settings as described in .
9.4 Discrepancies with Published Work
If available, we use the authors’ publicly available code. Baseline results differ from published values despite this. The best results in ,  are obtained using ensemble methods, but our results use a single model. Although, single model  with the five-task multi-task trained GrOVLE + ft is on par with ensemble results.
9.5 Comparison of Word2Vec and GloVe
When initially deciding the set of embeddings to use in our experiments, we did consider GloVe. However, there were insignificant differences between Word2Vec and GloVe results (some shown below). Thus, we didn’t include it in the main paper due to space constraints as GloVe is also a dated embedding.
9.6 Image-Sentence Retrieval Extended Pretrained Embedding Metrics
|Embedding Network |
|(a)||Training from scratch|
|Average Embedding + ft||56.7||84.3||91.4||41.6||72.9||82.1||71.5||62.4||89.1||95.0||50.2||82.2||90.2||78.2|
|Self-Attention + ft||57.0||84.4||91.4||42.4||73.5||82.8||71.9||64.8||91.2||96.4||51.9||83.1||91.9||79.9|
|LSTM + ft||52.1||82.4||89.9||39.6||70.0||79.9||69.0||63.5||89.4||95.0||49.7||81.4||90.3||78.2|
|Average Embedding + ft||59.4||86.8||92.0||42.6||73.7||83.5||73.0||66.6||91.7||96.6||52.7||84.4||92.2||80.7|
|Self-Attention + ft||58.8||85.8||91.8||44.2||74.6||83.3||73.1||65.3||92.0||96.7||52.8||84.2||92.5||80.6|
|LSTM + ft||52.1||81.4||89.0||39.0||69.9||79.6||68.5||65.3||91.5||97.1||51.6||83.7||91.5||80.1|
9.7 Image-Sentence Retrieval Extended Adapted Embedding Metrics
|Embedding Network |
|(a)||Word2Vec + wn|
|Average Embedding + ft||57.7||85.3||91.5||42.2||73.2||82.3||72.0||63.6||90.8||95.6||51.1||83.2||91.1||79.2|
|Self-Attention + ft||57.6||86.2||92.1||42.5||73.3||82.7||72.4||64.0||91.5||96.8||51.4||84.3||91.7||80.0|
|LSTM + ft||53.5||82.8||89.9||39.3||70.2||80.5||69.3||63.8||90.6||95.7||50.2||82.0||90.9||78.9|
|Average Embedding + ft||57.6||85.1||92.0||42.6||73.6||82.6||72.3||65.2||91.8||96.5||52.1||83.9||92.1||80.2|
|Self-Attention + ft||56.9||84.2||91.7||43.2||73.9||82.8||72.1||67.6||91.4||96.3||52.0||83.7||92.1||80.5|
|LSTM + ft||54.1||82.7||91.1||39.7||70.2||80.1||69.7||65.0||89.6||95.8||49.7||82.0||90.8||78.8|
|Average Embedding + ft||50.0||79.7||87.0||37.0||68.3||78.6||66.8||61.7||90.6||95.8||50.0||82.7||91.2||78.7|
|Self-Attention + ft||51.3||82.3||89.5||40.9||69.1||79.9||68.8||61.6||91.4||96.7||50.2||83.1||92.4||79.2|
|LSTM + ft||50.5||78.3||88.6||36.2||67.7||78.7||66.7||56.2||87.3||94.8||42.5||77.3||87.8||74.5|
|Average Embedding + ft||56.6||84.2||90.8||41.4||72.0||81.2||71.0||65.5||90.7||96.0||51.5||83.4||91.5||79.8|
|Self-Attention + ft||56.4||84.7||91.3||42.1||73.3||82.2||71.8||66.2||91.0||96.3||51.8||84.7||92.6||80.4|
|LSTM + ft||54.1||82.0||90.2||40.2||70.4||80.2||69.5||61.5||89.9||95.3||48.9||81.5||90.4||77.9|
|Average Embedding + ft||60.5||86.4||92.9||43.8||73.9||83.3||73.5||67.2||91.7||97.5||53.0||84.0||92.2||80.9|
|Self-Attention + ft||61.6||88.4||94.5||46.4||75.7||84.1||75.1||65.4||93.0||97.4||52.6||83.6||90.6||80.6|
|LSTM + ft||51.4||80.7||89.4||39.1||68.7||78.6||68.0||65.0||90.7||96.1||51.2||82.8||90.9||79.4|
9.8 Image-Sentence Retrieval Extended Multi-task Trained GrOVLE Metrics
|Embedding Network |
|GrOVLE w/o multi-task pretraining||47.3||78.9||87.0||33.2||65.1||76.8||64.7||56.3||87.4||94.3||44.5||79.0||88.5||75.0|
|+ multi-task pretraining w/o target task||49.0||79.7||87.7||35.7||66.2||76.3||65.8||60.8||87.3||94.7||46.7||79.7||89.3||76.4|
|+ multi-task pretraining w/ target task||51.3||68.7||80.7||36.2||64.3||66.3||66.2||65.5||91.6||96.7||51.2||83.6||91.4||80.2|
|+ multi-task pretraining w/ target task + ft||58.2||85.8||91.9||42.1||73.8||84.0||72.6||66.8||93.4||97.9||51.8||85.0||92.8||81.3|
9.9 Image-Sentence Retrieval Additional Model Metrics
|Stacked Cross Attention Network (SCAN) |
|Training from scratch||60.8||86.8||92.0||43.0||72.1||81.9||72.8||69.9||94.3||97.4||56.6||87.1||94.0||83.2|
|Word2Vec + ft||59.7||83.4||90.9||41.2||70.6||79.8||70.9||71.9||94.1||98.1||58.2||87.8||93.8||84.0|
|FastText + ft||60.7||86.8||91.5||42.1||73.0||80.8||72.5||71.4||94.4||97.7||58.0||87.4||93.8||83.8|
|GrOVLE (w/o multi-task pretraining) + ft||61.0||86.7||92.0||42.2||72.7||81.3||72.7||72.3||94.0||97.9||58.4||87.7||94.4||84.1|
|+ multi-task pretraining w/ target task + ft||65.8||89.8||94.2||46.8||76.2||84.5||76.2||74.4||94.8||97.8||59.1||87.8||94.2||84.7|
9.10 Phrase Grounding Additional Model Metrics
|Query Adaptive R-CNN |
|Training from scratch||68.56||50.23|
|Word2Vec + ft||69.78||52.97|
|FastText + ft||69.27||53.01|
|GrOVLE (w/o multi-task pretraining) + ft||70.03||53.88|
|+ multi-task pretraining w/ target task + ft||71.08||54.10|
9.11 Text-to-Clip Extended Pretrained Embedding Metrics
|(a)||Training from scratch|
|Average Embedding + ft||15.65||55.00||27.10||32.58|
|Self-Attention + ft||15.81||55.48||28.48||33.26|
|LSTM + ft||15.49||59.29||25.04||33.94|
|Average Embedding + ft||15.69||53.72||26.62||32.01|
|Self-Attention + ft||15.60||55.93||27.99||33.17|
|LSTM + ft||14.80||58.02||24.71||32.51|
9.12 Text-to-Clip Extended Adapted Embedding Metrics
|(a)||Word2Vec + wn|
|Average Embedding + ft||16.05||55.89||27.79||33.24|
|Self-Attention + ft||16.05||57.73||27.16||33.65|
|LSTM + ft||16.36||59.81||26.32||34.16|
|Average Embedding + ft||16.53||56.05||28.56||33.71|
|Self-Attention + ft||15.60||58.16||25.67||33.14|
|LSTM + ft||15.79||61.65||25.98||34.47|
|Average Embedding + ft||14.05||56.90||24.23||31.73|
|Self-Attention + ft||14.12||55.23||24.11||31.15|
|LSTM + ft||14.03||58.52||24.31||32.29|
|Average Embedding + ft||15.96||54.67||27.24||32.62|
|Self-Attention + ft||16.23||56.07||28.01||33.44|
|LSTM + ft||15.89||59.84||25.81||33.85|
|Average Embedding + ft||15.43||55.79||26.76||32.66|
|Self-Attention + ft||15.60||57.82||27.30||33.57|
|LSTM + ft||16.41||60.86||26.59||34.62|
9.13 Text-to-Clip Extended Multi-task Trained GrOVLE Metrics
|GrOVLE w/o multi-task pretraining||16.34||60.84||26.17||34.45|
|+ multi-task pretraining w/o target task||16.94||58.90||27.88||34.57|
|+ multi-task pretraining w/ target task||16.96||59.40||28.09||34.82|
|+ multi-task pretraining w/ target task + ft||17.05||59.84||28.39||35.09|
9.14 Text-to-Clip Additional Model Metrics
|Temporal GroundNet (TGN) |
|Training from scratch||26.26||74.33||31.32||43.97|
|Word2Vec + ft||25.98||74.11||32.06||44.05|
|FastText + ft||26.13||74.23||30.53||43.64|
|GrOVLE (w/o multi-task pretraining) + ft||25.54||73.98||34.24||44.59|
|+ multi-task pretraining w/ target task + ft||24.91||73.58||32.37||43.62|
9.15 Image Captioning Extended Pretrained Embedding Metrics
|(a)||Training from scratch|
|LSTM + ft||26.7||89.7||24.3|
|LSTM + ft||28.5||94.0||24.8|
|LSTM + ft||28.3||93.2||24.8|
9.16 Image Captioning Extended Adapted Embedding Metrics
|(a)||Word2Vec + wn|
|LSTM + ft||28.6||93.3||24.9|
|LSTM + ft||28.3||92.5||24.8|
|LSTM + ft||28.8||94.0||24.9|
|LSTM + ft||28.7||94.0||24.9|
|LSTM + ft||28.0||92.8||24.7|
9.17 Image Captioning Extended Multi-task Trained GrOVLE Metrics
|GrOVLE w/o multi-task pretraining||28.5||92.7||24.7|
|+ multi-task pretraining w/o target task||28.8||93.3||24.7|
|+ multi-task pretraining w/ target task||28.5||92.7||24.7|
|+ multi-task pretraining w/ target task + ft||28.7||93.2||24.7|
9.18 Image Captioning Additional Model Metrics
|Neural Image Captioning (NIC) |
|Training from scratch||18.2||62.5||20.3|
|Word2Vec + ft||18.7||62.8||20.2|
|FastText + ft||17.9||61.6||17.9|
|GrOVLE (w/o multi-task pretraining) + ft||19.4||65.4||20.6|
|+ multi-task pretraining w/ target task + ft||19.4||65.1||20.9|
|Bottom-Up Top-Down Attention (BUTD) |
|Training from scratch||35.2||109.8||27.2|
|Word2Vec + ft||35.1||110.8||27.1|
|FastText + ft||35.2||110.3||27.1|
|GrOVLE (w/o multi-task pretraining) + ft||35.1||110.4||27.1|
|+ multi-task pretraining w/ target task + ft||35.7||111.6||27.3|
9.19 Visual Question Answering Additional Model Metrics
|Bilinear Attention Network|
|Training from scratch||68.68|
|Word2Vec + ft||69.91|
|FastText + ft||69.91|
|GrOVLE (w/o multi-task pretraining) + ft||69.36|
|+ multi-task pretraining w/ target task + ft||69.97|