Artificial intelligence (AI) is an inherently multi-modal problem: understanding and reasoning about multiple modalities (as humans do), seems crucial for achieving artificial intelligence (AI). Language and vision are two vital interaction modalities for humans. Thus, modeling the rich interplay between language and vision is one of fundamental problems in AI.
Language modeling is an important problem in natural language processing (NLP). A language model estimates the likelihood of a word conditioned on other (context) words in a sentence. There is a rich history of works on-gram based language modeling [4, 17]. It has been shown that simple, count-based models trained on millions of sentences can give good results. However, in recent years, neural language models [3, 31] have been explored. Neural language models learn mappings (
) from words (encoded using a dictionary) to a real-valued vector space (embedding), to maximize the log-likelihood of words given context. Embedding words into such a vector space helps deal with the curse of dimensionality, so that we can reason about similarities between words more effectively. One popular architecture for learning such an embedding is word2vec[30, 32]. This embedding captures rich notions of semantic relatedness and compositionality between words .
For tasks at the intersection of vision and language, it seems prudent to model semantics as dictated by both text and vision. It is especially challenging to model fine-grained interactions between objects using only text. Consider the relations “eats” and “stares at” in Fig. 1. When reasoning using only text, it might prove difficult to realize that these relations are semantically similar. However, by grounding the concepts into vision, we can learn that these relations are more similar than indicated by text. Thus, visual grounding provides a complimentary notion of semantic relatedness. In this work, we learn word embeddings to capture this grounding.
Grounding fine-grained notions of semantic relatedness between words like “eats” and “stares at” into vision is a challenging problem. While recent years have seen tremendous progress in tasks like image classification , detection , semantic segmentation , action recognition , etc., modeling fine-grained semantics of interactions between objects is still a challenging task. However, we observe that it is the semantics of the visual scene that matter for inferring the visually grounded semantic relatedness, and not the literal pixels (Fig. 1). We thus use abstract scenes made from clipart to provide the visual grounding. We show that the embeddings we learn using abstract scenes generalize to text describing real images (Sec. 6.1).
Our approach considers visual cues from abstract scenes as context for words. Given a set of words and associated abstract scenes, we first cluster the scenes in a rich semantic feature space capturing the presence and locations of objects, pose, expressions, gaze, age of people, etc. Note that these features can be trivially extracted from abstract scenes. Using these features helps us capture fine-grained notions of semantic relatedness (Fig. 4). We then train to predict the cluster membership from pre-initialized word embeddings. The idea is to bring embeddings for words with similar visual instantiations closer, and push words with different visual instantiations farther (Fig. 1). The word embeddings are initialized with word2vec . The clusters thus act as surrogate classes. Note that each surrogate class may have images belonging to concepts which are different in text, but are visually similar. Since we predict the visual clusters as context given a set of input words, our model can be viewed as a multi-modal extension of the continuous bag of words (CBOW)  word2vec model.
Contributions: We propose a novel model visual word2vec (vis-w2v) to learn visually grounded word embeddings. We use abstract scenes made from clipart to provide the grounding. We demonstrate the benefit of vis-w2v on three tasks which are ostensibly in text, but can benefit from visual grounding: common sense assertion classification , visual paraphrasing , and text-based image retrieval . Common sense assertion classification  is the task of modeling the plausibility of common sense assertions of the form (boy, eats, cake). Visual paraphrasing  is the task of determining whether two sentences describe the same underlying scene or not. Text-based image retrieval is the task of retrieving images by matching accompanying text with textual queries. We show consistent improvements over baseline word2vec (w2v) models on these tasks. Infact, on the common sense assertion classification task, our models surpass the state of the art.
2 Related Work
Word embeddings learnt using neural networks[6, 32] have gained a lot of popularity recently. These embeddings are learnt offline and then typically used to initialize a multi-layer neural network language model [3, 31]. Similar to those approaches, we learn word embeddings from text offline, and finetune them to predict visual context. Xu et al.  and Lazaridou et al.  use visual cues to improve the word2vec representation by predicting real image representations from word2vec and maximizing the dot product between image features and word2vec respectively. While their focus is on capturing appearance cues (separating cats and dogs based on different appearance), we instead focus on capturing fine-grained semantics using abstract scenes. We study if the model of Ren et al.  and our vis-w2v provide complementary benefits in the appendix. Other works use visual and textual attributes (e.g. vegetable is an attribute for potato) to improve distributional models of word meaning [38, 39]. In contrast to these approaches, our set of visual concepts need not be explicitly specified, it is implicitly learnt in the clustering step. Many works use word embeddings as parts of larger models for tasks such as image retrieval , image captioning [18, 41], etc. These multi-modal embeddings capture regularities like compositional structure between images and words. For instance, in such a multi-modal embedding space, “image of blue car” - “blue” + “red” would give a vector close to “image of red car”. In contrast, we want to learn unimodal (textual) embeddings which capture multi-modal semantics. For example, we want to learn that “eats” and “stares at” are (visually) similar.
Surrogate Classification: There has been a lot of recent work on learning with surrogate labels due to interest in unsupervised representation learning. Previous works have used surrogate labels to learn image features [7, 9]. In contrast, we are interested in augmenting word embeddings with visual semantics. Also, while previous works have created surrogate labels using data transformations  or sampling , we create surrogate labels by clustering abstract scenes in a semantically rich feature space.
Learning from Visual Abstraction:
Visual abstractions have been used for a variety of high-level scene understanding tasks recently. Zitnicket al. [43, 44] learn the importance of various visual features (occurrence and co-occurrence of objects, expression, gaze, etc.) in determining the meaning or semantics of a scene.  and  learn the visual interpretation of sentences and the dynamics of objects in temporal abstract scenes respectively. Antol et al.  learn models of fine-grained interactions between pairs of people using visual abstractions. Lin and Parikh  “imagine” abstract scenes corresponding to text, and use the common sense depicted in these imagined scenes to solve textual tasks such as fill-in-the-blanks and paraphrasing. Vedantam et al. classify common sense assertions as plausible or not by using textual and visual cues. In this work, we experiment with the tasks of  and , which are two tasks in text that could benefit from visual grounding. Interestingly, by learning vis-w2v, we eliminate the need for explicitly reasoning about abstract scenes at test time, i.e., the visual grounding captured in our word embeddings suffices.
Language, Vision and Common Sense: There has been a surge of interest in problems at the intersection of language and vision recently. Breakthroughs have been made in tasks like image captioning [5, 8, 14, 16, 18, 20, 29, 33, 41], video description [8, 36], visual question answering [1, 11, 12, 27, 28, 35], aligning text and vision [16, 18], etc. In contrast to these tasks (which are all multi-modal), our tasks themselves are unimodal (i.e., in text), but benefit from using visual cues. Recent work has also studied how vision can help common sense reasoning [34, 37]. In comparison to these works, our approach is generic, i.e., can be used for multiple tasks (not just common sense reasoning).
Recall that our vis-w2v model grounds word embeddings into vision by treating vision as context. We first detail our inputs. We then discuss our vis-w2v model. We then describe the clustering procedure to get surrogate semantic labels, which are used as visual context by our model. We then describe how word-embeddings are initialized. Finally, we draw connections to word2vec (w2v) models.
Input: We are given a set of pairs of visual scenes and associated text in order to train vis-w2v. Here refers to the image features and refers to the set of words associated with the image. At each step of training, we select a window to train the model.
Model: Our vis-w2v model (Fig. 2) is a neural network that accepts as input a set of words and a visual feature instance . Each of the words
is represented via a one-hot encoding. A one-hot encoding enumerates over the set of words in a vocabulary (of size) and places a 1 at the index corresponding to the given word. This one-hot encoded input is transformed using a projection matrix of size that connects the input layer to the hidden layer, where the hidden layer has a dimension of . Intuitively, decides the capacity of the representation. Consider an input one-hot encoded word whose index is set to 1. Since is one-hot encoded, the hidden activation for this word () is a row in the weight matrix , i.e., . The resultant hidden activation would then be the average of individual hidden activations as is shared among all the words , i.e.,:
Given the hidden activation , we multiply it with an output weight matrix of size , where is the number of output classes. The output class (described next) is a discrete-valued function of the visual features (more details in next paragraph). We normalize the output activations to form a distribution using the softmax function. Given the softmax outputs, we minimize the negative log-likelihood of the correct class conditioned on the input words:
We optimize for this objective using stochastic gradient descent (SGD) with a learning rate of 0.01.
Output Classes: As mentioned in the previous section, the target classes for the neural network are a function of the visual features. What would be a good choice for ? Recall that our aim is to recover an embedding for words that respects similarities in visual instantiations of words (Fig. 1). To capture this visual similarity, we model as a grouping function111Alternatively, one could regress directly to the feature values . However, we found that the regression objective hurts performance.
. In practice, this function is learnt offline using clustering with K-means. That is, the outputs from clustering are the surrogate class labels used invis-w2v training. Since we want our embeddings to reason about fine-grained visual grounding (e.g. “stares at” and “eats”), we cluster in the abstract scenes feature space (Sec. 4). See Fig. 4 for an illustration of what clustering captures. The parameter in K-means modulates the granularity at which we reason about visual grounding.
Initialization: We initialize the projection matrix parameters with those from training w2v on large text corpora. The hidden-to-output layer parameters are initialized randomly. Using w2v is advantageous for us in two ways: i) w2v embeddings have been shown to capture rich semantics and generalize to a large number of tasks in text. Thus, they provide an excellent starting point to finetune the embeddings to account for visual similarity as well. ii) Training on a large corpus gives us good coverage in terms of the vocabulary. Further, since the gradients during backpropagation only affect parameters/embeddings for words seen during training, one can view
Further, since the gradients during backpropagation only affect parameters/embeddings for words seen during training, one can viewvis-w2v as augmenting w2v with visual information when available. In other words, we retain the rich amount of non-visual information already present in it222We verified empirically that this does not cause calibration issues. Specifically, given a pair of words where one word was refined using visual information but the other was not (unseen during training), using vis-w2v for the former and w2v for the latter when computing similarities between the two outperforms using w2v for both.. Indeed, we find that the random initialization does not perform as well as initialization with w2v when training vis-w2v.
Design Choices: Our model (Sec. 3) admits choices of in a variety of forms such as full sentences or tuples of the form (Primary Object, Relation, Secondary Object). The exact choice of is made depending upon on what is natural for the task of interest. For instance, for common sense assertion classification and text-based image retrieval, is a phrase from a tuple, while for visual paraphrasing is a sentence. Given , the choice of is also a design parameter tweaked depending upon the task. It could include all of (e.g., when learning from a phrase in the tuple) or a subset of the words (e.g., when learning from an -gram context-window in a sentence). While the model itself is task agnostic, and only needs access to the words and visual context during training, the validation and test performances are calculated using the vis-w2v embeddings on a specific task of interest (Sec. 5
). This is used to choose the hyperparametersand .
Connections to w2v: Our model can be seen as a multi-modal extension of the continuous bag of words (CBOW) w2v models. The CBOW w2v objective maximizes the likelihood for a word and its context . On the other hand, we maximize the likelihood of the visual context given a set of words (Eq. 2).
We compare vis-w2v and w2v on the tasks of common sense assertion classification (Sec. 4.1), visual paraphrasing (Sec. 4.2), and text-based image retrieval (Sec. 4.3). We give details of each task and the associated datasets below.
4.1 Common Sense Assertion Classification
We study the relevance of vis-w2v to the common sense (CS) assertion classification task introduced by Vedantam et al. . Given common sense tuples of the form (primary object or , relation or , secondary object or ) e.g. (boy, eats, cake), the task is to classify it as plausible or not. The CS dataset contains TEST assertions (spanning relations) out of which are plausible, as indicated by human annotations. These TEST assertions are extracted from the MS COCO dataset , which contains real images and captions. Evaluating on this dataset allows us to demonstrate that visual grounding learnt from the abstract world generalizes to the real world.  approaches the task by constructing a multi-modal similarity function between TEST assertions whose plausibility is to be evaluated, and TRAIN assertions that are known to be plausible. The TRAIN dataset also contains abstract scenes made from clipart depicting relations between various objects (20 scenes per relation). Each scene is annotated with one tuple that names the primary object, relation, and secondary object depicted in the scene. Abstract scene features (from ) describing the interaction between objects such as relative location, pose, absolute location, etc. are used for learning vis-w2v. More details of the features can be found in the appendix. We use the VAL set from  ( assertions) to pick the hyperparameters. Since the dataset contains tuples of the form (, , ), we explore learning vis-w2v with separate models for each, and a shared model irrespective of the word being , , or .
4.2 Visual Paraphrasing
Visual paraphrasing (VP), introduced by Lin and Parikh  is the task of determining if a pair of descriptions describes the same scene or two different scenes. The dataset introduced by  contains pairs of descriptions, of which a third are positive (describe the same scene) and the rest are negatives. The TRAIN dataset contains VP pairs whereas the TEST dataset contains VP pairs. Each description contains three sentences. We use scenes and descriptions from Zitnick et al.  to train vis-w2v models, similar to Lin and Parikh. The abstract scene feature set from  captures occurrence of objects, person attributes (expression, gaze, and pose), absolute spatial location and co-occurrence of objects, relative spatial location between pairs of objects, and depth ordering (3 discrete depths), relative depth and flip. We withhold a set of pairs (333 positive and 667 negative) from TRAIN to form a VAL set to pick hyperparameters. Thus, our VP TRAIN set has pairs.
4.3 Text-based Image Retrieval
In order to verify if our model has learnt the visual grounding of concepts, we study the task of text-based image retrieval. Given a query tuple, the task is to retrieve the image of interest by matching the query and ground truth tuples describing the images using word embeddings. For this task, we study the generalization of vis-w2v embeddings learnt for the common sense (CS) task, i.e., there is no training involved. We augment the common sense (CS) dataset  (Sec. 4.1) to collect three query tuples for each of the original 4260 CS TRAIN scenes. Each scene in the CS TRAIN dataset has annotations for which objects in the scene are the primary and secondary objects in the ground truth tuples. We highlight the primary and secondary objects in the scene and ask workers on AMT to name the primary, secondary objects, and the relation depicted by the interaction between them. Some examples can be seen in Fig. 3. Interestingly, some scenes elicit diverse tuples whereas others tend to be more constrained. This is related to the notion of Image Specificity . Note that the workers do not see the original (ground truth) tuple written for the scene from the CS TRAIN dataset. More details of the interface are provided in the appendix. We use the collected tuples as queries for performing the retrieval task. Note that the queries used at test time were never used for training vis-w2v.
5 Experimental Setup
We now explain our experimental setup. We first explain how we use our vis-w2v or baseline w2v (word2vec) model for the three tasks described above: common sense (CS), visual paraphrasing (VP), and text-based image retrieval. We also provide evaluation details. We then list the baselines we compare to for each task and discuss some design choices. For all the tasks, we preprocess raw text by tokenizing using the NLTK toolkit . We implement vis-w2v as an extension of the Google C implementation of word2vec333https://code.google.com/p/word2vec/.
5.1 Common Sense Assertion Classification
The task in common sense assertion classification (Sec. 4.1) is to compute the plausibility of a test assertion based on its similarity to a set of tuples () known to be plausible. Given a tuple (Primary Object , Relation , Secondary Object ) and a training instance , the plausibility scores are computed as follows:
where represent the corresponding word embedding spaces. The final text score is given as follows:
where sums over the entire set of training tuples. We use the value of used by  for our experiments.
 share embedding parameters across , , in their text based model. That is, . We call this the shared model. When are learnt independently for (, , ), we call it the separate model.
The approach in  also has a visual similarity function that combines text and abstract scenes that is used along with this text-based similarity. We use the text-based approach for evaluating both vis-w2v and baseline w2v. However, we also report results including the visual similarity function along with text similarity from vis-w2v. In line with , we also evaluate our results using average precision (AP) as a performance metric.
5.2 Visual Paraphrasing
In the visual paraphrasing task (Sec. 4.2), we are given a pair of descriptions at test time. We need to assign a score to each pair indicating how likely they are to be paraphrases, i.e., describing the same scene. Following  we average word embeddings (vis-w2v or w2v) for the sentences and plug them into their text-based scoring function. This scoring function combines term frequency, word co-occurrence statistics and averaged word embeddings to assess the final paraphrasing score. The results are evaluated using average precision (AP) as the metric. While training both vis-w2v and w2v for the task, we append the sentences from the train set of  to the original word embedding training corpus to handle vocabulary overlap issues.
5.3 Text-based Image Retrieval
We compare w2v and vis-w2v on the task of text-based image retrieval (Sec. 4.3). The task involves retrieving the target image from an image database, for a query tuple. Each image in the database has an associated ground truth tuple describing it. We use these to rank images by computing similarity with the query tuple. Given tuples of the form (, , ), we average the vector embeddings for all words in , , . We then explore separate and shared models just as we did for common sense assertion classification. In the separate
model, we first compute the cosine similarity between the query and the ground truth for, , separately and average the three similarities. In the shared model, we average the word embeddings for , , for query and ground truth and then compute the cosine similarity between the averaged embeddings. The similarity scores are then used to rank the images in the database for the query. We use standard metrics for retrieval tasks to evaluate: Recall@1 (R@1), Recall@5 (R@5), Recall@10 (R@10) and median rank (med R) of target image in the returned result.
We describe some baselines in this subsection. In general, we consider two kinds of w2v models: those learnt from generic text, e.g., Wikipedia (w2v-wiki) and those learnt from visual text, e.g., MS COCO (w2v-coco), i.e., text describing images. Embeddings learnt from visual text typically contain more visual information . vis-w2v-wiki are vis-w2v embeddings learnt using w2v-wiki as an initialization to the projection matrix, while vis-w2v-coco are the vis-w2v embeddings learnt using w2v-coco as the initialization. In all settings, we are interested in studying the performance gains on using vis-w2v over w2v. Although our training procedure itself is task agnostic, we train separately on the common sense (CS) and the visual paraphrasing (VP) datasets. We study generalization of the embeddings learnt for the CS task on the text-based image retrieval task. Additional design choices pertaining to each task are discussed in Sec. 3.
We present results on common sense (CS), visual paraphrasing (VP), and text-based image retrieval tasks. We compare our approach to various baselines as explained in Sec. 5 for each application. Finally, we train our model using real images instead of abstract scenes, and analyze differences. More details on the effect of hyperparameters on performance (for CS and VP) can be found in the appendix.
|Approach||common sense AP (%)|
|vis-w2v-coco (shared) + vision||74.2|
|vis-w2v-coco (separate) + vision||75.2|
|w2v-wiki (from )||68.4|
|w2v-coco (from )||72.2|
|w2v-coco + vision (from )||73.6|
6.1 Common Sense Assertion Classification
We first present our results on the common sense assertion classification task (Sec. 4.1). We report numbers with a fixed hidden layer size, (to be comparable to ) in Table. 1. We use , which gives the best performance on validation. We handle tuple elements, , or , with more than one word by placing each word in a separate window (i.e. ). For instance, the element “lay next to” is trained by predicting the associated visual context thrice with “lay”, “next” and “to” as inputs. Overall, we find an increase of 2.6% with vis-w2v-coco (separate) model over the w2v-coco model used in . We achieve larger gains (5.8%) with vis-w2v-wiki over w2v-wiki. Interestingly, the tuples in the common sense task are extracted from the MS COCO  dataset. Thus, this is an instance where vis-w2v (learnt from abstract scenes) generalizes to text describing real images.
Our vis-w2v-coco (both shared and separate) embeddings outperform the joint w2v-coco + vision model from  that reasons about visual features for a given test tuple, which we do not. Note that both models use the same training and validation data, which suggests that our vis-w2v model captures the grounding better than their multi-modal text + visual similarity model. Finally, we sweep for the best value of for the validation set and find that vis-w2v-coco (separate) gets the best AP of 75.4% on TEST with . This is our best performance on this task.
Separate vs. Shared: We next compare the performance when using the separate and shared vis-w2v models. We find that vis-w2v-coco (separate) does better than vis-w2v-coco (shared) (74.8% vs. 74.5%), presumably because the embeddings can specialize to the semantic roles words play when participating in , or . In terms of shared models alone, vis-w2v-coco (shared) achieves a gain in performance of 2.3% over the w2v-coco model of , whose textual models are all shared.
What Does Clustering Capture? We next visualize the semantic relatedness captured by clustering in the abstract scenes feature space (Fig. 4). Recall that clustering gives us surrogate labels to train vis-w2v. For the visualization, we pick a relation and display other relations that co-occur the most with it in the same cluster. Interestingly, words like “prepare to cut”, “hold”, “give” occur often with “stare at”. Thus, we discover the fact that when we “prepare to cut” something, we also tend to “stare at” it. Reasoning about such notions of semantic relatedness using purely textual cues would be prohibitively difficult. We provide more examples in the appendix.
6.2 Visual Paraphrasing
We next describe our results on the Visual Paraphrasing (VP) task (Sec. 4.2). The task is to determine if a pair of descriptions are describing the same scene. Each description has three sentences. Table. 2 summarizes our results and compares performance to w2v. We vary the size of the context window and check performance on the VAL set. We obtain best results with the entire description as the context window , , and . Our vis-w2v models give an improvement of 0.7% on both w2v-wiki and w2v-coco respectively. In comparison to w2v-wiki approach from , we get a larger gain of 1.2% with our vis-w2v-coco embeddings444Our implementation of  performs 0.3% higher than that reported in .. Lin and Parikh  imagine the visual scene corresponding to text to solve the task. Their combined text + imagination model performs 0.2% better (95.5%) than our model. Note that our approach does not have the additional expensive step of generating an imagined visual scene for each instance at test time. Qualitative examples of success and failure cases are shown in Fig. 5.
Window Size: Since the VP task is on multi-sentence descriptions, it gives us an opportunity to study how size of the window () used in training affects performance. We evaluate the gains obtained by using window sizes of entire description, single sentence, 5 words, and single word respectively. We find that description level windows and sentence level windows give equal gains. However, performance tapers off as we reduce the context to 5 words (0.6% gain) and a single word (0.1% gain). This is intuitive, since VP requires us to reason about entire descriptions to determine paraphrases. Further, since the visual features in this dataset are scene level (and not about isolated interactions between objects), the signal in the hidden layer is stronger when an entire sentence is used.
6.3 Text-based Image Retrieval
We next present results on the text-based image retrieval task (Sec. 4.3). This task requires visual grounding as the query and the ground truth tuple can often be different by textual similarity, but could refer to the same scene (Fig. 3). As explained in Sec. 4.3, we study generalization of the embeddings learnt during the commonsense experiments to this task. Table. 3 presents our results. Note that vis-w2v here refers to the embeddings learnt using the CS dataset. We find that the best performing models are vis-w2v-wiki (shared) (as per R@1, R@5, medR) and vis-w2v-coco (separate) (as per R@10, medR). These get Recall@10 scores of 49.5% whereas the baseline w2v-wiki and w2v-coco embeddings give scores of 45.4% and 47.6%, respectively.
|Approach||R@1 (%)||R@5 (%)||R@10 (%)||med R|
6.4 Real Image Experiment
Finally, we test our vis-w2v approach with real images on the CS task, to evaluate the need to learn fine-grained visual grounding via abstract scenes. Thus, instead of semantic features from abstract scenes, we obtain surrogate labels by clustering real images from the MS COCO dataset using fc7 features from the VGG-16  CNN. We cross validate to find the best number of clusters and hidden units. We perform real image experiments in two settings: 1) We use all of the MS COCO dataset after removing the images whose tuples are in the CS TEST set of . This gives us a collection of K images to learn vis-w2v. MS COCO dataset has a collection of captions for each image. We use all these five captions with sentence level context555We experimented with other choices but found this works best. windows to learn vis-w2v80K. 2) We create a real image dataset by collecting 20 real images from MS COCO and their corresponding tuples, randomly selected for each of relations from the VAL set (Sec. 5.1). Analogous to the CS TRAIN set containing abstract scenes, this gives us a dataset of 4260 real images along with an associate tuple, depicting the 213 CS VAL relations. We refer to this model as vis-w2v4K.
We report the gains in performance over w2v baselines in both scenario 1) and 2) for the common sense task. We find that using real images gives a best-case performance of 73.7% starting from w2v-coco for vis-w2v80K (as compared to 74.8% using CS TRAIN abstract scenes). For vis-w2v4K-coco, the performance on the validation actually goes down during training. If we train vis-w2v4K starting with generic text based w2v-wiki, we get a performance of 70.8% (as compared to 74.2% using CS TRAIN abstract scenes). This shows that abstract scenes are better at visual grounding as compared to real images, due to their rich semantic features.
Antol et al.  have studied generalization of classification models learnt on abstract scenes to real images. The idea is to transfer fine-grained concepts that are easier to learn in the fully-annotated abstract domain to tasks in the real domain. Our work can also be seen as a method of studying generalization. One can view vis-w2v as a way to transfer knowledge learnt in the abstract domain to the real domain, via text embeddings (which are shared across the abstract and real domains). Our results on commonsense assertion classification show encouraging preliminary evidence of this.
We next discuss some considerations in the design of the model. A possible design choice when learning embeddings could have been to construct a triplet loss function, where the similarity between a tuple and a pair of visual instances can be specified. That is, given a textual instance A, and two images B and C (where A describes B, and not C), one could construct a loss that enforces, and learn joint embeddings for words and images. However, since we want to learn hidden semantic relatedness (e.g.“eats”, “stares at”), there is no explicit supervision available at train time on which images and words should be related. Although the visual scenes and associated text inherently provide information about related words, they do not capture the unrelatedness between words, i.e., we do not have negatives to help us learn the semantics.
We can also understand vis-w2v in terms of data augmentation. With infinite text data describing scenes, distributional statistics captured by w2v would reflect all possible visual patterns as well. In this sense, there is nothing special about the visual grounding. The additional modality helps to learn complimentary concepts while making efficient use of data. Thus, the visual grounding can be seen as augmenting the amount of textual data.
We learn visually grounded word embeddings (vis-w2v) from abstract scenes and associated text. Abstract scenes, being trivially fully annotated, give us access to a rich semantic feature space. We leverage this to uncover visually grounded notions of semantic relatedness between words that would be difficult to capture using text alone or using real images. We demonstrate the visual grounding captured by our embeddings on three applications that are in text, but benefit from visual cues: 1) common sense assertion classification, 2) visual paraphrasing, and 3) text-based image retrieval. Our method outperforms word2vec (w2v) baselines on all three tasks. Further, our method can be viewed as a modality to transfer knowledge from the abstract scenes domain to the real domain via text. Our datasets, code, and vis-w2v embeddings are available for public use.
Acknowledgments: This work was supported in part by the The Paul G. Allen Family Foundation via an award to D.P., ICTAS at Virginia Tech via an award to D.P., a Google Faculty Research Award to D.P. the Army Research Office YIP Award to D.P, and ONR grant N000141210903.
We present detailed performance results of Visual Word2Vec (vis-w2v) on all three tasks :
Common sense assertion classification (Sec. A)
Visual paraphrasing (Sec. B)
Text-based image retrieval (Sec. C)
Specifically, we study the affect of various hyperparameters like number of surrogate labels (), number of hidden layer nodes (), etc., on the performance of both vis-w2v-coco and vis-w2v-wiki. We remind the reader that vis-w2v-coco models are initialized with w2v learnt on visual text, i.e., MSCOCO captions in our case while vis-w2v-wiki models are initialized with w2v learnt on generic Wikipedia text. We also show few visualizations and examples to qualitatively illustrate why vis-w2v performs better in these tasks that are ostentatiously in text, but benefit from visual cues. We conclude by presenting the results of training on real images (Sec. D). We also show a comparison to the model from Ren et al., who also learn word2vec with visual grounding.
Appendix A Common Sense Assertion Classification
Recall that the common sense assertion classification task  is to determine if a tuple of the form (primary object or P, relation or R, secondary object or S) is plausible or not. In this section, we first describe the abstract visual features used by . We follow it with results for vis-w2v-coco, both shared and separate models, by varying the number of surrogate classes . We next discuss the effect of number of hidden units which can be seen as the complexity of the model. We then vary the amount of training data and study performance of vis-w2v-coco. Learning separate word embeddings for each of these specific roles, i.e., P, R or S results in separate models while learning single embeddings for all of them together gives us shared models. Additionally, we also perform and report similar studies for vis-w2v-wiki. Finally, we visualize the clusters learnt for the common sense task through word clouds, similar to Fig. 4 in the main paper.
a.1 Abstract Visual Features
We describe the features extracted from abstract scenes for the task of common sense assertion classification. Our visual features are essentially the same as those used by: a) Features corresponding to primary and secondary object, i.e
., P and S respectively. These include type (category ID and instance ID), absolute location modeled via Gaussian Mixture Model (GMM), orientation, attributes and poses for both P and S present in the scene. We use Gaussian Mixture at hands and foot locations to model pose, measuring relative positions and joint locations. Human attributes are age (5 discrete values), skin color (3 discrete values) and gender (2 discrete values). Animals have 5 discrete poses. Human pose features are constructed using keypoint locations. b) Features corresponding to relative location of P and S, once again modeled using Gaussian Mixture Models. These features are normalized by the flip and depth of the primary object, which results in the features being asymmetric. We compute these with respect to both P and S to make the features symmetric. c) Features related to the presence of other objects in the scene,i.e., category ID and instance ID for all the other objects. Overall the feature vector is of dimension .
a.2 Varying number of clusters
Intuition: We cluster the images in the semantic clipart feature space to get surrogate labels. We use these labels as visual context, and predict them using words to enforce visual grounding. Hence, we study the influence of the number of surrogate classes relative to the number of images. This is indicative of how coarse/detailed the visual grounding for a task needs to be.
Setup: We train vis-w2v
models by clustering visual features with and without dimensionality reduction through Principal Component Analysis (PCA), giving usOrig and PCA settings, respectively. Notice that each of the elements of tuples, i.e., P, R or S could have multiple words, e.g., lay next to. We handle these in two ways: a) Place each of the words in separate windows and predict the visual context repeatedly. Here, we train by predicting the same visual context for lay, next, to thrice. This gives us the Words setting. b) Place all the words in a single window and predict the visual context for the entire element only once. This gives the Phrases setting. We explore the cross product space of settings a) and b). PCA/Phrases (red in Fig. 6) refers to the model trained by clustering the dimensionality reduced visual features and handling multi-word elements by including them in a single window. We vary the number of surrogate classes from to in steps of , re-train vis-w2v for each , and report the accuracy on the common sense task. The number of hidden units is kept fixed to to be comparable to the text-only baseline reported in . Fig. 6 shows the performance on the common sense task as varies for both shared and separate models in four possible configurations each, as described above.
As varies, the performance for both shared and separate models increases initially and then either saturates or decreases. For a given dataset, low values of result in the visual context being too coarse to learn the visual grounding. On the other hand, being too high results in clusters which do not capture visual semantic relatedness. We found the best model to have around clusters in both the cases.
Words models perform better than Phrases models in both cases. Common sense task involves reasoning about the specific role (P, R or S) each word plays. For example, (man, eats, sandwich) is plausible while (sandwich, eats, sandwich) or (man, sandwich, eats) is not. Potentially, vis-w2v could learn these roles in addition to the learning semantic relatedness between the words. This explains why separate models perform better than shared models, and Words outperform Phrases setting.
For lower , PCA models dominate over Orig models while the latter outperforms as increases. As low values of correspond to coarse visual information, surrogate classes in PCA models could be of better quality and thus help in learning the visual semantics.
a.3 Varying number of hidden units
Intuition: One of the model parameters for our vis-w2v is the number of hidden units . This can be seen as the capacity of the model. We vary while keeping the other factors constant during training to study its affect on performance of the vis-w2v model.
Setup: To understand the role of , we consider two vis-w2v models trained separately with set to and respectively. Additionally, both of these are separate models with Orig/Words configuration (see Sec. A.2). We particularly choose these two settings as the former is trained with a very coarse visual semantic information while the latter is the best performing model. Note that as  fix the number of hidden units to in their evaluation, we cannot directly compare the performance to their baseline. We, therefore, recompute the baselines for each value of and use it to compare our two models, as shown in Fig. 8.
Observations: Models of low complexity, i.e., low values of , perform the worst. This could be due to the inherent limitation of low to capture the semantics, even for w2v. On the other hand, high complexity models also perform poorly, although better than the low complexity models. The number of parameters to be learnt, i.e. and , increase linearly with . Therefore, for a finite amount of training data, models of high complexity tend to overfit resulting in drop in performance on an unseen test set. The baseline w2v models also follow a similar trend. It is interesting to note that the improvement of vis-w2v over w2v for less complex models (smaller ) is at (for ) as compared to (for ). In other words, lower complexity models benefit more from the vis-w2v enforced visual grounding. In fact, vis-w2v of low complexity , outperforms the best w2v baseline across all possible settings of model parameters. This provides a strong evidence for the usefulness of visually grounding word embeddings in capturing visually-grounded semantics better.
a.4 Varying size of training data
Intuition: We next study how varying the size of the training data affects performance of the model. The idea is to analyze whether more data about relations would help the task, or more data per relation would help the task.
Setup: We remind the reader that vis-w2v for common sense task is trained on CS TRAIN dataset that contains 4260 abstract scenes made from clipart depicting 213 relations between various objects (20 scenes per relation). We identify two parameters: the number of relations and the number of abstract scenes per relation . Therefore, CS TRAIN dataset originally has . We vary the training data size in two ways: a) Fix and vary . b) Fix and vary in steps of from to . These cases denote two specific situations–the former limits the model in terms of how much it knows about each relation, i.e. its depth, keeping the number of relations, i.e. its breadth, constant; while the latter limits the model in terms of how many relations it knows, i.e., it limits the breadth keeping the depth constant. Throughout this study, we select the best performing vis-w2v model with in the Orig/Words configuration. Fig. 6(a) shows the performance on the common sense task when is fixed while Fig. 6(b) is the performance when is fixed.
Observations: The performance increases with the increasing size of training data in both the situations when and is fixed. However, the performance saturates in the former case while it increases with almost a linear rate in the latter. This shows that breadth helps more than the depth in learning visual semantics. In other words, training with more relations and fewer scenes per relation is more beneficial than training with fewer relations and more scenes per relation. To illustrate this, consider performance with approximately around half the size of the original CS TRAIN dataset. In the former case, it corresponds to at while at in the latter. Therefore, we conclude that the model learns semantics better with more concepts (relations) over more instances (abstract scenes) per concept.
a.5 Cluster Visualizations
We show the cluster visualizations for a randomly sampled set of relations from the CS VAL set (Fig. 9). As in the main paper (Fig. 4), we analyze how frequently two relations co-occur in the same clusters. Interestingly, relations like drink from co-occur with relations like blow out and bite into which all involve action with a person’s mouth.
Appendix B Visual Paraphrasing
The Visual Paraphrasing (VP) task  is to classify whether a pair of textual descriptions are paraphrases of each other. These descriptions have three sentence each. Table 4 presents results on VP for various settings of the model that are described below.
Model settings: We vary the number of hidden units for both vis-w2v-coco and vis-w2v-wiki models. We also vary our context window size to include entire description (Descs), individual sentences (Sents), window of size (Winds) and individual words (Words). As described in Sec. A.2, we also have Orig and PCA settings.
Observations: From Table 4, we see improvements over the text baseline . In general, PCA configuration outperforms Orig for low complexity models (). Using entire description or sentences as the context window gives almost the same gains, while performs drops when smaller context windows are used (Winds and Words). As VP is a sentence level task where one needs to reason about the entire sentence to determine whether the given descriptions are paraphrases, these results are intuitive.
Appendix C Text-based Image Retrieval
Recall that in Text-based Image Retrieval (Sec. 4.3 in main paper), we highlight the primary object (P) and secondary object (S) and ask workers on Amazon Mechanical Turk (AMT) to describe the relation illustrated by the scene with tuples. An illustration of our tuple collection interface can be found in Fig. 10. Each of the tuples entered in the text-boxes is treated as the query for text-based image retrieval.
Some qualitative examples of success and failure cases of vis-w2v-wiki with respect to w2v-wiki are shown in Fig. 11. We see that vis-w2v-wiki captures notions such as the relationship between holding and opening better than w2v-wiki.
Appendix D Real Image Experiments
We now present the results when training vis-w2v with real images from MSCOCO dataset by clustering using fc7 features from the VGG-16  CNN.
Intuition: We train vis-w2v embeddings with real images and compare them to those trained with abstract scenes, through the common sense task.
Setup: We experiment with two settings: a) Considering all the images from MSCOCO dataset, along with associated captions. Each image has around captions giving us a total of around captions to train. We call vis-w2v trained on this dataset as vis-w2v80k. b) We randomly select 213 relations from VAL set and collect 20 real images from MSCOCO and their corresponding tuples. This would give us real images with tuples, depicting the 213 CS VAL relations. We refer to this model as vis-w2v4k.
We first train vis-w2v80k with and use the fc7 features as is, i.e. without PCA, in the Sents configuration (see Sec. B). Further, to investigate the complementarity between visual semantics learnt from real and visual scenes, we initialize vis-w2v-coco with vis-w2v-coco80k, i.e., we learn the visual semantics from the real scenes and train again to learn from abstract scenes. Table 5 shows the results for vis-w2v-coco80k, varying the number of surrogate classes .
We then learn vis-w2v4k with in the Orig/Words setting (see Sec. A). We observe that the performance on the validation set reduces for vis-w2v-coco4k. Table 6 summarizes the results for vis-w2v-wiki4k.
Observations: From Table 5 and Table 6, we see that there are indeed improvements over the text baseline of w2v. The complementarity results (Table 5) show that abstract scenes help us ground word embeddings through semantics complementary to those learnt from real images. Comparing the improvements from real images (best AP of ) to those from abstract scenes (best AP of ), we see that that abstract visual features capture visual semantics better than real images for this task. It if often difficult to capture localized semantics in the case of real images. For instance, extracting semantic features of just the primary and secondary objects given a real image, is indeed a challenging detection problem in vision. On the other hand, abstract scene offer these fine-grained semantics features therefore making them an ideal for visually grounding word embeddings.
Appendix E Comparison to Ren et al.
We next compare the embeddings from our vis-w2v model to those from Ren et al. . Similar to ours, their model can also be understood as a multi-modal extension of the Continuous Bag of Words (CBOW) architecture. More specifically, they use global-level fc7
image features in addition to the local word context to estimate the probability of a word conditioned on its context.
We use their model to finetune word w2v-coco embeddings using real images from the MS COCO dataset. This performs slightly worse on common sense assertion classification than our corresponding (real image) model (Sec. 6.4) (73.4% vs 73.7%), while our best model gives a performance of 74.8% when trained with abstract scenes. We then initialize the projection matrix in our vis-w2v model with the embeddings from Ren et al.’s model, and finetune with abstract scenes, following our regular training procedure. We find that the performance improves to 75.2% for the separate model. This is a 0.4% improvement over our best vis-w2v separate model. In contrast, using a curriculum of training with real image features and then with abstract scenes within our model yields a slightly lower improvement of 0.2%. This indicates that the global visual features incorporated in the model of Ren et al., and the fine-grained visual features from abstract scenes in our model provide complementary benefits, and a combination yields richer embeddings.
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and
VQA: Visual question answering.
International Conference on Computer Vision (ICCV), 2015.
-  S. Antol, C. L. Zitnick, and D. Parikh. Zero-shot learning via visual abstraction. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV, pages 401–416, 2014.
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin.
A neural probabilistic language model.
Journal of Machine Learning Research, 3:1137–1155, 2003.
-  S. F. Chen, S. F. Chen, J. Goodman, and J. Goodman. An empirical study of smoothing techniques for language modeling. Technical report, 1998.
-  X. Chen and C. L. Zitnick. Learning a recurrent visual representation for image caption generation. CoRR, abs/1411.5654, 2014.
-  R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In International Conference on Machine Learning, ICML, 2008.
-  C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In International Conference on Computer Vision (ICCV), 2015.
-  J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. CoRR, abs/1411.4389, 2014.
A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox.
Discriminative unsupervised feature learning with convolutional neural networks.In Advances in Neural Information Processing Systems 27 (NIPS), 2014.
-  D. F. Fouhey and C. L. Zitnick. Predicting object dynamics in scenes. In CVPR, 2014.
-  H. Gao, J. Mao, J. Zhou, Z. Huang, and A. Yuille. Are you talking to a machine? dataset and methods for multilingual image question answering. ICLR, 2015.
-  D. Geman, S. Geman, N. Hallonquist, and L. Younes. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences, 112(12):3618–3623, 2015.
R. Girshick, J. Donahue, T. Darrell, and J. Malik.
Rich feature hierarchies for accurate object detection and semantic
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
M. Hodosh, P. Young, and J. Hockenmaier.
Framing image description as a ranking task: Data, models and evaluation metrics.J. Artif. Intell. Res. (JAIR), 47:853–899, 2013.
-  M. Jas and D. Parikh. Image Specificity. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128–3137, 2015.
-  S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. In IEEE Transactions on Acoustics, Speech and Signal Processing, pages 400–401, 1987.
-  R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. page 13, 11 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
-  G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th CVPR, 2011.
-  A. Lazaridou, N. T. Pham, and M. Baroni. Combining language and vision with a multimodal skip-gram model. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 153–163, Denver, Colorado, May–June 2015. Association for Computational Linguistics.
-  T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
-  X. Lin and D. Parikh. Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. CVPR (to appear), Nov. 2015.
-  E. Loper and S. Bird. Nltk: The natural language toolkit. In In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Association for Computational Linguistics, 2002.
S. Maji, L. Bourdev, and J. Malik.
Action recognition from a distributed representation of pose and appearance.In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
-  M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. CoRR, abs/1410.0210, 2014.
-  M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neural-based approach to answering questions about images. CoRR, abs/1505.01121, 2015.
-  J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. CoRR, abs/1410.1090, 2014.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013.
-  T. Mikolov, J. Kopecky, L. Burget, O. Glembek, and J. Cernocky. Neural network based language models for highly inflective languages. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4725–4728. IEEE, 2009.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.
M. Mitchell, X. Han, and J. Hayes.
Midge: Generating descriptions of images.
Proceedings of the Seventh International Natural Language Generation Conference, INLG ’12, pages 131–133, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.
-  T. B. C. L. Z. D. P. Ramakrishna Vedantam, Xiao Lin. Learning common sense through visual abstraction. In IEEE International Conference on Computer Vision (ICCV), 2015.
-  M. Ren, R. Kiros, and R. S. Zemel. Image question answering: A visual semantic embedding model and a new dataset. CoRR, abs/1505.02074, 2015.
-  M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In IEEE International Conference on Computer Vision (ICCV), December 2013.
-  F. Sadeghi, S. K. Divvala, and A. Farhadi. Viske: Visual knowledge extraction and question answering by visual verification of relation phrases. In CVPR, pages 1456–1464, 2015.
-  C. Silberer, V. Ferrari, and M. Lapata. Models of semantic representation with visual attributes. In ACL (1), pages 572–582. The Association for Computer Linguistics, 2013.
C. Silberer and M. Lapata.
Learning grounded meaning representations with autoencoders.In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 721–732, Baltimore, Maryland, June 2014. Association for Computational Linguistics.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015.
-  R. Xu, J. Lu, C. Xiong, Z. Yang, and J. J. Corso. Improving word representations via global visual context. 2014.
-  C. Zitnick, R. Vedantam, and D. Parikh. Adopting abstract images for semantic scene understanding. PAMI, 2014.
-  C. L. Zitnick and D. Parikh. Bringing semantics into focus using visual abstraction. In CVPR, 2013.
-  C. L. Zitnick, D. Parikh, and L. Vanderwende. Learning the visual interpretation of sentences. In ICCV, 2013.