ViCo: Word Embeddings from Visual Co-occurrences

We propose to learn word embeddings from visual co-occurrences. Two words co-occur visually if both words apply to the same image or image region. Specifically, we extract four types of visual co-occurrences between object and attribute words from large-scale, textually-annotated visual databases like VisualGenome and ImageNet. We then train a multi-task log-bilinear model that compactly encodes word "meanings" represented by each co-occurrence type into a single visual word-vector. Through unsupervised clustering, supervised partitioning, and a zero-shot-like generalization analysis we show that our word embeddings complement text-only embeddings like GloVe by better representing similarities and differences between visual concepts that are difficult to obtain from text corpora alone. We further evaluate our embeddings on five downstream applications, four of which are vision-language tasks. Augmenting GloVe with our embeddings yields gains on all tasks. We also find that random embeddings perform comparably to learned embeddings on all supervised vision-language tasks, contrary to conventional wisdom.


Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes

We propose a model to learn visually grounded word embeddings (vis-w2v) ...

Language with Vision: a Study on Grounded Word and Sentence Embeddings

Language grounding to vision is an active field of research aiming to en...

Morphological Word Embeddings

Linguistic similarity is multi-faceted. For instance, two words may be s...

An Unsupervised Approach for Mapping between Vector Spaces

We present a language independent, unsupervised approach for transformin...

Asynchronous Training of Word Embeddings for Large Text Corpora

Word embeddings are a powerful approach for analyzing language and have ...

Group-Sparse Matrix Factorization for Transfer Learning of Word Embeddings

Sparse regression has recently been applied to enable transfer learning ...

Simple dynamic word embeddings for mapping perceptions in the public sphere

Word embeddings trained on large-scale historical corpora can illuminate...

1 Introduction

Word embeddings, , compact vector representations of words, are an integral component in many language [46, 14, 23, 38, 36, 48, 43] and vision-language models [28, 52, 53, 2, 40, 41, 49, 12, 47, 6, 54, 16, 27]. These word embeddings, , GloVe and word2vec, are typically learned from large-scale text corpora by modeling textual co-occurrences. However, text often consists of interpretations of concepts or events rather than a description of visual appearance. This limits the ability of text-only word embeddings to represent visual concepts.

To address this shortcoming, we propose to gather co-occurrence statistics of words based on images and learn word embeddings from these visual co-occurrences. Concretely, two words co-occur visually if both words are applicable to the same image or image region. We use four types of co-occurrences as shown in Fig. 1: (1) Object-Attribute co-occurrence between an object in an image region and the region’s attributes; (2) Attribute-Attribute co-occurrence of a region; (3) Context co-occurrence which captures joint object appearance in the same image; and (4) Object-Hypernym co-occurrence between a visual category and its hypernym (super-class).

Ideally, for reliable visual co-occurrence modeling of a sufficiently large vocabulary (a vocabulary size of K is typical for text-only embeddings), a dataset with all applicable vocabulary words annotated for each region in an image is required. While no visual dataset exists with such exhaustive annotations (many non-annotated words may still be applicable to an image region), large scale datasets like VisualGenome [17] and ImageNet [8] along with their WordNet [32] synset annotations provide a good starting point. We use ImageNet annotations augmented with WordNet hypernyms to compute Object-Hypernym co-occurrences while the remaining types of co-occurrence are computed from VisualGenome’s object and attribute annotations.

To learn ViCo, , word embeddings from Visual Co-occurrences, we could concatenate GloVe-like embeddings trained separately for each co-occurrence type via a log-bilinear model. However, in this naïve approach, the dimensionality of the learned embeddings scales linearly with the number of co-occurrence types. To avoid this linear scaling, we extend the log-bilinear model by formulating a multi-task problem, where learning embeddings from each co-occurrence type constitutes a different task with compact trainable embeddings shared among all tasks. In this formulation the embedding dimension can be chosen independently of the number of co-occurrence types.

To test ViCo’s ability to capture similarities and differences between visual concepts, we analyze performance in an unsupervised clustering, supervised partitioning (see supplementary material), and a zero-shot-like

visual generalization setting. The clustering analysis is performed on a set of most frequent words in VisualGenome which we manually label with

coarse and fine-grained visual categories. For the zero-shot-like setting, we use CIFAR-100 with different splits of the 100 categories into seen and unseen sets. In both cases, ViCo augmented GloVe outperforms GloVe, random vectors, vis-w2v, or their combinations. Through a qualitative analogy question answering evaluation, we also find ViCo embedding space to better capture relations between visual concepts than GloVe.

We also evaluate ViCo on five downstream tasks – a discriminative attributes task, and four vision-language tasks. The latter includes Caption-Image Retrieval, VQA, Referring Expression Comprehension, and Image Captioning. Systems using ViCo outperform those using GloVe for almost all tasks and metrics. While learned embeddings are typically believed to be important for vision-language tasks, somewhat surprisingly, we find random embeddings compete tightly with learned embeddings on all vision-language tasks. This suggests that either by nature of the tasks, model design, or simply training on large datasets, the current state-of-the-art vision-language models do not benefit much from learned embeddings. Random embeddings perform significantly worse than learned embeddings in our clustering, partitioning, and zero-shot analysis, as well as the discriminative attributes task, which does not involve images.

To summarize our contributions: (1) We develop a multi-task method to learn a word embedding from multiple types of co-occurrences; (2) We show that the embeddings learned from multiple visual co-occurrences, when combined with GloVe, outperform GloVe alone in unsupervised clustering, supervised partitioning, and zero-shot-like analysis, as well as on multiple vision-language tasks; (3) We find that performance of supervised vision-language models is relatively insensitive to word embeddings, with even random embeddings leading to nearly the same performance as learned embeddings. To the best of our knowledge, our study provides the first empirical evidence of this unintuitive behavior for multiple vision-language tasks.

2 Related Work

Here we describe non-associative, associative, and the most recent contextual models of word representation.

Non-Associative Models. Semantic Differential (SD) [34] is among the earliest attempts to obtain vector representations of words. SD relies on human ratings of words on 50 scales between bipolar adjectives, such as ‘happy-sad’ or ‘slow-fast.’ Osgood  [34] further reduced the 50 scales to 3 orthogonal factors. However, the scales were often vague (, is the word ‘coffee’ ‘slow’ or ‘fast’) and provided a limited representation of the word meaning. Another approach involved acquiring word similarity annotations followed by applying Multidimensional Scaling (MDS) [21] to obtain low dimensional (typically 2-4) embeddings and then identifying meaningful clusters or interpretable dimensions [45]. Like SD, the MDS approach lacked representation power, and embeddings and their interpretations varied based on words (, food names [45], animals [44], ) to which MDS was applied.

Associative Models. The hypothesis underlying associative models is that word-meaning may be derived by modeling a word’s association with all other words. Early attempts involved factorization of word-document [7] or word-word [26] co-occurrence matrices. Since raw co-occurrence counts can span several orders of magnitude, transformations of the co-occurrence matrix based on Positive Pointwise Mutual Information (PPMI) [4] and Hellinger distance [22] have been proposed. Recent neural approaches like the Continuous Bag-of-Words (CBOW) and the Skip-Gram models [29, 31, 30] learn from co-occurrences in local context windows as opposed to global co-occurrence statistics. Unlike global matrix factorization, local context window based approaches use co-occurrence statistics rather inefficiently because of the requirement of scanning context windows in a corpus during training but performed better on word-analogy tasks. Levy  [24] later showed that Skip-Gram with negative-sampling performs implicit matrix factorization of a PMI word-context matrix.

Figure 2: Log-bilinear models and our multi-task extension. We show loss computation of different approaches for learning word embeddings and for words and . The embeddings are denoted by colored vertical bars. (i) shows GloVe’s log-bilinear model. (ii) is our multi-task extension to learn from multiple co-occurrence matrices. Word embeddings and are projected into a dedicated space for each co-occurrence type through transformation . Log-bilinear losses are computed in the projected embedding spaces. (iii) shows an approach where the different colored regions of (or ) are allocated to learn from different co-occurrence types. This approach, equivalent to training separate embeddings followed by concatenation, can be implemented in our multi-task formulation using a select transform (Tab. 1). Tab. 4 shows that an appropriate choice of (, linear) in the multi-task framework leads to more compact embeddings than (iii) without sacrificing performance since the correlation between different co-occurrence types is utilized.

Our work is most closely related to GloVe [37] which combines the efficiency of global matrix factorization approaches with the performance obtained from modelling local context. We extend GloVe’s log-bilinear model to simultaneously learn from multiple types of co-occurrences. We also demonstrate that visual datasets annotated with words are a rich source of co-occurrence information that complements the representations learned from text corpora alone.

Visual Word Embeddings. There is some work on incorporating image representations into word embeddings. vis-w2v [18]

uses abstract (synthetic) scenes to learn visual relatedness. The scenes are clustered and cluster membership is used as a surrogate label in a CBOW framework. Abstract scenes have the advantage of providing good semantic features for free but are limited in their ability to match the richness and diversity of natural scenes. However, natural scenes present the challenge of extracting good semantic features. Our approach uses natural scenes but bypasses image feature extraction by only using co-occurrences of annotated words. ViEW 


is another approach to visually enhance existing word embeddings. An autoencoder is trained on pre-trained word embeddings while matching intermediate representations to visual features extracted from a convolutional network trained on ImageNet. ViEW is also limited by the requirement of good image features.

Contextual Models. Embeddings discussed so far represent individual words. However, many language understanding applications demand representations of words in context (, in a phrase or sentence) which in turn requires to learn how to combine word or character level representations of neighboring words or characters. The past year has seen several advances in contextualized word representations through pre-training on language models such as ELMo [39], OpenAI GPT [42], and BERT [9]. However, building mechanisms for representing context is orthogonal to our goal of improving representations of individual words (which may be used as input to these models).

3 Learning ViCo

We describe the GloVe formulation for learning embeddings from a single co-occurrence matrix in Sec. 3.1 and introduce our multi-task extension to learn embeddings jointly from multiple co-occurrence matrices in Sec. 3.2. Sec. 3.3 describes how co-occurrence count matrices are computed for each of the four co-occurrence types.

3.1 GloVe: Log-bilinear Model

Let denote the co-occurrence count between words and in a text corpus. Also let be the list of word pairs with non-zero co-occurrences. GloVe learns -dimensional embeddings for all words by optimizing


where is a weighting function that assigns lower weight to less frequent, noisy co-occurrences and is a learnable bias term for word .

Intuitively, the program in Eq. (1) learns word embeddings such that for any word pair with non-zero co-occurrence, the dot product approximates the log co-occurrence count up to an additive constant. The word meaning is derived by simultaneously modeling the degrees of association of a single word with a large number of other words [33]. We also refer the reader to [37] for more details.

Note the slight difference between the objective in Eq. (1) and the original GloVe objective: GloVe replaces and with (context vector) and which are also trainable. The GloVe vectors are obtained by averaging and . However, as also noted in [37], given the symmetry in the objective, both vectors should ideally be identical. We did not observe a significant change in performance when using separate word and context vectors.

3.2 Multi-task Log-bilinear Model

We now extend the log-bilinear model described above to jointly learn embeddings from multiple co-occurrence count matrices , where refers to a type from the set of types . Also let and be the list of word pairs with non-zero and zero co-occurrences of type respectively. We learn ViCo embeddings for all words

by minimizing the following loss function


Here is a co-occurrence type-specific transformation function that maps ViCo embeddings to a type-specialized embedding space. is a learned bias term for word and type . We set function in Eq. (1) to the constant for all . Next, we discuss the transformations , benefits of capturing different types of co-occurrences, use of the second term in Eq. (2), and training details. Fig. 2 illustrates (i) GloVe and versions of our model (ii,iii).

select (200) where are indices pre-allocated for in
linear (50) where
linear (100) where
linear (200) where
Table 1: Description and parametrization of transforms. is a transform for co-occurrence type . select corresponds to approach (iii) in Fig. 2 that concatenates separately trained dimensional embeddings.
Figure 3: Rich sense of relatedness through multiple co-occurrences. Different notions of word relatedness exist but current word embeddings do not provide a way to disentangle those. Since ViCo is learned from multiple types of co-occurrences with dedicated embedding spaces for each (obtained through transformations

), it can provide a richer sense of relatedness. The figure shows cosine similarities computed in GloVe, ViCo(linear) and embedding spaces dedicated to different co-occurrence types (components of ViCo(select)). For example, ‘hosiery’ and ‘sock’ are related through an object-hypernym relation but not related through object-attribute or a contextual relation. ‘laptop’ and ‘desk’ on the other hand are related through context.

Transformations . To understand the role of the transformations in learning from multiple co-occurrence matrices, consider the naïve approach of concatenating -dimensional word embeddings learned separately for each type using Eq. (1). Such an approach would yield an embedding with dimensions. For instance, 4 co-occurrence types, each producing embeddings of size , leads to dimensional final embeddings. Thus, a natural question arises – Is it possible to learn a more compact representation by utilizing the correlations between different co-occurrence types?

Eq. (2) is a multi-task learning formulation where learning from each type of co-occurrence constitutes a different task. Hence, is equivalent to a task-specific head that projects the shared word embedding to a type-specialized embedding space . A log-bilinear model equivalent to Eq. (1) is then applied for each co-occurrence type in the corresponding specialized embedding space. We learn the embeddings and parameters of simultaneously for all in an end-to-end manner.

With this multi-task formulation the dimensions of can be chosen independently of or . Also note that the new formulation encompasses the naïve approach which is implemented in this framework by setting , and as a slicing operation that ‘selects’ non-overlapping indices allocated for type . In our experiments, we evaluate this naïve approach and refer to it as the select transformation. We also assess linear transformations of different dimensions as described in Tab. 1. We find that 100 dimensional ViCo embeddings learned with linear transform achieve the best performance compactness trade-off.

Role of term. Optimizing only the first term given in Eq. (2) can lead to accidentally embedding a word pair from (zero co-occurrences) close together (high dot product). To suppress such spurious similarities, we include the term which encourages all word pairs to have a small predicted log co-occurrence


In particular, the second term in the objective linearly penalizes positive predicted log co-occurences of word-pairs that do not co-occur.

Obj-Attr Attr-Attr Obj-Hyp Context Overall
Unique Words
Non-zero entries (in millions)
Table 2: Co-occurrence statistics showing the number of words and millions of non-zero entries in each co-occurrence matrix. For reference, GloVe uses a vocabulary of words with 8-40 billion non-zero entries.

Training details. Pennington  [37] report Adagrad to work best for GloVe. We found that Adam leads to faster initial convergence. However, fine-tuning with Adagrad further decreases the loss. For both optimizers, we use a learning rate of , a batch size of word pairs sampled from and each for all , and no weight decay.

Multiple notions of relatedness. Learning from multiple co-occurrence types leads to a richer sense of relatedness between words. Fig. 3 shows that the relationship between two words may be better understood through similarities in multiple embedding spaces than just one. For example, ‘window’ and ‘door’ are related because they occur in context in scenes, ‘hair’ and ‘blonde’ are related through an object-attribute relation, ‘crouch’ and ‘squat’ are related because both attributes apply to similar objects, .

3.3 Computing Visual Co-occurrence Counts

To learn meaningful word embeddings from visual co-occurrences, reliable co-occurrence count estimates are crucial. We use Visual Genome and ImageNet for estimating visual co-occurrence counts. Specifically, we use object and attribute

synset (set of words with the same meaning) annotations in VisualGenome to get Object-Attribute (), Attribute-Attribute (), and Context () co-occurrence counts. ImageNet synsets and their ancestors in WordNet are used to compute Object-Hypernym () counts. Tab. 2 shows the number of unique words and non-zero entries in each co-occurrence matrix.

Let denote the set of four co-occurrence types and denote the number of co-occurrences of type between words and . We denote a synset and its associated set of words as . All co-occurrences are initialized to . We now describe how each co-occurrence matrix is computed.

  • Let and be the sets of object and attribute synsets annotated for an image region. For each region in VisualGenome, we increment by , for each word pair , and for all synset pairs . is also incremented unless .

  • For each region in VisualGenome, we increment by , for each word pair , and for all synset pairs .

  • Let be the union of all object synsets annotated in an image. For each image in VisualGenome, is incremented by , for each word pair , and for all synset pairs .

  • Let be a set of object synsets annotated for an image in ImageNet and its ancestors in WordNet. For each each image in ImageNet, is incremented by , for each word pair , and for all synset pairs .

4 Experiments

We analyze ViCo embeddings with respect to the following properties: (1) Does unsupervised clustering result in a natural grouping of words by visual concepts? (Sec. 4.1); (2) Do the word embeddings enable transfer of visual learning (, visual recognition) to classes not seen during training? (Sec. 4.2); (3) How well do the embeddings perform on downstream applications? (Sec. 4.3); (4) Does the embedding space show word arithmetic properties ()? (Sec. 4.4).111We also perform a supervised partitioning analysis which is included in the supplementary material. The results show that a supervised classification algorithm partitions words into visual categories more easily in the ViCo embedding space than in the GloVe or random vector space.

Data for clustering analysis. To answer (1) we manually annotate frequent words in VisualGenome with coarse (see legend in the t-SNE plots in Fig. 4) and fine categories (see appendix for the list of categories).

Data for zero-shot-like analysis. To answer (2), we use CIFAR-100 [20]

. We generate 4 splits of the 100 categories into disjoint Seen (categories used for training visual classifiers) and Unseen (categories used for evaluation) sets. We use the following scheme for splitting: The list of 5 sub-categories in each of the 20 coarse categories (provided by CIFAR) is sorted alphabetically and the first

categories are added to Seen and the remaining to Unseen for .

Figure 4: Unsupervised Clustering Analysis. (a,b) Qualitative evaluation with t-SNE: Plots show that ViCo augmented GloVe results in tighter, more homogenous clusters than GloVe. Marker shape encodes the annotated coarse category and color denotes if the word is used more frequently as an object or an attribute; (c,d) Quantitative evaluation: Plots show clustering performance of different embeddings measured through V-Measure at different number of clusters. All ViCo based embeddings outperform GloVe for both fine and coarse annotations (Sec. 4.1). See Tab. 3 and Tab. 4 for average performance across cluster numbers. Best viewed in color on a screen.

4.1 Unsupervised Clustering Analysis

The main benefit of word vectors over one-hot or random vectors is the meaningful structure captured in the embedding space: words that are closer in the embedding space are semantically similar. We hypothesize that ViCo represents similarities and differences between visual categories that are missing from GloVe.

Qualitative evidence to support this hypothesis can be found in t-SNE plots shown in Fig. 4, where concatenation of GloVe and ViCo embeddings leads to tighter, more homogenous clusters of the 13 coarse categories than GloVe.

To test the hypothesis quantitatively, we cluster word embeddings with agglomerative clustering (cosine affinity and average linkage) and compare to the coarse and fine ground truth annotations using V-Measure

which is the harmonic mean of

Homogeneity and Completeness scores. Homogeneity is a measure of cluster purity, assessing whether all points in the same cluster have the same ground truth label. Completeness measures whether all points with the same label belong to the same cluster222Analysis with other metrics and methods yields similar conclusions and is included in the supplementary material..

Plots (c,d) in Fig. 4 compare random vectors, GloVe, variants of ViCo and their combinations (concatenation) for different number of clusters using V-Measure. Average performance across different cluster numbers is shown in Tab. 3 and Tab. 4. The main conclusions are as follows:

ViCo clusters better than other embeddings. Tab. 3 shows that ViCo alone outperforms GloVe, random, and vis-w2v based embeddings. GloVe+ViCo improves performance further, especially for coarse categories.

WordNet is not the sole contributor to strong performance of ViCo. To verify that ViCo’s gains are not simply due to the hierarchical nature of WordNet, we evaluate a version of ViCo trained on co-occurrences computed without using WordNet, , using raw word annotations in VisualGenome instead of synset annotations and without Object-Hypernym co-occurrences. Tab. 3 shows that GloVe+ViCo(linear,100,w/o WordNet) outperforms GloVe for both coarse and fine categories on both metrics.

ViCo outperforms existing visual word embeddings. Tab. 3 evaluates performance of existing visual word embeddings which are learned from abstract scenes [18]. wiki and coco are different versions of vis-w2v depending on the dataset (Wikipedia or MS-COCO [25, 5]) used for training word2vec for initialization. After initialization, both models are trained on an abstract scenes (clipart images) dataset [56]. ViCo(linear,100) outperforms both of these embeddings. GloVe+vis-w2v-wiki performs similarly to GloVe and GloVe+vis-w2v-wiki-coco performs only slightly better than GloVe, showing that the majority of the information captured by vis-w2v may already be present in GloVe.

Learned embeddings significantly outperform random vectors. Tab. 3 shows that random vectors perform poorly in comparison to learned embeddings. GloVe+random performs similarly to GloVe or worse. This implies that gains of GloVe+ViCo over GloVe are not just an artifact of increased dimensionality.

Linear achieves similar performance as Select with fewer dimensions. Tab. 4 illustrates the ability of the multi-task formulation to learn a more compact representatio than select (concatenating embeddings learned from each co-occurrence type separately) without sacrificing performance. , , and

dimensional ViCo embeddings learned with linear transformations, all achieve performance similar to


4.2 Zero-Shot-like Analysis

The ability of word embeddings to capture relations between visual categories enables to generalize visual models trained on limited visual categories to larger sets unseen during training. To assess this ability, we evaluate embeddings on their zero-shot-like object classification performance using the CIFAR-100 dataset. Note that our zero-shot-like setup is slightly different from a typical zero-shot setup because even though the visual classifier is not trained on unseen class images in CIFAR, annotations associated with images of unseen categories in VisualGenome or ImageNet may be used to compute word co-occurrences while learning word embeddings.

Embeddings Dim. Fine Coarse
random(100) 100 0.34 0.15
GloVe 300 0.50 0.52
GloVe+random(100) 300+100 0.50 0.49
vis-w2v-wiki [18] 200 0.41 0.43
vis-w2v-coco [18] 200 0.45 0.4
GloVe+vis-w2v-wiki 300+200 0.5 0.52
GloVe+vis-w2v-coco 300+200 0.52 0.55
ViCo(linear,100) 100 0.60 0.59
GloVe+ViCo(linear,100) 300+100 0.61 0.65
GloVe+ViCo(linear,100, w/o WN) 300+100 0.54 0.58
Table 3: Comparing ViCo to other embeddings. All ViCo based embeddings outperform GloVe and random vectors. ViCo(linear,100) also outperforms vis-w2v. GloVe+vis-w2v performs similarly to GloVe while GloVe+ViCo outperforms both GloVe and ViCo. Using WordNet yields healthy performance gains but is not the only contributor to performance since GloVe+ViCo(linear,100, w/o WN) also outperforms GloVe. Best and second best numbers are highlighted in each column.
Embeddings Dim. Fine Coarse
ViCo(linear,50) 50 0.57 0.56
ViCo(linear,100) 100 0.60 0.59
ViCo(linear,200) 200 0.59 0.60
ViCo(select,200) 200 0.59 0.60
GloVe 300 0.50 0.52
GloVe+ViCo(linear,50) 300+50 0.60 0.66
GloVe+ViCo(linear,100) 300+100 0.61 0.65
GloVe+ViCo(linear,200) 300+200 0.60 0.65
GloVe+ViCo(select,200) 300+200 0.57 0.63
Table 4: Effect of transformations on clustering performance. The table compares average performance across number of clusters. The linear variants achieve performance similar to select with fewer dimensions. In fact, when used in combination with GloVe, linear variants outperform select. Best and second best numbers are highlighted in each column.

Model. Let be the features extracted from image using a CNN and let denote the word embedding for class . Let denote a function that projects word embeddings into the space of image features. We define the score for class as , where

is the cosine similarity. The class probabilities are defined as


where is a learnable temperature parameter. In our experiments, is a -dimensional feature vector produced by the last linear layer of a 34-layer ResNet (modified to accept CIFAR images) and is a linear transformation.

Learning. The model (parameters of , , and ) is trained on images from the set of seen classes . We use the Adam [17] optimizer with a learning rate of . The model is trained with a batch size of for epochs.

Model Selection and Evaluation. The best model (among iteration checkpoints) is selected based on seen class accuracy (classifying only among classes in ) on the test set. The selected model is evaluated on unseen category () prediction accuracy computed on the test set.

Fig. 5 compares chance performance (), random vectors, GloVe, and GloVe+ViCo

on four seen/unseen splits. We show mean and standard deviation computed across four runs (

models trained in all). The key conclusions are as follows:

ViCo generalizes to unseen classes better than GloVe. ViCo based embeddings, especially -dim. select and linear variants show healthy gains over GloVe. Note that this is not just due to higher dimensions of the embeddings since GloVe+random(200) performs worse than GloVe.

Learned embeddings significantly outperform random vectors. Random vectors alone achieve close to chance performance, while concatenating random vectors to GloVe degrades performance.

Select performs better than Linear. Compression to -dimensional embeddings using linear transformation shows a more noticeable drop in performance as compared to the select setting. However, GloVe+ViCo(linear,100) still outperforms GloVe in 3 out of 4 splits.

Figure 5: Zero-Shot Analysis.

The histogram compares the transfer learning ability of a simple word embedding based object classification model. The

-axis denotes the number of CIFAR-100 classes () used during training. During test, we evaluate the classifier on its ability to correctly classify among the remaining () unseen classes. Results show that GloVe+ViCo leads to better transfer to unseen classes than GloVe alone (Sec. 4.2).
Discr. Attr. Im-Cap Retrieval VQA Ref. Exp. Image Captioning
Avg. F1 Recall@1 Accuracy Loc. Accuracy Captioning Metrics
Embeddings Dim. Im2Cap Cap2Im Overall Y/N Num. Other Val TestA TestB B1 B4 C S
random 300 50.03 2.26 43.1 30.6 66.1 82.0 44.8 57.5 71.3 73.5 66.3 0.714 0.296 0.910 0.170
GloVe 300 63.85 0.04 44.8 33.5 67.5 83.8 46.5 58.3 72.2 75.3 66.8 0.708 0.290 0.891 0.167
GloVe + random 300+100 63.88 0.03 44.3 34.4 67.5 84.1 45.9 58.2 72.5 75.1 67.5 0.707 0.288 0.881 0.166
GloVe + ViCo (linear) 300+100 64.46 0.17 46.3 34.2 67.7 84.4 46.6 58.4 72.7 75.5 67.5 0.711 0.291 0.894 0.168
Table 5: Comparing ViCo to GloVe and random vectors. GloVe+ViCo(linear) outperforms GloVe and GloVe+random for all tasks and outperforms random for all tasks except Image Captioning. While random vectors perform close to chance on the word-only task, they compete tightly with learned embeddings on vision-language tasks. This suggests that vision-language models are relatively insensitive to the choice of word embeddings. Best and second best numbers in each column are highlighted.

4.3 Downstream Task Evaluation

We now evaluate ViCo embeddings on a range of downstream tasks. Generally, we expect tasks requiring better word representations of objects and attributes to benefit from our embeddings. When using existing models, we initialize and freeze word embeddings so that performance changes are not due to fine-tuning embeddings of different dimensions. The rest of the model is left untouched except for the dimensions of the input layer where the size of the input features needs to match the embedding dimension.

Tab. 5 compares performance of embeddings on a word-only discriminative attributes task and 4 vision-language tasks. On all tasks GloVe+ViCo outpeforms GloVe and GloVe+random. Unlike the word-only task which depends solely on word representations, vision-language tasks are less sensitive to word embeddings, with performance of random embeddings approaching learned embeddings 333See supplementary material for our hypothesis and test for why random vectors work well for vision-language tasks..

Discriminative Attributes [19] is one of the SemEval 2018 challenges. The task requires to identify whether an attribute word discriminates between two concept words. For example, the word “red” is a discriminative attribute for word pair (“apple”, “banana”) but not for (“apple”, “cherry”). Samples are presented as tuples of attribute and concept words and the model makes a binary prediction. Performance is evaluated using class averaged F1 scores.

Let , , and be the word embeddings (GloVe or ViCo) for the two concept words and the attribute word. We compute the scores and for GloVe and ViCo using function , where is the cosine similarity. We then learn a linear SVM over for the GloVe only model and over and for the GloVe+ViCo model.

Caption-Image Retrieval is a classic vision-language task requiring a model to retrieve images given a caption or vice versa. We use the open source VSE++ [10] implementation which learns a joint embedding of images and captions using a Max of Hinges loss that encourages attending to hard negatives and is geared towards improving top-1 Recall. We evaluate the model using Recall@1 on MS-COCO.

Visual Question Answering [3, 11] systems are required to answer questions about an image. We compare the performance of embeddings using Pythia [54, 15] which uses bottom-up top-down attention for computing a question-relevant image representation. Image features are then fused with a question representation using a GRU operating on word embeddings and fed into an answer classifier. Performance is evaluated using overall and by-question-type accuracy on the test-dev split of the VQA v2.0 dataset.

Referring Expression Comprehension consists of localizing an image region based on a natural language description. We use the open source implementation of MAttNet [55] to compare localization accuracy with different embeddings on the RefCOCO+ dataset using the UNC split. MAttNet uses an attention mechanism to parse the referring expression into phrases that inform the subject’s appearance, location, and relationship to other objects. These phrases are processed by corresponding specialized localization modules. The final region scores are a linear combination of module scores using predicted weights.

Image Captioning involves generating a caption given an image. We use the Show and Tell model of Vinyals  [51] which feeds CNN extracted image features into an LSTM followed by beam search to sample captions. We report BLEU1 (B1), BLEU4 (B4), CIDEr (C), and SPICE (S) metrics [35, 50, 1] on the MS-COCO test set.

4.4 Exploring Embedding Space Structure

Previous work [31] has demonstrated linguistic regularities in word embedding spaces through analogy tasks solved using simple vector arithmetics. Fig. 6 shows qualitatively that ViCo embeddings possess similar properties, capturing relations between visual concepts well.

Analogy Answer Candidates GloVe ViCo
car:land::aeroplane:? ocean, sky, road, railway ocean sky
clock:circle::tv:? triangle, square, octagon, round triangle square
park:bench::church:? door, sofa, cabinet, pew door pew
sheep:fur::person:? hair, horn, coat, tail coat hair
monkey:zoo::cat:? park, house, church, forest park house
leg:trouser::wrist:? watch, shoe, tie, bandana bandana watch
yellow:banana::red:? strawberry, lemon, mango, orange mango strawberry
rice:white::spinach:? blue, green, red, yellow blue green
train:railway::car:? land, desert, ocean, sky land land
can:metallic::bottle:? wood, glass, cloth, paper glass glass
man:king::woman:? queen, girl, female, adult queen girl
can:metallic::bottle:? wood, plastic, cloth, paper plastic wood
train:railway::car:? road, desert, ocean, sky road ocean
Table 6: Answering Analogy Questions. Out of 30 analogy pairings tested, we found both GloVe and ViCo to be correct 19 times, only ViCo was correct 8 times, and only Glove was correct 3 times. Correct answers are highlighted.

5 Conclusion

This work shows that in addition to textual co-occurrences, visual co-occurrences are a surprisingly effective source of information for learning word representations. The resulting embeddings outperform text-only embeddings on unsupervised clustering, supervised partitioning, zero-shot generalization, and various supervised downstream tasks. We also develop a multi-task extension of GloVe’s log-bilinear model to learn a compact shared embedding from multiple types of co-occurrences. Type-specific embedding spaces learned as part of the model help provide a richer sense of relatedness between words.

Acknowledgments: Supported in part by NSF 1718221, ONR MURI N00014-16-1-2007, Samsung, and 3M.


  • [1] P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) Spice: semantic propositional image caption evaluation. In ECCV, Cited by: §4.3.
  • [2] L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell (2017)

    Localizing moments in video with natural language

    In ICCV, Cited by: §1.
  • [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In ICCV, Cited by: §4.3.
  • [4] J. A. Bullinaria and J. P. Levy (2007) Extracting semantic representations from word co-occurrence statistics: a computational study.. Behavior research methods. Cited by: §2.
  • [5] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §4.1.
  • [6] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra (2017) Visual dialog. In CVPR, Cited by: §1.
  • [7] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman (1990) Indexing by latent semantic analysis. JASIS. Cited by: §2.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §1.
  • [9] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Cited by: §2.
  • [10] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler (2018) Vse++: improved visual-semantic embeddings. BMVC. Cited by: §4.3.
  • [11] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In CVPR, Cited by: §4.3.
  • [12] T. Gupta, K. Shih, S. Singh, and D. Hoiem (2017) Aligned image-word representations improve inductive transfer across vision-language tasks. In ICCV, Cited by: §1.
  • [13] M. Hasegawa, T. Kobayashi, and Y. Hayashi (2017) Incorporating visual features into word embeddings: a bimodal autoencoder-based approach. In IWCS, Cited by: §2.
  • [14] L. He, K. Lee, M. Lewis, and L. Zettlemoyer (2017) Deep semantic role labeling: what works and what’s next. In ACL, Cited by: §1.
  • [15] Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, and D. Parikh (2018) Pythia. Note: Cited by: §4.3.
  • [16] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In CVPR, Cited by: §1.
  • [17] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §1, §4.2.
  • [18] S. Kottur, R. Vedantam, J. M. F. Moura, and D. Parikh (2016) Visual word2vec (vis-w2v): learning visually grounded word embeddings using abstract scenes. CVPR. Cited by: §2, §4.1, Table 3.
  • [19] A. Krebs, A. Lenci, and D. Paperno (2018) Semeval-2018 task 10: capturing discriminative attributes. In International Workshop on Semantic Evaluation, Cited by: §4.3.
  • [20] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.
  • [21] J. B. Kruskal (1964) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika. Cited by: §2.
  • [22] R. Lebret and R. Collobert (2014) Word embeddings through hellinger pca. In EACL, Cited by: §2.
  • [23] K. Lee, L. He, M. Lewis, and L. Zettlemoyer (2017) End-to-end neural coreference resolution. EMNLP. Cited by: §1.
  • [24] O. Levy and Y. Goldberg (2014) Neural word embedding as implicit matrix factorization. In NIPS, Cited by: §2.
  • [25] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §4.1.
  • [26] K. Lund and C. Burgess (1996) Producing high-dimensional semantic spaces from lexical co-ocurrence. Cited by: §2.
  • [27] R. Luo and G. Shakhnarovich (2017) Comprehension-guided referring expressions. In CVPR, Cited by: §1.
  • [28] D. Massiceti, N. Siddharth, P. K. Dokania, and P. H. Torr (2018) Flipdial: a generative model for two-way visual dialogue. In CVPR, Cited by: §1.
  • [29] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. CoRR abs/1301.3781. Cited by: §2.
  • [30] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: §2.
  • [31] T. Mikolov, W. Yih, and G. Zweig (2013) Linguistic regularities in continuous space word representations. In HLT-NAACL, Cited by: §2, §4.4.
  • [32] G. A. Miller (1995) WordNet: a lexical database for english. ACM. Cited by: §1.
  • [33] G. Murphy (2004) The big book of concepts. MIT press. Cited by: §3.1.
  • [34] C. E. Osgood, G. J. Suci, and P. H. Tannenbaum (1957) The measurement of meaning. University of Illinois press. Cited by: §2.
  • [35] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, Cited by: §4.3.
  • [36] A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit (2016)

    A decomposable attention model for natural language inference

    In EMNLP, Cited by: §1.
  • [37] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §2, §3.1, §3.1, §3.2.
  • [38] M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power (2017) Semi-supervised sequence tagging with bidirectional language models. ACL. Cited by: §1.
  • [39] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. S. Zettlemoyer (2018) Deep contextualized word representations. In NAACL-HLT, Cited by: §2.
  • [40] B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, and S. Lazebnik (2017) Phrase localization and visual relationship detection with comprehensive image-language cues. In ICCV, Cited by: §1.
  • [41] B. A. Plummer, P. Kordas, M. H. Kiapour, S. Zheng, R. Piramuthu, and S. Lazebnik (2018) Conditional image-text embedding networks. In ECCV, Cited by: §1.
  • [42] A. Radford (2018) Improving language understanding by generative pre-training. Cited by: §2.
  • [43] H. Rashkin, M. Sap, E. Allaway, N. A. Smith, and Y. Choi (2018) Event2Mind: commonsense inference on events, intents, and reactions. In ACL, Cited by: §1.
  • [44] L. J. Rips, E. J. Shoben, and E. E. Smith (1973) Semantic distance and the verification of semantic relations. Journal of verbal learning and verbal behavior. Cited by: §2.
  • [45] B. H. Ross and G. L. Murphy (1999) Food for thought: cross-classification and category organization in a complex real-world domain. Cognitive Psychology. Cited by: §2.
  • [46] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2017) Bidirectional attention flow for machine comprehension. ICLR. Cited by: §1.
  • [47] K. J. Shih, S. Singh, and D. Hoiem (2016) Where to look: focus regions for visual question answering. In CVPR, Cited by: §1.
  • [48] G. Stanovsky, J. Michael, L. S. Zettlemoyer, and I. Dagan (2018) Supervised open information extraction. In NAACL-HLT, Cited by: §1.
  • [49] M. I. Vasileva, B. A. Plummer, K. Dusad, S. Rajpal, R. Kumar, and D. Forsyth (2018) Learning type-aware embeddings for fashion compatibility. In ECCV, Cited by: §1.
  • [50] R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In CVPR, Cited by: §4.3.
  • [51] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In CVPR, Cited by: §4.3.
  • [52] L. Wang, Y. Li, J. Huang, and S. Lazebnik (2019)

    Learning two-branch neural networks for image-text matching tasks

    TPAMI. Cited by: §1.
  • [53] X. Wang, Y. Ye, and A. Gupta (2018)

    Zero-shot recognition via semantic embeddings and knowledge graphs

    In CVPR, Cited by: §1.
  • [54] Yu Jiang*, Vivek Natarajan*, Xinlei Chen*, M. Rohrbach, D. Batra, and D. Parikh (2018) Pythia v0.1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956. Cited by: §1, §4.3.
  • [55] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg (2018) Mattnet: modular attention network for referring expression comprehension. In CVPR, Cited by: §4.3.
  • [56] C. L. Zitnick and D. Parikh (2013) Bringing semantics into focus using visual abstraction. In CVPR, Cited by: §4.1.