Log In Sign Up

Better Text Understanding Through Image-To-Text Transfer

by   Karol Kurach, et al.

Generic text embeddings are successfully used in a variety of tasks. However, they are often learnt by capturing the co-occurrence structure from pure text corpora, resulting in limitations of their ability to generalize. In this paper, we explore models that incorporate visual information into the text representation. Based on comprehensive ablation studies, we propose a conceptually simple, yet well performing architecture. It outperforms previous multimodal approaches on a set of well established benchmarks. We also improve the state-of-the-art results for image-related text datasets, using orders of magnitude less data.


page 1

page 4


Learning to Learn from Web Data through Deep Semantic Embeddings

In this paper we propose to learn a multimodal image and text embedding ...

Self-Supervised Learning from Web Data for Multimodal Retrieval

Self-Supervised learning from multimodal image and text data allows deep...

Deep Transfer Reinforcement Learning for Text Summarization

Deep neural networks are data hungry models and thus they face difficult...

Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling

As an important task in multimodal context understanding, Text-VQA (Visu...

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

We present a new state-of-the-art on the text to video retrieval task on...

Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Though deep generative models have gained a lot of attention, most of th...

Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case

Semantic embeddings have advanced the state of the art for countless nat...

1 Introduction

Figure 1: Two images are close in the visual space, which can be quantified via a CNN. Their descriptions convey the same concept, yet using an entirely different vocabulary. Our method improves the understanding of text by leveraging knowledge about visual properties of corresponding images. Photo credit: teaser_credit_left ; teaser_credit_right

The problem of finding good representations of text data is a very actively studied area of research. Many models are able to learn representations directly by optimizing the end-to-end task in a supervised manner. This, however, often requires an enormous amount of labeled data which is not available in many practical applications and gathering such data can be very costly. A common solution that requires an order of magnitude less labeled examples is to reuse pre-trained embeddings.

A large body of work in this space is focused on training embeddings from pure text data. However, there are many types of relations and co-occurrences that are hard to grasp from pure text. Instead, they appear naturally in other modalities, such as images. In particular, from a similarity measure in the image space and pairs of images and sentences, a similarity measure for sentences can be induced, as illustrated in Figure 1.

In this work, we study how to build sentence embeddings from text-image pairs that are good in terms of sentence similary metrics. This extends previous works, for example Lazaridou15 or MaoPinterest16 .

We propose a conceptually simple, but well performing model that we call Visually Enhanced Text Embeddings (VETE). It takes advantage of visual information from images in order to improve the quality of sentence embeddings. This model uses simple ingredients that already exist and combines them properly. Using a pre-trained Convolutional Neural Network (CNN) for the image embedding, the sentence embeddings are obtained as the normalized sum of the word embeddings. Those are trained end-to-end to be aligned with the corresponding image embeddings and not aligned with mismatching pairs, optimizing the Pearson correlation.

Despite its simplicity, the model significantly outperforms pure-text models and the best multimodal model from MaoPinterest16 on a set of well established text similarity benchmarks from the SemEval competition semeval . In particular, for image-related datasets, our model matches state-of-the-art results with substantially less training data. These results indicate that exploring image data can significantly improve the quality of text embeddings and that incorporating images as a source of information can result in text representations which effectively captures visual knowledge. We also conduct a detailed ablation study to quantify the effect of different factors on the embedding quality in the image-to-text transfer setup.

In summary, the contributions of this work are:

  • We propose a simple multimodal model that outperforms previous image-text approaches on a wide variety of text similarity tasks. Furthermore, the proposed model matches state-of-the-art results on image-related SemEval datasets, despite being trained with substantially less data.

  • We perform a comprehensive study of image-to-text transfer, comparing different model types, text encoders, loss types and datasets.

  • We provide evidence that the approach of learning sentence embeddings directly outperforms methods that learn word embeddings and then combine them.

2 Related Work

Many works study the use of multimodal data, in particular pairs of images and text. Most of them explore how these pairs of data can be leveraged for tasks that require knowledge of both modalities, like captioning or image retrieval

barnard2003matching ; jia2011learning ; kiros14 ; Lazaridou15 ; wang2016learning ; karpathy2015deep ; vendrov2015order .

While this line of work is very interesting as common embeddings can directly be applied to captioning or image retrieval tasks, the direct use of text embeddings for NLP tasks, using images only as auxiliary training data, is less explored.

Lazaridou15 propose to extend the skip-gram algorithm mikolov2013linguistic to incorporate image data. hill2014learning also took a similar approach before. In the original skip-gram algorithm, each word embedding is optimized to increase the likelihood of the neighboring words conditioned on the center word. In addition to predicting contextual words, Lazaridou15 ’s models maximize the similarity between image and word embeddings. More precisely:

where is given by a softmax formulation:

is the embedding vector of the word

and is the dot product between the vectors and . To inject visual information, Lazaridou15 add a max margin loss between the image embedding and the word embedding:

with being the average embedding of the images paired with the word .

They show that they can augment word embeddings learnt from large text sources with visual information and in addition to image labeling and retrieval, they show that those embeddings perform better on word similarity metrics.

kiros14 also use a max margin loss to co-embed images and corresponding text. For the text embeddings, they use different models depending on the task. For the image captioning or image retrieval task, they employ an LSTM encoder. To explore the resulting word embedding properties, in particular arithmetics, they use a simpler word embedding model. Some interesting arithmetical properties of the embeddings are demonstrated, like “image of a blue car” - “blue” + “red” leading to an image of a red car. However, there is no quantitative evaluation of the quality of the text embeddings in terms of text similarity.

Some more recent works investigate phrase embeddings trained with visual signals and their quality in terms of text similarity metrics. For example, MaoPinterest16

use an Recurrent Neural Network (RNN) as a language model in order to learn word embeddings which are then combined to create a phrase embedding. They propose three models. For their model A, they use a similar setup as the captioning model from

Vinyals_2015_CVPR , with an RNN decoder conditioned on a pre-trained CNN embedding. The RNN (GRU in that case) reads the text, trying to predict the next token. The initial state is set to a transformation of the last internal layer of a pre-trained VGGNet simonyan2014very . Let that vector be . Model B tries to match the final RNN state with . Finally, model C develops the multimodal skip-gram Lazaridou15 by adding an additional loss measuring the distance between the word embeddings and . The authors’ experiments show that model A performs best and we use that model as a baseline in our experiments.

3 VETE Models

Figure 2: An overview of the VETE model. Images are fed into a pre-trained CNN (). Their representation is then transformed via to match the dimension of the sentence embeddings. The sentences are encoded via a text embedding model (). Finally, the embeddings are paired in two ways: matching pairs and incorrect pairs. For each set of pairs, we compute the similarity. The loss signal comes from the Pearson correlation, which should be for matching pairs and for incorrect pairs. Only green shaded modules are trained. Photo credit: graph_image ; teaser_credit_left ; teaser_credit_right

Our setup aims at directly transferring knowledge between image and text representations and the main goal is to find reusable sentence embeddings. We make direct use of paired data, consisting of pictures and text describing them. We propose a model consisting of two separate encoders - one for images and another one for text. An overview of the archiecture is presented in Figure 2.

For the text encoder, we consider three families of models which combine words into text representations. For the bag-of-words (BOW) model, the sentence embedding is simply a normalized sum of vectors corresponding to the individual words. For the RNN model, we create a stacked recurrent neural network encoder (LSTM or GRU based). Finally, for the CNN model, the encoder includes a convolutional layer followed by a fully connected layer, as described in kim2014cnn .

For encoding images, we use a pre-trained InceptionV3 network szegedy2016rethinking which provides a -dimensional feature vector for each image in the dataset (images are rescaled and cropped to x px).

Let denote the -dimensional embedding vector for an image and the -dimensional embedding for sentence produced by

. Throughout this paper, we will refer to the cosine similarity of two vectors

as . Informally speaking, our training goal is to maximize , when sentence is paired with (i.e. describes) image , and minimize this value otherwise.

Formally, let be the vocabulary, the embedding dimensionality and the text encoding model. Let’s define an affine transformation that transforms the 2048-dimensional image embeddings to dimensions. Our learnable parameters consist of a word embedding matrix , the internal parameters of the model (when is an or encoder), as well as the transformation matrix . For each batch of image-sentence pairs of the form , we randomly shuffle the sentences , and add the following "incorrect" pairs to the batch , with a random permutation of . If we denote

then our training goal is to maximize the Pearson correlation between the vectors


We will denote models trained this way VETE-BOW, VETE-RNN and VETE-CNN, respectively.

4 Experiments

4.1 Training Datasets

In our experiments we consider three training datasets: MS COCO, SBU and Pinterest5M. They are described in detail below. One important note is that we modify datasets that contain multiple captions per image (MS COCO and Pinterest5M) to keep only one caption. This was done to prevent the network from “cheating” by using the image feature vector only as a way of joining text pairs. To the best of our knowledge, we are the first to notice this issue in the evaluation of the quality of multimodal image-text models. It is known that training similarity models directly on text-text pairs yields good results wieting2015towards but here we want to investigate only the effect of knowledge transfer from images.

Figure 3: Examples from the three datasets. Shown are images and all provided captions. Notice how the language varies from formal descriptions geared towards a general audience (MS COCO) to more informal posts (SBU, Pinterest5M).

Ms Coco

The MS COCO dataset coco2014 contains image categories. For each image five high-quality captions are provided. We use the MS COCO 2014 dataset using the same train/validation/test split as the im2txt im2txt_impl Tensorflow implementation of im2txt_paper . Initially, our train/validation/test sets contain 586k/10k/20k examples, respectively. Then, we filter the sets to keep only one caption per image, so the "text part" of our final datasets is five times smaller.


The Stony Brook University datasetOrdonez:2011:im2text consists of 1M image-caption pairs collected from Flickr, with only one caption per image. We randomly split this dataset into train/validation/test sets with sizes 900k/50k/50k, respectively.


The original Pinterest40M dataset MaoPinterest16 contains M images. However, only M image urls were released at the time of this submission. Unfortunately, some images are no longer available so we were able to collect approx. M images from this dataset. For every image we keep only one caption. Then, we randomly split the data into M/k/ train/validation/test sets, respectively.

The training data in all datasets is lowercased and tokenized using the Stanford Tokenizer stanfordtokenizer . We also wrap all sentences with "<S>" and "</S>" marking the beginning and the end of the sentence.

4.2 Hyperparameter selection and training

The performance of every machine learning model highly depends on the choice of its hyperparameters. In order to fairly compare our approach to previous works, we follow the same hyperparameter search protocol for all models. We choose the average score on the SemEval 2016 (c.f.

Section 4.3) as the validation metric and refer to this as “avg2016”.

for i=1,2,…,100 do
       Sample a set of hyperparameters in the allowed ranges;
      Run training and evaluate on ‘‘avg2016’’;
end for
Report the results on all benchmarks of the model that has the highest score on ‘‘avg2016’’;
Algorithm 1 Protocol for hyperparameter search.

If a hyperparameter has a similar meaning in two models (e.g., learning rate, initialization scale, lr decay, etc.), the ranges searched were set to be the same. Additionally, we ensured that the parameters reported by the authors are included in the ranges.

In all models, we train using the Adam optimizerkingma2014adam , for (MS COCO, SBU) or epochs (Pinterest5M). The final embeddings have size of . For all VETE models we use the Pearson loss (see ablation study in Section 5.2).

Model imagess2014 images2015 COCO-Test Pin-Test avg2014 avg2015
VETE-CNN 0.911
VETE-BOW 0.861 0.855 0.579 0.579 0.622

Table 1: Results of models trained only on MS COCO data with one sentence per image.

4.3 Evaluation

Our goal is to create good text embeddings that encode knowledge from corresponding images. To evaluate the effect of this knowledge transfer from images to text, we use a set of textual semantic similarity datasets from the SemEval and competitionssemeval . Unfortunately, we could not compare our models directly on the Gold RP10K dataset introduced by MaoPinterest16 as it was not publicly released.

We also use two additional custom test sets: COCO-Test and Pin-Test. These were created from the MS COCO and Pinterest5M test datasets, respectively, by randomly sampling semantically related captions (from the same image) and non-related captions from different images. As opposed to the SemEval datasets, the similarity score is binary in this case. The goal was to check the performance on the same task as SemEval but with data from the same distribution of words as our training data.

For every model type we select the best model according to the average score on the SemEval datasets. Then, we report the results on all other test datasets.

4.4 Results

Table 1 presents the scores obtained by models trained only on MS COCO datasets. This allows us to fairly compare only the algorithms, not the data used. In Section 5.3, we analyze the robustness of our methods on two additional datasets (SBU and Pinterest5M).

As a direct comparison, we implement as described in MaoPinterest16 , which we refer to as PinModelA

. Our implementation uses a pre-trained InceptionV3 network for visual feature extraction, as in the VETE models. To understand the impact of adding information from images to text data, we also evaluate two models trained purely on text:

  • RNN-based language model This model learns sentence embeddings via an RNN based language model. It corresponds to the PureTextRnn baseline from MaoPinterest16 .

  • Word2Vec We trained Word2Vec word embeddings rehurek_lrec where the corpus consists of sentences from MS COCO.

All VETE models outperform pure-text baselines and PinModelA. Similarly to wieting2015towards , we observed that RNN-based encoders are outperformed by a simpler BOW model. We also show that this holds for CNN-based encoders. It is worth noting that it is mostly a domain adaptation issue, as both RNN and CNN encoders perform better than BOW on COCO-Test, where the data has the same distribution as the training data. We analyze the effect of changing the text encoder in Section 5.1.

To put our results in context, Table 2 compares them to other methods trained with much larger corpora. We used word embeddings obtained from three methods:

  • Glove: embeddings proposed in pennington2014glove , trained on a Common Crawl dataset with 840 billion tokens.

  • M-Skip-Gram: embeddings proposed in Lazaridou15

    , trained on Wikipedia and a set of images from ImageNet.

  • PP-XXL: the best embeddings from the wieting2015towards , trained on 9M phrase pairs from PPDB.

For all three approaches, we consider two versions that differ in the vocabulary allowed at inference time. One experiment was done with a vocabulary restricted to MS COCO (marked with “R”) and the non-restricted version (“NR”) where we use the whole vocabulary for given embeddings. The vocabulary size has a significant impact on the final score for the Pinterest-Test benchmark, where % of all tokens are not in the MS COCO vocabulary. That means that % of all sentences have at least one missing token.

Finally, we also include the best results from the SemEval competition, where available. Note that those were obtained from heavily tuned and more complex models, trained without any data restrictions. Still, our VETE model is able to match their results.

Model images2014 images2015 COCO-Test Pin-Test
Glove (R)
Glove (NR)
M-Skip-Gram (R)
M-Skip-Gram (NR) 0.654
Best SemEval 0.871 N/A N/A
VETE-BOW (our) 0.861 0.894

Table 2: Comparison of models on image-related text datasets. VETE and PinModelA were trained only on MS COCO.

5 Ablation studies

To analyze the impact of different components of our architecture, we perform ablation studies on the employed text encoder, loss type and training dataset. We also investigate the effect of training on word or sentence level. In all cases, we follow a similar protocol as in Section 4.2:

Randomly generate sets of combinations for all hyperparameters.;
for Hyperparameter p (e.g, “loss type”) do
       for v in the allowed range of values for p do
             Run training using the 100 sets of hyperparameters, keeping p=v fixed.;
       end for
      Choose the best one based on ‘‘avg2016’’ validation metric, and report the scores.;
end for
Algorithm 2 Protocol for hyperparameter ablation study.

5.1 Encoder

We study the impact of the different text encoders on the VETE model. The results are summarized in Table 4. “RNN-GRU” and “RNN-LSTM” denote RNN encoders with GRU gru and LSTM lstm cells, respectively. For BOW, we try two options: either we use the sum or the mean of word embeddings. Both bag-of-words encoders perform better than RNN encoders, although RNNs are slightly better on the test data which has the same distribution as the training data.

Encoder images2014 images2015 COCO-Test Pin-Test RNN-GRU RNN-LSTM BOW-SUM BOW-MEAN
Table 3: Results of applying different text encoders to the VETE model. The training data is MS COCO, and the RNN-based models learned to model this distribution better. However, BOW generalizes better to other datasets.
. Loss type Avg score Covariance SKT SKT Rank loss SKT Pearson
Table 4: Comparison of different loss types with the VETE-BOW model.

5.2 Loss type

In this section, we describe various loss types that we trained our model with. Consider two paired variables (similarity score between two embeddings) and . Then, the sample sets and stand for corresponding realizations of and .

  • Covariance: .

  • The Pearson correlation

    : measures the linearity of the link between two variables, estimated on a sample; it is defined as


  • Surrogate Kendall : The Pearson correlation takes into account only linear dependencies. To mitigate this, we experimented with the Kendall correlation which is only rank-dependent. Unfortunately, it is not differentiable. We therefore used its differentiable approximation: skt defined as for some .

  • Rank loss: Another cost function is the pairwise ranking loss. We follow closely the definition in kiros14 .

Table 4 compares the effects of the various losses.

5.3 Dataset

We study the effect of the training dataset. The results of training the model on MS COCO, SBU and Pinterest5M dataset are presented in Table 5. Each cell of the table contains the average score of evaluation datasets (images2014, images2014, COCO-Test, Pin-Test). The quality of image captions varies significantly between the datasets, as can be seen in Figure 3. However, we conclude that the relation between the models is preserved, that is: regardless of the dataset used for training, PinModelA is always worse than VETE-RNN, which in turn is worse than VETE-BOW.

Train Dataset Word2Vec PinModelA VETE-RNN VETE-BOW
Pinterest5M 0.408

Table 5: Comparison of average test scores when training on different datasets.

5.4 Sentence-level vs word-level embedding

Previous methods for transferring knowledge from images to text focused on improving the word-level embeddings. A sentence representation could then be created by combining them. In our work, we learn sentence embeddings as a whole, but the best performing text encoder turned out to be BOW. This raises the following question: could the model perform equally well if we train it on word-level, and then only combine word embeddings during inference? The comparison of these two approaches is presented in Table 6 which clearly shows the benefit of sentence-level training. This effect should be studied further, but while separately training word embeddings forces each of them to be close to the corresponding images, training at the sentence level gives the opportunity to have the word embeddings become complementary, each of them explaining a part of the image, and capturing co-occurences.

Model images2014 images2015 COCO-Test Pin-Test
Sentence-level 0.861 0.855 0.894 0.579

Table 6: Comparison of training on a word-level vs sentence-level.

6 Conclusion

We studied how to improve text embeddings by leveraging multimodal datasets, using a pre-trained image model and paired text-image datasets. We showed that VETE, a simple approach which directly optimizes phrase embeddings to match corresponding image representations, outperforms previous multimodal approaches which are sometimes more complex and optimize word embeddings as opposed to sentence embeddings. We also showed that even for relatively complex similarity tasks at sentence levels, our proposed models can create very competitive embeddings, even compared to more sophisticated models trained on orders of magnitude more text data, especially when the vocabulary is related to visual concepts.

To our initial surprise, state-of-the-art encoder models, like LSTMs, performed significantly worse than much simpler encoders like bag-of-word models. While they achieve better results when evaluated on the same data distribution, their embeddings do not transfer well to other text distributions. General embeddings need to be robust to distribution shifts and applying such techniques can probably further improve the results.

Using a multimodal approach in order to improve general text embeddings is under-explored and we hope that our results motivate further developments. For example, the fact that the best models are very simple suggests that there is a large headroom in that direction.