Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions

06/23/2016 ∙ by Fabio Carrara, et al. ∙ Consiglio Nazionale delle Ricerche 0

In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the, typically huge, image collection on which the search is performed. We propose Text2Vis, a neural network that generates a visual representation, in the visual feature space of the fc6-fc7 layers of ImageNet, from a short descriptive text. Text2Vis optimizes two loss functions, using a stochastic loss-selection method. A visual-focused loss is aimed at learning the actual text-to-visual feature mapping, while a text-focused loss is aimed at modeling the higher-level semantic concepts expressed in language and countering the overfit on non-relevant visual components of the visual loss. We report preliminary results on the MS-COCO dataset.



There are no comments yet.


page 6

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Using a textual query to retrieve images is a very common cross-media search task, as text is the most efficient media to describe the kind of image the user is searching for. The actual retrieval process can be implemented in a number of ways, depending on how the shared search space between text and images is defined. The search space can be based on textual features, visual features, or a joint space in which textual and visual features are projected into.

Using textual features is the most common solution, specially at the Web scale. Each image is associated with a set of textual features extracted from its context of use (e.g., the text surrounding the image in the Web page, description fields in metadata), and eventually enriched by means of classifiers that assign textual labels related to the presence or certain relevant entities or abstract properties in the image. The textual search space model can exploit the actual visual content of the image only when classifiers for the concepts of interest are available, thus requiring a relevant number of classifiers; this also requires to reprocess the entire image collection whenever a new classifier is made available.

On the other side, the visual and joint search spaces represent each image through visual features extracted from its actual content. The method we propose in this paper adopts a visual space search model. A textual query is converted into a visual representation in a visual space, where the search is performed by similarity. An advantage of this model is that any improvement in the text representation model, and its conversion to visual features, has immediate benefits on the image retrieval process, without requiring to reprocess the whole image collection.

A joint space model requires instead a reprocessing of all images whenever the textual model is updated, since the projection of images into the joint space is influenced also by the textual model part. It also requires managing and storing the additional joint space representations that are used only for the cross-media search.

In this paper we present the preliminary results on learning Text2Vis

, a neural network model that converts textual descriptions into visual representations in the same space of those extracted from deep Convolutional Neural Networks (CNN) such as ImageNet

[15]. Text2Vis achieves its goal by using a stochastic loss choice on two separate loss functions (as detailed in Section 3

), one for textual representations autoencoding, and one for visual representations generation. Preliminary results show that the produced visual representations capture the high level concepts expressed in the textual description.

2 Related work

Deep Learning and Deep Convolutional Neural Networks (DCNNs) in particular, have recently shown impressive performance on a number of multimedia information retrieval tasks [15, 23, 10]

. Deep Learning methods learn representations of data with multiple levels of abstraction. As a result, the activation of the hidden layers has been used in the context of transfer learning and content-based image retrieval

[5, 22] as high-level representations of the visual content. Somewhat similarly, distributional semantic models, such as those produced by Word2Vec [18], or GloVe [21], have been found useful in modeling semantic similarities among words by establishing a correlation between word meaning and position

in a vector space.

In order to perform cross-media retrieval, the two feature spaces (text and images in our case) should become comparable, typically by learning how to properly map the different sources. This problem has been attempted in different manners so far, which could be roughly grouped into three main variants, depending on whether the mapping is performed into a common space, the textual space, or the visual space.

Mapping into a common space: The idea of comparing texts and images in a shared space has been investigated by means of Cross-modal Factor Analysis and (Kernel) Canonical Correlation Analysis in [4]. In a similar vein, Corr-AE was proposed for cross-modal retrieval, allowing the search to be performed in both directions, i.e., from text-to-image and viceversa [8]. The idea is to train two autoencoders, one for the image domain and another for the textual domain, imposing restrictions between the two. As will be seen, the architecture we are presenting here bears resemblance to one of the architectures investigated in [8], the so-called Correspondence full-modal autoencoder (which is inspired by the multimodal deep learning method [19]). Contrarily to the multimodal architectures though, we apply a stochastic criterion to jointly optimize for the two modals, thus refraining from combining them into a parametric single loss.

Mapping into the textual space:

The BoWDNN method trains a deep neural network (DNN) to map images directly into a bag-of-words (BoW) space, where the cosine similarity between BoWs representations is used to generate the ranking

[1]. Somehow similarly, a dedicated area of related research is focused on generating captions describing the salient information of an image (see, e.g., [13, 7]).

Two other important examples along these lines are DeViSE [9] and ConSE [20]. Both methods build upon the higher layers of the convolutional neural network of [15]; the main difference lies on the way both methods treat the last layer of the net. Whereas DeViSE replaces this last layer with a linear mapping (thus fine-tuning the whole network) ConSE, on the other side, directly takes the outputs of the last layer and learns a projection to the textual embedding space.

Mapping into the visual space: Our proposal Text2Vis belongs to this group where, to the best of our knowledge, the only example up to now was a method dubbed Word2VisualVec [6], which was reported just very recently. There are some fundamental points where their method and ours differ, though. On the one hand, their Word2VisualVec takes combinations of Word2Vec-like vectors as a starting point, thus reducing the dimensionality of the input space; we directly start from the bag-of-words vector encoding of the textual space, as we did not observed any improvement in pre-training the textual part. On the other, they build a deep network on top of the textual representation. As shall be seen, our Text2Vis is much shallower, as we found the net to be capable of mapping textual vectors into the visual space quite efficiently, provided that the model is properly regularized; an issue on which we focused our attention.

3 Generating visual representations of text

In this section we describe the architecture of our Text2Vis network. Our idea is to map textual descriptions to high-level visual representations. As the visual space we used the fc6 and fc7 layers of the Hybrid network [24] (i.e., an AlexNet [15] trained on both ImageNet111http://image-net.org and Places222http://places.csail.mit.edu/index.html datasets). We tested two vectorial representations for the textual descriptions: Text2Vis uses simple bag-of-words vectors that mark with a value of one the positions that are relative to words that appear in the textual description and leave to zero all the others; Text2Vis

adds a bit text structure info by considering also N-grams for a selection of part-of-speech patterns

333We considered the part-of-speech patterns: ‘NOUN-VERB’, ‘NOUN-VERB-VERB’, ‘ADJ-NOUN’, ‘VERB-PRT’, ‘VERB-VERB’, ‘NUM-NOUN’, and ‘NOUN-NOUN’.. Text2Vis is a first approach at modeling text structure into the input vectorial representation, which differentiates the task of search from detailed/complex textual description we aim at from the traditional keyword search.

We have also investigated the use of pre-trained word embeddings, representing the textual description as the average of the embeddings of the words composing the description (see Equation 1 in [6]), but we have not observed any improvement. Generating the word embeddings is an additional cost, and the fitness of the embeddings for the task depends on the type of documents they are learned from. For example, an 11% improvement in MAP is reported in [2] from learning embedding from Flickr tags compared to learning them from Wikipedia pages. The direct use of bag-of-words vectors in Text2Vis removes the variable of selecting an appropriate document collection to learn the embedding and its learning cost.

As described in the following, Text2Vis actually learns a description embedding space that is able to reconstruct both the original description and the visual description. To reach this, we started with a simple regressor model (Figure 1, left) trained to directly predict the visual representation of the image associated with the textual input. We observed a strong tendency to overfit (Figure 1, right), thus degrading the applicability of the method to unseen images.

Figure 1: Overfitting of a simple regressor model with one hidden layer of size 1024.

We explained this overfitting with the fact that a visual representation keeps track of every element that appears in the image, regardless of their semantic relevance within the image, while a (short) textual description is more likely focused on the visually relevant information, disregarding the secondary content of the image, as shown in Figure 4.6. As the learning iterations proceed, the simple regressor model starts capturing secondary elements of the images that are not relevant for the main represented concept, but are somewhat characteristic in the training data.

Our Text2Vis proposal to contrast such overfitting is to add a text-to-text autoencoding branch to the hidden layer (Figure 2, left), forcing the model to satisfy two losses: one visual (text-to-visual regression) and one linguistic (text-to-text autoencoder). The linguistic loss works at higher level of abstraction than the visual one, acting as a regularization constraint on the model, and preventing, as confirmed by our experiments, overfitting on the visual loss (Figure 2, right). As detailed in the next section, we implemented the use of the two losses with a stochastic process, in which at each iteration one of the two is selected for optimization.

Figure 2: Our proposed Text2Vis which controls overfitting by adding an autoencoding constraint on the hidden state.

3.1 Text2Vis

Text2Vis consists of two overlapped feedforward neural nets with a shared hidden layer. The shared hidden layer causes a regularization effect during the combined optimization; i.e., the hidden state is constrained to be a good representation to accomplish with two different goals. The feedforward computation is described by the following equations:


where represents the bag-of-words encoding for the textual descriptor given as input to the net,

is the hidden representation,

and are the visual and textual predictions, respectively, obtained from the hidden representation , are the model parameters to be learned, and

is the activation function, defined by


Both predictions and are then confronted with the expected outputs (i) the visual representation corresponding to the or layers of [15], and (ii) a textual descriptor that is semantically equivalent to . We used the mean squared error (MSE) as the loss function in both cases:


The model is thus multi-objective, and many alternative strategies could be followed at this point in order to set the parameters so that both criteria are jointly minimized. We rather propose a much simpler, yet effective, way for carrying out the optimization search, that consists of considering both branches of the net as independent, and randomly deciding in each iteration which of them is to be used for the gradient descend optimization.

Let thus define and as the model parameters of each independent branch. The optimization problem has two objectives (Equations 5 and 6

), and at each iteration, a random choice decides which of them is to be optimized. We call this heuristic the

Stochastic Loss (SL) optimization.


Note that the net is fed with a triple at each iteration. When the text-to-text branch is an autoencoder. It is also possible to have , with the two pieces of text been semantically equivalent (e.g., a woman cutting a pizza with a knife”, a woman holds a knife to cut pizza”) then the text-to-text branch might be reminiscent of the Skip-gram- and CBOW-like architectures. The text-to-image branch is, in any case, a regressor. The SL causes the model to be co-regularized. Notwithstanding, since our final goal is to project the textual descriptor into the visual space, the text-to-text branch might be though as a regularization to the visual reconstruction (and, more specifically, to its internal encoding) which responds to constrains of linguistic nature.

4 Experiments

4.1 Datasets

We used the Microsoft COCO dataset (MsCOCO444Publicly available at http://mscoco.org/ [17]). MsCOCO was originally proposed for image recognition, segmentation, and caption generation. Although other datasets for image retrieval exist (e.g., the one proposed in [11]), they are more oriented to keyword-based queries. We believe MsCOCO to be more fit to the scenario we want to explore, since the captions associated to the images are expressed in natural language, thus semantically richer than a short list of keywords composing a query.

MsCOCO contains 82.783 training images (Train2014), 40.504 validation images (Val2014), and about 40K and 80K test images corresponding to two different competitions [3] (Test2014 and Test2015). Because MsCOCO was proposed for caption generation, the captions are only accessible in the Train2014 and Val2014 sets, while they are not yet released for Test2014 and Test2015. We have thus taken the Train2014 set for training, and split the Val2014 into two disjoint sets of 20K images each for validation and test.

Each image in MsCOCO has 5 different captions associated. Let be any labeled instance in MsCOCO, where is an image and is a set of captions describing the content of . Given a pair, we define a labeled instance in our model as , where is the visual representation of the image taken from the fc6 layer (or fc7, in separate experiments) of the Hybrid network [24]; and are two textual descriptors from representing the input and output descriptors for the model, respectively. During training, and are uniformly chosen at random from (thus and are not imposed to be different). Note that the number of training instances one could extract from a given

amounts to 25, which increases the variability of the training set along the different epochs.

4.2 Training

We solve the optimization problems of Equations 5 and 6, using the Adam method [14] for stochastic optimization, with default parameters (learning rate , , , and ). Note that there are two independent instances of the Adam optimizer, one associated to (Equation 5) and other for (Equation 6

). In this preliminary study we decided to set for the SL an equal selection probability to both

and ; different distributions will be investigated in future research.

We set the size of the training batch to 100 examples. We set the maximum number of iterations to 300.000, but apply an early stop when the model starts overfitting (as reflected in the validation error). The training set is shuffled each time a complete pass over all images is completed.

All the

parameters have been initialized at random according to a truncated normal distribution centered in zero with standard deviation of

, where is the number of columns. The biases have all been initialized to 0.

The vocabulary size is 10,358 for Text2Vis after removing terms appearing in less than 5 captions. For Text2Vis we considered the 23,968 uni-grams and N-grams appearing at least in 10 captions. Since the number of units in the hidden and output layers are 1024 and 4096, respectively, the total number of parameters of the models amount to 25.4M in Text2Vis and 53.3M in Text2Vis.

A Tensorflow implementation of

Text2Vis is available at https://github.com/AlexMoreo/tensorflow-Tex2Vis.

Figure 3:

Cumulative probability distribution of the difference in performance of our

Text2Vis with respect to VisSim (upper plot) and VisReg (lower plot), on . Positive differences mean Text2Vis obtained a better ranking score than VisSim or VisReg (resp. 69.2% and 58.1% of cases, shadowed region).

4.3 Evaluation Measures

Image retrieval is performed by similarity search in the visual space, using Euclidean distance on the l2-normalized visual vectors to generate a ranking of images, sorted by closeness. We measure the retrieval effectiveness of the visual representations produced from textual descriptions by our Text2Vis network by means of the Discounted Cumulative Gain (DCG [12]), defined as:


where quantifies the relevance of the retrieved element at rank position with respect to the query, and is the rank at which the metric is computed; we set in our experiments, as was done in related research [11, 6].

Because the

values are not provided in the MsCOCO, we estimate them by using the

[16] metric. is one of the evaluation measures for the MsCOCO caption generation competition555https://github.com/tylin/coco-caption [3]. We compute , where is the query caption, and are the 5 captions associated to the retrieved image at rank . This caption-to-caption relevance model is thus aimed at measuring how much the concepts expressed in the query appear as relevant parts of the retrieved images.

4.4 Results

We compared the performance of Text2Vis and Text2Vis models against: RRank, a lower bound baseline that produces a random ranking of images, for any query; VisSim, a direct similarity method that computes the Euclidean distances using the original fc6, or fc7, features for the image that is associated to query caption in MsCOCO; and VisReg, the text-to-image regressor described in Figure 1.

Table 1 reports the averaged DCG scores obtained by the compared methods. These results show a significant improvement of our proposal with respect to the compared methods. When using as the visual space, Text2Vis obtains a relative improvement with respect to VisSim and over VisReg. The improvements of Text2Vis are respectively of and . When using as the visual space it is Text2Vis that obtains, yet by a small margin, the best result. The relative improvements of Text2Vis over emphVisSim and VisReg are respectively of and , and for Text2Vis respectively of and .

Method fc6 fc7
RRank 1.524 1.524
VisSim 2.150 2.180
VisReg 2.317 2.359
Text2Vis 2.350 2.382
Text2Vis 2.339 2.385
Table 1: Performance comparison of the different methods in terms of averaged DCG

In addition to the averaged performance, we also investigated how often the ranking produced by Text2Vis is more relevant (according to DCG) than those produced by VisSim and VisReg. Figure 3 indicates that in 69.2% of the cases, the ranking of Text2Vis was found more relevant than VisSim (see Figure 3). The same happens in of the cases when comparing Text2Vis to VisReg.

Figure 4: Validation loss for and , optimizing on a linear combination of losses (blue) or using two optimizers with stochastic loss selection (SL, red).
Figure 5: Examples of search results from the three compared methods.

4.5 Why Stochastic Loss?

Text2Vis uses two independent optimizers to optimize the visual () and the textual () losses, based on a stochastic choice at each iteration (SL, section 3.1). Previous approaches to multimodal learning relied instead on a unique aggregated loss (typically of the form ) that is minimized by a single optimizer [8, 19]. We compared the two approaches on the case of equal relevance of the two losses (

, uniform distribution for SL). SL better optimizes the two losses (Figure

4), and is less prone to overfit.

We deem that SL allows to model in a more natural way the relative relevance of the various losses that are combined, i.e., by selecting the losses in proportion to the assigned relevance, whereas the numeric aggregation is affected by the relative values of losses and the differences in their variation during the optimization (e.g., a loss that has a large improvement may compensate for another loss getting worse). SL is also computationally lighter than the aggregated loss, as SL updates only a part of the model on each iteration.

4.6 Visual comparison

Figure 5 show a few samples666More results at https://github.com/AlexMoreo/tensorflow-Tex2Vis that highlight the differences in results from the three compared methods. In all the cases results from the VisSim method are dominated by the main visual features of the images: a face for the first query, the content of the screen for the second query, an outdoor image with a light lower part, plants, people and a bit of sky in the third one. The two text based methods obtains results that more often contain the key elements of the description. For the first query, Text2Vis retrieves four relevant images out of five, one more that VisReg. For the other two queries the results are pretty similar, with Text2Vis placing in second position an image that is a perfect match for the query, while VisReg places it in fifth position.

5 Conclusions

The preliminary experiments indicate our method produces more relevant rankings than those produced by similarity search directly on the visual features of a query image. This is an indication that our text-to-image mapping produces better prototypical representations of the desired scene than the representation of a sample image itself. A simple explanation of this result is that textual descriptions strictly emphasize the relevant aspects of the scene the user has in mind, whereas the visual features, directly extracted from the query image, are keeping track of all the information that is contained in that image, causing the similarity search to be potentially confused by secondary elements of the scene. The Text2Vis model also improved, yet by a smaller margin, over the VisReg model , showing that an auto-enconding branch in the network is useful to avoid overfitting on visual features. We also found that combing losses in a stochastic fashion, rather than numerically, improves both the effectiveness and efficiency of the system. In the future we plan to compare Text2Vis against the recently proposed Word2VisualVec [6] model. We also intend to improve the modeling word order information in Text2Vis, likely by adding a recurrent component to the network architecture.