Imagination, creating new images in the mind, is a fundamental capability of humans, studies of which date back to Plato’s ideas about memory and perception. Through imagery, we form mental images
, picture-like representations in our mind, that encode and extend our perceptual and linguistic experience of the world. Recent work in neuroscience attempts to generate reconstructions of these mental images, as encoded in vector-based representations of fMRI patterns. In this work, we take the first steps towards implementing the same paradigm in a computational setup, by generating images that reflect the imagery of distributed word representations.
We introduce language-driven image generation
, the task of visualizing the contents of a linguistic message, as encoded in word embeddings, by generating a real image. Language-driven image generation can serve as evaluation tool providing intuitive visualization of what computational representations of word meaning encode. More ambitiously, effective language-driven image generation could complement image search and retrieval, producing images for words that are not associated to images in a certain collection, either for sparsity, or due to their inherent properties (e.g., artists and psychologists might be interested in images of abstract or novel words). In this work, we focus on generating images for distributed representations encoding the meaning ofsingle words. However, given recent advances in compositional distributed semantics  that produce embeddings for arbitrarily long linguistic units, we also see our contribution as the first step towards generating images depicting the meaning of phrases (e.g., blue car) and sentences. After all, language-driven image generation can be seen as the symmetric goal of recent research (e.g., [7, 8]) that introduced effective methods to generate linguistic descriptions of the contents of a given image.
To perform language-driven image generation, we combine various recent strands of research. Tools such as word2vec  and Glove  have been shown to produce extremely high-quality vector-based word embeddings
. At the same time, in computer vision, images are effectively represented by vectors of abstract visual features, such as those extracted by Convolutional Neural Networks (CNNs). Consequently, the problem of translating between linguistic and visual representations has been coached in terms of learning a cross-modal mapping function between vector spaces [4, 19]. Finally, recent work in computer vision, motivated by the desire to achieve a better understanding of what the layers of CNNs and other deep architectures have really learned, has proposed feature inversion techniques that map a representation in abstract visual feature space (e.g., from the top layer of a CNN) back onto pixel space, to produce a real image [24, 10].
Our language-driven image generation system takes a word embedding as input (e.g., the word2vec vector for grasshopper), projects it with a cross-modal function onto visual space (e.g., onto a representation in the space defined by a CNN layer), and then applies feature inversion to it (using the method HOGgles method of ) to generate an actual image (cell A18 in Figure 3). We test our system in a rigorous zero-shot setup, in which words and images of tested concepts are neither used to train cross-modal mapping, nor employed to induce the feature inversion function. So, for example, our system mapped grasshopper onto visual and then pixel space without having ever been exposed to grasshopper pictures.
Figure 3 illustrates our results (“answer key” for the figure provided as supplementary material). While it is difficult to discriminate among similar objects based on these images, the figure shows that our language-driven image generation method already captures the broad gist of different domains (food looks like food, animals are blobs in a natural environment, and so on).
2 Language-driven image generation
2.1 From word to visual vectors
Up to now, feature inversion algorithms [10, 22, 24] have been applied to visual representations directly extracted from images (hence the “inversion” name). We aim instead at generating an image conveying the semantics of a concept as encoded in a word representation. Thus, we need a way to “translate” the word representation into a visual representation, i.e., a representation laying on the visual space that conveys the corresponding visual semantics of the word.
Cross-modal mapping has been first introduced in the context of zero-shot learning as a way to address the manual annotation bottleneck in domains where other vector-based representations (e.g., images or brain signals) must be associated to word labels [13, 19]. This is achieved by using training data to learn a mapping function from vectors in the domain of interest to vector representations of word labels. In our case, we are interested in the general ability of cross-modal mapping to translate a representation between different spaces, and specifically from a word to a visual feature space.
The mapping is performed by inducing a function from data points , where is a word representation and the corresponding visual representation. The mapping function can then be applied to any given word vector to obtain its projection onto visual space. Following previous work [13, 4]
, we assume that the mapping is linear. To estimate its parameters, given word vectors paired with visual vectors , we use Elastic-Net-penalized least squares regression, that linearly combines the L1 and L2 weight penalties of Lasso and Ridge regularization:
By modifying the weights of the L1 and L2 penalties, and , we can derive different regression methods. Specifically, we experiment with plain regression (, ), ridge regression (, ), lasso regression ( and ) and symmetric elastic net (, ).
2.2 From visual vectors to images
Convolutional Neural Networks have recently surpassed human performance on object recognition . Nevertheless, these models exhibit “intriguing properties”, that are somewhat surprising given their state-of-the-art performance , prompting an effort to reach a deeper understanding of how they really work. Given that these models consist of millions of parameters, there is ongoing research on feature inversion of different CNN layers to attain an intuitive visualization of what each of them learned.
Several methods have been proposed for inverting CNN visual features, however, the exact nature of the task imposes certain constraints on the inversion method. For example, the original work of Zeiler and Fergus  cannot be straightforwardly adapted to our task of generating images from word embeddings, since their DeConvNet method requires information related to the activations of the network in several layers. In this work, we adopt the framework of Vondrick et al.  that casts the problem of inversion as paired dictionary learning.111Originally, the HOGgles method of  was introduced for visualizing HOG features. However, the method does not make feature-specific assumptions and it has also recently been used to invert CNN features .
Specifically, given an image and its visual representation , the goal is to find an image that minimizes the reconstruction error:
Given that there are no guarantees regarding the convexity of , both images and visual representations are approximated by paired, over-complete bases, and , respectively. Enforcing and to have paired representations through shared coefficients , i.e., and , allows the feature inversion to be done by estimating such coefficients that minimize the reconstruction error. Practically, the algorithm proceeds by finding , and through a standard sparse coding method. For learning the parameters, the algorithm is presented with training data of the form , where is an image patch and the corresponding visual vector associated with that patch.
3 Experimental Setup
We refer to the words we generate images for as dreamed concepts. The dreamed word set comes from the concepts studied by McRae et al. , in the context of property norm generation. This set contains 541 base-level concrete concepts (e.g., cat, apple, car etc.) that span across 20 general and broad categories (e.g., animal, fruit/vegetable, vehicle etc). For the purposes of the current experiments, 69 McRae concepts were excluded (either because of high ambiguity or for technical reasons), resulting in 472 dreamed words we test on.
We refer to the set of words associated to real pictures that are used for training purposes as seen
concepts. The real picture set contains approximately 480K images extracted from ImageNet representing 5K distinct concepts. The seen concepts are used for training the cross-modal mapping. Importantly, the dreamed and seen concept sets do emphnot overlap.
For all seen and dreamed concepts, we
build 300-dimensional word vectors with the word2vec
the CBOW method.333 Other hyperparameters, adopted without
tuning, include a context window size of 5 words to either side of
the target, setting the sub-sampling option to 1e-05 and estimating
the probability of target words by negative sampling, drawing 10
samples from the noise distribution
Other hyperparameters, adopted without tuning, include a context window size of 5 words to either side of the target, setting the sub-sampling option to 1e-05 and estimating the probability of target words by negative sampling, drawing 10 samples from the noise distribution. CBOW, which learns to predict a target word from the ones surrounding it, produces state-of-the-art results in many linguistic tasks . Word vectors are induced from a language corpus (e.g., Wikipedia) of 2.8 billion words.444Corpus sources: http://wacky.sslmit.unibo.it, http://www.natcorp.ox.ac.uk
The visual representations, for the set of 480K seen concept images, are extracted with the pre-trained CNN model of 
through the Caffe toolkit. CNNs trained on natural images learn a hierarchy of increasingly more abstract properties: the features in the bottom layers resemble Gabor filters, while features in the top layers capture more abstract properties of the dataset or tasks the CNN is trained for (see ) (e.g., the topmost layer captures a distribution over training labels). In this work, we experiment with feature representations extracted from two levels, pool-5, extracted from the 5th layer (6x6x256=9216 dimensions), and fc-7, extracted from the 7th layer (1x4096 dimensions). pool-5 is an intermediate pooling layer that should capture object commonalities. fc-7 is a fully-connected layer just below the topmost one, and as such it is expected to capture high-level discriminative features of different object classes.
Since each seen concept is associated with many images, we experiment with two ways to derive a unique visual representation. Inspired from categorization schemes in cognitive science , we will refer to them as the prototype and exemplar methods. The prototype visual vector of a concept is constructed by averaging the visual representations (either pool-5 or fc-7) of images tagged in ImageNet with the concept. The averaging method should smooth out noise and emphasize invariances in images associated to a concept. On the other hand, the constructed prototype does not correspond to an actual depiction of the concept. The exemplar
visual vector, on the other hand, is a single visual vector that is picked as a good representative of the set, as it is the one with the highest average cosine similarity to all other vectors extracted from images labeled with the same concept.
3.2 Model selection and parameter estimation
Visual feature type and concept representations
In order to determine the optimal visual feature type (between pool-5 and fc-7) and concept representation method (between prototype and exemplar), we set up a human study through CrowdFlower.555http://www.crowdflower.com/ For 50 randomly chosen test concepts, we generate 4 images, each obtained by inverting the visual vector computed by combining a feature type with a concept representation method, e.g., for pool-5+prototype, we generate an image by inverting the visual vector resulting from averaging the pool-5 feature vectors extracted from images labeled with the test concept (details on our implementation of feature inversion below). Participants are then asked to judge which of the 4 images is more likely to denote the test concept. For each test concept, we collect 20 judgments. Overall, participants showed a strong significant preference for the images generated from inverting pool-5 feature vectors (28/50), and in particular for those that were generated from pool-5 by inverting feature vectors constructed with the exemplar protocol (18/50).666Throughout this paper, statistical significance is assessed with two-tailed exact binomial tests with threshold , corrected for multiple comparisons with the false discovery rate method. The following experiments were thus carried out using the pool-5+exemplar visual space.
To learn the mapping of Equation 1, we use 5K training pairs () , where is the word vector and is the visual vector for the (seen) concept , based on pool-5 features and exemplar representation. Specifically, we estimate the weights by training the 4 regression methods described in Section 2.1 above, cross-validating the values of and on the training data. Model selection is performed by conducting a human study on the language-driven image generation task. For the same test of 50 concepts as above, we obtain estimates of their visual vectors by mapping their word vectors into visual space through the different mapping functions . We then generate an image by inverting the visual features . Participants are again asked to judge which of the 4 images is more likely to denote the test concept. For each concept we collected 20 judgments. Participants showed a preference for plain regression (9/50 significant tests in favor of this model), which we adopt in rest of the paper.
Training data for feature inversion (Section 2.2 above) are created by using the PASCAL VOC 2011 dataset, that contains 15K images of 20 distinct objects. Note that the 20 PASCAL objects are not part of our dreamed concepts, and thus the feature inversion is performed in a zero-shot way (the inversion will be asked to generate an image for a concept that it has never encountered before). In order to increase the size of the training data, from each image we derived several image patches associated with different parts of the image and paired them with their equivalent visual representations . Both paired dictionary learning and feature inversion are conducted using the HOGgles software  with default hyperparameters.777https://github.com/CSAILVision/ihog
Figure 3 provides a snapshot of our results; we randomly picked 10 dreamed concepts from each of the 20 McRae categories, and we show the image we generated for them from the corresponding word embeddings, as described in Section 2. We stress again that the images of dreamed concepts were never used in any step of the pipeline, neither to train cross-modal mapping, nor to train feature inversion, so they are genuinely generated in a zero-shot manner, by leveraging their linguistic associations to seen concepts.
Not surprisingly, the images we generate are not as clear as those one would get by retrieving existing images. However, we see in the figure that concepts belonging to different categories are clearly distinguished, with the exception of food and fruit/vegetable (columns 12 and 13), that look very much the same (on the other hand, fruit and vegetable are also food, and word vectors extracted from corpora will likely emphasize this “functional” role of theirs).
We next present a series of user studies providing quantitative and qualitative insights into the information that subjects can extract from the visual properties of the generated images.
4.1 Experiment 1: Correct word vs. random confounder
The first experiment is a sanity check, evaluating whether the visual properties in the generated images are informative enough for subjects to guess the correct label against a random alternative.
Participants are presented with the generated image of a dreamed concept and are asked to judge if it is more likely to denote the correct word or a confounder randomly picked from the seen word set. Given that the confounder is a randomly picked item, the task is relatively easy. However, both confounders and dreamed concepts are concrete, basic-level concepts, so they are sometimes related just by chance. Moreover, the confounders were used to train the mapping and inversion functions, which could have introduced a systematic bias in their favour. We test the 472 dreamed concepts, collecting 20 ratings for each via CrowdFlower. Word order is randomized both across and within trials (the same setup is used in the following experiments, with image order also randomized).
Participants show a consistent preference for the correct word (dreamed concept) (median proportion of votes in favor: 75%). Preference for the correct word is significantly different from chance in 211/472 cases. Participants expressed a significant preference for the confounder in 10 cases only, and in the majority of those, dreamed concepts and their confounders shared similar properties, e.g., cape-tabletop (both made of textile), zebra-baboon (both mammals), oak-boathouse (existing in similar natural environments).
The experiment confirms that our method can generally capture at least those visual properties of dreamed concepts that can distinguish them from visually dissimilar random items.
4.2 Experiment 2: Correct image vs. image of similar concept
The second experiment ascertains to what extent subjects can pick the right generated image for a dreamed concept over a closely related alternative.
For each dreamed concept, we pick as confounder the closest semantic neighbor according to the subject-based conceptual distance statistics provided by McRae et al. . In 379/472 cases, the confounder belongs to the category of the dreamed concept; hence, distinguishing the two concepts is quite challenging (e.g., mandarin vs. pumpkin). Participants were presented with the images generated from the dreamed concept and the confounder, and they were asked which of the two images is more likely to denote the dreamed concept.
Results and examples are provided in Table 1. In the vast majority of cases (409/472) the participants did not show a significant preference for either the correct image or the confounder. This shows that the current image generation pipeline does not capture, yet, fine-grained properties that would allow within-category discrimination. Still, within the subset of 63 cases for which subjects did express a significant preferences, we observe a clear trend in favour of the correct image (41 vs. 22). Color and environment seem to be the fine-grained properties that determined many of the subjects’ right or wrong choices within this subset. Of the 63 pairs, 14 involve concepts from different categories, and 49 same-category pairs. Of the former, in 11/14 the preference was for the right image. In 2 of the 3 wrong cases, the dreamed concept vs. intruder pairs have similar color (emerald vs. parsley, bowl vs. dish), while neither concept has a typical discriminative color in the third case (thermometer vs. marble). Even in the challenging same-category group, 30/49 pairs display the right preference. In particular, subjects distinguished objects that typically have different colors (e.g., flamingo vs. partridge), or live in different environments (e.g., turtle vs. tortoise). In the remaining 19 within-category cases in which the confounder was preferred, color seems again to play a crucial role in the confusion (e.g., alligator vs. crocodile, asparagus vs. spinach).
|In favor of dreamed concept||In favor of confounder|
|8.6% (41/472)||4.6% (22/472)|
|Same category||Different category||Same category||Different category|
|flamingo partridge||helicopter shotgun||alligator crocodile||bowl dish|
|turtle tortoise||barn cabinet||sailboat boat||emerald parsley|
|pumpkin mandarin||whale bison||asparagus spinach||thermometer marble|
We next ran a follow-up experiment to find out to what extent the lack of precision of our algorithm should be attributed to noise in image generation from abstract visual features, independently of the linguistic origin of the signal. For these purposes, we replaced the visual feature vector produced by cross-modal mapping with the “gold-standard” visual vector for each dreamed/confounder concept (e.g., instead of mapping the partridge word vector onto visual space, we generated a partridge image by inverting a pool-5+exemplar vector directly extracted from a set of images labeled with this word obtained from ImageNet). We repeated the Experiment 2 setup using the images generated in this way. In this case, the number of pairs for which no significant preference emerged was 75.4% (356/472), in 17.6% (83/472) of the cases there was a significant preference for the correct image, and in 7% (33/472) for the confounder. The results in this setting are better than when visual features are derived from word representations, but not dramatically so. Since feature inversion is an active area of research in computer vision, we can thus expect that the quality of language-driven image generation will greatly improve simply in virtue of general advances in image generation methods.
4.3 Experiment 3: Judging macro-categories of objects
The previous experiments have shown that our language-driven image generation system visualizes properties that are salient and relevant enough to distinguish unrelated concepts (Experiment 1) but not closely related ones (Experiment 2). The last experiment takes high-level category structure explicitly into account in the design.
We group the McRae categories into three macro-categories, namely animal vs. organic vs. man-made, that are widely recognized in cognitive science as fundamental and unambiguous . Participants are given a generated image and are asked to pick the macro-category that best describes the object in it.
Again, the number of images for which participants’ preferences are not significant is high: 28% of the organic images, 47% of the man-made images and 56% of the animal images. However, when participants do show significant preference, in the large majority of cases it is in favor of the correct macro-category: this is so for 98% of the organic images (70.5% of total), 90% of the man-made images (48% of total), and 59% of the animal ones (25.7% of total). Table 2
reports the confusion matrix across the macro-categories. Confusions arise where one would expect them: bothman-made and animal images are more often confused with organic things than with each other.
Again, color (either of the object itself or of the environment) is the leading property, distinguishing objects among the three macro-categories. As Figure 3 shows, orange, green and a darker mixture of colors characterize organic things, animals, and man-made objects respectively. Images that do not typically have these colors are harder to be recognized. For instance, the few mistakes for organic images belong to the natural object category (e.g., rocks); all the other categories within this macro-category are in the vast majority of the cases judged correctly. In the man-made macro-category (Figure 2, left), the images of buildings are those more easily recognizable; as one can see in Figure 3
those images share the same pattern: two horizontal layers (land/dark and sky/blue) with a vertical structure cutting across them (the building itself). Similarly, vehicles display two layers with a small horizontal structure crossing them, and they are almost always correctly classified. Finally, within theanimal macro-category (Figure 2, right), birds and fish are more often misclassified than other animals , with their typical environment probably playing a role in the confusion.
We introduced the new task of generating pictures visualizing the semantic content of linguistic expressions as encoded in word embeddings, proposing more specifically a method we dubbed language-driven image generation.
The current system seems capable to visualize the typical color of object classes and aspects of their characteristic environment. Interestingly, vector-based word representations are notoriously bad at capturing color , and we do not expect them to be much better at characterizing environments, so our results suggest that, already in its current form, our system could also be used to enrich word representations, by highlighting aspects of concepts that are not salient in language but are probably learned by similarity-based generalization from the cross-modal mapping training examples. In this sense, language-driven image generation is more than a simple word embedding evaluation tool. At the same time, our system completely ignores visual properties related to shape. Shapes are not often expressed by linguistic means (although we all recognize the typical “gestalt” of, say, a mammal, it is very difficult to describe it in words), but in the same way in which we can capture color and environment, better visual representations or feature inversion methods might lead us in the future to associate, by means of images, typical shapes to shape-blind linguistic representations.
Currently we approach language-based image generation as a two-step process. Inspired from recent work in caption generation that conditions word production on visual vectors, we plan to explore an end-to-end model that conditions the generation process on information encoded in the word embeddings of the word/phrase that we wish to produce an image for, building upon classic generative models of image generation [18, 5].
-  Marco Baroni, Georgiana Dinu, and Germán Kruszewski. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of ACL, 2014.
-  Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. Distributional semantics in Technicolor. In Proceedings of ACL, pages 136–145, Jeju Island, Korea, 2012.
-  Jia Deng, Wei Dong, Richard Socher, Lia-Ji Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of CVPR, pages 248–255, Miami Beach, FL, 2009.
-  Andrea Frome, Greg Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. DeViSE: A deep visual-semantic embedding model. In Proceedings of NIPS, pages 2121–2129, Lake Tahoe, NV, 2013.
-  Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
-  Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of CVPR, Boston, MA, 2015. In press.
Ryan Kiros, Ruslan Salakhutdinov, and Richard Zemel.
Unifying visual-semantic embeddings with multimodal neural language
Proceedings of the NIPS Deep Learning and Representation Learning Workshop, 2014.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of NIPS, pages 1097–1105.
-  Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In Proceedings of CVPR, 2015.
-  Ken McRae, George Cree, Mark Seidenberg, and Chris McNorgan. Semantic feature production norms for a large set of living and nonliving things. Behavior Research Methods, 37(4):547–559, 2005.
-  Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. http://arxiv.org/abs/1301.3781/, 2013.
-  Tom Mitchell, Svetlana Shinkareva, Andrew Carlson, Kai-Min Chang, Vincente Malave, Robert Mason, and Marcel Just. Predicting human brain activity associated with the meanings of nouns. Science, 320:1191–1195, 2008.
-  Gregory Murphy. The Big Book of Concepts. MIT Press, Cambridge, MA, 2002.
-  Shinji Nishimoto, An T Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, and Jack L Gallant. Reconstructing visual experiences from brain activity evoked by natural movies. Current Biology, 21(19):1641–1646, 2011.
-  Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of EMNLP, pages 1532–1543, Doha, Qatar, 2014.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition challenge, 2014.
Ruslan Salakhutdinov and Geoffrey E Hinton.
Deep boltzmann machines.In In Proceedings of AISTATS, pages 448–455, 2009.
-  Richard Socher, Milind Ganjoo, Christopher Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In Proceedings of NIPS, pages 935–943, Lake Tahoe, NV, 2013.
-  Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP, pages 1631–1642, Seattle, WA, 2013.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
-  C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba. HOGgles: Visualizing Object Detection Features. In In Proceedings of ICCV, 2013.
-  Carl Vondrick, Hamed Pirsiavash, Aude Oliva, and Antonio Torralba. Acquiring visual classifiers from human imagination. arXiv preprint arXiv:1410.4627, 2014.
-  Matthew Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Proceedings of ECCV (Part 1), pages 818–833, Zurich, Switzerland, 2014.
Appendix A Answer Keys to Figure 1
We provide the concept names of the word embeddings used to generate the images of Figure 1 (we provide again Figure 1 in this document to facilitate the readers). Due to lack of space, we split the concept names into 3 tables, Table 1-3, where each table provides the concept names of the word embeddings used to generate the man-made, organic and animal images respectively.