Attributes as Semantic Units between Natural Language and Visual Recognition

04/12/2016 ∙ by Marcus Rohrbach, et al. ∙ 0

Impressive progress has been made in the fields of computer vision and natural language processing. However, it remains a challenge to find the best point of interaction for these very different modalities. In this chapter we discuss how attributes allow us to exchange information between the two modalities and in this way lead to an interaction on a semantic level. Specifically we discuss how attributes allow using knowledge mined from language resources for recognizing novel visual categories, how we can generate sentence description about images and video, how we can ground natural language in visual content, and finally, how we can answer natural language questions about images.



There are no comments yet.


page 2

page 14

page 16

page 17

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Computer vision has made impressive progress in recognizing large number of objects categories Szegedy et al. [2015], diverse activities Wang and Schmid [2013], and most recently also in describing images and videos with natural language sentences Vinyals et al. [2015], Venugopalan et al. [2015b] and answering natural language questions about images Malinowski and Fritz [2014]. Given sufficient training data these approaches can achieve impressive performance, sometimes even on par with humans He et al. [2015]. However, humans have two key abilities most computer vision system lack. On the one hand humans can easily generalize to novel categories with no or very little training data. On the other hand, humans can rely on other modalities, most notably language, to incorporate knowledge in the recognition process. To do so humans seem to be able to rely on compositionality and transferability, which means they can break up complex problems into components, and use previously learned components in other (recognition) tasks. In this chapter we discuss how attributes can form such components which allow to transfer and share knowledge, incorporate external linguistic knowledge, and decompose the challenging problems of visual description and question answering into smaller semantic units, which are easier to recognize and associate with textual representation.

(a) Semantic attributes allow recognition of novel classes. (b) Sentence description for an image.
Image and caption from MS COCO Chen et al. [2015].
Figure 1: Examples for textual descriptions and visual content.

Let us first illustrate this with two examples. Attribute descriptions given in the form of hierarchical information (a mammal), properties (striped, black, and white), and similarities, (similar to a horse), allow humans to recognize a visual category, even if they never observed this category before. Given this description in form of attributes most humans would be able to recognize the animal shown in Fig. 1(a) as a zebra. Furthermore, once humans know that Fig. 1(a) is a zebra, they can describe what it is doing within a natural sentence, even if they never saw example images with captions of zebras before (Fig. 1b). A promising way to handle these challenges is to have compositional models which allow interaction between multi-modal information at a semantic level.

One prominent way to model such a semantic level are semantic attributes. As the term “attribute” has a large variety of definitions in the computer vision literature we define for the course of this chapter as follows.

Definition 1

An attribute is a semantic unit, which has a visual and a textual representation.

The first part of this definition, the restriction to a semantic unit is important to discriminate attributes from other representations, which do not have human interpretable meaning, such as image gradients, bag of (visual) words, or hidden representations in deep neural networks. We will refer to these as

features. Of course for a specific feature, one can try to find or associate it with a semantic meaning or unit, but typically it is unknown and once one is able to identify such a association, one has found a representation for this semantic attribute. The restriction to a semantic unit allows to connect to other sources of information on a semantic level, i.e. a level of meaning. In the second part of the definition we restrict it to semantic units which can be both represented textually and visually.111There are attributes / semantic units, which are not visual but textually, e.g. smells, tastes, tactile sensory inputs, and ones which are visual but not textual, which are naturally difficult to describe in language, but think of many visual patterns beyond striped and dotted, for which we do not have name, or the different visual attributes between two people or faces which humans can clearly recognize but which might be difficult to put into words. We also like to note that some datasets such as Animals with Attributes Lampert et al. [2014] include non-visual attributes, e.g. smelly, which might still improve classification performance as they are correlated to visual features. This this specific for this chapter as we want to exploit the connection between language and visual recognition. From this definition it should also be clear that attributes are not distinct from objects, but rather that objects are also attributes, as they obviously are semantic and have a textual and visual representation.

In this chapter we discuss some of the most prominent directions where language understanding and visual recognition interact. Namely how knowledge mined from language resources can help visual recognition, how we can ground language in visual content, how we can generate language about visual content, and finally how we can answer natural language questions about images, which can be seen as a combination of grounding the question, recognition, and generating an answer. It is clear that these directions cannot cover all potential interactions between visual recognition and language. Other directions include generating visual content from language descriptions [e.g. Zitnick et al., 2013, Liang et al., 2013] or localizing images in text i.e. to find where in a text an image is discussed. In the following we first analyze challenges for combining visual and linguistic modalities; afterwards we provide an overview of this chapter which includes a discussion how the different sections relate to each other and to the idea of attributes.

1.1 Challenges for combining visual and linguistic modalities

One of the fundamental differences between the visual and the linguistic modality is the level of abstraction. The basic data unit of the visual modality is a (photographic) image or video which always shows a specific instance of a category, or even more precisely a certain instance for a specific viewpoint, lighting, pose, time etc. For example Fig. 1(a) shows one specific instance of the category zebra from a side view, eating grass. In contrast to this, the basic semantic unit of the linguistic modality are words (which are strings of characters or phonemes for spoken language, but we will restrict ourselves to written linguistic expressions in this chapter). Although a word might refer to a specific instance, the word, i.e. the string, always represents a category of objects, activities, or attributes, abstracting from a specific instance. Interestingly this difference, instance versus category level representation, is also what defines one of the core challenges in visual recognition and is also an important topic in computational linguistics. In visual recognition we are interested in defining or learning models which abstract over a specific image or video to understand the visual characteristic of a category. In computational linguistics, when automatically parsing a text, we frequently face the inverse challenge of trying to identify intra and extra linguistic references (co-reference resolution / grounding222co-reference is when two or more words refer to the same thing or person within text, while grounding looks at how words refer to things outside text, e.g. images.) of a word or phrase. These problems arise because words typically represent concepts rather than instances and because anaphors, synonyms, hypernyms, or metaphorical expressions are used to refer to the identical object in the real world.

Understanding that the visual and linguistic modalities have different levels of abstraction is important when trying to combine both modalities. In Section 2 we use linguistic knowledge at category rather than instance level for visual knowledge transfer, i.e. we use linguistic knowledge at the level where it is most expressive that is at level of its basic representation. In Section 3, when describing visual input with natural language, we put the point of interaction at a semantic attribute level and leave concrete realization of sentences to a language model rather than inferring it from the visual representation, i.e. we recognize the most important components or attributes of a sentence, which are activities, objects, tools, locations, or scenes and then generate a sentence based on these. In Section 4 we look at a model which grounds phrases which refer to a specific instance by jointly learning visual and textual representations. In Section 5 we answer questions about images by learning small modules which recognize visual elements which are selected according to the question and linked to the most important components in the questions, e.g. questions words/phrases (How many), nouns, (dog) and qualifiers (black. By this composition in modules or attributes, we create an architecture, which allows learning these attributes, which link visual and textual modality, jointly across all questions and images.

1.2 Overview and outline

In this chapter we explain how linguistic knowledge can help to recognize novel object categories and composite activities (Section 2), how attributes help to describe videos and images with natural language sentences (Section 3), how to ground phrases in images (Section 4), and how compositional computation allows for effective question answering about images (Section 5). We conclude with directions for future work in Section 6.

All these directions have in common that attributes form a layer or composition which is beneficial for connecting between textual and visual representations. In Section 2

, for recognizing novel object categories and composite activities, attributes form the layer where the transfer happens. Attributes are shared across known and novel categories, while information mined from different language resources is able to provide the associations between the know categories and attributes at training time to learn attribute classifiers and between the attributes and novel categories at test time to recognize the novel categories.

When describing images and videos (Section 3), we first learn an intermediate layer of attribute classifiers, which are then used to generate natural language descriptions. This intermediate layer allows us to reason across sentences at a semantic level and in this way to build a model which generates consistent multi-sentence description. Furthermore, we discuss how such an attribute classifier layer allows us to describe novel categories where no paired image-caption data is available.

When grounding sentences in images, we argue that it makes sense to do this on a level of phrases are rather full sentences, as phrases form semantic units, or attributes, which can be well localized in images. Thus, in Section 4 we discuss how we localize short phrases or referential expressions in images.

In Section 5 we discuss the task of visual question answering which connects these previous sections, as one has to ground the question in the image and then predict or generate an answer. Here we show how we can decompose the question into attributes which are in this case small neural network components, which are composed in a computation graph to predict the answer. This allows us to share and train the attributes across questions and images, but build a neural network which is specific for a given question.

The order of the following sections weakly follows the historic development, where we start with work which appeared at the time when attributes started to become popular in computer vision Lampert et al. [2009], Farhadi et al. [2010]. And the last section on visual question answering, a problem which requires more complex interactions between language and visual recognition, has only recently become a topic in the computer vision community Malinowski and Fritz [2014], Antol et al. [2015].

2 Linguistic knowledge for recognition of novel categories

While supervised training is an integral part of building visual, textual, or multi-modal category models, more recently, knowledge transfer between categories has been recognized as an important ingredient to scale to a large number of categories as well as to enable fine-grained categorization. This development reflects the psychological point of view that humans are able to generalize to novel333We use “novel” throughout this chapter to denote categories with no or few labeled training instances. categories with only a few training samples [Moses et al., 1996, Bart and Ullman, 2005]

. This has recently gained increased interest in the computer vision and machine learning literature, which look at zero-shot recognition (with no training instances for a class)

[Lampert et al., 2014, Farhadi et al., 2009, Palatucci et al., 2009, Parikh and Grauman, 2011, Fu et al., 2014, Mensink et al., 2012, Frome et al., 2013], and one- or few-shot recognition [Thrun, 1996, Bart and Ullman, 2005, Raina et al., 2007]. Knowledge transfer is particularly beneficial when scaling to large numbers of classes where training data is limited [Mensink et al., 2012, Frome et al., 2013, Rohrbach et al., 2011], distinguishing fine-grained categories [Farrell et al., 2011, Duan et al., 2012], or analyzing compositional activities in videos [Fu et al., 2014, Rohrbach et al., 2012b].


Figure 2: Zero-shot recognition with the Direct Attribute Prediction model Lampert et al. [2009] allows recognizing unseen classes using an intermediate layer of attributes . Instead of manually defined associations between classes and attributes (cyan lines), Rohrbach et al. [2010] reduce supervision by mining object-attribute association from language resources, such as Wikipedia, WordNet, and image or web search.

Recognizing categories with no or only few labeled training instances is challenging. In this section we first discuss how we can build attribute classifiers using only category-labeled image data and different language resources which allow recognize novel categories (Section 2.1

). And then, to further improve this transfer learning approach, we discuss how to additionally integrate instance similarity and labeled instances of the novel classes if available (Section 

2.2). Furthermore we discuss what changes have to be made to apply similar ideas to composite activity recognition (Section 2.3).

2.1 Semantic relatedness mined from language resources for zero-shot recognition

Lampert et al. [2009, 2014] propose to use attribute based recognition to allow recognizing unseen categories based on their object-attribute associations. Their Direct Attribute Prediction (DAP) model is visualized in Fig. 2. Given images which are labeled with known category labels and object-attribute associations between categories and attributes, we can learn attribute classifier for an image . This allows to recognize novel categories if we have associations .

To scale the approach to a larger number of classes and attributes, Rohrbach et al. [2010, 2012c, 2011] show how these previously manual defined attribute associations and can be replaced with associations mined automatically from different language resources. Table 1

(a) compares several language resources and measures to estimate semantic relatedness to determine if a class should be associated with a specific attribute. Yahoo Snippets

Chen et al. [2006], Rohrbach et al. [2012c], which computes co-occurrence statistics on summary snippets returned by search engines, shows the best performance of all single measures. Rohrbach et al. [2012c] also discuss several fusion strategies to get more robust measures by expanding the attribute inventory with clustering and combining several measures, which can achieve performance on par with manually defined associations (second last versus last line in Table 1a).

Language Resource Measure in AUC
WordNet [Fellbaum, 1998], path Lin measure Lin [1998] Rohrbach et al. [2010] 60.5
Yahoo Web, hit count Mihalcea and Moldovan [1999] Dice coef. Dice [1945], Sørensen [1948] Rohrbach et al. [2010] 60.4
Flickr Img, hit count Rohrbach et al. [2010] Dice coef. Dice [1945], Sørensen [1948] Rohrbach et al. [2010] 70.1
Yahoo Img, hit count Rohrbach et al. [2010] Dice coef. Dice [1945], Sørensen [1948] Rohrbach et al. [2010] 71.0
Wikipedia Rohrbach et al. [2010] ESA [Gabrilovich and Markovitch, 2007, Zesch and Gurevych, 2010] Rohrbach et al. [2010] 69.7
Yahoo Snippets Chen et al. [2006] Dice/Snippets Rohrbach et al. [2012c] Rohrbach et al. [2012c] 76.0
Yahoo Img Expanded attr. Rohrbach et al. [2012c] 77.2
Combination Classifier fusion Rohrbach et al. [2012c] 75.9
Combination Expanded attr. Rohrbach et al. [2012c] 79.5
manual Lampert et al. [2009] Rohrbach et al. [2012c] 79.2
images: test + train cls
Object - Attribute Associations
Yahoo Img 71.0 73.2  ( +2.2 )
Classifier fusion 79.5 78.9  ( -0.6 )
manual 79.2 79.4  ( +0.2 )
Direct Similarity
Yahoo Img 79.9 76.4  ( -2.5 )
Classifier fusion 75.9 72.3  ( -3.6 )
Effect of adding images from known classes in the test set as distractors/negatives.
(a) Attribute-based zero-shot recognition. (b) Attributes versus direct-similarity, reported in Rohrbach et al. [2012c].
Table 1: Zero-shot recognition on AwA dataset Lampert et al. [2009]. Results for different language resources to mine association. Trained on 92 images per class, mean area under the ROC curve (AUC) in %.

As an alternative to attributes, Rohrbach et al. [2010] also propose to directly transfer information from most similar classes which does not require and intermediate level of attributes. While this achieves higher performance when the test set only contains novel objects, in the more adversarial settings, when the test set also contains images from the known categories, the direct similarity based approach significantly drops in performance as can be seen in Table 1(b).

Approach/Language resource in Top-5 Error
leaf WordNet nodes Rohrbach et al. [2010] 72.8
inner WordNet nodes Rohrbach et al. [2010] 66.7
all WordNet nodes Rohrbach et al. [2010] 65.2
+ metric learning Mensink et al. [2012]   64.3
Part Attributes
Wikipedia Rohrbach et al. [2010] 80.9
Yahoo Holonyms Rohrbach et al. [2010] 77.3
Yahoo Image Rohrbach et al. [2010] 81.4
Yahoo Snippets Rohrbach et al. [2010] 76.2
all attributes Rohrbach et al. [2010] 70.3
Direct Similarity
Wikipedia Rohrbach et al. [2010] 75.6
Yahoo Web Rohrbach et al. [2010] 69.3
Yahoo Image Rohrbach et al. [2010] 72.0
Yahoo Snippets Rohrbach et al. [2010] 75.5
all measures Rohrbach et al. [2010] 66.6
Label embedding
DeViSe Frome et al. [2013]   68.2
Table 2: Large scale zero-shot recognition results. Flat error in % and hierarchical error in brackets. Note that Mensink et al. [2012], Frome et al. [2013] report on a different set of unseen classes than Rohrbach et al. [2010].

Rohrbach et al. [2011]

extend zero-shot recognition from the 10 unseen categories in the AwA dataset to a setting of 200 unseen ImageNet

Deng et al. [2009] categories. One of the main challenges in this setting is, that there are no pre-defined attributes on this dataset available. Rohrbach et al. propose to mine part-attributes from WordNet Fellbaum [1998] as ImageNet categories correspond to WordNet synsets. Additionally, as the known and unknown classes are leaf nodes of the ImageNet hierarchy, inner nodes can be used to group leaf nodes, similar to attributes. Also, the closest known leaf node categories can transfer to the corresponding unseen leaf category.

An alternative approach is DeViSE Frome et al. [2013] which learns an embedding into a semantic skip-gram word-space Mikolov et al. [2013], trained on Wikipedia documents. Classification is achieved by projecting an image in the word-space and taking the closest word as label. Consequently this also allows for zero-shot recognition.

Table 2 compares the different approaches. The hierarchical variants Rohrbach et al. [2011] performs best, also compared to DeViSE Frome et al. [2013] which relies on more powerful CNN Krizhevsky et al. [2012] features. Further improvements can be achieved by metric learning Mensink et al. [2012]. As a different application, Mrowca et al. [2015] show how such hierarchical semantic knowledge allows to improve large scale object detection not just classification. While the WordNet hierarchy is very reliable as it was manually created, the attributes are restricted to part attributes and the mining is not as reliably. To improve in this challenging setting, we discuss next how one can exploit instance similarity and few labeled examples if available.

Transferring knowledge from known categories to novel classes is challenging as it is difficult to estimate visual properties of the novel classes. Approaches discussed in the previous section can not exploit instance similarity or few labeled instances, if available. The approach Propagated Semantic Transfer (PST) Rohrbach et al. [2013a] combines four ideas to jointly handle the challenging scenario of recognizing novel categories. First, PST transfers information from known to novel categories by incorporating external knowledge, such as linguistic or expert-specified information, e.g., by a mid-level layer of semantic attributes as discussed in Section 2.1

. Second, PST exploits the manifold structure of novel classes similar to unsupervised learning approaches

Weber et al. [2000], Sivic et al. [2005]. More specifically it adapts the graph-based Label Propagation algorithm Zhu et al. [2003], Zhou et al. [2004]

– previously used only for semi-supervised learning

Ebert et al. [2010] – to zero-shot and few-shot learning. In this transductive setting information is propagated between instances of the novel classes to get more reliable recognition as visualized with the red graph in Fig. 3. Third, PST improves the local neighborhood in such graph structures by replacing the raw feature-based representation with a semantic object- or attribute-based representation. And forth, PST generalizes from zero- to few-shot learning by integrating labeled training examples as certain nodes in its graph based propagation. Another positive aspect of PST is that attribute or category models do not have to be retrained if novel classes are added which can be an important aspect e.g. in a robotic scenario.

2.2 Propagated semantic transfer


Figure 3: Recognition of novel categories. The approach Propagated Semantic Transfer Rohrbach et al. [2013a] combines knowledge transferred via attributes from known classes (left) with few labeled examples in graph (red lines) which is build according to instance similarity.

Fig. 4 shows results on the AwA Lampert et al. [2009] dataset. We note that in contrast to the previous section the classifiers are trained on all training examples, not only 92 per class. Fig. 4(a) shows zero-shot results, where no training examples are available for the novel or in this case unseen classes. The table compares PST with propagating on a graph based on attribute-classifier similarity versus image descriptor similarity and shows a clear benefit of the former. This variant also outperform DAP and IAP Lampert et al. [2014] as well as Zero-Shot Learning Fu et al. [2014]. Next we compare PST in the few-shot setting, i.e. we add labeled examples per class. In Fig. 4(b) we compare PST to two label propagation (LP) baselines Ebert et al. [2010]. We first note that PST (red curves) seamlessly moves from zero-shot to few-shot, while traditional LP (blue and black curves) needs at least one training example. We first examine the three solid lines. The black curve is the best LP variant from Ebert et al. [2010] and uses similarity based image features. LP in combination with the similarity metric based on the attribute classifier scores (blue curves) allows to transfer knowledge residing in the classifier trained on the known classes and gives a significant improvement in performance. PST (red curve) additionally transfers labels from the known classes and improves further. The dashed lines in Fig. 4(b) provide results for automatically mined associations between attributes and classes from language resources. It is interesting to note that these automatically mined associations achieve performance very close to the manual defined associations (dashed vs. solid).

Approach AUC Acc.
DAP [Lampert et al., 2014] 81.4 41.4
IAP [Lampert et al., 2014] 80.0 42.2
Zero-Shot Learning [Fu et al., 2014] n/a 41.3
PST Rohrbach et al. [2013a]
    on image descriptors 81.2 40.5
    on attributes 83.7 42.7
(a) Zero-Shot, in %. (b) Few-Shot
Figure 4: Zero-shot results on AwA dataset. Predictions with attributes and manual defined associations. Adapted from Rohrbach et al. [2013a].
(a) Zero-Shot recognition (b) Few-Shot recognition
Figure 5: Results on 200 unseen classes of ImageNet. Adapted from Rohrbach et al. [2013a].

Fig. 5 shows results on the classification task with 200 unseen ImageNet categories. In Fig. 5(a) we compare PST to zero-shot without propagation presented as discussed in Section 2.1. For zero-shot recognition PST (red bars) improves performance over zero-shot without propagation (black bars) for all language resources and transfer variants. Similar to the AwA dataset, PST also improves over the LP-baseline for few-shot recognition (Fig. 5b). The missing LP-baseline on raw features is due to the fact that for the large number of images and high dimensional features the graph construction is very time and memory consuming if not infeasible. In contrast, the attribute representation is very compact and thus computational tractable even with a large number of images.

2.3 Composite activity recognition with attributes and script data


Figure 6: Recognizing composite activities using attributes and script data.

Understanding activities in visual and textual data is generally regarded as more challenging than understanding object categories due to the limited training data, challenges in defining the extend of an activity, and the similarities between activities Regneri et al. [2013]. However, long-term composite activities can be decomposed in shorter fine-grained activities Rohrbach et al. [2012b]. Consider for example the composite cooking activities prepare scrambled egg which can be decomposed in attributes of fine-grained activities (e.g. open, fry), ingredients (e.g. egg), and tools (e.g. pan, spatula). These attributes can than be shared and transferred across composite activities as visualized in Fig. 6 using the same approaches as for objects and attributes discussed in the previous section. However, the representations, both on the visual and on the language side have to change. Fine-grained activities and associated attributes are visually characterized by fine-grained body motions and low inter-class variability. In addition to holistic features Wang and Schmid [2013], one consequently should exploit human pose-based Rohrbach et al. [2012a] and hand-centric Senina et al. [2014] features. As the previously discussed language resources do not provide good associations between composite activities and their attributes, Rohrbach et al. [2012b] collected textual description (Script data) of these activities with AMT. From this script data associations can be computed based on either the frequency statistics or, more discriminate, by term frequency times inverse document frequency (tfidf).

Table 3 shows results on the MPII Cooking 2 dataset Rohrbach et al. [2015d]. Comparing the first column (holistic Dense Trajectory features Wang and Schmid [2013]) with the second, shows the benefit of adding the more semantic hand-Senina et al. [2014] and pose-Rohrbach et al. [2012a] features. Comparing line (1) with line (2) or (3) shows the benefit of representing composite activities with attributes as this allows sharing across composite activities. Best performance is achieved with 57.4% mean AP in line (6) when combining compositional attributes with the Propagated Semantic Transfer (PST) approach (see Section 2.2) and Script data to determine associations between composites and attributes.

Attribute training on: All Disjoint
Composites Composites
Activity representation: Wang and Schmid [2013] Wang and Schmid [2013], Senina et al. [2014], Rohrbach et al. [2012a] Wang and Schmid [2013] Wang and Schmid [2013], Senina et al. [2014], Rohrbach et al. [2012a]
With training data for composites
Without attributes
   (1) SVM 39.8 41.1 - -
Attributes on gt intervals
   (2) SVM 43.6 52.3 32.3 34.9
Attributes on automatic segmentation
   (3) SVM 49.0 56.9 35.7 34.8
   (4) NN 42.1 43.3 24.7 32.7
   (5) NN+Script data 35.0 40.4 18.0 21.9
   (6) PST+Script data 54.5 57.4 32.2 32.5
No training data for composites
Attributes on automatic segmentation
   (7) Script data 36.7 29.9 19.6 21.9
   (8) PST + Script data 36.6 43.8 21.1 19.3
Table 3: Composite cooking activity classification on MPII Cooking 2 Rohrbach et al. [2015d], mean AP in %. Top left quarter: fully supervised, right column: reduced attribute training data, bottom section: no composite cooking activity training data, right bottom quarter: true zero shot. Adapted from Rohrbach et al. [2015d].

3 Image and video description using compositional attributes

In this section we discuss how we can generate natural language sentences describing visual content, rather than just giving labels to images and videos as discussed in the previous section. This intriguing task has recently received increased attention in computer vision and computational linguistics communities Venugopalan et al. [2015b, c], Vinyals et al. [2015] and has a large number of potential applications including human robot interaction, image and video retrieval, and describing visual content for visually impaired people. In this section we focus on approaches which decouple the visual recognition and the sentence generation and introduce an intermediate semantic layer, which can be seen a layer of attributes (Section 3.1). Introducing such a semantic layer has several advantages. First, this allows to reason across sentences on a semantic level, which is, as we will see, beneficial for multi-sentence description of videos (Section 3.2). Second, we can show that when learning reliable attributes, this leads to state-of-the-art sentences generation with high diversity in the challenging scenario of movie description (Section 3.3). Third, this leads to a compositional structure which allows describing novel concepts in images and videos (Section 3.4).

3.1 Translating image and video content to natural language descriptions

To address the problem of image and video description, Rohrbach et al. [2013b] propose a two-step translation approach which first predicts an intermediate semantic attribute layer and then learns how to translate from this semantic representation to natural sentences. Figure 7 gives an overview of this two-step approach for videos. First, a rich semantic representation of the visual content including e.g. object and activity attributes is predicted. To predict the semantic representation a CRF models the relationships between different attributes of the visual input. And second, the generation of natural language is formulated as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this a parallel corpus of videos, annotated semantic attributes, and textual descriptions allows to adapt statistical machine translation (SMT) Koehn [2010] to translate between the two languages. Rohrbach et al. train and evaluate their approach on the videos of the MPII Cooking dataset Rohrbach et al. [2012a, b] and the aligned descriptions from the TACoS corpus Regneri et al. [2013]

. According to automatic evaluation and human judgments, the two-step translation approach significantly outperforms retrieval and n-gram-based baseline approaches, motivated by prior work. This similarly can be applied to image description task, however, in both cases it requires an annotated semantic attribute representation. In Sections 

3.3 and 3.4 we discuss how we can extract such attribute annotations automatically from sentences. An alternative approach is presented by Fang et al. [2015] who mine visual concepts for image description by integrating multiple instance learning Maron and Lozano-Pérez [1998]. Similar to the work presented in the following, Wu et al. [2016] learn an intermediate attribute representation from the image descriptions. Captions are then generated solely from the intermediate attribute representation.

Figure 7: Video description. Overview of the two-step translation approach Rohrbach et al. [2013b] with an intermediate semantic layer of attributes (SR) for describing videos with natural language. From Rohrbach [2014].

3.2 Coherent multi-sentence video description with variable level of detail

Most approaches for automatic video description, including the one presented above, focus on generating single sentence descriptions and are not able to vary the descriptions’ level of detail. One advantage of the two-step approach with an explicit intermediate layer of semantic attributes is that it allows to reason on this semantic level. To generate coherent multi-sentence descriptions, Rohrbach et al. [2014] extend the two-step translation approach to model across-sentence consistency at the semantic level by enforcing a consistent topic, which is the prepared dish in the cooking scenario. To produce shorter or one-sentence summaries, Rohrbach et al. select the most relevant sentences on the semantic level by using tfidf (term frequency times inverse document frequency). For an example output on the TACoS Multi-Level corpus Rohrbach et al. [2014] see Figure 8. In order to fully automatically do multi-sentence description, Rohrbach et al. propose a simple but effective method based on agglomerative clustering to perform automatic video segmentation. The most important component of good clustering is the similarity measure and it turns out that the semantic attribute classifiers (see Fig. 7) are very well suited for that in contrast to Bag-of-Words dense trajectories Wang et al. [2011]. This confirm the observation made in Section 2.2 that attribute classifiers seem to form a good space for distance computations.

To improve performance, Donahue et al. [2015] show that the second step, the SMT-based sentence generation, can be replaced with a deep recurrent network to better model visual uncertainty, but still relying on the multi-sentence reasoning on the semantic level. On the TACoS Multi-Level corpus this achieves 28.8% BLEU@4, compared to 26.9% Rohrbach et al. [2014] with SMT and 24.9% with SMT without multi-sentence reasoning Rohrbach et al. [2013b].

Figure 8: Coherent multi-sentence descriptions at three levels of detail, using automatic temporal segmentation. See Section 3.2 for details. From Rohrbach et al. [2014].

3.3 Describing movies with an intermediate layer of attributes

Two challenges arise, when extending the idea presented above to movie description Rohrbach et al. [2015c], which looks at the problem how to describe movies for blind people. First, and maybe more importantly, there are no semantic attributes annotated as on the kitchen data, and second, the data is more visually diverse and challenging. For the first challenge, Rohrbach et al. [2015c] propose to extract attribute labels from the description to train visual classifiers to build a semantic intermediate layer by relying on a semantic parsing approach of the description. To additionally accommodate the second challenge of increased visual difficulty, Rohrbach et al. [2015b] show how to improve the robustness of these attributes or “Visual Labels” by three steps. First, by distinguishing three semantic groups of labels (verbs, objects and scenes) and using corresponding feature representations for each: activity recognition with dense trajectories Wang and Schmid [2013], object detection with LSDA Hoffman et al. [2014]

, and scene classification with Places-CNN

Zhou et al. [2014]. Second, training each semantic group separately, which removes noisy negatives. And third, selecting only the most reliable classifiers. While Rohrbach et al. use SMT for sentence generation in Rohrbach et al. [2015c], they rely on a recurrent network (LSTM) in Rohrbach et al. [2015b].

SMT Rohrbach et al. [2015c] Someone is a man, someone is a man.
S2VT Venugopalan et al. [2015a] Someone looks at him, someone turns to someone.
Visual labels Rohrbach et al. [2015b] Someone is standing in the crowd, a little man with a little smile.
Reference Someone, back in elf guise, is trying to calm the kids.
SMT Rohrbach et al. [2015c] The car is a water of the water.
S2VT Venugopalan et al. [2015a] On the door, opens the door opens.
Visual labels Rohrbach et al. [2015b] The fellowship are in the courtyard.
Reference They cross the quadrangle below and run along the cloister.
SMT Rohrbach et al. [2015c] Someone is down the door, someone is a back of the door,
and someone is a door.
S2VT Venugopalan et al. [2015a] Someone shakes his head and looks at someone.
Visual labels Rohrbach et al. [2015b] Someone takes a drink and pours it into the water.
Reference Someone grabs a vodka bottle standing open on the counter
and liberally pours some on the hand.
Figure 9: Qualitative results on the MPII Movie Description (MPII-MD) dataset Rohrbach et al. [2015c]. The “Visual labels” approach Rohrbach et al. [2015b] which uses an intermediate layer of robust attributes, identifies activities, objects, and places better than related work. From Rohrbach et al. [2015b].

The Visual Labels approach outperforms prior work Venugopalan et al. [2015a], Rohrbach et al. [2015c], Yao et al. [2015] on the MPII-MD Rohrbach et al. [2015c] and M-VAD Torabi et al. [2015] dataset with respect to automatic and human evaluation. Qualitative results are shown in Fig. 9. An interesting characteristic of the compared methods is the size of the output vocabulary, which is 94 for Rohrbach et al. [2015c], 86 for Venugopalan et al. [2015a] (which uses an end-to-end LSTM approach without an intermediate semantic representation) and 605 for Rohrbach et al. [2015b]. Although it is far lower than 6,422 for the human reference sentences, it clearly shows a higher diversity of the output for Rohrbach et al. [2015b].

3.4 Describing novel object categories


Figure 10: Describing novel object categories which are not contained in caption corpora (like otter). The Deep Compositional Captioner (DCC) Hendricks et al. [2016] uses an intermediate semantic attribute or “lexical” layer to connect classifiers learned on unpaired image datasets (ImageNet) with text corpora (e.g. Wikipedia). This allows it to compose descriptions about novel objects without any paired image-sentences training data. Adapted from Hendricks et al. [2015].

In this section we discuss how to describe novel object categories which combines challenges discussed for recognizing novel categories (Section 2) and generating descriptions (Section 3.1). State-of-the-art deep image and video captioning approaches (e.g. Vinyals et al. [2015], Mao et al. [2015], Donahue et al. [2015], Fang et al. [2015], Venugopalan et al. [2015b]) are limited to describe objects which appear in caption corpora such as MS COCO Chen et al. [2015] which consist of pairs of images and sentences. In contrast, labeled image datasets without sentence descriptions (e.g. ImageNet Deng et al. [2010]) or text only corpora (e.g. Wikipedia) cover many more object categories.

Hendricks et al. [2016] propose the Deep Compositional Captioner (DCC) to exploit these vision-only and language-only unpaired data sources to describe novel categories as visualized in Fig. 10. Similar to the attribute layer discussed in Section 3.1, Hendricks et al. extract words as labels from the descriptions to learn a “Lexical Layer”. The Lexical Layer is expanded by objects from ImageNet Deng et al. [2010]. To be able to not only recognize but also generate the description about the novel objects, DCC transfers the word prediction model from semantically closest known word in the Lexical Layer, where similarity is computed with Word2Vec Mikolov et al. [2013]. Interesting to note is, that image captioning approaches such as Vinyals et al. [2015], Donahue et al. [2015] do use ImageNet data to (pre-) train the models (indicated with a dashed arrow in Fig. 10), but they do not make use of the semantic information but only the learned representation.

Figure 11: Qualitative results for describing novel ImageNet object categories. DCC Hendricks et al. [2016] compared to an ablation without transfer. X Y: known word X is transferred to novel word Y. From Hendricks et al. [2015].

Fig. 11 shows several categories where there exist no captions for training. With respect to quantitative measures, compared to a baseline without transfer, DCC improves METEOR from 18.2% to 19.1% and F1 score, which measures the appearance of the novel object, from 0 to 34.3%. Hendricks et al. also show similar results for video description.

4 Grounding text in images

(a) Without bounding box annotations at training or test time GroundeR Rohrbach et al. [2015a] learns to ground free-form natural language phrases in images. (b) GroundeR Rohrbach et al. [2015a] reconstructs phrases by learning to attend to the right box at training time. (c) GroundeR Rohrbach et al. [2015a] localizes boxes test time.
Figure 12: Unsupervised grounding by learning to associate visual and textual semantic units. From Rohrbach et al. [2015a].

In this section we discuss the problem of grounding natural language in images. Grounding in this case means that given an image and a natural language sentence or phrase, we aim to localize the subset of the image which corresponds to the input phrase. For example, for the sentence “A little brown and white dog emerges from a yellow collapsable toy tunnel onto the lawn.” and the corresponding image in Fig. 12(a), we want to segment the sentence into phrases and locate the corresponding bounding boxes (or segments) in the image. While grounding has been addressed e.g. in Kong et al. [2014], Johnson et al. [2015], Barnard et al. [2003], Socher and Fei-Fei [2010], it is restricted to few categories. An exception are Karpathy et al. Karpathy and Fei-Fei [2015], Karpathy et al. [2014] who aim to discover a latent alignment between phrases in text and bounding box proposals in the image. Karpathy et al. [2014] ground dependency-tree relations to image regions using multiple instance learning (MIL) and a ranking objective. Karpathy and Fei-Fei [2015] simplify the MIL objective to just the maximal scoring box and replace the dependency tree with a learned recurrent network. These approaches have unfortunately not been evaluated with respect to the grounding performance due to a lack of annotated datasets. Only recently two datasets were released: Flickr30k Entities Plummer et al. [2015] augments Flickr30k Young et al. [2014] with bounding boxes for all noun phrases present in textual descriptions and ReferItGame Kazemzadeh et al. [2014] has localized referential expressions in images. Even more recent, at the time of writing, efforts are being made to also collect grounded referential expressions for the MS COCO Lin et al. [2014] dataset, namely the authors of ReferItGame are in progress of extending their annotations as well as longer referential expressions have been collected by Mao et al. [2016]. Similar efforts are also made in the Visual Genome project Krishna et al. [2016] which provides densely annotated images with phrases.

In the following we focus on how to approach this problem and the first question is, where is the best point of interaction between linguistic elements and visual elements? Following the approaches in Karpathy and Fei-Fei [2015], Karpathy et al. [2014], Plummer et al. [2015] a good way to this is to decompose both, sentence and image into concise semantic units or attributes which we can match to each other. For the data as shown in Figures 12(a) and 13, sentences can be split into phrases of typically a few words and images are composed into a larger number of bounding box proposals Uijlings et al. [2013]. An alternative is to integrate phrase grounding in a fully-convolutional network, for bounding box prediction Johnson et al. [2016] or segmentation prediction Hu et al. [2016]. In the following, we discuss approaches which focus on how to find the association between visual and linguistic components, rather than the actual segmentation into components. We first look at an unsupervised setting with respect to the grounding task, i.e. we assume that no bounding box annotations are available for training (Section 4.1), and then we show how to integrate supervision (Section 4.2). Section 4.3 discusses the results.

4.1 Unsupervised grounding

redA man walking by bluea sitting man on greenthe street. redA white dog is following bluea black dog along greenthe beach. redThree people on a walk down bluea cement path beside greena field of wildflowers with magentaskyscrapers in the background.
Figure 13: Qualitative results for GroundeR unsupervised Rohrbach et al. [2015a] on Flickr 30k Entities Plummer et al. [2015]. Compact textual semantic units (phrases, e.g. “a sitting man”) are associated with visual semantic units (bounding boxes). Best viewed in color.

Although many data sources contain images which are described with sentences or phrases, they typically do not provide the spatial localization of the phrases. This is true for both curated datasets such as MSCOCO Lin et al. [2014] or large user generated content as e.g. in the YFCC 100M dataset Thomee et al. [2016]. Consequently, being able to learn from this data without grounding supervision would allow large amount and variety of training data. This setting is visualized in Fig. 12(a).

For this setting Rohrbach et al. [2015a] propose the approach GroundeR, which is able to learn the grounding by aiming to reconstruct a given phrase using an attention mechanism as shown in Fig. 12(b). In more detail, given images paired with natural language phrases (or sentence descriptions), but without any bounding box information, we want to localize these phrases with a bounding box in the image (Fig. 12c). To do this, GroundeR learns to attend to a bounding box proposal and, based on the selected bounding box, reconstructs the phrase (Fig. 12b). Attention means that the model predicts a weighting over the bounding boxes and then takes the weighted average of the features from all boxes. A softmax over the weights encourages that only one or a few boxes have high weights. As the second part of the model (Fig. 12b, bottom) is able to predict the correct phrase only if the first part of the model attended correctly (Fig. 12b, top), this can be learned without additional bounding box supervision. At test time we evaluate the grounding performance, i.e. whether the model assigned the highest weight to / attended to the correct bounding box. The model is able to learn these associations as the parameters of the model are learned across all phrases and images. Thus, for a proper reconstruction, the visual semantic units and linguistic phrases have to match, i.e. the models learns what certain visual phrases mean in the image.

SCRC Hu et al. [2015] GroundeR semi-supervised Rohrbach et al. [2015a] GroundeR supervised Rohrbach et al. [2015a]
with 3.12% annot.
redanywhere but the people – bluefirst person in line – greengroup people center – magentavery top left of whole image
redthe street – bluetree to the far left – greentop middle sky – magentawhite car far right bottom corner
Figure 14: Qualitative grounding results on ReferItGame Dataset Kazemzadeh et al. [2014]. Different colors show different referential expressions for the same image. Best viewed in color.

4.2 Semi-supervised and fully supervised grounding

Approach Accuracy
Unsupervised training
GroundeR (VGG-CLS) Rohrbach et al. [2015a] 24.66
GroundeR (VGG-DET) Rohrbach et al. [2015a] 32.42
Semi-supervised training
GroundeR (VGG-CLS) Rohrbach et al. [2015a]
    3.12% annotation 33.02
    6.25% annotation 37.10
    12.5% annotation 38.67
Supervised training
CCA embedding Plummer et al. [2015] 25.30
SCRC (VGG+SPAT) Hu et al. [2015] 27.80
GroundeR (VGG-CLS) Rohrbach et al. [2015a] 41.56
GroundeR (VGG-DET) Rohrbach et al. [2015a] 47.70
Approach Accuracy
Unsupervised training
LRCN Donahue et al. [2015] (reported in Hu et al. [2015]) 8.59
CAFFE-7K Guadarrama et al. [2014] (reported in Hu et al. [2015]) 10.38
GroundeR (VGG+SPAT) Rohrbach et al. [2015a] 10.44
Semi-supervised training
GroundeR (VGG+SPAT) Rohrbach et al. [2015a]
    3.12% annotation 15.03
    6.25% annotation 19.53
    12.5% annotation 21.65
Supervised training
SCRC (VGG+SPAT) Hu et al. [2015] 17.93
GroundeR (VGG+SPAT) Rohrbach et al. [2015a] 26.93
(a) Flickr 30k Entities dataset Plummer et al. [2015] (b) ReferItGame dataset Kazemzadeh et al. [2014]
Table 4: Phrase grounding, accuracy in %. VGG-CLS: Pre-training the VGG network Simonyan and Zisserman [2015] for the visual representation on ImageNet classification data only. VGG-DET: VGG further fine-tuned for the object detection task on the PASCAL dataset Everingham et al. [2010] using Fast R-CNN Girshick [2015]. VGG+SPAT: VGG-CLS + spatial bounding box features (box location and size).

If grounding supervision (phrase bounding box associations) is available, GroundeR Rohrbach et al. [2015a] can integrate it by adding a loss over the attention mechanism (Fig. 12b, “Attend”). Interestingly, this allows to provide supervision only for a subset of the phrases (semi-supervised) or all phrases (fully supervised).

For supervised grounding, Plummer et al. [2015] proposed to learn a CCA embedding Gong et al. [2014] between phrases and the visual representation. The Spatial Context Recurrent ConvNet (SCRC) Hu et al. [2015] and the approach of Mao et al. [2016] use a caption generation framework to score phrases on a set of bounding box proposals. This allows to rank bounding box proposals for a given phrase or referential expression. Hu et al. [2015] show the benefit of transferring models trained on full-image description datasets as well as spatial (bounding box location and size) and full-image context features. Mao et al. [2016] show how to discriminatively train the caption generation framework to better distinguish different referential expression.

4.3 Grounding results

In the following we discuss results on the Flickr 30k Entities dataset Plummer et al. [2015] and the ReferItGame dataset Kazemzadeh et al. [2014], which both provide ground truth alignment between noun phrases (within sentences) and bounding boxes. For the unsupervised models, the grounding annotations are only used at test time for evaluation, not for training. All approaches use the activations of the second last layer of the VGG network Simonyan and Zisserman [2015] to encode the image inside the bounding boxes.

Table 4(a) compares the approaches quantitatively. The unsupervised variant of GroundeR reaches nearly the supervised performance of CCA Plummer et al. [2015] or SCRCHu et al. [2015] on Flickr 30k Entities, successful examples are shown in Fig. 13. For the referential expressions of the ReferItGame dataset the unsupervised variant of GroundeR reaches performance on par with prior work (Table 4b) and quickly gains performance when adding few labeled training annotation (semi-supervised training). In the fully supervised setting GroundeR improves significantly over state-of-the-art on both datasets, which is also reflected in the qualitative results shown in Fig. 14.

5 Visual question answering

Figure 15: To approach visual question answering, Andreas et al. [2016a] propose to dynamically create a deep network which is composed of different “modules” (colored boxes). These “modules” represent semantic units, i.e. attributes, which link linguistic units in the question with computational units to do the corresponding visual recognition. Adapted from Andreas et al. [2015].

Visual question answering is the problem of answering natural language questions about images, e.g. for the question “Where is the amber cat?” about the image shown in Fig. 15 we want to predict the corresponding answer on the floor, or just floor. This is a very interesting problem with respect to several aspects. On the one hand it has many applications, such visual search, human-robot interaction, and assisting blind people. On the other hand, it is also an interesting research direction as it requires to relate textual and visual semantics. More specifically it requires to ground the question in the image, e.g. by localizing the relevant part in the image (amber cat in Fig. 15), and then recognizing and predicting an answer based on the question and the image content. Consequently, this problem requires more complex semantic interaction between language and visual recognition than in previous sections, specifically, the problem requires ideas from grounding (Section 4) and recognition (Section 2) or description (Section 3).

test-dev test
Y/N Num Other All All
LSTM 78.7 36.6 28.1 49.8
ATT+LSTM 80.6 36.4 42.0 57.2
NMN 70.7 36.8 39.2 54.8
NMN+LSTM 81.2 35.2 43.3 58.0
NMN+LSTM+FT 81.2 38.0 44.0 58.6 58.7
LSTM: a question-only baseline
ATT: single find+describe for all questions
NMN+LSTM: full model shown in Fig. 15
+FT: image features fine-tuned on captions Donahue et al. [2015]
NMN: ablation w/o LSTM
how many different lights in various different shapes and sizes?  four (four) what color is the vase?

green (green)
is the bus full of passengers?

no (no)
(a) Results from evaluation server of Antol et al. [2015] in %. (b) Answers from Andreas et al. [2015] (ground truth answers in parentheses).
Figure 16: Results on the VQA dataset Antol et al. [2015]. Adapted from Andreas et al. [2015].

Most recent approaches to visual question answering learn a joint hidden embedding of the question and the image to predict the answer Malinowski et al. [2015], Ren et al. [2015], Gao et al. [2015], Antol et al. [2015] where all computation is shared and identical for all questions. An exception to this is proposed by Wu et al. [2016], who learn an intermediate attribute representation from the image descriptions, similar to the work discussed in Sections 3.3 and 3.4. Interestingly, this intermediate layer of attributes allows to query an external knowledge base to provide additional (textual) information not visible in the image. The embedded textual knowledge base information is combined with the attribute representation and the hidden representation of a caption-generation recurrent network (LSTM) and forms the input to an LSTM-based question-answer encoder-decoder Malinowski et al. [2015].

Andreas et al. [2016a] go one step further with respect to compositionality and propose to predict a compositional neural network structure from the questions. As visualized in Fig. 15 the question “Where is the amber cat?” is decomposed into network “modules” amber, cat, and, and where. These modules are semantic units, i.e. attributes, which connect most relevant semantic components of the questions (i.e. word or short phrases) with corresponding computation to recognize it in the image. These Neural Module Networks (NMN) have different types of modules for different types of attributes. Different types have different colors in Fig. 15. The find[] and find[] (green) modules take in CNN activations (VGG Simonyan and Zisserman [2015], last convolutional layer) and produce a spatial attention heatmap, while combine[] (orange) combines two heatmaps to a single one, and describe[] (blue) takes in a heatmap and CNN features to predict an answer. Note that the distinction between different types, e.g. find versus describe, which have different kind of computation and different instances, e.g. find[] versus find[], which learn different parameters. All parameters are initialized randomly and only trained from question answer pairs. Interestingly, in this work attributes are not only distinguished with respect of their type, but also are composed with other attributes in a deep network, whose parameters’ are learned end-to-end from examples, here question-answer pairs. In a follow up work, Andreas et al. [2016b]

learn not only the modules, but also what the best network structure is from a set of parser proposals, using reinforcement learning.

In addition to NMN, Andreas et al. Andreas et al. [2016a, b] also incorporate a recurrent network (LSTM) to model common sense knowledge and dataset bias which has been shown to be important for visual question answering Malinowski et al. [2015]. Quantitative results in Table 16(a) indicate that NMNs are indeed a powerful tool to question answering, a few qualitative results can be seen Fig. 16(b).

6 Conclusions

In this chapter we presented several tasks and approaches where attributes enable a connection of visual recognition with natural language on a semantic level. For recognizing novel object categories or activities, attribute can build an intermediate representation which allows incorporating knowledge mined from language resources or script data (Section 2). For this scenario we saw that semantic attribute classifiers additionally build a good metric distance space useful for constructing instance graphs and learning composite activity recognition models. In Section 3 we explained how an intermediate level of attributes can be used to describe videos with multiple sentences and at a variable level and allow describing novel object categories. In Section 4 we presented approaches for unsupervised and supervised grounding of phrases in images. Different phrases are semantically overlapping and the examined approaches try to relate these semantic units by jointly learning representations for the visual and language modalities. Section 5 discusses an approach to visual question answering which composes the most important attributes of a question in a compositional computation graph, whose parameters are learned end-to-end only by back-propagating from the answers.

While the discussed approaches take a step towards the challenges discussed in Section 1.1, there are many future steps ahead. While the approaches in Section 2 use many advanced semantic relatedness measures minded from diverse language resources they are not jointly trained on textual and visual modalities. Regneri et al. [2013] and Silberer et al. [2013] take a step in this direction by looking at joint semantic representation from the textual and visual modalities. Section 3 presents compositional models for describing videos, but it is only a first step towards automatically describing a movie to a blind person as humans can do it Rohrbach et al. [2015c], which will require an even higher degree of semantic understanding, and transfer within and between modalities. Section 4 describes interesting ideas to grounding in images and it will be interesting to see how this scales to the size of the Internet. Visual question answering (Section 5) is an interesting emerging direction with many challenges as it requires to solve all of the above, at least to some extend.

I would like to thank all my co-authors, especially those whose publications are presented in this chapter. Namely, Sikandar Amin, Jacob Andreas, Mykhaylo Andriluka, Trevor Darrell, Sandra Ebert, Jiashi Feng, Annemarie Friedrich, Iryna Gurevych, Lisa Anne Hendricks, Ronghang Hu, Dan Klein, Raymond Mooney, Manfred Pinkal, Wei Qiu, Michaela Regneri, Anna Rohrbach, Kate Saenko, Michael Stark, Bernt Schiele, György Szarvas, Stefan Thater, Ivan Titov, Subhashini Venugopalan, and Huazhe Xu. Marcus Rohrbach was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD).


  • Andreas et al. [2015] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Deep compositional question answering with neural module networks. arXiv preprint arXiv:1511.02799, 2015.
  • Andreas et al. [2016a] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016a.
  • Andreas et al. [2016b] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2016b.
  • Antol et al. [2015] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015.
  • Barnard et al. [2003] K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. M. Blei, and M. I. Jordan. Matching words and pictures. Journal of Machine Learning Research (JMLR), 3:1107–1135, 2003.
  • Bart and Ullman [2005] E. Bart and S. Ullman. Single-example learning of novel classes using representation by similarity. In Proceedings of the British Machine Vision Conference (BMVC), 2005.
  • Chen et al. [2006] H.-H. Chen, M.-S. Lin, and Y.-C. Wei. Novel association measures using web search with double checking. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2006.
  • Chen et al. [2015] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  • Deng et al. [2010] J. Deng, A. Berg, K. Li, and L. Fei-Fei. What does classifying more than 10,000 image categories tell us? In European Conference on Computer Vision (ECCV), 2010.
  • Dice [1945] L. R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
  • Donahue et al. [2015] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Duan et al. [2012] K. Duan, D. Parikh, D. Crandall, and K. Grauman. Discovering Localized Attributes for Fine-grained Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • Ebert et al. [2010] S. Ebert, D. Larlus, and B. Schiele. Extracting Structures in Image Collections for Object Recognition. In European Conference on Computer Vision (ECCV), 2010.
  • Everingham et al. [2010] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision (IJCV), 88(2):303–338, 2010.
  • Fang et al. [2015] H. Fang, S. Gupta, F. N. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From captions to visual concepts and back. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Farhadi et al. [2009] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  • Farhadi et al. [2010] A. Farhadi, I. Endres, and D. Hoiem. Attribute-centric recognition for cross-category generalization. In Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
  • Farrell et al. [2011] R. Farrell, O. Oza, V. Morariu, T. Darrell, and L. Davis. Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance. In International Conference on Computer Vision (ICCV), 2011.
  • Fellbaum [1998] C. Fellbaum. WordNet: An Electronical Lexical Database. The MIT Press, 1998.
  • Frome et al. [2013] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In Conference on Neural Information Processing Systems (NIPS), 2013.
  • Fu et al. [2014] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Learning multimodal latent attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 36(2):303–316, 2014.
  • Gabrilovich and Markovitch [2007] E. Gabrilovich and S. Markovitch. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In

    Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)

    , 2007.
  • Gao et al. [2015] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine? dataset and methods for multilingual image question answering. In Conference on Neural Information Processing Systems (NIPS), 2015.
  • Girshick [2015] R. Girshick. Fast R-CNN. In International Conference on Computer Vision (ICCV), 2015.
  • Gong et al. [2014] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision (ECCV), 2014.
  • Guadarrama et al. [2014] S. Guadarrama, E. Rodner, K. Saenko, N. Zhang, R. Farrell, J. Donahue, and T. Darrell. Open-vocabulary object retrieval. In Robotics: science and systems, 2014.
  • He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In International Conference on Computer Vision (ICCV), 2015.
  • Hendricks et al. [2015] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. Deep compositional captioning: Describing novel object categories without paired training data. arXiv preprint arXiv:1511.05284v1, 2015.
  • Hendricks et al. [2016] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. Deep compositional captioning: Describing novel object categories without paired training data. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Hoffman et al. [2014] J. Hoffman, S. Guadarrama, E. Tzeng, J. Donahue, R. Girshick, T. Darrell, and K. Saenko. LSDA: Large scale detection through adaptation. In Conference on Neural Information Processing Systems (NIPS), 2014.
  • Hu et al. [2015] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Hu et al. [2016] R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. arXiv preprint arXiv:1603.06180, 2016.
  • Johnson et al. [2015] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Johnson et al. [2016] J. Johnson, A. Karpathy, and L. Fei-Fei.

    Densecap: Fully convolutional localization networks for dense captioning.

    In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Karpathy and Fei-Fei [2015] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Karpathy et al. [2014] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. In Conference on Neural Information Processing Systems (NIPS), 2014.
  • Kazemzadeh et al. [2014] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
  • Koehn [2010] P. Koehn. Statistical Machine Translation. Cambridge University Press, 2010.
  • Kong et al. [2014] C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. What are you talking about? text-to-image coreference. In Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • Krishna et al. [2016] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalanditis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332, 2016.
  • Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

    Imagenet classification with deep convolutional neural networks.

    In Conference on Neural Information Processing Systems (NIPS), 2012.
  • Lampert et al. [2009] C. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  • Lampert et al. [2014] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 36(3):453–465, 2014.
  • Liang et al. [2013] C. Liang, C. Xu, J. Cheng, W. Min, and H. Lu. Script-to-movie: A computational framework for story movie composition. Multimedia, IEEE Transactions on, 15(2):401–414, 2013.
  • Lin [1998] D. Lin. An information-theoretic definition of similarity. In International Conference on Machine Learning (ICML), 1998.
  • Lin et al. [2014] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
  • Malinowski and Fritz [2014] M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Conference on Neural Information Processing Systems (NIPS), 2014.
  • Malinowski et al. [2015] M. Malinowski, M. Rohrbach, and M. Fritz.

    Ask your neurons: A neural-based approach to answering questions about images.

    In International Conference on Computer Vision (ICCV), 2015.
  • Mao et al. [2015] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille.

    Deep captioning with multimodal recurrent neural networks (m-rnn).

    In International Conference on Learning Representations (ICLR), 2015.
  • Mao et al. [2016] J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Maron and Lozano-Pérez [1998] O. Maron and T. Lozano-Pérez. A framework for multiple-instance learning. Conference on Neural Information Processing Systems (NIPS), 1998.
  • Mensink et al. [2012] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost. In European Conference on Computer Vision (ECCV), 2012.
  • Mihalcea and Moldovan [1999] R. Mihalcea and D. I. Moldovan. A method for word sense disambiguation of unrestricted text. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 1999.
  • Mikolov et al. [2013] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Conference on Neural Information Processing Systems (NIPS), 2013.
  • Moses et al. [1996] Y. Moses, S. Ullman, and S. Edelman. Generalization to novel images in upright and inverted faces. Perception, 25:443–461, 1996.
  • Mrowca et al. [2015] D. Mrowca, M. Rohrbach, J. Hoffman, R. Hu, K. Saenko, and T. Darrell. Spatial semantic regularisation for large scale object detection. In International Conference on Computer Vision (ICCV), 2015.
  • Palatucci et al. [2009] M. Palatucci, D. Pomerleau, G. Hinton, and T. Mitchell. Zero-shot learning with semantic output codes. In Conference on Neural Information Processing Systems (NIPS), 2009.
  • Parikh and Grauman [2011] D. Parikh and K. Grauman. Relative attributes. In International Conference on Computer Vision (ICCV), 2011.
  • Plummer et al. [2015] B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In International Conference on Computer Vision (ICCV), 2015.
  • Raina et al. [2007] R. Raina, A. Battle, H. Lee, B. Packer, and A. Ng. Self-taught learning: Transfer learning from unlabeled data. In International Conference on Machine Learning (ICML), 2007.
  • Regneri et al. [2013] M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. Grounding Action Descriptions in Videos. Transactions of the Association for Computational Linguistics (TACL), 2013.
  • Ren et al. [2015] M. Ren, R. Kiros, and R. Zemel. Image question answering: A visual semantic embedding model and a new dataset. In Conference on Neural Information Processing Systems (NIPS), 2015.
  • Rohrbach et al. [2014] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. Coherent multi-sentence video description with variable level of detail. In Proceedings of the German Confeence on Pattern Recognition (GCPR), 2014.
  • Rohrbach et al. [2015a] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. arXiv preprint arXiv:1511.03745, 2015a.
  • Rohrbach et al. [2015b] A. Rohrbach, M. Rohrbach, and B. Schiele. The long-short story of movie description. Proceedings of the German Confeence on Pattern Recognition (GCPR), 2015b.
  • Rohrbach et al. [2015c] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. A dataset for movie description. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015c.
  • Rohrbach [2014] M. Rohrbach. Combining visual recognition and computational linguistics: linguistic knowledge for visual recognition and natural language descriptions of visual content. PhD thesis, Saarland University, 2014.
  • Rohrbach et al. [2010] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele. What helps Where - and Why? Semantic Relatedness for Knowledge Transfer. In Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
  • Rohrbach et al. [2011] M. Rohrbach, M. Stark, and B. Schiele. Evaluating Knowledge Transfer and Zero-Shot Learning in a Large-Scale Setting. In Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
  • Rohrbach et al. [2012a] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity detection of cooking activities. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012a.
  • Rohrbach et al. [2012b] M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele. Script data for attribute-based recognition of composite activities. In European Conference on Computer Vision (ECCV), 2012b.
  • Rohrbach et al. [2012c] M. Rohrbach, M. Stark, G. Szarvas, and B. Schiele. Combining language sources and robust semantic relatedness for attribute-based knowledge transfer. In Proceedings of the European Conference on Computer Vision Workshops (ECCV Workshops), volume 6553 of LNCS, 2012c.
  • Rohrbach et al. [2013a] M. Rohrbach, S. Ebert, and B. Schiele. Transfer Learning in a Transductive Setting. In Conference on Neural Information Processing Systems (NIPS), 2013a.
  • Rohrbach et al. [2013b] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In International Conference on Computer Vision (ICCV), 2013b.
  • Rohrbach et al. [2015d] M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele. Recognizing fine-grained and composite activities using hand-centric features and script data. International Journal of Computer Vision (IJCV), 2015d.
  • Senina et al. [2014] A. Senina, M. Rohrbach, W. Qiu, A. Friedrich, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele. Coherent multi-sentence video description with variable level of detail. arXiv preprint arXiv:1403.6173, 2014.
  • Silberer et al. [2013] C. Silberer, V. Ferrari, and M. Lapata. Models of semantic representation with visual attributes. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2013.
  • Simonyan and Zisserman [2015] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
  • Sivic et al. [2005] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering Object Categories in Image Collections. In International Conference on Computer Vision (ICCV), 2005.
  • Socher and Fei-Fei [2010] R. Socher and L. Fei-Fei. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
  • Sørensen [1948] T. Sørensen. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons. Biol. Skr., 5:1–34, 1948.
  • Szegedy et al. [2015] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Thomee et al. [2016] B. Thomee, B. Elizalde, D. A. Shamma, K. Ni, G. Friedland, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  • Thrun [1996] S. Thrun. Is learning the n-th thing any easier than learning the first. In Conference on Neural Information Processing Systems (NIPS), 1996.
  • Torabi et al. [2015] A. Torabi, C. Pal, H. Larochelle, and A. Courville. Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070v1, 2015.
  • Uijlings et al. [2013] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International Journal of Computer Vision (IJCV), 104(2):154–171, 2013.
  • Venugopalan et al. [2015a] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence – video to text. arXiv preprint arXiv:1505.00487v2, 2015a.
  • Venugopalan et al. [2015b] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence – video to text. In International Conference on Computer Vision (ICCV), 2015b.
  • Venugopalan et al. [2015c] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2015c.
  • Vinyals et al. [2015] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Wang and Schmid [2013] H. Wang and C. Schmid. Action recognition with improved trajectories. In International Conference on Computer Vision (ICCV), 2013.
  • Wang et al. [2011] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action Recognition by Dense Trajectories. In Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
  • Weber et al. [2000] M. Weber, M. Welling, and P. Perona. Towards automatic discovery of object categories. In Conference on Computer Vision and Pattern Recognition (CVPR), 2000.
  • Wu et al. [2016] Q. Wu, C. Shen, A. v. d. Hengel, P. Wang, and A. Dick. Image captioning and visual question answering based on attributes and their related external knowledge. arXiv preprint arXiv:1603.02814, 2016.
  • Yao et al. [2015] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. arXiv preprint arXiv:1502.08029v4, 2015.
  • Young et al. [2014] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics (TACL), 2:67–78, 2014.
  • Zesch and Gurevych [2010] T. Zesch and I. Gurevych. Wisdom of crowds versus wisdom of linguists - measuring the semantic relatedness of words. Natural Language Engineering, 16(1):25–59, 2010.
  • Zhou et al. [2014] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.

    Learning Deep Features for Scene Recognition using Places Database.

    In Conference on Neural Information Processing Systems (NIPS), 2014.
  • Zhou et al. [2004] D. Zhou, O. Bousquet, T. N. Lal, Jason Weston, and B. Schölkopf. Learning with Local and Global Consistency. In Conference on Neural Information Processing Systems (NIPS), 2004.
  • Zhu et al. [2003] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In International Conference on Machine Learning (ICML), 2003.
  • Zitnick et al. [2013] C. L. Zitnick, D. Parikh, and L. Vanderwende. Learning the visual interpretation of sentences. In International Conference on Computer Vision (ICCV), 2013.