Context-Aware Zero-Shot Learning for Object Recognition

by   Éloi Zablocki, et al.

Zero-Shot Learning (ZSL) aims at classifying unlabeled objects by leveraging auxiliary knowledge, such as semantic representations. A limitation of previous approaches is that only intrinsic properties of objects, e.g. their visual appearance, are taken into account while their context, e.g. the surrounding objects in the image, is ignored. Following the intuitive principle that objects tend to be found in certain contexts but not others, we propose a new and challenging approach, context-aware ZSL, that leverages semantic representations in a new way to model the conditional likelihood of an object to appear in a given context. Finally, through extensive experiments conducted on Visual Genome, we show that contextual information can substantially improve the standard ZSL approach and is robust to unbalanced classes.


page 3

page 8


Context-Aware Zero-Shot Recognition

We present a novel problem setting in zero-shot learning, zero-shot obje...

From Pixel to Patch: Synthesize Context-aware Features for Zero-shot Semantic Segmentation

Zero-shot learning has been actively studied for image classification ta...

Learning Class Prototypes via Structure Alignment for Zero-Shot Recognition

Zero-shot learning (ZSL) aims to recognize objects of novel classes with...

CAZSL: Zero-Shot Regression for Pushing Models by Generalizing Through Context

Learning accurate models of the physical world is required for a lot of ...

COBE: Contextualized Object Embeddings from Narrated Instructional Video

Many objects in the real world undergo dramatic variations in visual app...

Towards Context-aware Interaction Recognition

Recognizing how objects interact with each other is a crucial task in vi...

Few-shot Learning with Contextual Cueing for Object Recognition in Complex Scenes

Few-shot Learning aims to recognize new concepts from a small number of ...

1 Introduction

Traditional Computer Vision models, such as Convolutional Neural Networks (CNNs)


, are designed to classify images into a set of predefined classes. Their performances have kept improving in the last decade, namely on object recognition benchmarks such as ImageNet

(DBLP:conf/cvpr/DengDSLL009), where state-of-the-art models (DBLP:journals/corr/ZophVSL17; DBLP:journals/corr/abs-1802-01548) have outmatched humans. However, training such models requires hundreds of manually-labeled instances for each class, which is a tedious and costly acquisition process. Moreover, these models cannot replicate humans’ capacity to generalize and to recognize objects they have never seen before. As a response to these limitations, Zero-Shot Learning (ZSL) has emerged as an important research field in the last decade (DBLP:conf/cvpr/FarhadiEHF09; DBLP:conf/eccv/MensinkVPC12; DBLP:journals/corr/FuYHXG15; DBLP:conf/cvpr/KodirovXG17). In the object recognition field, ZSL aims at labeling an instance of a class for which no supervised data is available, by using knowledge acquired from another disjoint set of classes, for which corresponding visual instances are provided. In the literature, these sets of classes are respectively called target and source

domains — terms borrowed from the transfer learning community. Generalization from the source to the target domain is achieved using auxiliary knowledge that semantically relates classes of both domains, e.g. attributes or textual representations of the class labels.

Previous ZSL approaches only focus on intrinsic properties of objects, e.g. their visual appearance, by the means of handcrafted features — e.g. shape, texture, or color — (DBLP:journals/pami/LampertNH14)

or distributed representations learned from text corpora

(DBLP:journals/pami/AkataPHS16; DBLP:conf/cvpr/LongLSSDH17). The underlying hypothesis is that the identification of entities of the target domain is made possible thanks to the implicit principle of compositionality (a.k.a. Frege’s principle (frege)) — an object is formed by the composition of its attributes and characteristics — and the fact that other entities of the source domain share the same attributes. For example, if textual resources state that an apple is round and that it can be red or green, this knowledge can be used to identify apples in images because these characteristics (‘round‘, ‘red‘) could be shared by classes of the source domain (e.g, ‘round‘ like a ball, ‘red‘ like a strawberry…).

We believe that visual context, i.e. the other entities surrounding an object, also explains human’s ability to recognize an object that has never been seen before. This assumption relies on the fact that scenes are compositional in the sense that they are formed by the composition of objects they contain. Some works in Computer Vision have exploited visual context to refine the predictions of classification (Mensink2014COSTACS) or detection (DBLP:conf/cvpr/BellZBG16)

models. To the best of our knowledge, context has not been exploited in ZSL because, for obvious reasons, it is impossible to directly estimate the likelihood of a context for objects from the target domain — from visual data only. However, textual resources can be used to provide insights on the possible visual context in which an object is expected to appear. To illustrate this, knowing from language that an apple is likely to be found hanging on a tree or in the hand of someone eating it, can be very helpful to identify apples in images. In this paper, our goal is to leverage visual context as an additional source of knowledge for ZSL, by exploiting the distributed word representations


of the object class labels. More precisely, we adopt a probabilistic framework in which the probability to recognize a given object is split into three components: (1) a

visual component based on its visual appearance (which can be derived from any traditional ZSL approach), (2) a contextual component exploiting its visual context, and (3) a prior component, which estimates the frequency of objects in the dataset. As a complementary contribution, we show that separating prior information in a dedicated component, along with simple yet effective sampling strategies, leads to a more interpretable model, able to deal with imbalanced datasets. Finally, as traditional ZSL datasets lack contextual information, we design a new dedicated setup based on the richly annotated Visual Genome dataset (visualgenome). We conduct extensive experiments to thoroughly study the impact of contextual information.

2 Related work

Zero-shot learning

While state-of-the-art image classification models (DBLP:journals/corr/ZophVSL17; DBLP:journals/corr/abs-1802-01548) restrict their predictions to a finite set of predefined classes, ZSL bypasses this important limitation by transferring knowledge acquired from seen classes (source domain) to unseen classes (target domain)

. Generalization is made possible through the medium of a common semantic space where all classes from both source and target domains are represented by vectors called

semantic representations.

Historically, the first semantic representations that were used were handcrafted attributes (DBLP:conf/cvpr/FarhadiEHF09; DBLP:conf/cvpr/ParikhG11; DBLP:conf/eccv/MensinkVPC12; DBLP:journals/pami/LampertNH14). In these works, the attributes of a given image are determined and the class with the most similar attributes is predicted. Most methods represent class labels with binary vectors of visual features (e.g, ’IsBlack’,’HasClaws’) (DBLP:conf/cvpr/LampertNH09; DBLP:conf/cvpr/LiuKS11; DBLP:journals/pami/FuHXG14; DBLP:journals/pami/LampertNH14). However, attribute-based methods do not scale efficiently since the attribute ontology is often domain-specific and has to be built manually. To cope with this limitation, more recent ZSL works rely on distributed semantic representations learned from textual datasets such as Wikipedia, using Distributional Semantic Models (mikolov2013; glove; elmo). These models are based on the distributional hypothesis (harris1954distributional), which states that textual items with similar contexts in text corpora tend to have similar meanings. This is of particular interest in ZSL: all object classes (from both source and target domains) are embedded into the same continuous vector space based on their textual context, which is a rich source of semantic information. Some models directly aggregate textual representations of class labels and the predictions of a CNN (DBLP:journals/corr/NorouziMBSSFCD13), whereas others learn a cross-modal mapping between image representations (given by a CNN) and pre-learned semantic embeddings (DBLP:conf/cvpr/AkataRWLS15; DBLP:conf/eccv/BucherHJ16). At inference, the predicted class of a given image is the nearest neighbor in the semantic embedding space. The cross-modal mapping is linear in most of ZSL works (DBLP:conf/nips/PalatucciPHM09; DBLP:conf/icml/Romera-ParedesT15; DBLP:journals/pami/AkataPHS16; DBLP:conf/cvpr/QiaoLSH16); this is the case in the present paper. Among these works, the DeViSE model (devise) uses a max-margin ranking objective to learn a cross-modal projection and fine-tune the lower layers of the CNN. Several models have built upon DeViSE with approaches that learn non-linear mappings between the visual and textual modalities (DBLP:conf/iccv/BaSFS15; DBLP:conf/cvpr/XianA0N0S16), or by using a common multimodal space to embed both images and object classes (DBLP:conf/cvpr/FuXKG15; DBLP:conf/cvpr/LongLSSDH17). In this paper, we extend DeViSE in two directions: by additionally leveraging visual context, and by reformulating it as a probabilistic model that allows coping with an imbalanced class distribution.

Figure 1: The goal is to find the class (in the target domain) of the object contained within the blue image region . Its context is formed of labeled objects from the source domain (red plain boxes) and of unlabeled object from the target domain (red dashed boxes).

Visual context

The intuitive principle that some objects tend to be found in some contexts but not others, is at the core of many works. In NLP, visual context of objects can be used to build efficient word representations (zablockiaaai2018). In Computer Vision, it can be used to refine detection (DBLP:journals/pami/ChenSDHHY15; DBLP:journals/ijon/ChuC18) or segmentation (contextsegm2018) tasks.

Visual context can either be low-level (i.e. raw image pixels) or high-level (i.e. labeled objects). When visual context is exploited in the form of low-level information (DBLP:journals/ijcv/Torralba03; DBLP:journals/ijcv/WolfB06; DBLP:journals/cacm/TorralbaMF10), it often consists of global image features. For instance, in (DBLP:conf/cvpr/HeZC04), a Conditional Random Field is trained at combining low-level image features to assign to each pixel a class. In high-level approaches, the referential meaning of the context objects (i.e. class labels) is used. For example, DBLP:conf/iccv/RabinovichVGWB07 show that high-level context can be used at the post-processing level to reduce the ambiguities of a pre-learned object classification model, by leveraging co-occurrence patterns between objects that are computed from the training set. Moreover, the_role_of_context_selection study the role of context to classify objects: they investigate the importance of contextual indicators, such as object co-occurrence, relative scale and spatial relationships, and find that contextual information can sometimes be more informative than direct visual cues from objects. Spatial relations between objects can also be used in addition to co-occurrences, as in (DBLP:conf/cvpr/GalleguillosRB08; DBLP:conf/cvpr/ChenLFG18). In (DBLP:journals/corr/BengioDEILRSS13), co-occurrences are computed using external information collected from web documents. The model classifies all objects jointly; it gives an inference method enabling a balance between an image coherence term (given by an image classifier) and a semantic term (given by a co-occurrence matrix). However, the approach is fully supervised, and this setting cannot be applied to ZSL. The context-aware zero-shot learning task is related to the graph generation tasks (DBLP:conf/cvpr/ZellersYTC18; DBLP:conf/eccv/YangLLBP18) and visual relationship detection (DBLP:conf/eccv/LuKBL16).

In conclusion, while many works in NLP and Computer Vision show the importance of visual context, its use in ZSL remains a challenge, that we propose to tackle in this paper.

3 Context-aware Zero-Shot Learning

Let be the set of all object classes, divided in classes from the source domain and classes from the target domain . The goal of our approach — context-aware ZSL — is to determine the class of an object contained in an image , given its visual appearance and its visual context . The image is annotated with bounding boxes, each containing an object. Given the zone , the context consists of the surrounding objects in the image. Their classes can either belong to the source domain () or to the target domain (). Note that the class of an object of is not accessible in ZSL, only its visual appearance is.

3.1 Model overview

We tackle this task by modeling the conditional probability of a class given both the visual appearance and the visual context of the object of interest. Given the absence of data in the target domain, we need to limit the complexity of the model, for generalizability’s purpose. Accordingly, we suppose that and are conditionally independent given the class — we show in the experiments (section 5) that this hypothesis is acceptable. This hypothesis leads to the following expression:


where each conditional probability expresses the probability of either the visual appearance or the context given class , and denotes the prior distribution of the dataset. Each term of this equation is modeled separately.

The intuition behind our approach is illustrated in Figure 1, where the blue box contains the object of interest. Here, the class is apple, which belongs to the target domain . The visual component, which focuses on the zone , recognizes a tennis ball due to its yellow and round appearance; apple is ranked second. The prior component indicates that apple is slightly more frequent than tennis ball, but the frequency discrepancy may not be high enough to change the prediction of the visual component. In that case, the context component is discriminant: it ranks objects that are likely to be found in a kitchen, and reveals that an apple is far more likely to be found than a tennis ball in this context.

Precisely modeling , and is challenging due to the ZSL setting. Indeed, these distributions cannot be computed for classes of the target domain because of the absence of corresponding training data. Thus, to transfer the knowledge acquired from the source domain to the target domain, we use a common semantic space, namely Word2Vec (mikolov2013), where source and target class labels are embedded as vectors of , with the dimension of the space. It is worth noting that we propose to separately learn the prior class distribution with a ranking loss (in section 3.3). This allows dealing with imbalanced datasets, in contrast to ZSL models like DeViSE (devise). This intuition is experimentally validated in section 5.2.

3.2 Description of the model’s components

Due to both the ZSL setting and the variety of possible context and/or visual appearance of objects, it is not possible to estimate directly the different probabilities of equation 1. Hence, in what follows, we estimate quantities related to , and using parametric energy functions (lecun2006tutorial). These quantities are learned separately, as described in section 3.3. Finally, we explain how we combine them to produce the global probability in section 3.4.

Visual component

The visual component models by computing the compatibility between the visual appearance of the object of interest, and the semantic representation of the class .

Following previous ZSL works based on cross-modal projections (devise; DBLP:conf/eccv/BansalSSCD18), we introduce , a parametric function mapping an image to the semantic space: where is a vector in , output by a pretrained CNN truncated at the penultimate layer, is a projection matrix () and

a bias vector — in our experiments,

. The probability that the image region corresponds to the class

is set to be proportional to the cosine similarity between the projection

of and the semantic representation of :


Context component

The context component models by computing a compatibility score between the visual context , and the semantic representation of class . More precisely, the conditional probability is written:


where is a vector representing the context, are parameters to learn, and is the concatenation operator. To take non-linear and high-order interactions between and into account,

is modeled by a 2-layer Perceptron. We found that concatenating

with leads to better results than a cosine similarity, as done in equation 2 for the visual component.

To specify the modeling of , we propose various context models depending on which context objects are considered and how they are represented. Specifically, a context model is characterized by (a) the domain of context objects that are considered (i.e. source or target ) and (b) the way these objects are represented, either by a textual representation of their class label or by a visual representation of their image regions. Accordingly, we distinguish:
The low-level () approach that computes a representation from the image region of a context object. This produces the following context models:

The high-level () approach which considers semantic representations of the class labels of the context objects (only available for entities of the source domain). This produces context models:

Note that is not defined in the zero-shot setting, since class labels of objects from the target domain are unknown; yet it is used to define Oracle models (section 4.3).

These four basic sets of vectors can further be combined in various ways to form new context models (for instance: , etc.). At last, averages the representations of these vectors to build a global context representation. For example, equals:

where denotes the cardinality of a set of vectors.

Prior component

The goal of the prior component is to assess whether an entity is frequent or not in images. We estimate from the semantic representation of class :


where is a 2-layer Perceptron that outputs a scalar.

3.3 Learning

In this section, we explain how we learn the energy functions , and . Each component (resp. context, visual, prior) of our model is assigned a training objective (resp. , , ). As the components are independent by design, they are learned separately. This allows for a better generalization in the target domain, as shown experimentally (section 5.2). Besides, ensuring that some configurations are more likely than others motivates us to model each objective by a max-margin ranking loss, in which a positive configuration is assigned a lower energy than a negative one, following the learning to rank paradigm (wsabie). Unlike previous works (devise), which are generally based on balanced datasets such as ImageNet and thus are not concerned with prior information, we want to avoid any bias coming from the imbalance of the dataset in and , and learn the prior separately with . In other terms, the visual (resp. context) component should focus exclusively on the visual appearance (resp. visual context) of objects. This is done with a careful sampling strategy of the negative examples within the ranking objectives, that we detail in the following. To the best of our knowledge, such a discussion relative to prior modeling in learning objectives — which is, in our view, paramount in imbalanced datasets such as Visual Genome — has not been done in previous research.

Positive examples are sampled among entities of the source domain from the data distribution : they consist in a single object for , an object/box pair for , an object/context pair for . To sample negative examples from the source domain, we distinguish two ways:

(1) For the prior objective , negative object classes are sampled from the uniform distribution :


Noting , the contribution of two given objects and to this objective is:

If , i.e. when object class is more frequent than object class , this term is minimized when , i.e. . Thus, captures prior information, as it learns to rank objects based on their frequency.

(2) For the visual and context objectives, negative object classes are sampled from the prior distribution :


Similarly, the contribution of two given objects , and a context to the objective is:

Minimizing this term does not depend on the relative order between and ; thus, does not take prior information into account. Moreover, implies that .

The alternative, as done in DeViSE (devise), is to sample negative classes uniformly in the source domain in the objective . Thus, if the prior is uniform, DeViSE directly models ; otherwise, cannot be analyzed straightforwardly. Besides, the contributions of visual and prior information are mixed. However, we show that learning the prior separately and imposing the context (resp. visual) component to exclusively focus on contextual (resp. visual) information is more efficient (section 5.2).

3.4 Inference

In this section, we detail the inference process. The goal is to combine the predictions of the individual components of the model to form the global probability distribution

. In section 3.3, we detailed how to learn the functions , and , from which , and are deduced respectively. However, the normalization constants in equations 2, 3 and 4, which depend on the object class in the general case, are unknown. As a simplifying hypothesis, we suppose that these normalization constants are scalars that we respectively note , and . This leads to:


To see whether this hypothesis is reasonable, we did some post-hoc analysis of one of our model, and plotted in Figure 2 the values , and for positive (red points) and negative (blue points) configurations of the test set of Visual Genome. We observe that positive and negative triplets are well separated, which empirically validates our initial hypothesis.

Figure 2: 3D visualization of the unnormalized log-probabilities of each component (). Context model .

Hyper-parameters and are selected on the validation set to compute

. To build models that do not use a visual/contextual component, we simply select a subset of the probabilities and their respective hyperparameters. For example,


4 Experimental protocol

4.1 Data

To measure the role of context in ZSL, a dataset that presents annotated objects within a rich visual context is required. However, traditional ZSL datasets, such as AwA (zsldatasetawa), CUB-200 (zsldatasetbird) or LAD (zsldatasetlad), are made of images that contain a unique object each, with no or very little surrounding visual context. We rather use Visual Genome (visualgenome), a large-scale image dataset (108K images) annotated at a fine-grained level (3.8M object instances), covering various concepts (105K unique object names). This dataset is of particular interest for our work, as objects have richly annotated contexts ( object instances per image on average). In order to shape the data to our task, we randomly split the set of images of Visual Genome into train/validation/test sets (70%/10%/20% of the total size). To build the set of all objects classes, we select classes which appear at least 10 times in Visual Genome and have an available Word2vec representation. contains object classes; it amounts to 3.4M object instances in the dataset. This dataset is highly imbalanced as 10% of most represented classes amount to 84% of object instances. We define the level of supervision as the ratio of the size of the source domain over the total number of objects: . For a given ratio, the source and target domains are built by randomly splitting accordingly. Every object is annotated with a bounding box and we use this supervision in our model for entities of both source and target domains. To facilitate future work on context-aware ZSL, we publicly release data splits and annotations 111

4.2 Evaluation methodology and metrics

We adopt the conventional setting for ZSL, which implies entities to be retrieved only among the target domain . Besides, we also evaluate the performance of the model to retrieve entities of the source domain (with models tuned on the target domain).

The model’s prediction takes the form of a list of classes, sorted by probability; the rank of the correct class in that list is noted . Depending on the setting, equals or . We define the First Relevant (FR) metric with . To further evaluate the performance over the whole test set, the Mean First Relevant (MFR) metric is used (DBLP:journals/sigir/Fuhr17). It is computed by taking the mean value of FR scores obtained on each image of the test set. Note that the factor rescales the metric such that the MFR score of a random baseline is 100%, while the MFR of a perfect model would be 0%. The MFR metric has the advantage to be interval-scale-based, unlike more traditional Recall@ metrics or Mean Reciprocal Ranks metrics (DBLP:conf/ictir/FerranteFP17), and thus can be averaged; this allows for meaningful comparison with a varying .

4.3 Scenarios and Baselines

Model scenarios

Model scenarios depend on the information that is used in the probabilistic setting: , , or both and . When contextual information is involved, a context model is specified to represent , which we note . The different context models are . For clarity’s sake, we note our model M. For example, M models the probability as explained in 3.4, M models , and M models .


To evaluate upper-limit performances for our models, we define Oracle baselines where classes of target objects are used, which is not allowed in the zero-shot setting. Note that every Oracle leverages visual information.
True Prior: This Oracle uses, for its prior component, the true prior distribution computed for all objects of both source and target domains on the full dataset, where is the number of instances of the -th class in images and is the total number of images.
Visual Bayes: This Oracle uses for its prior component as well. Its context component uses co-occurrence statistics between objects computed on the full dataset: where is the probability that objects and co-occur in images, with the number of co-occurrences of and .
Textual Bayes: Inspired by (DBLP:journals/corr/BengioDEILRSS13), this Oracle is similar to Visual Bayes, except that its prior and context component are based on textual co-occurrences instead of image co-occurrences: is computed by counting co-occurrences of words and in windows of size 8 in the Wikipedia dataset, and is computed by summing the number of instances of the -th class divided by the total size of Wikipedia.
Semantic representations for all objects: M uses word embeddings of both source and target objects.

M): To study the validity of the hypothesis about the conditional independence of and , we introduce a baseline where we directly model . To do so, we replace, in the expression of (equation 6), by the concatenation of and projected in with a 2-layer Perceptron.
DeViSE: To evaluate the impact of our Bayesian model (equation 1) and our sampling strategy (section 3.3), we compare against DeViSE (devise). DeViSE is different from M because negative examples in are uniformly sampled, and the prior is not learned.
DeViSE: similarly to M, we define a baseline that does not rely on the conditional independence of and , using the same sampling strategy as DeViSE.
M: To understand the importance of context supervision, i.e. annotations of context objects (boxes and classes), we design a baseline where no context annotations are used. The context is the whole image without the zone of the object, which is masked out. The associated context model is with ; is a parametric function to be learned. This baseline is inspired from (DBLP:journals/cacm/TorralbaMF10), where global image features are used to refine the prediction of an image model.

4.4 Implementation details

For each objective and , at each iteration of the learning algorithm, 5 negative entities are sampled per positive example. Word representations are vectors of , learned with the Skip-Gram algorithm (mikolov2013) on Wikipedia. Image regions are cropped, rescaled to (299299), and fed to CNN, an Inception-v3 CNN (inception), whose weights are kept fixed during training. This model is pretrained on ImageNet (imagenet). As a result, every ImageNet class that belongs to the total set of objects was included in the source domain . Models are trained with Adam (adam) and regularized with a L2-penalty; the weight of this penalty decreases when the level of supervision increases, as the model is less prone to overfitting. All hyper-parameters are cross-validated on classes of the target domain, on the validation set.

5 Results

Target domain Source domain
10% 50% 90% 10% 50% 90%
Domain size 4358 2421 484 484 2421 4358


Random 100 100 100 100 100 100
M 38.6 23.7 13.8 12.0 10.6 11.2
M 20.5 10.7 6.0 1.5 2.6 3.6
M 28.7 14.4 9.1 4.2 4.3 4.4
M 18.1 9.0 5.2 1.1 1.9 2.4
(%) 11.6 16.4 12.1 23.7 27.3 31.5
Table 1: Evaluation of various information sources, with varying levels of supervision. MFR scores in . is the relative improvement (in ) of M over M.

5.1 The importance of context

In this section, we evaluate the contribution of contextual information, with varying levels of supervision . We fix a simple context model () and report MFR results with in Table 1 for every combination of information sources: , , and — we observe similar trends for the other context models. Results highlight that contextual knowledge acquired from the source domain can be transferred to the target domain, as M significantly outperforms the Random baseline. As expected, it is not as useful as visual information: M M, where means lower MFR scores, i.e. better performances. However, Table 1 demonstrates that contextual and visual information are complementary: using M outperforms both M and M. Interestingly, as the learned prior model M is also able to generalize, we show that visual frequency can somehow be learned from textual semantics, which extends previous work where word embeddings were shown to be a good predictor of textual frequency (frequence).

When increases, we observe that all models are better at retrieving objects of the target domain (i.e. MFR decreases), which is intuitive because models are trained on more data and thus generalize better to recognize entities from the target domain. Besides, when increases, the context is also more abundant. This explains: (1) the decreasing MFR values for model M on , (2) the increasing relative improvement of M over M on . However, on the target domain, we note that does not monotonously increase with . A possible explanation is that the visual component improves faster than the context component, so the relative contribution brought by context to the final model M decreases after . Since the highest relative improvement (in ) is attained with , we fix the standard level of supervision in the rest of the experiments; this amounts to 2421 classes in both source and target domains.

5.2 Modeling contextual information

Model Probability
Oracles Textual Bayes 14.54 6.73
M 7.57 2.53
True Prior 4.92 2.63
Visual Bayes 3.40 2.11
Baselines DeViSE 10.73 3.62
DeViSE 10.11 3.11
M 10.07 1.85
M 9.19 2.13
Our models M 10.72 2.64
M 9.01 2.05
M 9.00 2.13
M 8.96 1.92
M 8.71 1.88
M 8.60 1.93
M 8.52 1.86
M 8.31 1.79
Table 2: MFR performances (given in ) for all baselines and scenarios. . Oracle results, written in italics, are not taken into account to determine the best scores, written in bold.

In this section, we compare the different context models; results are reported in Table 2. First, underlying hypotheses of our model are experimentally tested. (1) Modeling context and prior information with semantic representations (models M is far more efficient than using direct textual co-occurrences, as shown by the Textual Bayes baseline, which is the weaker model despite being an Oracle. (2) Moreover, we show that the hypothesis on the conditional independence of and is acceptable, as separately modeling and gives better results than jointly modeling them (i.e. M M). (3) Furthermore, we observe that our approach M) is more efficient to capture the imbalanced class distribution of the source domain, compared to DeViSE; indeed, True Prior M , whereas True Prior DeViSE on . Even if the improvement is only significant for the source domain , it indicates that separately using information sources is clearly a superior approach to further integrate contextual information.

Second, as observed in the case of the context model (section 5.1), using contextual information is always beneficial. Indeed, all models with context M improve over M — which is the model with no contextual information — both on target and source domains. In more details, we observe that performances increase when additional information is used: (1) when the bounding boxes annotations are available: all of our models that use both and outperform the baseline M, which could also be explained by the useless noise outside the object boxes in the image and the difficulty of computing a global context from raw image, (2) when context objects are labeled and high-level features are used instead of low-level features, e.g.  and , (3) when more context objects are considered (e.g. ), (4) when low-level information is used complementarily to high-level information (e.g. ). As a result, the best performance is attained for M, with a (resp. ) relative improvement in the target (resp. source) domain compared to M.

We note that there is still room for improvement to approach ground-truth distributions for objects of the target domain (e.g, towards word embeddings able to better capture visual context). Indeed, even if our models outperform True Prior and Visual Bayes on the source domain, these Oracle baselines are still better on the target domain, hence showing that learning the visual context of objects from textual data is challenging.

5.3 Qualitative Experiments

Figure 3: Boxplot representing the distribution of the correct ranks (First Relevant in ) for five randomly selected classes of the target domain, with the context model . Below are listed, by order of frequency, the classes that co-occur the most with the object of interest (classes of in green; in red).

To gain a deeper understanding of contextual information, we compare in Figure 3 the predictions of M and the global model M. We randomly select five classes of the target domain and plot, for all instances of these classes in the test set of Visual Genome, the distribution of the predicted ranks of the correct class (in percentage); we also list the classes that appear the most in the context of these classes. We observe that, for certain classes (player, handle and field), contextual information helps to refine the predictions; for others (house and dirt), contextual information degrades the quality of the predictions.

First, we can outline that visual context can guide the model towards a more precise prediction. For example, a player, without context, could be categorized as person, man or woman; but visual context provides important complementary information (e.g, helmet, baseball) that grounds person in a sport setting, and thus suggests that the person could be playing. Visual context is also particularly relevant when the object of interest has a generic shape. For example, handle, without context, is visually similar to many round objects; but the presence of objects like door or fridge in the context helps determine the nature of the object of interest.

To get a better insight on the role of context, we cherry-picked examples where the visual or the prior component is inacurrate and the context component is able to counterbalance the final prediction (Figure 4). In (i), for example, the visual component ranks flower at position 223. However, the context component assesses flower to be highly probable in this context, due to the presence of source objects like vase, water, stems or grass, but also target objects like the other flowers around. At the inference phase, probabilities are aggregated and flower is ranked first.

Figure 4: Qualitative examples where the global model M correctly retrieves the class ( classes only).

It is worth noting that our work is not without limitations. Indeed, some classes (such as house and dirt) have a wide range of possible contexts; in these cases, context is not a discriminating factor. This is confirmed by a complementary analysis: the Spearman correlation between the number of unique context objects and , the relative gain of M over M, is . In other terms, contextual information is useful for specific objects, which appear in particular contexts; for objects that are too generic, adding contextual information can be a source of noise.

6 Conclusion

In this paper, we introduced a new approach for ZSL: context-aware ZSL, along with a corresponding model, using complementary contextual information that significantly improves predictions. Possible extensions could include spatial features of objects, and, more importantly, removing the dependence on the detection of object boxes to make it fully applicable to real-world images (e.g. by using a Region Proposal Network (faster)) Finally, designing grounded word embeddings that include more visual context information would also benefit such models.


This work is partially supported by the CHIST-ERA EU project MUSTER 1 (ANR-15-CHR2-0005) and the Labex SMART (ANR-11-LABX-65) supported by French state funds managed by the ANR within the Investissements d’Avenir program under reference ANR-11-IDEX-0004-02.


Appendix A Additional negative results

Figure 5: Qualitative analysis: negative examples where the use of the context leads to degraded predictions, i.e. examples where model M is worse than the simpler model M ( classes only).

As explained in Section 5.3, using contextual information can sometimes degrade predictions. We provide here additional examples, when an object occurs in an environment in which it is unexpected. For example, Figure 5 shows a picture of a kitchen where the object of interest to be predicted is “books”. Given only the surrounding environment, predicted objects are logically related to the environment of a kitchen (“freezer”, “oven”, …), and the correct label is badly ranked (because it is unexpected in such an environment). However, the model M retrieves the correct label, given only the region of interest. Finally, integrating contextual information in the final model M leads to worse performances over M.

Appendix B Generalized ZSL

In the previous sections, retrieval in done only among classes of the domain of interest, this is the classical zero-shot learning setting. We now report results obtained when both source and target object classes exist in the retrieval space: this setting amounts to generalized zero-shot learning. Results are reported in Table 3.

Target domain Source domain
10% 50% 90% 10% 50% 90%
Domain size 4358 2421 484 484 2421 4358


Random 100 100 100 100 100 100
M 39.6 26.3 16.9 6.6 8.68 10.9
M 21.0 11.8 6.9 0.9 2.3 3.5
M 28.6 15.0 10.7 3.5 3.9 4.4
M 18.2 9.4 6.0 0.8 1.8 2.4
13.4 20.2 13.4 13.8 24.4 31.5
Table 3: Evaluation of various information sources, with varying levels of supervision. Generalized ZSL setting. MFR scores in . is the relative improvement (in ) of M over M.

Appendix C MRR and top- performances

ZSL models are usually evaluated with recall@

or MRR (mearn reciprocal rank, i.e. harmonic mean). However, the metrics are not optimal to evaluate our models for two reasons:

  • Theoretically, recent research points out that RR is not an interval scale and thus MRR should not be used (Fuhr, Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum 2017 ; Ferrante et al. Are IR evaluation measures on an interval scale? ICTIR 2017).

  • Practically, we make the size of the target domain vary (10%, 50%, 90%). MRR and top- scores cannot be compared across these scenarios (e.g. top-5 among 100 entities is not comparable to top-5 among 1000)

Therefore, as explained in Section 4.2 we used MFR (mean first relevant): the arithmetic mean of rank numbers (linearly rescaled to have 100% for random model and 0% for perfect model). FR is an interval scale and thus can be averaged.

However, we report here top and MRR scores in Table 4.

Target domain Source domain
Recall @ MRR Recall @ MRR
1 5 10 1 5 10
Random <.1 0.2 0.4 <.1 <.1 0.2 0.4 <.1
 M 3.2 11.7 16.3 7.8 5.7 17.9 24.9 12.5
 M 14.7 33.5 43.2 24.0 36.3 63.8 73.1 48.8
 M 5.9 17.8 25.4 11.9 17.3 43.7 56.7 29.9
 M 15.0 34.7 44.7 24.7 41.6 70.6 78.6 54.2
Table 4: Recall@ () (in percentage) and MRR scores (in percentage). .