A”black swan” was ironically used as a metaphor in the 16th century for an unlikely event because the western world had only seen white swans. Yet when the European settlers observed a black swan for the first time in Australia in 1697, they immediately knew what it was. This is because humans posses the ability to compose their knowledge of known entities to generalize to novel concepts. In the literature, this task is known as Compositional Zero-Shot Learning (CZSL)[redwine, aopp, tmn, symnet, cge], where the goal is to learn how to compose observed objects and their states and to generalize to novel state-object compositions.
CZSL task is far more challenging than standard Zero-Shot Learning (ZSL) task due to much larger number of compositions populating the search space. For example, the MIT states dataset [mitstates] contains 28175 possible compositions (in total 115 states and 245 objects), with the test time search space of 1662 compositions. Whereas in the standard ZSL benchmarks, such as CUB [welinder2010cub] and AWA [lampert2013awa], the test time search space is 50 and 10 respectively. While standard benchmarks study CZSL in a closed space, assuming the prior knowledge of unseen compositions that might arise at test time, here we take a step further, analyzing the more realistic Open World CZSL (OW-CZSL) where we impose no constraint on the test time search space. For instance, in OW-CZSL, in the MIT states dataset contains all the 28175 compositions in the search space at test time. Addressing OW-CZSL requires building a discriminative representation space where seen and unseen compositions can be recognized despite the huge number of test compositions.
In this work, we tackle the OW-CZSL task with Compositional Cosine Graph Embeddings (Co-CGE), a graph-based approach for OW-CZSL. Co-CGE is based on two inductive biases. Our first inductive bias is a rich dependency structure of different states, objects and their compositions, e.g. learning the composition old dog is not only dependent on the state old and object dog, but may also require being aware of other compositions like cute dog, old car, etc. We argue that such dependency structure provides a strong regularization, while allowing the network to better generalize to novel compositions and model it through a compositional graph, connecting state, objects and their compositions. Differently from previous works [symnet, aopp, tmn, redwine] that treat each state-object composition independently, our graph formulation allows the model to learn compositional embeddings that are globally consistent.
Our second inductive bias is the presence of distractors, i.e. less feasible compositions (e.g. ripe dog) that a model needs to either eliminate or isolate in the search space. For this purpose, we use similarities among primitives embeddings to assign a feasibility score to each unseen composition. We then use these scores as margins in a cosine-based cross-entropy loss, showing how the feasibility scores enforce a shared embedding space where unfeasible distractors are discarded, while visual and compositional domains are aligned. Since the distractors may pollute the learned representations of other unseen compositions in Co-CGE, we inject the feasibility scores also within the graph. In particular, we instantiate a weighted adjacency matrix, where the weights depend on the feasibility of each composition. Experiments show that Co-CGE is either superior or competitive with the state of the art in CZSL while being much more effective on the challenging OW-CZSL task.
Our contributions are as follows: (1) We introduce Co-CGE, a graph formulation for the new OW-CZSL problem with an integrated feasibility estimation mechanism used to weight the graph connections; (2) We exploit the dependency between visual primitives and their compositional classes and propose a multimodal compatibility learning framework that embeds related states, objects and their compositions into a shared embedding space learned through cosine logits and feasibility-based margins; (3) We improve the state-of-the-art on MIT states[mitstates], UT Zappos [utzappos1] and the recently proposed C-GQA [cge] benchmarks on both CZSL and OW-CZSL.
This paper extends our previous works [cge] and [compcos] published in CVPR 2021 in many aspects. First, while being effective on standard CZSL, the CGE model of [cge] performs poorly in the OW-CZSL, due to the noisy connections arising from the huge search space. We thus take the idea of estimating the feasibility of each composition from [compcos] and we inject the feasibility scores both at the loss level and within the graph connections. Our model is based on a graph convolutional neural network (GCN) [gcn]. This means that the embeddings as well as the feasibility scores are influenced by all other composition in the search space, rather than considered in isolation as it was the case in [compcos]. We extend our OW-CZSl benchmark proposed in [compcos] to the new C-GQA dataset proposed in [cge] with 453 states, 870 objects thus a total OW-CZSL search space of almost 400k compositions. Finally, we significantly improve the state of the art on the challenging OW-CZSL setting.
2 Related works
Compositionality can loosely be defined as the ability to decompose an observation into its primitives. These primitives can then be used for complex reasoning. One of the earliest attempts in computer vision in this direction can be traced to Hoffman[hoffman1984parts] and Biederman [biederman1987recognition] who theorized that visual systems can mimic compositionality by decomposing objects to their parts. Compositionality at a fundamental level is already included in modern vision systems. Convolutional Neural Networks (CNN) have been shown to exploit compositionality by learning a hierarchy of features[zeiler2014visualizing, lecun1989backpropagation]caruana1997multitask, choi2013adding, deng2014large, patricia2014learning] and few-shot learning[hariharan2017low, ravi2016optimization, mensink2012metric] exploit the compositionality of pretrained features to generalize to data constraint environments. Visual scene understanding[johnson2015image, dai2017detecting, jae2018tensorize, lu2016visual] aims to understand the compositionality of concepts in a scene. Nevertheless, these approaches still requires collecting data for new compositional classes.
ZSL and CZSL. Zero-Shot Learning (ZSL) aims to recognize novel classes not observed during training [LNH13] using side information describing novel classes e.g. attributes [LNH13], text descriptions [RALS16] or word embeddings [SGMN13]. Some notable approaches include learning a compatibility function between image and class embeddings [akata2013label, zhang2016learning] and learning to generate image features for novel classes [xian2018feature, Zhu_2018_CVPR].
Compositional Zero-Shot Learning (CZSL) aims to learn the compositionality of objects and their states from the training set and generalizing to unseen combinations of these primitives. Approaches in this direction can be divided into two groups. The first group is directly inspired by [hoffman1984parts, biederman1987recognition]
. Some notable methods include learning a transformation upon individual classifiers of states and objects[redwine]
, modeling each state as a linear transformation of objects[aopp], learning a hierarchical decomposition and composition of visual primitives[yang2020learning] and modeling objects to be symmetric under attribute transformations[symnet]. The second group argues that compositionality requires learning a joint compatibility function with respect to the image, the state and the object[causal, tmn, wang2019task]. This is achieved by learning a modular networks conditioned on each composition [tmn, wang2019task] that can be “rewired” for a new compositions. Finally a recent work from Atzmon et al. [causal] argue that achieving generalization in CZSL requires learning the visual transformation through a causal graph where the latent representation of primitives are independent of each other.
GCN. Graph Convolutional Networks (GCN) [gcn, gcnzs, gcnzsrethinking] are a special type of neural networks that exploit the dependency structure of data (nodes) defined in a graph. Current methods [gcn] are limited by the network depth due to over smoothing at deeper layers of the network. The extreme case of this can cause all nodes to converge to the same value [li2018deeper]. Several works have tried to remedy this by dense skip connections [xu2018representation, li2019deepgcns], randomly dropping edges [rong2019dropedge] and applying a linear combination of neighbor features [wu2019simplifying, klicpera2019diffusion, klicpera2018predict]. A recent work in this direction from Chen et al.[gcnii]
combines residual connections with identity mapping. GCNs have shown to be promising for zero-shot learning. Wang et al.[gcnzs]
propose to directly regress the classifier weights of novel classes with a GCN operated on an external knowledge graph (WordNet[wordnet]). Kampffmeyer et al.[gcnzsrethinking] improve this formulation by introducing a dense graph to learn a shallow GCN as a remedy for the Laplacian smoothing problem [li2018deeper].
Our method lies at the intersection of several discussed approaches. We learn a joint compatibility function similar to [causal, tmn, wang2019task] and utilize a GCN similar to [gcnzs, gcnzsrethinking]. However, we exploit the dependency structure between states, objects and compositions which has been overlooked by previous CZSL approaches [causal, tmn, wang2019task]. Instead of using a predefined knowledge graph like WordNet [wordnet], we propose a novel way to build a compositional graph and learn classifiers for all classes. In contrast to [causal] we explicitly promote the dependency between all primitives and their compositions in our graph. This allows us to learn embeddings that are consistent with the whole graph. Furthermore, our approach estimates the feasibility of each composition, exploiting this information to re-weight the graph connections and to model the presence of distractors within the training objective. Finally, unlike all existing methods [redwine, aopp, causal, tmn, wang2019task, yang2020learning], instead of using a fixed image feature extractor our model is trained end-to-end.
Open World Recognition. In our open world setting, all the combinations of states and objects can form a valid compositional class. This is different from an alternate definition of Open World Recognition (OWR) [bendale2015towardsowr, mancini2019knowledge] where the goal is to dynamically update a model trained on a subset of classes to recognize increasingly more concepts as new data arrives.
Our definition is related to the open set zero-shot learning (ZSL) [zslxian18benchmark] scenario in [fu2016semi, fu2019vocabulary], proposing that expands the output space to a large vocabulary of semantic concepts. Both our work and [fu2019vocabulary] consider the lack of constraints in the output space for unseen concepts as a requirement for practical (compositional) ZSL methods. However, since we focus on the CZSL task, we have access to images of all primitives during training but not all their possible compositions. This implies that we can use the knowledge obtained from the visual world to model the feasibility of compositions and modifying the representations in the shared visual-compositional embedding space. We explicitly model the feasibility of each unseen composition, incorporating this knowledge into our model into training.
3 Compositional Cosine Graph Embeddings
Let be the set of possible states, with being the set of possible objects, and with being the set of all their possible compositions. is a training set where is a sample in the input (image) space and is a composition in the subset . is used to train a model predicting combinations in a space where may include compositions not present in (i.e. ).
The CZSL task entails different challenges depending on the extent of the target set . If is a subset of and , the task definition is of [redwine], where the model needs to predict only unseen compositions at test time. In case we are in the generalized CZSL scenario, and the output space of the model contains both seen and unseen compositions. Similar to the generalized ZSL [zslxian18benchmark], GCZSL scenario is more challenging due to the natural prediction bias of the model in , seen during training. Most recent works on CZSL consider the GCZSL scenario [tmn, symnet], and the set of unseen compositions in is known a priori.
In our work, the output space is the whole set of possible compositions , i.e. Open World Compositional Zero-shot Learning (OW-CZSL). Note that this task presents the same challenges of the GZSL setting while being far more difficult since i) , thus it is hard to generalize from a small set of seen to a very large set of unseen compositions; and ii) there are a large number of distracting compositions in , i.e. compositions predicted by the model but not present in the actual test set that can be close to other unseen compositions, hampering their discriminability. We highlight that, despite being similar to Open Set Zero-shot Learning [fu2019vocabulary], we do not only consider objects but also states. Therefore, this knowledge can be exploited to identify unfeasible distracting compositions (e.g. rusty pie) and isolate them. In the following we describe how we tackle this problem by means of compositional graph embeddings.
3.1 Compositional Graph Embedding for CZSL
In this section, we focus on the closed world setting, where . Since in this scenario and the number of unseen compositions is usually lower than the number of seen ones, this problem presents several challenges. In particular, while learning a mapping from the visual to the compositional space, the model needs to avoid being overly biased toward seen class predictions.
As states and objects are not independent e.g. the appearance of the state sliced varies significantly with the object (e.g. apple or bread) and learning state and object classifiers separately is prone to overfit to labels observed during training. Therefore, we model the states and objects jointly via that learns the compatibility between an image and a state-object composition. Given a specific input image , we predict its label as the state-object composition that yields the highest compatibility score:
where is the mapping from the image space to the -dimensional shared embedding space , embeds a composition to the same shared space and is a compatibility scoring function.
We implement as a deep neural network, as a graph convolutional neural network (GCN) [gcn] and as cosine similarity. This way we exploit deep image representations and propagate information from seen to unseen concepts through a graph and while at the same time avoid bias on seen classes through cosine similarity scores. We name our model Compositional Cosine Graph Embeddings (Co-CGE).
Compositional Graph Embeddings (CGE). We encode the dependency structure of states, objects and their compositions (both seen and unseen) through a compositional graph. We map compositions into the shared embedding space by modeling as a GCN with nodes, layers and the output of the layer:
is a non-linear activation function (i.e.ReLU), is the matrix of the -dimensional node representations at layer , is the trainable weight matrix at layer and is the column normalized adjacency matrix . In the CZSL task, CGE defines the set of starting nodes as , with and . Given , i.e. the i element of , the representation of its node in is:
where maps the primitives, i.e. objects and states, into their corresponding
-dimensional embedding vectors. The input embeddings of a composition is initialized as the average of its primitive embeddings. All the node representation inare fixed and initialized with word embeddings, e.g. [word2vec]. A crucial element of the graph is the adjacency matrix . CGE connects all states/objects to objects/states that form at least one composition in the dataset, all composition to their corresponding primitives, and vice-versa. Formally, for two elements , the value of the adjacency matrix at row , column is:
where is true if . The first case in Eq. (4) denotes the connection between states and objects belonging to the same composition, while the second and the third rows denote the connections between compositions and their base primitives. We highlight that this formulation allows the model to propagate the information through the graph, obtaining better node embeddings for both the seen and unseen compositional labels. For example, the GCN allows an unseen composition e.g. old dog to aggregate information from its seen neighbor nodes e.g. old, dog, cute dog, and old car.
Objective function. The final element of our model is the compatibility score . We implement as the cosine similarity between the visual and compositional embeddings:
to produce bounded scores and it is beneficial to avoid prediction to be influenced by the higher magnitude of scores for seen training classes [hou2019learning] while generalizing better to new ones [gidaris2018dynamic]. This is a greater challenge for our model compared to CGE[cge] since we tailor it for the open world. Finally, we learn the mappings and by minimizing the cross-entropy loss over the cosine logits.
is a temperature value that scales the probabilities values for the cross-entropy loss[zhang2019adacos] and . By exploiting graph embeddings and bounding the classifier scores for seen and unseen compositions, our Co-CGE achieves outstanding performance on the closed world scenario. In the following we discuss how we extend Co-CGE to the more challenging OW-CZSL.
3.2 From Closed to Open World CZSL
OW-CZSL setting requires avoiding distractors, i.e. unlikely concepts such as ripe dog
. The similarity among objects and states can be used as a proxy to estimate the feasibility of each composition. We can then inject the estimated feasibility into Co-CGE both as margins in the loss function and as weights within the adjacency matrix.
Estimating Compositional Feasibility. Let us consider two objects, namely cat and dog. We know, from our training set, that cats can be small and dogs can be wet since we have at least one image for each of these compositions. However, the training set may not contain images of wet cats and small dogs, which we know are feasible in reality. We conjecture that similar objects share similar states while dissimilar ones do not. Hence, it is safe to assume that the states of cats can be transferred to dogs and vice-versa.
With this idea in mind, we define the feasibility score of s composition with respect to the object as:
with being the set of objects associated with state in the training set , i.e. . Note, that the score is computed as the cosine similarity between the object embedding produced by the graph and the most similar other object with the target state, thus the score is bounded in . Training compositions get assigned the score of 1. Similarly, we define the score with respect to the state as:
with being the set of states associated with the object in the training set , i.e. . The feasibility score for a composition is then:
where is a mixing function, e.g. max operation () or the average (), keeping the feasibility score bounded in . Note that, while we focus on extracting feasibility from the visual information, external knowledge (e.g. knowledge bases [liu2004conceptnet]wang2019language]) can be complementary resources.
A simple strategy to use the feasibility scores would be to consider all compositions above the threshold as valid and others as distractors:
However, this strategy might be too restrictive in practice. For instance, tomatoes and dogs being far in the embedding space does not mean that a state for dog, e.g. wet, cannot be applied to a tomato. Therefore, considering the feasibility scores as the golden standard may lead to excluding valid compositions (see Figure 1). To sidestep this issue, we inject the feasibility scores directly into both the model and the training procedure. We argue that doing so enforces separation between most and least feasible unseen compositions in the shared embedding space.
Feasibility-aware objective. First, we integrate the feasibility scores directly within our objective function as margins, defining the new objective as:
where are used as margins for the cosine similarities, and is a scalar factor. With Eq. (11) we include the full compositional space while training with the seen compositions data to raise awareness of the margins between seen and unseen compositions directly during training. Note that, since if and , we have a different margin, i.e. , for each unseen composition . In this way, we penalize less the more feasible compositions, pushing them closer to the seen ones, to which the visual embedding network is biased. At the same time, we force the network to push the representation of less feasible compositions away from the compositions in in . More feasible unseen compositions will then be more likely to be predicted by the model than the less feasible ones (which are more penalized). As an example (Figure 1, top part), the unfeasible composition ripe dog is more penalized than the feasible wet tomato during training, with the outcome that the optimization procedure does not force the model to reduce the region of wet tomato, while reducing the one of ripe dog (top-right pie).
We highlight that in this stage we do not explicitly bound the revised scores to . Instead, we let the network implicitly adjust the cosine similarity scores during training. We also found it beneficial to linearly increase till a maximum value as the training progresses, rather than keeping it fixed. This permits the model to gradually introduce the feasibility margins within the objective while exploiting improved primitive embeddings to compute them.
Feasibility-driven graph. Modelling the relationship between seen and unseen compositions via GCN is more challenging in the open world scenario, since also less feasible unseen compositions will influence the graph structure. This leads to two problems. The first is that distractors will influence the embeddings of seen compositions, making them less discriminative. The second is that the gradient flow will push unfeasible compositions close to seen ones, making harder to isolate distractors in the embedding space.
For these reasons, we modify the adjacency matrix of Eq. (4) in a weighted fashion, with the goal of reducing both the gradient flow on and the influence of less feasibile compositions. To achieve this goal, we directly exploit the feasibility scores, defining the adjacency matrix as:
In Eq. (13), the connection between a state and an object corresponds to the feasibility of the composition , such that the higher is the feasibility of the composition and the stronger is the connection among the two constituent primitives. Similarly, the influence of a composition to its primitives (third row) corresponds to the feasibility of the composition itself. We found that it is beneficial to influence the embedding of a composition fully by the embeddings of its primitives and (Eq. (13), second row). The motivation is that the mapping between compositions and primitives is not bijective: one composition corresponds to only one state and one object, but states and objects build multiple compositions. So while a composition is surely connected with its constituent primitives (second row, value 1), a state and an object are more related to existing, feasible compositions (third row).
The formulation of Eq. (13) makes the connections in the graph dependent on the feasibility of the compositions. This allows the model to reduce the impact of less feasible compositions both in the forward pass and in the backward, making the shared embedding space more discriminative and less-influenced by distractors.
Discussion. Our Co-CGE model uses a GCN to map compositions to the shared embedding space, and a cosine classifier to measure the compatibility between image features and composition embeddings. This formulation merges and extends our previous models CGE [cge] and CompCos [compcos]. In particular, as in CGE we model the relationship between seen and unseen compositions through a graph. This allows us to perform end-to-end training of the CNN backbone without overfitting, since the feature representation is regularized by the compositional graph.
Naïvely applying CGE is not effective in the open world scenario, where we need to model the feasibility of each composition. Thus, following CompCos, we estimate the feasibility scores of each compositions and using the scores as margins in the objective function, with a cosine similarity-based classifier. We improve CompCos by modeling the feasibility of each composition also within the model by defining a weighted adjacency matrix for the GCN, with the weights dependent on the feasibility scores. Moreover, the primitive embeddings used to compute the feasibility scores, are produced by the GCN (thus influenced by the respective compositions) rather than learned in isolation, as in CompCos. These modifications allow Co-CGE to build a more discriminative shared embedding space where the compatibility function better isolates less feasible compositions. Finally, since the model is already robust enough to the presence of distractors in OW-CZSL, we do not need to use hard masking in Eq. (10).
on MIT-States, UT Zappos and C-GQA. We measure best seen (S) and unseen accuracy (U), best harmonic mean (HM), and area under the curve (AUC) on the compositions.
Datasets. We perform our experiments on three datasets (see Table I). We adopt the standard split of MIT-States[mitstates] from [tmn]. For the open world scenario, 26114 out of 28175 (93%) are not present in any splits of the dataset but are included in our open world setting. In UT Zappos [utzappos1, utzappos2] we follow the splits from [tmn]. Note that although 76 out of 192 possible compositions (40%) are not in any of the splits of the dataset, we consider them in our open world setting.
Both UT-Zappos and MIT-States have limitations. UT-Zappos[utzappos1, utzappos2] is arguably not entirely compositional as states like Faux leather vs Leather are material differences not always observable as visual transformations. MIT-States instead contains images collected through older search engine with limited human annotation leading to significant label noise [causal]. To address the limitations of these two datasets, in our previous work [cge] we introduced a split built on top of Stanford GQA dataset [gqa], i.e. the Compositional GQA (C-GQA) dataset. In this work we extend it to the OW-CZSL task. With 453 states and 870 objects, the resulting OW-CZSL search space has almost 400K compositions making it way more challenging than other benchmarks.
Metrics. In zero-shot learning, models being trained only on seen labels (compositions) causes an inherent bias against the unseen labels. As pointed out by [chao2016empirical, tmn], the model thus needs to be calibrated by adding a scalar bias to the activations of the novel compositions to find the best operating point and evaluate the GCZSL performance.
We adopt the evaluation protocol of [tmn] and report the Area Under the Curve (AUC) (in ) between the accuracy on seen and unseen compositions at different operating points with respect to the bias. The best unseen accuracy is calculated when the bias term is large, i.e. the model predicts only the unseen labels, also known as zero-shot performance. In addition, the best seen performance is calculated when the bias term is negative, i.e. the model predicts only the seen labels. As a balance between the two, we also report the best harmonic mean (HM). We emphasize that our C-GQA dataset splits and the MIT-States and UT-Zappos dataset splits from [tmn] do no not violate the zero-shot assumption as results are ablated on the validation set. We therefore advice future works to also use our splits.
Benchmark and Implementation Details. Following [tmn, symnet]
we use a ResNet18 pretrained on ImageNet[deng2009imagenet] as feature extractor and fine-tune the whole architecture with a learning rate of , i.e. Co-CGE. For a fair comparison with the models that use a fixed feature extractor, we also perform experiments with a simplification of our model where we learn a 3 layer fully-connected (FC) network with ReLU[nair2010rectified], LayerNorm[ba2016layer] and Dropout[srivastava2014dropout] while keeping the feature extractor fixed, i.e. Co-CGE.
We initialize the embedding function with 300-dimensional word2vec [word2vec] embeddings for UT Zappos and C-GQA, and with 600-dimensional word2vec+fastext [fasttext] embeddings for MIT-States, following [xian2019semantic], keeping the same dimensions for the shared embedding space . We train both and using Adam [kingma2014adam] optimizer with a learning rate and a weight decay set to . For both Co-CGE and CompCos, the margin factor and the temperature are set to and respectively for MIT-States, and for UT Zappos, and and for C-GQA. We linearly increase
from 0 to these values during training, reaching them after 15 epochs. We consider the mixing functionas the average to merge state and object feasibility scores for both our model and CompCos. For CompCos we additionally use as predictor, unless otherwise stated.
For we use a shallow 2-layer GCN with a hidden dimension of in the closed world experiments, but for UT Zappos, where we use a dimension of . For the OW-CZSL experiments, we found beneficial to reduce the hidden dimension to the same of the input embeddings, i.e to for UT Zappos and C-GQA, for MIT-States. Note that since the C-GQA search space is extremely high in the OW-CZSL setting, to test CGE and the closed world version of Co-CGE, we reduce their hidden dimension to .
We compare with four state-of-the-art methods, Attribute as Operators (AOP) [aopp], considering objects as vectors and states as matrices modifying them [aopp]; LabelEmbed+ (LE+) [redwine, aopp] training a classifier merging state and object embeddings with an MLP; Task-Modular Neural Networks (TMN) [tmn], modifying the classifier through a gating function receiving as input the queried state-object composition; and SymNet [symnet], learning object embeddings showing symmetry under different state-based transformations. We also compare Co-CGE with our previous works, CGE [cge] and CompCos [compcos]paszke2019pytorch] and train on a Nvidia V100 GPU. For baseline comparisons, we use the authors’ implementations where available.
4.1 Closed World CZSL
Comparison with the State of the Art. We experiment with the closed world setting on the test sets of all three datasets. Table II (top) shows models trained with the closed world assumption, while Table II (bottom) shows the open world models, not using any prior on the unseen test compositions during training but still predicting over a closed set.
Co-CGE achieves either comparable or superior results to the state of the art in all settings and metrics. In general, the results of Co-CGE are comparable or superior to CGE, while surpassing by a margin the other approaches in the closed world, e.g. on MIT-States Co-CGE vs CGE achieves an AUC of 6.6 vs 6.5. This signifies the importance of graph based methods for CZSL as both outperform the closest non graph baseline CompCos by a large margin. Similar observations apply to UT-Zappos, where Co-CGE is superior to all methods in AUC (33.9 vs 33.5 of CGE). However CGE outperforms our model for the best seen, unseen and HM. This signifies that while our model does not achieve the best accuracies, it is less biased between the seen and unseen classes leading to a better AUC.
Finally, in the challenging C-GQA dataset, our Co-CGE achieves the best results in terms of AUC (3.8 vs 3.5 of CGE), HM (15.1% vs 14.4 of CGE) seen (+0.8 over CGE) and unseen (+0.4 over CGE) accuracies, twice the AUC of SymNet (1.8) and more than three times the one of TMN (1.1). Note that, with a compositional space of over 9.3k concepts, C-GQA is significantly harder than other datasets.
|Connections in Graph||AUC||Best HM|
|a) Direct Word Embedding||5.9||19.4|
|b) c p, p c, no self-loop on y||7.6||18.6|
|c) c p, p c||8.1||22.7|
|d) CGE: c p, p c, and s o||8.6||23.3|
|e) Co-CGE: c p, p c, and s o||7.9||22.5|
Ablation Study. We perform an ablation study with respect to the various connections in our compositional graph on the validation set of MIT-States and report results in Table III. We start with standard cross-entropy loss, as in CGE and we then include the cosine similarity-based classifier.
In the Direct Word Embedding variant, i.e. row (a), our label embedding function is an average operation of state and object word embeddings. We see that, directly using word embedding of compositional labels as the classifier weights leads to an AUC of 5.9. In row (b) we represent a graph with connections between primitives (i.e.
states and objects, p) to compositional labels (y) but remove the self connection for the compositional label. In this case, the final representation of compositional labels from the GCN only combines the hidden representations of states and objects leading to an AUC of 7.6.
Row (c) represents the graph that has self connections from each compositional label in addition to the connections between primitives and compositional labels as in row (b). We see that this variant achieves an AUC of 8.1 indicating that the hidden representation of compositional classes is beneficial. Row (d) is the final CGE model, which additionally incorporates the connections between states (s) and objects (o) in a pair to model the dependency between them. We observe that learning a representation that is consistent with states, objects and the compositional labels increases the AUC from 8.1 to 8.6 validating the choice of connections in the graph. Finally, if we employ a cosine classifier to replace the dot product classifier of CGE, we see in row (e) that the AUC and HM are comparable. Note that, with this variant we can use the feasibility scores as margins in the objective tailoring the method for the OW-CZSL.
4.2 Open World CZSL
Comparing with the State of the Art. As shown in Table IV, the first clear effect of moving to the more challenging OW-CZSL setting is the severe decrease in performance for every method. The largest decrease in performance is on the best unseen metric, due to the presence of a large number of distractors. As an example, in MIT states LE+ goes from 20.1% to 2.5% of best unseen accuracy and even the previous state of the art, CGE, loses 22.9%. Similarly, in C-GQA the best seen accuracy drops of 6.8 for SymNet, for CompCos and even the end-to-end trained CGE and Co-CGE lose 11.8 and respectively.
Compared to the baselines, our models, Co-CGE and Co-CGE are more robust to the presence of distractors, e.g. particularly for the best HM performance on MIT-States, Co-CGE surpasses CompCos by 1.2 at the same feature extractor. This demonstrates the importance of explicitly modeling the different feasibility of the compositions in the whole compositional space, injecting them within the objective function and the graph connections. Similar considerations apply to Co-CGE, that achieves the best results in MIT-States and C-GQA wrt all metrics. Remarkably, it achieves a 0.76 of AUC on C-GQA which surpasses the closed world results of early CZSL methods, such as AoP (0.3) and LE+ (0.6). Over CompCos, the improvements are clear also in the accuracy on unseen classes (+1.8 on MIT-States, +0.8 on C-GQA) and harmonic mean (+1.8 on MIT-States, and on C-GQA). In UT Zappos the performance gap with the other approaches is more nuanced. This is because the vast majority of compositions in UTZappos are feasible, thus it is hard to see a clear gain from injecting the feasibility scores into the training procedure. Nevertheless, Co-CGE achieves the best HM mean (40.8) and AUC (23.3). However, a good closed world model performs well in this scenario, as showed by the performance of CGE, achieving the best accuracy on unseen classes (47.7). However, the overall results being lower than the closed setting indicates that OW-CZSL setting poses an open challenge.
Finally, we observe that due to the C-GQA search space being huge (almost 400k compositions) achieving good OW-CZSL performance is extremely hard compared to the much smaller search space of MIT-States (almost 30k compositions) and UT Zappos (192). Table IV shows two interesting trends. The first is the importance of end-to-end training, with CGE and Co-CGE surpassing all other methods but Co-CGE. The second, is the results of SymNet being comparable to Co-CGE in AUC, i.e. SymNet is the only method modeling states and objects separately at classification levels. Therefore, as the search spaces grows, it may beneficial to predict each primitive independently to get an initial estimate of the composition present in the image.
|so||p c||c p|
In the following experiments, we use MIT-States’ validation set to analyze the different choices for our feasibility scores. In particular, we investigate the impact of the feasibility-based margins within the loss functions (starting from CompCos), how they are computed, and how they should be injected within the graph connections. Finally we check the eventual benefit that limiting the output space during inference using may bring to different models.
Importance of the feasibility-based margins. We check the impact of including all compositions in the objective function (without any margin) and of including the feasibility margin but without any warmup strategy for .
As the results in Table V (Top) show, including all unseen compositions in the cross-entropy loss without any margin (i.e. ) increases the best unseen accuracy by 4% and the AUC by 0.5. This is a consequence of the training procedure: since we have no positive examples for unseen compositions, including unseen compositions during training makes the network push their representation far from seen ones in the shared embedding space. This strategy regularizes the model in presence of a large number of unseen compositions in the output space. Note that this problem is peculiar in the open world scenario since in the closed world the number of seen compositions is usually larger than the unseen ones. The CompCos () model performs worse than CompCos on seen compositions, as the loss treats all unseen compositions equally.
Results increase if we include the feasibility scores during training (i.e. ). The AUC goes from 1.7 to 2.0, with consistent improvements over the best seen and unseen accuracy. This is a direct consequence of using the feasibility to separate the unseen compositions from the unlikely ones. In particular, this brings a large improvement on S and moderate improvements on both U and HM.
Finally, linearly increasing (i.e. warmup ) further improves the harmonic mean due to both the i) improved margins estimated from the updated primitive embeddings and ii) the gradual inclusion of these margins in the objective. This strategy improves the bias between seen and unseen classes (as for the better on harmonic mean) while slightly enhancing the discriminability on seen and unseen compositions in isolation.
Effect of Primitives. We can either use objects as in Eq. (7), states as in Eq. (8)) or both as in Eq. (9) to estimate the feasibility score for each unseen composition. Here we show the impact of these choices on the results in Table V (Middle).
We observe that computing feasibility on the primitives alone is already beneficial (achieving an AUC of 1.9) since the dominant states like caramelized and objects like dog provide enough information to transfer knowledge. In particular, computing the scores starting from state information () brings improves the best U and HM. Using similarities among objects () performs well on S while achieving slightly lower performances on U and HM.
Introducing both states and objects give the best result at AUC of 2.1 as it combines the best of both. Merging objects and states scores through their maximum () maintains the higher seen accuracy of the object-based scores, with a trade-off between the two on unseen compositions. However, merging objects and states scores through their average brings to the best performance overall, with a significant improvement on unseen compositions (almost 1%) as well as the harmonic mean. This is because the model is less-prone to assign either too low or too high feasibility scores for the unseen compositions, smoothing their scores. As a consequence, more meaningful margins are used in Eq. (11) and thus the network achieves a better trade-off between discrimination capability on the seen compositions and better separating them from unseen compositions (and distractors) in the shared embedding space.
Effect of Graph Connections. For each graph connection, we have two choices: either keeping it unaltered (i.e. value 1) or replacing it with the feasibility scores (), as in Eq. (13). In Table V (bottom), we analyze these choices, i.e. symmetric state and objects connections (so), the connection from primitives to compositions (pc) and viceversa (cp).
A clear observation is the importance of keeping the influence of primitives on compositions (cp) unaltered (equal 1). We conjecture that since less most of unseen compositions will get feasibility scores lower than 1, their representations would be updated mainly through their self-connection. However, the seen compositions for which we have supervision fully exploit the representations of their primitives. This causes the representations for unseen compositions to have an inherent distribution shift and being i) poorer with respect to seen one and ii) less discriminative. This is clearly shown in the table from the low HM and best unseen class accuracy of Co-CGE whenever cp is different than 1. Keeping the connections cp as 1 allows the model to keep its discrimination capability on unseen classes and a best trade-off between accuracy on seen and unseen classes, with an average improvement of 0.63 in AUC.
For the other two types of connections, so) and (pc), the best results are achieved when their weights are set as in Eq. (13), using the feasibility scores. This allows the model to achieve the best AUC (2.5), HM (12) and seen accuracy (29.5) while being slightly inferior to the top method in best unseen accuracy (-0.4). Note that the advantage of this combination is consistent also for cp set through .
As a final experiment, we check the benefits of fine-tuning the while representation end-to-end with the best perfoming comnination (cp and cp through , cp). As expected this brings to the best results for all metrics. In particular, the learned representations results more discriminative for both seen and unseen compositions, achieving an improvement of on both best seen and best unseen accuracies and a consequent gain of in HM.
Effect of Hard Thresholding. In Section 3, we described that the feasibility scores can also be used during inference to mask the predictions, i.e. using as prediction function Eq. (10) in place of Eq. (1), with the threshold computed empirically. We study the impact of this masking computed either with CompCos or with Co-CGE feasibility scores on Co-CGE and the closed world models Co-CGE, LE+, TMN, SymNet and CompCos, showing the results in Table VI. Note that, since seen compositions are not masked, best S performances do not change.
We observe that applying either ours or CompCos feasibility-based binary masks on top of all closed world approaches is beneficial. This is because the masks are able to filter out the less feasible compositions, rather than simply restricting the output space. In particular, CompCos-based mask brings an average improvement of 0.3 of AUC, 1.4% of HM and 1.6% of best unseen accuracy, while Co-CGE-based improves the AUC of the base model of 0.5 in average, 2.4 in harmonic mean and 2.9 in best unseen accuracy. This suggests that Co-CGE estimates more precise feasibility scores than CompCos, and it can better filter out compositions from the output space.
Interestingly, masking largely improves Co-CGE (+1.0 AUC), despite being the best performing closed-world method. On the other hand, SymNet does not benefit of CompCos-based masking (e.g. +0.1 HM) and marginally does with Co-CGE mask (i.e. +0.8% on unseen accuracy). This suggests that masking the output space with feasibility scores is more beneficial for models predicting object and states together at inference time (e.g. LE+, Co-CGE) than for those predicting them in isolation (e.g. SymNet).
Finally, the improvements are minimal (i.e. +0.1% AUC) for our open world Co-CGE model, being already robust to the presence of distractors. Since requires tuning an additional hyperparameter, we do not apply any masking to Co-CGE and Co-CGE in the experiments.
4.3 Qualitative results
Influence of Feasibility Scores on Predictions. We analyze the reasons for the improvements of Co-CGE over Co-CGE, by showing output examples of both models on sample images of MIT states (top) and C-GQA (bottom). We compare predictions on unseen compositions for samples where the closed world model is “distracted” while the open world model predicts either the correct class label (green) or an incorrect but reasonable one (red) in Figure 1(a).
We observe that the closed world model is generally not capable of dealing with the presence of distractors. For instance, there are cases where the object prediction is correct (e.g. coiled elephant, wilted silk, full dog, rusty snake, viscous cheese, open television) but the associated state is not only wrong but also making the compositions unfeasible. In other cases, the state prediction is correct but the associated object is not, either because the composition is unfeasible (e.g. chocolate bag) or because the model is not able to discriminate the correct one (red parachute, yellow cereal box). In some cases, both state and object predictions are incorrect for Co-CGE, being either unfeasible (e.g. steaming necklace, glazed blanket, blue jeans) or confused in the large open world search space (molten coin, old truck). All these problems are less severe in our full Co-CGE model since injecting the feasibility of each composition within the objective and the graph connections helps in both reducing the possibility to predict implausible distractors and improving the structure of the compositional space, better discriminating the constituent visual primitives. This occurs even when the predictions of Co-CGE are wrong, being either close to the ground-truth (i.e. small snake, black television, sliced chicken) or referring to another concept of the image (i.e. young girl).
Retrieving compositions in the open world. In the OW-CZSL scenario there is no limitation in the output space of the model. Thus, we can predict arbitrary state-object compositions at test time and, eventually, retrieve the closest images to arbitrary concepts. Here we explore the latter scenario and we check which images are the closest to the embeddings of random compositions for state and objects not present in the original datasets. The results are shown in Figure 1(b) for MIT-States (top) and C-GQA (bottom). When the composition is feasible (e.g. calm elephant, cut orange) the model retrieves an image depicting the exact concept. When the composition is inexact for the real world (e.g. closed vs folded bike, scratched vs broken tomato, creased vs wrinkled dog) the model still retrieves reasonable images, showing its ability to capture the underlying effect that a state is supposed to have on the particular object it refers. Finally, when the composition has an unclear meaning, the model tends to retrieve images containing both state and objects, even if present in isolation. This is the case of asphalt bench, where the bench is close to the an asphalt road, blue coffee, showing coffee in a blue cup, and of porcelain table, where the image shows a table with porcelain crockery.
|Most Feasible (Top-3)||Least Feasible (Bottom-3)|
|cat||huge, tiny, small||cloudy, browned, standing|
|tomato||diced, peeled, mashed||full, fallen, heavy|
|house||ancient, painted, grimy||mashed, wilted, browned|
|banana||diced, browned, fresh||dull, barren, unpainted|
|knife||blunt, curved, wide||viscous, standing, runny|
|Most Feasible (Top-3)||Least Feasible (Bottom-3)|
|dog||drinking, blond, standing||ridged, jagged, dull|
|wine||pink, black, red||clumped, park, feathered|
|fruit||ripe, sliced, cooked||vacant, crouched, angled|
|jacket||short sleeved, tight, closed||analog, hard, rubber|
|window||decorative, beige, outdoor||scrambled, greasy, boiled|
Discovering Most and Least Feasible Compositions. The biggest advantage of our method is its ability to estimate the feasibility of each unseen composition, to later inject these estimates into the learning process and the model. Our procedure described in Section 3.2 needs to be robust enough to model which compositions should be more feasible in the compositional space and which should not, isolating the latter in the shared embedding space. We highlight that here we are focusing mainly on visual information to extract the relationships. This information can be coupled with knowledge bases (e.g. [liu2004conceptnet]) and language models (e.g. [wang2019language]) to further refine the scores.
Table VII shows the top-3 most feasible compositions and bottom-3 least feasible compositions given five randomly selected objects for MIT-States (top) and C-GQA (bottom). These objects specific results show a tendency of the model to relate feasibility scores to the subgroups of classes. For instance, cooking states are considered as unfeasible for standard objects (e.g. mashed house, boiled window) as well as substance-specific states (e.g. runny knife). Similarly, states usually associated with substances are considered unfeasible for animals (e.g. runny cat). At the same time, size and actions are mostly linked with animals (e.g. small cat, drinking dog) while cooking states are correctly associated with food (e.g. diced tomato, sliced fruit).
Interestingly, in MIT-States the top states for knife are all present with blade as seen compositions, and in C-GQA the top states for dog are all present with animals as seen compositions (e.g. drinking cat, blond animal, standing cat). This shows that our model exploits the similarities between two objects to transfer these states, e.g. from blade to knife and from cat to dog, following Eq. (7). Furthermore, the state standing is considered as unfeasible for cat in MIT-States while being feasible for dog in C-GQA. This is because the state standing has different meanings in the two datasets, i.e. buildings (e.g. standing tower) in MIT-States, animals and persons (e.g. standing cat) in C-GQA. This highlights the strict dependency of the feasibility scores estimated by our model to the particular dataset, with the impossibility to capture polysemy if the dataset does not account for it. These limitations can be overcomed by integrating external information from knowledge bases [liu2004conceptnet] and language[wang2019language].
In this work, we address the compositional zero-shot learning (CZSL) problem, where the goal is recognizing unseen compositions of seen state-object primitives. We focus on the open world compositional zero-shot learning (OW-CZSL) task, where all the combinations of states and objects could potentially exist. We propose a way to model the feasibility of a state-object composition by using the visual information available in the training set. This feasibility is independent of an external knowledge base and can be directly incorporated in the optimization process. We propose a novel model, Co-CGE, that models the relationship between primitives and compositions through a graph convolutional neural network. Co-CGE incorporates the feasibility scores in two ways: as margins for a cross-entropy loss and as weights for the graph connections. Experiments show that our approach is either superior or comparable to the state of the art in closed world CZSL while improving it by margin in the open world setting.
This work has been partially funded by the ERC (853489-DEXIM) and the DFG (2064/1–Project number 390727645).