Humans possess the ability to compose their knowledge of known entities to generalize to novel concepts inherently. Given words, such as green horse, people can combine the known state green with the known object horse immediately, although they have never seen the inexistent stuff. To equip an AI system the similar ability, Compositional Zero-Shot Learning (CZSL) [misra2017red] is proposed, which aims to recognize unseen compositions composed of a set of seen states and objects. In CZSL setting, each composition comprises two components, namely, state and object, where the compositions of train and test sets are disjoint.
In order to infer unknown concepts such as green horse, CZSL aims to understand the meaning of state green and object horse after trained on other compositional concepts that separately contain green or horse, e.g., green grasses and young horse. The challenge of the task lies in the interaction degree between state and object that we cannot quantify, which gives rise to varying contextuality within different state-object combinations. For instance, we can not equate the state old in old car and that in old tiger, since they are fundamentally distinct in visual presentations, which greatly hinders the recognition of novel compositions.
Existing mainstream methods [li2020symmetry, misra2017red, purushwalkam2019task] focus on converting such problem into a general supervised recognition task by training two classifiers for state and object, respectively. They aim to directly predict state and object from the original visual features, ignoring their entanglement. Based on this problem, classifiers cannot capture discriminative state and object features, which potentially limits the recognition accuracy. In addition, other methods [nan2019recognizing, nagarajan2018attributes] aim to learn a common embedding space where the compositions as well as visual features can be projected to narrow the distance between them, such as Euclidean distance. However, these methods, only regarding compositions as entities, neglect the domain gap between training and testing compositions, which can be simply confused by similar images from unseen compositions (e.g., young cat and young tiger). Therefore, it is vital to excavate the discriminative prototypes of state and object to separate the interaction between them and consider the domain transfer between training and testing samples.
To address the problem mentioned above, we propose a Siamese Contrastive Embedding Network (SCEN) for recognizing novel compositions in this paper, aiming to excavate discriminative prototypes of state and object, respectively, as shown in Fig. 1. To be specific, we first project the visual features into state/object-based contrastive spaces to gain the prototypes of state and object. Then, to excavate the discriminative prototypes by contrastive constraints, we set up specific databases named State-constant and Object-constant databases as positive samples. Besides, a shared irrelevant database is built up as a negative sample set, which is embedded into two contrastive spaces. Benefiting from this learning paradigm, our proposed model can successfully excavate discriminative prototypes to represent the corresponding component. In addition, considering that the distribution between seen and unseen compositions is discrepant, we present a State Transition Module (STM), which generates the virtual but reasonable samples to augment the diversity of training data. In this way, the domain gap between seen and unseen composition sets can be mitigated effectively.
To sum up, our main contributions are as follows:
We propose a novel Siamese Contrastive Embedding Network (SCEN) to excavate prototypes of state and object for successfully recognizing both seen and unseen compositions.
We present a State Transition Module (STM) to produce virtual samples and augment the diversity of training compositions, guiding the model to generalize to those compositions not existing in the training process, and alleviating the issue of model migration performance.
Comprehensive experimental results on three benchmark datasets demonstrate the effectiveness of our proposed approach, which outperforms the state-of-the-art CZSL methods.
2 Related Work
Compositional Zero-Shot Learning. The goal of Compositional Zero-Shot Learning [mikolov2013distributed, misra2017red, nagarajan2018attributes, li2020symmetry, purushwalkam2019task, gu2021class, wei2019adversarial] is to learn the compositionality of objects and their states from the training data and is tasked with the generalization to an unseen combination of these primitives. Compared with typical Zero-Shot Learning [lampert2013attribute, wei2020lifelong, li2021generalized]
that utilizes inherent semantic descriptions or attributed vectors to recognize unseen instances, CZSL exploits transferable knowledge by two compositional parts as image labels: objects and states. There are two mainstream methods in this direction. The first mainstream approach is inspired by[biederman1987recognition, hoffman1984parts], which learns a single classifier for recognition and a transformation module [misra2017red]. In addition, [nagarajan2018attributes]
models each state as a linear transformation of objects.[yang2020learning] aims to learn disentangled and compositional primitives hierarchically. [li2020symmetry] models objects to be symmetric under attribute transformations. Other methods try to learn the joint representation of the state-object compositions [atzmon2020causal, purushwalkam2019task, wang2019task]. They aim to learn a modular network to rewire the new compositions conditioned on each composition [purushwalkam2019task, wang2019task]. Recently, GCN [naeem2021learning] is proposed to utilize a causal graph to establish the relationship between state and object reasonably. However,these methods ignore the interaction between state and objects that brings a negative influence for compositions recognition.
As for [atzmon2020causal], the author argues to tackle the CZSL problems through a causal graph where the latent features of primitives are independent from each other. However, it also neglects the discriminant analysis of state and object, which cannot excavate discriminative primitives for classification. In addition, there still exists a domain gap between seen and unseen compositions, although they are made up of the same states and objects, which potentially limits the performance of the model.
Inspired by noise contrastive estimation[gutmann2010noise, mnih2013learning, zhao2021graph], contrastive learning has attracted much attention which leads to major advances in self-supervised representation learning. An efficient way to get better contrastive learning is to employ large numbers of negative examples and design more semantically meaningful augmentations to create different view of images. SimCLR [chen2020simple] implements two data augmentation paths and a learnable non-linear transformation to train an encoder with a large batch by pulling the features embedding from the same images. Momentum Contrast (MoCo) [he2020momentum] is presented which enables building a large and consistent dictionary on-the-fly and transfers well to downstream tasks. Aiming at improving generalization in real domains, a contrastive synthetic-to-real generalization model [chen2021contrastive]
is proposed to prevent overfitting to the synthetic domain by leveraging the pre-trained ImageNet knowledge. More recently, supervised contrastive learning[khosla2021supervised] is proposed to extend the self-supervised batch contrastive approach to the fully supervised setting which can effectively leverage label information.
Based on the effectiveness of Contrastive Learning, the differences between this study and existing works are given below. First, we propose a Siamese Contrastive Embedding Network (SCEN) to excavate the discriminative prototypes of state and object, respectively. Besides, we present a State Transition Module (STM) to produce a virtual composition in training to improve the generalization of the proposed model. Second, we construct two contrastive spaces and utilize contrastive constraints to enforce the prototypes of state and object to be discriminative and generalized.
The goal of CZSL is to recognize the novel compositional samples whose labels are composed of a state (e.g., old) and an object (e.g., tiger). This is particularly challenging since various states significantly change the visual appearance of an object, which hinders the performance of the classifiers.
We propose a novel formulation to tackle the problem, namely Siamese Contrastive Embedding Network (SCEN), which constructs two independent embedding spaces and utilizes contrastive losses to guide corresponding feature extractors to excavate discriminative prototypes of state and object separately. The overview of our approach is shown in Fig. 3.
3.1 Problem Definition
In CZSL setting, each image consists of two primitive concepts. i.e., a state and an object. Given and as two sets of states and objects, we can compose a set of state-object pairs, i.e., . Besides, we denote training set as , where is the image set known in training, and is a subset of containing the corresponding labels. In the traditional Zero-Shot Learning setting, training and testing label are disjoint, i.e., , where , are two subsets of seen/unseen in training. In this case, the model only needs to predict the compositions drawn from in testing [misra2017red]. In this paper, we follow the setting of Generalized ZSL [xian2018zero] where testing samples can be drawn from either seen or unseen compositions, i.e., , which is more challenging on account of the larger prediction space and the dominant bias to seen compositions [purushwalkam2019task]. To sum up, CZSL aims to learn a mapping function that is trained on , in which is composed of two primitive concepts drawn from and .
3.2 Siamese Contrastive Embedding Network
Due to the entanglement between state and object into an image that influences the final classification, we design a Siamese Contrastive Embedding Network (SCEN) to better materialize the discriminative prototypes of state and object, which can effectively improve the accuracy of recognition models. The overall architecture is illustrated in Fig. 2. The SCEN is composed of a feature extractor , a State-Specific Encoder , a Object-Specific Encoder , and a State Transition Module (STM).
Specific database. Let us consider a training sample, such as sliced apple in Fig. 1. As we all know, from our training set, that object apple comes in various states such as caramelized and state sliced also does not just modify a single object, e.g., sliced banana. Therefore, these sample points with overlapping information might have potential relationships. Based on this idea, we set up three specific databases , , and to excavate discriminative state and object factors, respectively. is the set of compositions consisting of constant state and various objects, named State-constant database, while the Object-constant database is defined as the set of compositions made up of various states and a constant object. In addition, is the set of compositions formed from various states and objects, which are both different from the state and object of input instances. For instance, given an image as input which consists of a state and an object , i.e., , the is denoted as:
Analogously, the State-specific database is denoted as:
And the irrelevant database is denoted as:
Siamese Contrastive Space. Based on sets of specific databases being set up, the visual feature , extracted by the feature extractor , are separately projected into two independent contrastive embedding spaces to extract prototypes of state and object :
We hope that and contain information that is separately sensitive to classifiers for compositions recognition. Therefore, we aim to utilize contrastive learning as a constraint condition to extract discriminative prototypes of state and object. However, a general contrastive loss simultaneously cannot extract their discriminative representations due to their interaction. Based on this problem, we define state-based contrastive loss and object-based contrastive loss as constraints to enforce the model to extract discriminative primitives.
To be specific, we set as an anchor in the state-based contrastive space. Meanwhile, selected from is set as a positive point while negative samples selected from the are denoted as . We aim to decrease the distance between the anchor and the positive instance , increasing the distance between and each negative point to extract discriminative prototype of state. Therefore, the state-specific loss function in the state-based contrastive space can be calculated as follows:
where is the temperature parameter for the contrastive embedding and is the number of negative samples. It is obvious that the larger we set, the longer time the training process cost. The larger number of negative samples encourages State-Specific Encoders to excavate a more representative state prototype, which can be generalized to novel compositions.
As a Siamese Contrastive Embedding space, similar to the state-based contrastive space, we denote embedded from the same input visual features as an anchor in the object-based contrastive space, and as negative points embedded from irrelevant database . Therefore, the object-specific loss function can be defined in the object contrastive space as:
where is the temperature parameter for the contrastive embedding and is the number of negative samples.
Considering that the prototypes of state and object should be optimized in the same direction, we share irrelevant samples for two contrastive spaces as negative points, which effectively avoids the problem of unbalanced optimization between and .
Finally, we introduce classification losses to guide classifiers to recognize prototypes of state and object, respectively, which is formulated as:
where and are both fully connected layers with the cross-entropy loss trained to classify state and object respectively. The prototypes of state and object can be further preserved in the composition with the supervision of classification losses.
Thus, the total loss function in Siamese Contrastive Space can be formulated as:
State Transition Module. In order to enforce the SCEN to be generalized to novel samples that do not appear in the training stage, we aim to produce virtual samples to augment the diversity of training compositions, alleviating the domain gap between training and testing data. Therefore, we propose a State Transition Module (STM), which consists of a State-Specific encoder , a Object-Specific encoder , a Generator and a Discriminator [goodfellow2014generative]. The architecture is shown in Fig. 3.
Let us consider two objects, namely apple and banana. As is known to all, from our training set, that apple can be ripe while banana can be caramelized since there appears at least one image in training. However, there exists caramelized apple and ripe banana compositions in the testing set while the training set does not. Therefore, we can conclude that the object has the possibility of forming a new combination with various states. Based on this discovery, we aim to utilize the Generator to produce virtual compositions with the input of various states and a given object. However, such generation with the random combination can produce many irrational compositions, which will actually widen the domain gap between seen and unseen data. For instance, cored banana and squished apple do not appear in the testing set or even exist in reality. Thus, we design a Discriminator to distinguish whether a composition is composed by the generator .
In particular, the Generator takes as input the prototype of an object and another state to generate virtual compositions that never appear in training. Then, the Discriminator takes the real samples as input and determines which are produced by the Generator . and can be optimized by the following adversarial objective:
where . tries to minimize while tries to maximize it.
The goal of improving the and performance is to be generalized to novel compositions in testing, but the generated samples do not have labels as supervision. Thus, we re-encode the generated samples to extract prototypes of state and object again, and design a re-classification loss to constrain them, which is formulated as follows:
The total loss function of State Transition Module is formulated as follows:
Eventually, the final loss of our proposed framework is formulated as:
where and are the weighting coefficients to balance the influence of each loss function, respectively.
In the training stage, the model is trained to estimate the likelihoodfor image conditioned on state and object . The inference takes place in both and . In the inference, model embeds an image as and extract prototypes of state and object, i.e., and with trained and , respectively. Then the state and object of the most similar prototypes are taken as the prediction. The inference rule can be parameterized as:
In this section, all datasets and evaluation protocols are introduced concretely. Then, we present the implementation details and the comparison of experimental results with other state-of-the-art methods. Eventually, ablation studies prove the effectiveness of the method we proposed.
4.1 Experimental Setup
Datasets. Our proposed method is evaluated on three CZSL benchmark datasets, i.e., MIT-States [isola2015discovering], UT-Zappos [yu2014fine], and C-GQA [naeem2021learning].
MIT-States contains 53753 images, e.g., young cat and rusty bike, with 115 states and 245 objects in total. MIT-States has 1962 available compositions where 1262 state-object pairs are seen in the training stage, leaving 700 pairs unseen. UT-Zappos contains 50025 images of shoes, e.g., Cotton Sandals and Suede Slippers, with 16 states and 12 objects. In UT-Zappos, there are 116 state-object pairs, 83 pairs of which are used for training, while the other 33 pairs are unseen in training. As for C-GQA dataset, it contains over 9500 compositions that make it most extensive dataset for CZSL. The detailed information of each dataset is summarized in Tab. 1.
is the harmonic mean value ofU and S. The best results are marked in bold.
Evaluation Metrics. We evaluate the performance according to prediction accuracy for recognizing seen and unseen compositions. Following the setting of [purushwalkam2019task], we compute the accuracy in two situations: 1) Seen, testing only on seen compositions; 2) Unseen, testing only on unseen compositions. Based on these, we can compute Harmonic Mean HM of the two metrics, which balances the performance between seen and unseen accuracies. Eventually, we compute 4) Area Under the Curve (AUC) to quantify the overall performance of both seen and unseen accuracy at different operating points with respect to the bias. Following [purushwalkam2019task, chao2016empirical], we utilize a calibration bias to trade off between the prediction scores of seen and unseen pairs. As the calibration bias varies, we can draw a seen-unseen accuracy curve where the AUC metric can be computed.
Implementation Details. For each image, we extract a 1024 dimensional visual feature vector using ResNet-18 [he2016deep]
pre-trained on the ImageNet dataset[russakovsky2015imagenet]. We separately extract a 300-dimensional feature vectors for both states and objects with and paszke2019pytorch] and optimized by ADAM optimizer [kingma2014adam] on an NVIDIA GTX 1080Ti GPU. In addition, we set the learning rate as 0.00004, batch size as 128, and the number of negative samples
as 10. For the MIT-States dataset, the training time is approximately 3 hours for 800 epochs. For the UT-Zappos dataset, it takes around 1 hour for 500 epochs in training. As for C-GQA, it spends around 4 hours for 1000 epochs in training.
4.2 Comparison with State-of-the-Arts
We compare our experiments with the state-of-the-art in Tab. 2 and show that our Siamese Contrastive Embedding Network (SCEN) outperforms all previous methods in three benchmark datasets, which includes recent proposed C-GQA dataset [naeem2021learning]. Our detailed observations are as follows.
Generalized CZSL performance. For the CZSL task, our SCEN achieves a test AUC of 5.3%, which achieves the best result on MIT-States. In addition, our method significantly boosts the state-of-the-art harmonic mean, i.e., 17.2% to 18.4%. When it refers to state and object prediction accuracy, we can observe an improvement from 27.9% to 28.2 % for states and 31.8% to 32.2% for objects.
Similar observations are confirmed on UT-Zappos, in which we can achieve a superior improvement on state-of-the-arts with an AUC of 32.0% compared to 28.7% from Compcos. In addition, our proposed model performs the best harmonic mean 47.8% and improves around 4.5% compared with the Compcos.
Finally, on the recent proposed splits of the C-GQA dataset, which is shown in Tab. 3, we also achieve the best test AUC of 4.0%. Since C-GQA is a large number of compositions (over 9.3k concepts), which is more complex than MIT-States and UT-Zappos for recognition. The state and the object accuracies of our method are 13.6% and 27.9%, which are both higher than state-of-the-arts. In addition, our best seen and unseen accuracies (28.9% and 12.1%) also achieve the best results on this new dataset.
According to the signficant improvement on three challenging datasets, we can conclude that our proposed Siamese Contrastive Embedding Network (SCEN) can not only effectively extract discriminative prototypes of state and object, but also improve the robust of the model for unseen compositions recognition.
4.3 Ablation Study
We now make an ablation study to evaluate the effectiveness of the Siamese Contrastive Embedding Network. We take a single classification model as a base model, which trains two classifiers to recognize states and objects separately. Meanwhile, we train three variants by adding Siamese Contrastive space, adding State Transition Module (STM), or adding both of them, which is denoted as base model, base model + , and base model + + , respectively. According to the showing experiment of each setting as shown in Tab. 4, every variant tends to be more superior performance than the base model. The combination of two variants we proposed achieves the best improvement, which demonstrates that different components promote each other and work together to improve the performance of SCEN significantly. In addition, the result of the base model with proves that our proposed model successfully excavates discriminative prototypes of states and objects, which is better for compositions recognition. Meanwhile, the significant improvement after adding indicates the necessity of the proposed State Transition Module, which can effectively alleviate the domain gap between training and testing data. Finally, the result of the base model with adding and achieves the best, which shows that they improve the performance of the model together, not affected by each other.
4.4 Qualitative Results
We show some qualitative results for the novel compositions with top-3 predictions in Fig. 4. The first three columns present some examples where the top prediction matches the label. For MIT-States and UT-Zappos datasets, we notice that the remaining two answers of the model can fundamentally capture at least one factor, which proves that the superior performance of our method. As for more complex C-GQA dataset, our model can give the correct answer in top-3 predictions, which shows the robustness of our proposed framework.
Meanwhile, the model can predict more combinations of unseen compositions, rather than being limited to that of seen compositions, which effectively alleviate the domain gap between seen and unseen samples.
In addition, the last two columns show the wrong prediction. For instance, in column 4 and row 2, the image of the Slippers is misclassified as Sandals or Boost. This is because there exists a large number of training compositions so that the negative sample dataset may not contain the entire negative samples, such as Sandals and Boost, thus the model does not pay more attention to these pairs that are not included in . Besides, limited by the compositional class accuracy dependent on the number of groups associated with an object in the label space, the model may only focus on the state of the object in a certain aspect and ignore the state of the object labeled by the tag. For example, in column 4, row 1 the image of the cat consists of texture and age both present in the label space of the dataset and the output of the model. However the label for this image only contains its age.
4.5 Hyper-Parameter Analysis.
We perform an experiment to demonstrate the effect of the weighting co-efficients and for the loss functions and in our proposed model. As is shown in Fig. 5 and Fig. 6. With the different and setting, the Harmonic Mean HM and AUC have a certain degree of change, which indicates that and dominate the performance of the entire model. Based on this situation, we set and fix , changing the value of to observe the performance on different datasets. Finally, on MIT-States, UT-Zappos, and C-GQA datasets, we set , , and , which can achieve the best results, respectively.
In this paper, we propose a novel Siamese Contrastive Embedding Network (SCEN) to excavate discriminative prototypes of state and object for the CZSL task. We firstly project the visual feature into two contrastive spaces, where we set up state-constant and object-constant databases. Meanwhile, we design state-specific and object-specific loss functions as constraints, forcing them to contain discriminative corresponding information. In addition, we design a State Transition Module (STM) to produce virtual but rational compositions that never appear in training, which effectively augment the diversity of training data. The proposed module can provide a robust model that can excavate prototypes for seen samples and be generalized to novel compositions, where linear softmax classifiers can be trained to recognize compositions from both seen and unseen instances. The comparison and ablation study experiments demonstrate that our proposed CZSL framework has achieved state-of-the-arts on three challenging datasets.
Our work was supported in part by the National Natural Science Foundation of China under Grant 62132016, Grant 62171343, Grant 62071361, in part by Key Research and Development Program of Shaanxi under Grant 2021ZDLGY01-03, and in part by the Fundamental Research Funds for the Central Universities ZDRC2102.