Image classification methods have advanced significantly in the past few years. This has largely been driven by a large amount of data per class which has enabled models to learn them. However, data gathering can be time-consuming and expensive. Further, many rare classes may not have sufficient training data. This has led to the creation of “Zero-Shot Learning” (ZSL) methods which aim to leverage other information, typically natural language descriptions of classes, to learn about classes with little or no directly labelled data available. Recent generative ZSL methods have gone further; instead of only classifying unseen classes, they aim to also generate samples from unseen classes[ijcai2017-246, Yizhe_ZSL_2018, xian2018feature, long2017zeroshot, 7907197, Mohamed_Creative_2019].
Despite recent advances in the field of generative ZSL, there are still significant challenges. Generative ZSL methods do not guarantee that the generated visual examples of unseen classes deviate meaningfully from seen classes. That is, there is a risk that the generated images are too similar to samples from the seen classes. Another problem arises when the model is forced to generate samples of unseen classes that arbitrarily deviate from seen classes. In this case there is a risk that the generated images do not follow the description of unseen classes. Instead, the primarily property of the generated images is that they deviate sufficiently from seen classes.
We believe paying closer attention to the details of the description is key to solving both of these issues. Inspired by this, we introduce a new model that aims to encourage the generative model to pay closer attention to the details. Specifically, the model includes a mapping from the generated visual features back to the original text or attributes of a class. By requiring that there exists a mapping from the generated visual features to the class specific description, we force the generator to pay closer attention to these inputs. This is implemented by adding an additional loss function which penalizes the generator and regularizer if the generated description is not similar to the input description. We call our proposed method Description Generator Regularized ZSL (DGRZSL). The approach is unsupervised and not tied to a specific generative ZSL approach so it can be added to any ZSL approach that uses the descriptions of seen and unseen categories with minimal modifications to the underlying generative ZSL approach.
In our zero-shot learning setting each data point consists of visual features, a class label, and a semantic representation of the class. These semantic representations are either textual or attribute based. In this section, we introduce notations to represent training and test data.
Let and represent semantic representations of seen and unseen classes where is the semantic space from distribution . is the number of seen (training) image examples, is the visual features of the image in the visual space from distribution , and is the corresponding category label. The available training data is denoted as where we have unique seen class labels. Additionally, we denote the set of seen and unseen class labels as and where and do not have any labels in common. Then the zero-shot learning task is formulated as predicting the label of an unseen class sample . Generalized ZSL (GZSL) is formulated, as predicting the label of which means the search space at test time includes labels from both seen and unseen classes [xian2020zeroshot].
shows an overview of our model which we describe next. The basic generative ZSL model is based on a generative adversarial network[goodfellow2014generative] and was introduced in [Yizhe_ZSL_2018]. A generator network is trained to map samples of noise and a representation of the class into visual features. The noise,
, is sampled from mean zero standard deviation 1 Gaussian distribution. A discriminator network takes as input visual features. Its output is a classification as to whether the input visual features were real or were generated. The discriminator network can also include an additional output head which predicts the class label. The real/fake prediction of the discriminator for an input image and the predicted label of a seen classgiven the image are defined as and , respectively. The architectures for these networks are as described in [Yizhe_ZSL_2018].
The contribution of this work is the addition of a semantic representation generator network (SR) and corresponding loss to this model. The SR generator network learns to map from the visual features of a sample to the semantic representation of the class. An added loss function (described below) penalizes differences between the output of the SR network and the provided semantic representation of the class. This explicitly requires the generated visual features to contain more information about the semantic representation of a class and encourages better generalization to the unseen classes. The SR generator network consists of three fully connected layers accompanied with ReLU activations to generate semantic representations that describe the input visual features. We explain the loss function in detail in the following sections.
Previous work [Mohamed_Creative_2019] identified a challenge with training generative ZSL models of this form. Specifically, the generator, , never sees data from the unseen classes, neither visual features nor semantic representations. While this is, of course, the definition of zero-shot learning, it means that the generator sees very limited variability in semantic representations during training. In response [Mohamed_Creative_2019] proposed augmenting the training process to include novel, hallucinated semantic representations of new classes which the generator would try generate samples for. To generate the new semantic representations, two classes and are picked at random and with denoting their semantic representation. Then a random convex sum of these features to used create the hallucinated representation:
where alpha is uniformly sampled from interval [0.2, 0.8] [Mohamed_Creative_2019].
After training, the generative model can be used to generate visual features of unseen classes. These samples can then be used to train a classifier as in a regular, classification task
3.1 SR Loss Function
Here we introduce the DGRZSL loss. This loss is in addition to other terms that are commonly used in existing generative ZSL approaches [xian2018feature, Yizhe_ZSL_2018, Mohamed_Creative_2019]
. The main contribution of DGRZSL is the addition of a Semantic Representation (SR) network. The SR network maps from visual features to semantic representations. To constrain the output of this network, and encourage the visual features to represent information present in the semantic features, the added loss function encourages the generated semantic representation for the generated features to match the input. In essence, the model ensures that the combination of the visual feature generator and the semantic representation generator form an autoencoder. The loss for our SR generator network is as follows:
where denotes the visual features, denotes the semantic representations of the seen classes, is the training data distribution of , denote the hallucinated semantic representations generated from as described above, and is a function which measures the similarity between semantic representations. All terms encourage the semantic representations produced by to be similar to the “correct” semantic representation, even in the case of hallucinated semantic representations. The first term considers visual features from the training data with known classes and hence known semantic representations. The second term considers generated visual features given semantic representations of known classes. Finally, the third term considers generated visual features with hallucinated semantic representations. While all terms encourage to generate accurate semantic representations from the visual features, most critically, the last two terms also encourage to produce visual features which meaningfully capture the input semantic representations.
The above SR loss function is used in conjunction with the standard generator and discriminator losses and adversarial training used in GAZSL [Yizhe_ZSL_2018] and CIZSL [Mohamed_Creative_2019].
|Metric||Top-1 Accuracy (%)||Seen-Unseen H (%)|
|Model Selection Set||Dataset||AWA2||SUN||APY||AWA2||SUN||APY|
|Validation (Ours)||GAZSL [Yizhe_ZSL_2018]||56.33||60.76||27.18||28.36||25.59||14.77|
|Test (Original)||GAZSL [Yizhe_ZSL_2018]||66.44||61.31||45.38||34.49||27.71||23.80|
|Metric||Top-1 Accuracy (%)||Seen-Unseen AUC (%)|
|Model Selection Set||Split-mode||Easy||Hard||Easy||Hard||Easy||Hard||Easy||Hard|
|Validation (Ours)||GAZSL [Yizhe_ZSL_2018]||42.40||18.83||41.0||9.34||39.13||15.16||28.71||6.43|
|Test (Original)||GAZSL [Yizhe_ZSL_2018]||44.08||14.46||36.36||8.74||39.69||12.49||24.68||6.48|
We evaluate our method on textual-based datasets including Caltech UCSD Birds-2011 (CUB) [WahCUB_200_2011] and North America Birds (NAB) , which are split into easy and hard benchmarks by [elhoseiny2017link], and attribute-based datasets including AWA2 [xian2020zeroshot], SUN [conf/cvpr/PattersonH12], and APY . The metrics we considered in our experiments are Top-1 unseen class accuracy, Seen-Unseen Generalized Zero-shot performance with area under Seen-Unseen curve [chao2017empirical]
(Seen-Unseen AUC), and Harmonic mean[xian2020zeroshot] (Seen-Unseen H).
4.1 Baselines and Evaluation
The most relevant baselines for our methods are the GAZSL [Yizhe_ZSL_2018] and CIZSL [Mohamed_Creative_2019] ZSL models on which our approach is built. These methods are state-of-the-art generative ZSL approaches. Both GAZSL [Yizhe_ZSL_2018]
and CIZSL follow the same procedure to evaluate the performance of their models. A dataset is split into training and test sets. The training set is used to train the weights of the networks and performance on the test set is evaluated every few epochs. The final reported performance is of the model which achieved the best test set accuracy during training. A similar procedure was used to tune hyperparameters. However, this is an unrealistic representation of model performance as model selection is done based on the test set itself. Instead, we propose to use a validation set, disjoint from the training and testing sets, to select the final model. After training, the performance on this validation set is used to select a model and evaluate the performance on the test set to give a more fair and accurate picture of model performance. However, as a consequence, the results reported for the baseline models when our evaluation method is used differ from the results reported in their original papers[Mohamed_Creative_2019, Yizhe_ZSL_2018]. For transparency, we report results with both evaluation protocols. Our model outperforms both baseline models in most cases regardless of evaluation methods used. In what follows we limit analysis to the results obtained by using the validation set for model selection.
4.2 ZSL Recognition Results
Table 1 summarizes the accuracy achieved of the proposed method and the two baseline models on the attribute-based datasets. DGRZSL outperforms the state-of-the-art baseline methods in top-1 accuracy in all cases with an average improvement of 9.4% in the case of the APY dataset. DGRZSL also displayed significantly improved performance in the seen-unseen H metric, improving it by 29.65%, 1.63%, and 10.91% over state-of-the-art on AWA2, SUN, APY, respectively.
Table 2 shows the results achieved by our model on CUB and NAB datasets for their easy and hard splits compared to the two baseline models. DGRZSL outperformed the other models on easy splits when top-1 accuracy is used to evaluate the models. The advantage of the model becomes more clear when the seen-unseen AUC metric is used as DGRZSL outperforms other models on most benchmarks. The model is most successful on easy splits resulting in average improvements of 2.13% and 2.73% for top-1 accuracy and seen-unseen H, respectively. CUB_HARD is the only case where our method fails to improve upon the baslines. Refer to 2 for the visualization of the seen-unseen curves for our model, GAZSL and CIZSL on all four benchmarks.
We introduced the Description Generator Regularized ZSL (DGRZSL) model. DGRZSL includes an additional component which produces semantic representations of the underlying classes based on generated visual features. Combined with an additional regularization, this encouraged the generated semantic representations to be consistent with the input to the visual feature generator for both seen and hallucinated classes. Our experiments showed that this modification improved the generalization performance over state-of-the-art generative ZSL models in terms of both top-1 accuracy and seen-unseen metrics. Our evaluation on multiple benchmark datasets shows that the DGRZSL performs well for different types of semantic representation, including both textual-based and attribute-based class descriptions.