Learning Clusterable Visual Features for Zero-Shot Recognition

10/07/2020 ∙ by Jingyi Xu, et al. ∙ Stony Brook University 5

In zero-shot learning (ZSL), conditional generators have been widely used to generate additional training features. These features can then be used to train the classifiers for testing data. However, some testing data are considered "hard" as they lie close to the decision boundaries and are prone to misclassification, leading to performance degradation for ZSL. In this paper, we propose to learn clusterable features for ZSL problems. Using a Conditional Variational Autoencoder (CVAE) as the feature generator, we project the original features to a new feature space supervised by an auxiliary classification loss. To further increase clusterability, we fine-tune the features using Gaussian similarity loss. The clusterable visual features are not only more suitable for CVAE reconstruction but are also more separable which improves classification accuracy. Moreover, we introduce Gaussian noise to enlarge the intra-class variance of the generated features, which helps to improve the classifier's robustness. Our experiments on SUN,CUB, and AWA2 datasets show consistent improvement over previous state-of-the-art ZSL results by a large margin. In addition to its effectiveness on zero-shot classification, experiments show that our method to increase feature clusterability benefits few-shot learning algorithms as well.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Object recognition has made significant progress in recent years, relying on massive labeled training data. However, collecting large numbers of labeled training images is unrealistic in many real-world scenarios. For instance, training examples for certain rare species can be challenging to obtain, and annotating ground truth labels for such fine-grained categories also requires expert knowledge. Motivated by these challenges, zero-shot learning (ZSL), where no labeled samples are required to recognize a new category, has been proposed to handle this data dependency issue.

Specifically, zero-shot learning aims to learn to classify when only images from seen classes (source domain) are provided while no labeled examples from unseen classes (target domain) are available. The seen and unseen classes are assumed to share the same semantic space, such as the semantic attribute [Farhadi et al.2019, Akata et al.2015a, Parikh and Grauman.2011]

or word vector space

[Mikolov et al.2013b, Mikolov et al.2013a], to transfer knowledge between the seen and unseen. Existing ZSL learning methods [Lampert, Nickisch, and Harmeling2014, Akata et al.2015b, Socher et al.2013, Frome et al.2018] typically utilize training data from the source domain to learn a compatible transformation from the visual space to the semantic space of the seen and unseen classes. Then for test samples from the target domain, the visual features will be projected into the semantic space using the learned transformation, in which nearest neighbour (NN) search will be conducted to perform zero-shot recognition.

((a)) ResNet Features
((b)) Generated Features
Figure 1: Motivation of our method. ResNet features contain hard samples while features synthesized by a CVAE are more clusterable and discriminative.

In addition to the above setting where the test samples are from target domain only, another more challenging setting is generalized zero-shot learning (GZSL), where the test samples may come from either the source or the target domain. For GZSL, a number of methods based on feature generation [Mishra et al.2018, Xian et al.2018b, Verma et al.2018, Felix et al.2018, Xian et al.2019] have been proposed to alleviate the data imbalance problem by generating additional training samples with a conditional generator, such as a conditional GAN [Xian et al.2018b, Felix et al.2018, Li et al.2019] or a conditional VAE [Verma et al.2018, Schonfeld et al.2019, Mishra et al.2018]. A discriminative classifier can be then trained with the artificial features from unseen classes to perform classification.

One of the main limitations of CVAE based feature generation methods is the distribution shift between the real visual features and the synthesized features. The real visual features typically contain hard samples which are close to the decision boundary while the CVAE-generated features are usually more clusterable and discriminative (Figure. 1). As a result, the discriminative classifier trained on the generated features can not generalize well on real unseen samples. To address the problem, Keshari et al. [Keshari, Singh, and Vatsa2020] proposed to generate challenging samples that are closer to other competing classes to increase the generalizability of the network. In this paper, we try to tackle the problem from a different perspective. Instead of generating challenging features to train a robust classifier, we propose to project the original visual features to a clusterable and separable feature space. The projected features are supervised by a classification loss from a discriminative classifier. The advantages of learning clusterable visual features are two-fold: first, since the features generated by CVAE are clusterable in nature [Mishra et al.2018], the projected features will be more suitable for CVAE reconstruction; and second, during testing, the test samples are easier to be classified correctly after being projected to the same discriminative feature space using the learned mapping function compared to the original hard ones.

To further increase the visual features’ clusterability, we utilize Gaussian similarity loss, which was first proposed by Kenyon-Dean et al. [Kenyon-Dean, Cianflone, and Page-Caccia2019], to fine-tune the visual features before the CVAE reconstruction. The fine-tuning step helps to derive a more clusterable feature space and further improves ZSL performance.

In addition to learning clusterable visual features, we generate hard features to improve the robustness of the classifier. In practice, we introduce Gaussian noise when minimizing reconstruction loss to synthesize features with larger intra-class variance. In experiments, we show that our method, which simultaneously increasing the feature’s clusterability and the classifier’s generalizability, consistently improves the state-of-the-art zero-shot learning performance on various datasets significantly. Remarkably, we achieve 16.6% improvement on SUN dataset and 12.2% improvement on AWA2 dataset over previous best accuracy of unseen classes in the GZSL setting. In addition to zero-shot learning scenarios, we also apply our method on few-shot learning problems and show that more clusterable features can benefit few-shot learning as well.

In summary, our contributions are as follows:

  • For CVAE based feature generating methods, we propose to increase the clusterability of real features by projecting the original features to a discriminative feature space. The projected features can better mimic the distribution of generated features and are also easier to classify, which leads to better zero-shot learning performance.

  • We utilize Gaussian similarity loss to increase the clusterability of visual features and experimentally demonstrate that more clusterable features benefit both zero-shot and few-shot learning.

  • On both ZSL and GZSL settings, our method significantly improves the state-of-the-art zero-shot classification performance on CUB, SUN and AWA2 benchmarks.

Figure 2: The pipeline of our proposed method. Instead of reconstructing the original visual features directly, we use a mapping function to project the original visual features to a clusterable feature space. Another mapping function is introduced for fine-tuning. The projected features are supervised by the classification loss produced by a classifier and the reconstructed features are supervised by the same classifier to enforce the same distribution. The synthesized features are expected to exhibit larger intra-class variance, which will lead to a more robust classifier. Best viewed in color.

Related Work

Clustering-based Representation Learning

Clustering-based representation learning has attracted interest recently. A discriminative feature space is desirable for many image classification problems, especially facial recognition. Schrof

et al. proposed a triplet loss [Florian, Dmitry, and James.2015]

, which minimizes the distance between an anchor and a positive sample while maximizing the distance between an anchor and a negative sample until a margin is met. However, since image triplets are the input of the network, carefully designed methods to sample training data are required. Another line of research avoids such a pair selection procedure by using a softmax classifier, with a variety of loss functions designed to enhance discriminative power

[Deng, Guo, and Zafeiriou2019, Wen et al.2016, Wang et al.2018, Liu et al.2017]. Wen et al. proposed a center loss [Wen et al.2016], which penalizes the Euclidean distance between each feature vector and its class center to increase intra-class compactness. Besides the domain of facial recognition, Kenyon-Dean et al. proposed clustering-oriented representation learning (COREL) which builds latent representations that exhibit the quality of natural clustering. The more clusterable latent space leads to better classification accuracy for image and news article classification. In this paper, we empirically show that more clusterable visual features can benefit zero-shot learning and few-shot learning as well.


Conventional zero-shot learning methods generally focus on learning robust visual-semantic embeddings[Romera-Paredes and Torr2015, Frome et al.2018, Bucher, Herbin, and Jurie2016, Kodirov, Xiang, and Gong2017]. Lampert et al.[Lampert, Nickisch, and Harmeling2014]

proposed Direct Attribute Prediction (DAP), in which a probabilistic classifier is learned for each attribute independently. The trained estimators can then be used to map attributes to the class label at the inference stage. Bucher

et al. [Bucher, Herbin, and Jurie2016] proposed to control the semantic embedding of images by optimizing jointly the attribute embedding and the classification metric in a multi-objective framework. Kodirov et al. proposed a Semantic Autoencoder (SAE) [Kodirov, Xiang, and Gong2017], which uses an additional reconstruction constraint to enhance ZSL performance.

In the more challenging GZSL task, in which test samples can be from either seen or unseen categories, semantic embedding methods suffer from extreme data imbalance problems. The mapping between visual and semantic spaces has a bias towards the semantic features of seen classes, thus hurting the classification of unseen classes significantly. Recent research address the lack of training data for unseen classes by synthesizing visual representations via generative models [Xian et al.2018b, Mishra et al.2018, Li et al.2019, Verma et al.2018, Keshari, Singh, and Vatsa2020, Bucher, Herbin, and Jurie2017, Huang et al.2019, Li, Min, and Fu.2019]. Xian et al. [Xian et al.2018b] propose to generate image features with a WGAN conditioned on class-level semantic information. The generator is coupled with a classification loss to generate sufficiently discriminative CNN features. LisGAN[Li et al.2019], built on top of WGAN, employs ‘soul samples’ as the representations of each category to improve the quality of generated features.

However, GAN-based losses suffer from mode collapse issues and instability in training. Hence, conditional variational autoencoders (CVAE) have been employed for stable training[Verma et al.2018, Yu and Lee2019, Mishra et al.2018]. Verma et al. [Verma et al.2018] incorporates a CVAE based architecture with a discriminator that learns a mapping from the VAE generator’s output to the class-attribute, leading to an improved generator. Yu et al. [Yu and Lee2019] leverages a CVAE with category-specific multi-modal prior by generating and learning simultaneously. The trained CVAE is provided with experience about both seen and unseen classes. We also deploy a CVAE to synthesize features for the unseen classes. Our method’s novelty lies in that we project the original visual features to a clusterable and discriminative feature space. The projected features are more suitable for CVAE reconstruction and easier for the final classifier to classify correctly.


We first generate additional training samples with a CVAE-based architecture as the baseline of our model. In the following we describe how we learn clusterable visual features by using a discriminative classifier as well as fine-tuning with Gaussian similarity loss. Finally, we describe how we obtain a more robust classifier by synthesizing hard features to further improve ZSL performance.

Problem Setup: In zero-shot learning, we have a training set where are the visual features, denotes the class labels of source seen classes and is the semantic descriptor, e.g., semantic attributes, of class . In addition, we have a set of target unseen class labels which have no overlap with the source seen classes, i.e.. For unseen classes, we are given their semantic features but their visual features are missing. Zero-shot learning methods aim to learn a model which can classify the datapoints from unseen classes labeled .

Baseline ZSL with Conditional Autoencoder

The conditional VAE proposed in [Mishra et al.2018] consists of a probabilistic encoder model and a probabilistic decoder model . The encoder takes the input sample as input and encodes the latent variable . The encoded variable , concatenated with the corresponding attribute vector , is provided to the decoder , which is trained to reconstruct the input sample . The training loss is given by:


The first term is the generator’s reconstruction loss and the second term is the KL divergence loss that pushes the VAE posterior to be close to the prior. The latent code represents the class-independent component and the attribute vector represents the class-specific component.

Once the VAE is trained, one can synthesize samples of any class by sampling from the prior , specifying the class attribute vector and generating samples

with the generator. The generated samples can then be used to train a discriminative classifier, such as a support vector machine (SVM) or a softmax classifier.

Discriminative Embedding Space for Reconstruction

Instead of training a CVAE to mimic the distribution of real visual features directly, we project the original features to a clusterable and discriminative feature space and try to reconstruct the projected features instead. The projected features are expected to have low intra-class variance and large inter-class distance, mimicking the CVAE feature distribution. To ensure such a distribution, we introduce a discriminative classifier and minimize the classification loss over the projected features.


where is the mapping function and denotes the projected features. is the discriminative classifier and

is the probability of predicting

with its true label .

To further enforce the same distribution between the projected features and the reconstructed features, we minimize the classification error of the same classifier over the reconstructed features as well:


The complete learning objective is given by:


where and are the loss weights for the classification losses produced by the projected features and the reconstructed features respectively.

Finally in the testing stage, the test samples will also be projected to the discriminative space using the same mapping function to be classified by the discrminative classifier. Compared to the original samples, the projected samples are easily separated and more likely to be classified correctly.

Clusterable Feature Learning with Gaussian Similarity Loss

The Gaussian similarity loss was first proposed in [Kenyon-Dean, Cianflone, and Page-Caccia2019] to create representations which exhibit the quality of natural clustering. Here we adopt the Gaussian similarity loss to fine-tune the visual features to further increase the clusterability of the feature space.

Neural networks for conventional classification tasks are trained using a categorical cross-entropy (CCE) loss. Specifically, the CCE loss seeks to maximize the log-likelihood of the -sample training set from classes:


where is the th feature sample with label and is the th column of the classification matrix . We algebraically reformulate the CCE loss formulation as follows:


where is the similarity function between and , which is the dot product in the CCE loss.

Although the CCE loss is widely used for classification tasks, the representations learned by CCE are not naturally clusterable. Replacing the dot product operation in the CCE loss with Gaussian similarity function, we get the Gaussian similarity loss which leads to more clusterable latent representations [Kenyon-Dean, Cianflone, and Page-Caccia2019]

. Specifically, the Gaussian similarity function is defined based on the univariate normal probability density function and the standard radial basis function (RBF) as follows:


where the hyper parameter is a free parameter. Thus, the Gaussian similarity loss can be written as:


According to [Kenyon-Dean, Cianflone, and Page-Caccia2019], compared to the CCE loss, the Gaussian similarity loss helps to create naturally clusterable latent spaces. To fine-tune the original visual features with the Gaussian similarity loss, we use another mapping function to transform the original features to a new space and minimize the Gaussian similarity loss of the transformed features w.r.t a new classification matrix :


The transformed features can be then used for CVAE reconstruction.

SJE [Akata et al.2015c] 23.5 59.2 33.6 14.7 30.5 19.8 8.0 73.9 14.4
ESZSL [Romera-Paredes and Torr2015] 12.6 63.8 21.0 11.0 27.9 15.8 5.9 77.8 11.0
ALE [Akata et al.2013] 23.7 62.8 34.4 21.8 33.1 26.3 14.0 81.8 23.9
SAE [Kodirov, Xiang, and Gong2017] 7.8 54.0 13.6 8.8 18.0 11.8 1.1 82.2 2.2
SYNC [Changpinyo et al.2016] 11.5 70.9 19.8 7.9 43.3 13.4 10.0 90.5 18.0
LATEM [Xian et al.2016] 15.2 57.3 24.0 14.7 28.8 19.5 11.5 77.3 20.0
DEM [Zhang, Xiang, and Gong2017] 19.6 57.9 29.2 20.5 34.3 25.6 30.5 86.4 45.1
AREN [Xie et al.2019] 38.9 78.7 52.1 19.0 38.8 25.5 15.6 92.9 26.7
DEVISE [Frome et al.2018] 23.8 53.0 32.8 16.9 27.4 20.9 17.1 74.7 27.8
SE-ZSL [Verma et al.2018] 41.5 53.3 46.6 40.9 30.5 34.9 58.3 68.1 62.8
f-CLSWGAN [Xian et al.2018b] 41.5 53.3 46.6 40.9 30.5 34.9 58.3 68.1 62.8
CADA-VAE [Schonfeld et al.2019] 51.6 53.5 52.4 47.2 35.7 40.6 55.8 75.0 63.9
JGM-ZSL [Gao et al.2018] 42.7 45.6 44.1 44.4 30.9 36.5 56.2 71.7 63.0
RFF-GZSL [Han, Fu, and Yang2020] 52.6 56.6 54.6 45.7 38.6 41.9 - - -
LisGAN [Li et al.2019] 46.5 57.9 51.6 42.9 37.8 40.2 47.0 77.6 58.5
OCD [Keshari, Singh, and Vatsa2020] 44.8 59.9 51.3 44.8 42.9 43.8 59.5 73.4 65.7
Ours 56.8 69.2 62.4 63.8 45.4 53.0 71.6 87.8 78.8

Table 1:

Generalized zero-shot learning performance on CUB, SUN and AWA2 dataset. U = Top-1 accuracy of the test unseen-class samples, S = Top-1 accuracy of the test seen-class samples, H = harmonic mean. We measure top-1 accuracy in %. The best performance is indicated in bold.

SSE [Zhang and Saligrama2016] 43.9 51.5 61.0
ALE [Akata et al.2013] 54.9 58.1 62.5
DEVISE [Frome et al.2018] 52.0 56.5 59.7
SJE [Akata et al.2015c] 53.9 53.7 61.9
ESZSL [Romera-Paredes and Torr2015] 53.9 54.5 58.6
SYNC [Changpinyo et al.2016] 55.6 56.3 46.6
SAE [Kodirov, Xiang, and Gong2017] 33.3 40.3 54.1
GFZSL [Verma and Rai2009] 49.2 62.6 67.0
SE-ZSL [Verma et al.2018] 59.6 63.4 69.2
LAD [Jiang et al.2017] 57.9 62.6 67.8
CVAE-ZSL [Mishra et al.2018] 52.1 61.7 65.8
CDL [Jiang et al.2018] 54.5 63.6 67.9
OCD [Keshari, Singh, and Vatsa2020] 60.3 63.5 71.3
Ours 63.1 65.5 74.1

Table 2: Classification accuracy for conventional zero-shot learning for the proposed split (PS) on CUB, SUN and AWA2. The best performance is indicated in bold.

Introducing Gaussian Noise for Better Generalizability

With the above mapping function supervised with classification loss and fine-tuning with Gaussian similarity loss, we obtain discriminative and clusterable real visual features. Meanwhile, we introduce Gaussian noise when minimizing the reconstruction loss to synthesize hard features:


where is the projected features and are the features permuted with Gaussian noise; denotes the strength of the permutation. Instead of reconstructing , the decoder is trained to reconstruct . The synthesized features will have larger intra-class variance and thus lead to a more rebust final classifier. The CVAE training loss will then become:



We compare our proposed framework in both ZSL and GZSL settings on three benchmarking datasets: CUB [Welinder et al.2010], SUN [Patterson and Hays2012] and AWA2 [Lampert, Nickisch, and Harmeling2014]

. We present datasets, evaluation metrics, experimental results and comparisons with the state-of-the-art. Moreover, we perform few-shot learning experiments on CUB and SUN dataset to further demonstrate that the clusterable features fine-tuned with Gaussian similarity loss can benefit few-shot learning as well.



Caltech-UCSD-Birds 200-2011 (CUB) [Welinder et al.2010] and SUN Attribute (SUN) are both fine-grained datasets. CUB contains 11,788 examples of 200 fine-grained species annotated with 312 attributes. SUN consists of 14,340 examples of 717 different scenes annotated with 102 attributes. We use a split of 150/50 for CUB and 645/72 for SUN respectively. The Animals with Attributes2 (AWA2) [Lampert, Nickisch, and Harmeling2014] is a coarse-grained dataset proposed for animal classification. It is the extension of the AWA [Lampert, Nickisch, and Harmeling2009] database containing 37,322 samples from 50 classes and 85 attributes. We adopt the standard 40/10 zero-shot split in our experiments. The statistics and protocols of the datasets are presented in Table 3.

Evaluation Protocol:

We report results on the split (PS) proposed by Xian et al. [Xian et al.2018a]

, which guarantees that no target classes are from ImageNet-1K since it is used to pre-train the base network. For ZSL, we adopt the same metrics, i.e., top-1 per class accuracy, as

[Xian et al.2018a] for fair comparison with other methods. For GZSL, we compute top-1 per class accuracy on seen classes, denoted as , top-1 per class accuracy on unseen classes, denoted as , and their harmonic mean, defined as .

Dataset Attribute-Dim Images Seen/Unseen Classes
CUB 312 11788 150/50
SUN 102 14340 645/72
AWA2 85 37322 40/10

Table 3: Datasets used in our experiments
Figure 3: Visualization of real visual features and synthesized features on CUB dataset using t-SNE. (a) Original real features and synthesized features. (b) Projected real features and synthesized features. (c) Projected real features and synthesized features via introducing Gaussian noise. The real features are represented by ‘‘ and the synthsized features are represented by ’’. Different colors represent different categories. With our proposed method, the real visual features are more discriminative and easily separated while the synthesized features exhibit larger intra-class variance.

Implementation Details:

Our method is implemented in PyTorch. We extract image features using the ResNet101 model

[He et al.2016]

pretrained on ImageNet with 224 * 224 input size. The extracted features are from the 2048-dimensional final pooling layer. Both the encoder

and the decoder

are multilayer perceptrons with one 4096-unit hidden layer. LeakyReLU and ReLU are the nonlinear activation functions in the hidden and output layers respectively. The mapping function

is implemented with a fully connected layer and Sigmoid activation. The dimension of the latent space is chosen to be the same as the attribute vector. For CUB and SUN dataset, the dimension of the projected visual feature space is set to be 512 while for AWA2 dataset, it is set to be 2048. We use the Adam solver with , and a learning rate of 0.001. We set , in eq. 4. For the CUB and SUN datasets, we set in eq.10. For the AWA2 dataset, is set to 1.

Conventional Zero-Shot Learning(ZSL):

Table 2 summarizes the results of conventional Zero-Shot Learning. In these experiments, test samples can only belong to the unseen categories . It can be seen that our proposed method achieves state-of-the-art performance on all three datasets. The classification accuracies obtained on the PS protocol on CUB, SUN and AWA2 are 63.1%, 65.5%, and 74.1%, respectively. The proposed method improves the state-of-the-art performance by 2.9% on CUB, by 2.0% on SUN and by 2.8% on AWA2, which indicates the effectiveness of the framework. Our performance beats other CVAE based zero-shot learning methods, such as SE-ZSL[Verma et al.2018] and CVAE-ZSL [Mishra et al.2018], by a large margin. Compared to [Keshari, Singh, and Vatsa2020] which synthesizes hard features to increase network generalizability , we increase the clusterability of real visual features at the same time to make it easily separated and more reconstructable, which leads to a significant accuracy boost.

Generailized Zero-Shot Learning(GZSL):

In the GZSL setting, the testing samples can be from either seen or unseen classes. The setting is more challenging and more reflective of real-world application scenarios, since typically whether an image is from a source or target class is unknown in advance.

Table 1 shows the generalized zero-shot recognition results on the three datasets. We group the algorithms into non-generative models, i.e., SJE [Akata et al.2015c], ESZSL [Romera-Paredes and Torr2015], ALE [Akata et al.2013], SAE [Kodirov, Xiang, and Gong2017], SYNC [Changpinyo et al.2016], LATEM [Xian et al.2016], DEM [Zhang, Xiang, and Gong2017], DEVISE [Frome et al.2018], and generative models, i.e., SE-ZSL [Verma et al.2018], f-CLSWGAN [Xian et al.2018b], CADA-VAE [Schonfeld et al.2019], JGM-ZSL. We observe that generative models perform better than non-generative models in general. Synthesizing additional features helps to alleviate the imbalance between seen and unseen classes, which leads to higher accuracy for unseen class samples. In contrast, the non-generative methods mostly perform well on seen classes and obtain much lower accuracy on unseen classes, resulting in a low harmonic mean.

Compared with generative methods, our proposed method still achieves state-of-the-art performance on all three datasets, especially in terms of unseen class accuracy. W.r.t the harmonic mean, we significantly improve on OCD [Keshari, Singh, and Vatsa2020] by 11.1%, 9.2% and 13.1% on CUB, SUN and AWA2 respectively.

Ablation Studies

Effect of Different Components:

The proposed framework has multiple components for improving the performance of ZSL/GZSL. Here we conduct ablation studies to evaluate the effectiveness of each of the component individually. Fig. 4 summarizes the zero-shot recognition results of different submodels. By comparing the results of ‘NA’ and ‘CLS’, we can observe that models improve a lot by projecting the original features to a clusterable space, especially for CUB and SUN. Since CUB and SUN are fine-grained datasets, the original visual features are typically hard to distinguish. Therefore, increasing the clusterability of real visual features has a significant effect. This can also be seen through the comparisons of ‘CLS’ and ‘CLS-GAUSSIAN’, since fine-tuning with Gaussian similarity loss also leads to better clusterability. AWA2 is a coarse-grained dataset, so the original features are already well seperated. Thus the biggest improvement comes from adding noise to obtain a more robust classifer.

Evaluation of Feature Clusterablility:

Beyond zero-shot learning performance, here we analyze the clusterability of the visual features learned with our proposed method. Specifically, we apply one of the most commonly used clustering algorithms, K-Means, on the projected feature space. Then we evaluate the clustering performance by computing the mutual information (MI), which measures the agreement of the ground truth class assignments and the K-Means assignments. Higher mutual information indicates the clustering algorithm performs better and thus, a more clusterable feature space.

MI Acc MI Acc MI Acc
NA 0.67 56.6 0.56 61.4 0.83 70.2
Cls 0.71 61.2 0.59 64.4 0.85 71.3
Cls + Gaussian 0.73 62.6 0.60 64.9 0.85 72.4

Table 4: Mutual information (MI) score and classification accuracy on CUB, SUN and AWA2 without our proposed method (NA), with the supervision of classification loss (Cls) and Gaussian similarity loss (Gaussian)

As seen in table 4, the proposed mapping network supervised by the classification loss, improves visual feature clusterability by a large margin, i.e, from 0.67 to 0.71 on CUB, from 0.56 to 0.59 on SUN and from 0.83 to 0.85 on AWA2 in terms of MI score. Correspondingly, the zero-shot learning performance also improves from 56.6% to 61.2% , from 61.4% to 64.4% and from 70.2% to 71.3% respectively. Moreover, the fine-tuning step with Gaussian similarity loss further improves MI by 0.02 on CUB and 0.01 on SUN, which brings 1.4% and 0.5% zero-shot learning performance improvement.

We can observe from the above experimental results that feature space clusterability and classification accuracy are strongly correlated: more clusterable feature space leads to higher classification accuracy.

Figure 4: Zero-shot learning performance with different components of our method on three datasets. The reported values are classification accuracy (%)

Few-shot Learning

We also apply our method on few-shot recognition problems to show that more clusterable features can benefit few-shot learning as well. A baseline approach [Chen et al.2019]

to few-shot learning is to train a classifier using the features extracted from a support set. The trained classifier is then used to predict labels of query set images. We use two fine-grained datasets, CUB and SUN, where we extract image features by ResNet101 pretrained on ImageNet. We use base set features to train a mapping function supervised by Gaussian similarity loss. The mapping function is then applied to a novel set to obtain more clusterable features. We compare our method with the baseline, i.e. using features extracted from the pretrained ResNet directly. The results of our experiments as well as comparisons to some baselines

[Schwartz et al.2018, Snell, Swersky, and Zemel2017] are summarized in Table 5. We can conclude that increasing features’ clusterability can improve the baseline performance by a large margin, especially for the 1-shot setting (16.4% on CUB dataset and 6.1% on SUN dataset).

Method CUB SUN
1-shot 5-shot 1-shot 5-shot
Ours-Baseline 70.2 92.6 76.5 93.1
ProtoNet 71.9 92.4 74.7 94.8
-Encoder 82.2 92.6 82.0 93.0
Ours-Gaussian 86.6 95.8 82.6 93.8

Table 5: 1-shot/5-shot 5-way accuracy with ImageNet pretrained features (trained on disjoint cat egories)

Visualization of Synthesized Samples

To further demonstrate how our proposed method boosts zero-shot learning performance, we randomly sample ten unseen categories and visualize the real features and synthesized features using t-SNE[Akata et al.2013]. Figure 3 depicts the empirical distributions of the true visual features and the synthesized visual features with and without our proposed framework. We observe the original true features contain hard samples close to another class and some of them are overlapping (Figure 2(a)). The discriminative classifier trained with synthesized samples typically performs poorly on such a distribution. On the contrary, the projected features are easily separated with larger inter-class distance (Figure 2(b)), which leads to better distribution alignment between real and synthesized features. Moreover, by introducing Gaussian noise, the synthesized features exhibit larger intra-class variance (Figure 2(c)), which leads to a more robust classifier to handle the projected test samples.


In this work, we have proposed to learn clusterable visual features to address the challenges of Zero-Shot Learning and Generalized Zero-Shot Learning within a CVAE based framework. Specifically, we use a mapping function, supervised by softmax classification loss, to project the original features to a new clusterable feature space. The projected clusterable visual features are not only more suitable for the generator to reconstruct, but also are more separable for the final classifier to classify correctly. To further increase the clusterability of visual features, we utilize Gaussian similarity loss to fine-tune the features first before CVAE reconstruction. In addition, we introduce Gaussian noise to enlarge the intra-class variance of the synthesized features to obtain a more robust classifier. We evaluate the clusterability of visual features quantitatively and experimentally demonstrate that more clusterable features lead to better ZSL performance. Experimental results on three benchmarks under ZSL, GZSL and FSL settings show our method improves the state-of-the-art by a large margin. We are also interested to see how learning clusterable features can be applied to other tasks such as face recognition and person re-identification.


  • [Akata et al.2013] Akata, Z.; Perronnin, F.; Harchaoui, Z.; and Schmid, C. 2013. Label-embedding for attribute-based classification. In CVPR.
  • [Akata et al.2015a] Akata, Z.; Reed, S.; Walter, D.; Lee, H.; and Schiele., B. 2015a. Evaluation of output embeddings for fine-grained image classification. In CVPR.
  • [Akata et al.2015b] Akata, Z.; Reed, S.; Walter, D.; Lee, H.; and Schiele, B. 2015b. Evaluation of output embeddings for fine-grained image classification. In CVPR.
  • [Akata et al.2015c] Akata, Z.; Reed, S.; Walter, D.; Lee, H.; and Schiele., B. 2015c. Evaluation of output embeddings for fine-grained image classification. In CVPR.
  • [Bucher, Herbin, and Jurie2016] Bucher, M.; Herbin, S.; and Jurie, F. 2016. Improving semantic embedding consistency by metric learning for zero-shot classification. In ECCV.
  • [Bucher, Herbin, and Jurie2017] Bucher, M.; Herbin, S.; and Jurie, F. 2017. Generating visual representations for zero-shot classification. In ICCVW.
  • [Changpinyo et al.2016] Changpinyo, S.; Chao, W.-L.; Gong, B.; and Sha, F. 2016. Synthesized classifiers for zero-shot learning. In CVPR.
  • [Chen et al.2019] Chen, W.-Y.; Liu, Y.-C.; Kira, Z.; Wang, Y.-C. F.; and Huang, J.-B. 2019. A closer look at few-shot classification. In ICML.
  • [Deng, Guo, and Zafeiriou2019] Deng, J.; Guo, J.; and Zafeiriou, S. 2019. Arcface: Additive angular margin loss for deep face recognition. In CVPR.
  • [Farhadi et al.2019] Farhadi, A.; Endres, I.; Hoiem, D.; and Forsyth., D. 2019. Describing objects by their attributes. In CVPR.
  • [Felix et al.2018] Felix, R.; Kumar, V. B.; Reid, I.; and Carneiro, G. 2018. Multi-modal cycle-consistent generalized zero-shot learning. In ECCV.
  • [Florian, Dmitry, and James.2015] Florian, S.; Dmitry, K.; and James., P. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR.
  • [Frome et al.2018] Frome, A.; Corrado, G. S.; Shlens, J.; Bengio, S.; Dean, J.; and Mikolov, T. 2018. Devise: A deep visual-semantic embedding model. In ECCV.
  • [Gao et al.2018] Gao, R.; Hou, X.; Qin, J.; Liu, L.; Zhu, F.; and Zhang., Z. 2018. A joint generative model for zero-shot learning. In ECCV.
  • [Han, Fu, and Yang2020] Han, Z.; Fu, Z.; and Yang, J. 2020. Learning the redundancy-free features for generalized zero-shot object recognition. In CVPR.
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
  • [Huang et al.2019] Huang, H.; Wang, C.; Yu, P. S.; and Chang-DongWang. 2019. Generative dual adversarial network for generalizedzero-shot learning. In CVPR.
  • [Jiang et al.2017] Jiang, H.; Wang, R.; Shan, S.; Yang, Y.; and Chen, X. 2017. Learning discriminative latent attributes for zero-shot classification. In

    The IEEE International Conference on Computer Vision (ICCV)

  • [Jiang et al.2018] Jiang, H.; Wang, R.; Shan, S.; and Chen, X. 2018. Learning class prototypes via structure alignment for zero-shot recognition. In ECCV.
  • [Kenyon-Dean, Cianflone, and Page-Caccia2019] Kenyon-Dean, K.; Cianflone, A.; and Page-Caccia, L. 2019. Clustering-oriented representation learning with attractive-repulsive loss. In AAAI.
  • [Keshari, Singh, and Vatsa2020] Keshari, R.; Singh, R.; and Vatsa, M. 2020. Generalized zero-shot learning via over-complete distribution. In CVPR.
  • [Kodirov, Xiang, and Gong2017] Kodirov, E.; Xiang, T.; and Gong, S. 2017. Semantic autoencoder for zero-shot learning. In CVPR.
  • [Lampert, Nickisch, and Harmeling2009] Lampert, C.; Nickisch, H.; and Harmeling, S. 2009. Learning to detect unseen object classes by between-class attribute transfer. In CVPR.
  • [Lampert, Nickisch, and Harmeling2014] Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2014. Attribute-based classification for zero-shot visual object categorization. In IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • [Li et al.2019] Li, J.; Jing, M.; Lu, K.; Ding, Z.; Zhu, L.; and Huang., Z. 2019. Leveraging the invariant side of generative zero-shot learning. In CVPR.
  • [Li, Min, and Fu.2019] Li, K.; Min, M. R.; and Fu., Y. 2019. Rethinking zero-shot learning: A conditional visual classification perspective. In ICCV.
  • [Liu et al.2017] Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; and Song., L. 2017. Learning a deep embedding model for zero-shot learning. In CVPR.
  • [Mikolov et al.2013a] Mikolov, T.; Chen, K.; Corrado, G.; and Dean., J. 2013a. Efficient estimation of word representations in vector space. In arXiv preprint arXiv:1301.3781.
  • [Mikolov et al.2013b] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean., J. 2013b. Distributed representations of words andphrases and their compositionality. In NeruIPS.
  • [Mishra et al.2018] Mishra, A.; Reddy, S. K.; Mittal, A.; and Murthy., H. A. 2018. A generative model for zero shot learning using conditional variational autoencoders. In CVPRW.
  • [Parikh and Grauman.2011] Parikh, D., and Grauman., K. 2011. Relative attributes. In ICCV.
  • [Patterson and Hays2012] Patterson, G., and Hays, J. 2012. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR.
  • [Romera-Paredes and Torr2015] Romera-Paredes, B., and Torr, P. 2015. An embarrassingly simple approach to zero-shot learning. In

    International Conference on Machine Learning(ICML)

  • [Schonfeld et al.2019] Schonfeld, E.; Ebrahimi, S.; Sinha, S.; Darrell, T.; and Akata., Z. 2019. Generalized zero-and few-shot learning via aligned variational autoencoders. In CVPR.
  • [Schwartz et al.2018] Schwartz, E.; Karlinsky, L.; Shtok, J.; Harary, S.; Marder, M.; Feris, R.; Kumar, A.; Giryes, R.; and Bronstein, A. M. 2018. Delta-encoder: an effective sample synthesis method for few-shot object recognition. In NeruIPS.
  • [Snell, Swersky, and Zemel2017] Snell, J.; Swersky, K.; and Zemel, R. S. 2017. Prototypical networks for few-shot learning. In NeruIPS.
  • [Socher et al.2013] Socher, R.; Ganjoo, M.; Manning, C. D.; and Ng, A. Y. 2013. Zero-shot learning through cross-modal transfer. In NeruIPS.
  • [Verma and Rai2009] Verma, V. K., and Rai, P. 2009. A simple exponential family framework for zero-shot learning. In TKDE.
  • [Verma et al.2018] Verma, V. K.; Arora, G.; Mishra, A.; and Rai., P. 2018. Generalized zero-shot learning via synthesized examples. In CVPR.
  • [Wang et al.2018] Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Li, Z.; Gong, D.; Zhou, J.; and C, W. L. 2018. Cosface: Large margin cosine loss for deep face recognition. In CVPR.
  • [Welinder et al.2010] Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; and Perona, P. 2010. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology.
  • [Wen et al.2016] Wen, Y.; Zhang, K.; Li, Z.; and Qiao., Y. 2016. A discriminative feature learning approach for deep face recognition. In ECCV.
  • [Xian et al.2016] Xian, Y.; Akata, Z.; Sharma, G.; Nguyen, Q.; Hein, M.; and Schiele, B. 2016. Latent embeddings for zero-shot classification. In CVPR.
  • [Xian et al.2018a] Xian, Y.; Lampert, C. H.; Schiele, B.; and Akata, Z. 2018a. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. In TPAMI.
  • [Xian et al.2018b] Xian, Y.; Lorenz, T.; Schiele, B.; and Akata, Z. 2018b. Feature generating networks for zero-shot learning. In CVPR.
  • [Xian et al.2019] Xian, Y.; Sharma, S.; Schiele, B.; and ZeynepAkata. 2019. f-vaegan-d2: A feature generating framework forany-shot learning. In CVPR.
  • [Xie et al.2019] Xie, G.-S.; Liu, L.; Jin, X.; Zhu, F.; Zhang, Z.; JieQin; Yao, Y.; and Shao., L. 2019. Attentive region embed-ding network for zero-shot learning. In CVPR.
  • [Yu and Lee2019] Yu, H., and Lee, B. 2019. Zero-shot learning via simultaneous generating and learning. In NeruIPS.
  • [Zhang and Saligrama2016] Zhang, Z., and Saligrama, V. 2016. Learning joint feature adaptation for zero-shot recognition. In arXiv preprint arXiv:1611.07593.
  • [Zhang, Xiang, and Gong2017] Zhang, L.; Xiang, T.; and Gong, S. 2017. Learning a deep embedding model for zero-shot learning. In CVPR.