SR-GAN: Semantic Rectifying Generative Adversarial Network for Zero-shot Learning

04/15/2019 ∙ by Zihan Ye, et al. ∙ 0

The existing Zero-Shot learning (ZSL) methods may suffer from the vague class attributes that are highly overlapped for different classes. Unlike these methods that ignore the discrimination among classes, in this paper, we propose to classify unseen image by rectifying the semantic space guided by the visual space. First, we pre-train a Semantic Rectifying Network (SRN) to rectify semantic space with a semantic loss and a rectifying loss. Then, a Semantic Rectifying Generative Adversarial Network (SR-GAN) is built to generate plausible visual feature of unseen class from both semantic feature and rectified semantic feature. To guarantee the effectiveness of rectified semantic features and synthetic visual features, a pre-reconstruction and a post reconstruction networks are proposed, which keep the consistency between visual feature and semantic feature. Experimental results demonstrate that our approach significantly outperforms the state-of-the-arts on four benchmark datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The classical pattern of object recognition classifies image into categories only seen in training stage [1, 2]. In contrast, zero-shot learning (ZSL) aims at exploring unseen image categories, which gets a lot of attention [3, 4, 5, 6, 7, 8, 9, 10, 11] in recent years. By using the intermediate semantic features (obtained from human-defined attribute) of both seen and unseen classes, the previous methods inference unseen classes of image. However, the human-defined attributes are highly overlapped for similar classes, which is prone to failure prediction. In this paper, we propose a Semantic Rectifying Network (SRN) to make semantic feature more distinguishable, and a Semantic Rectifying Generative Adversarial Network to synthesize unseen classes data from both corresponding semantic and rectified semantic feature. By synthesizing missing features, the unseen classes can be classified by supervised classification approach, like nearest neighbors.

ZSL is challenging [4], because the images to be predicted are from unseen classes. [12] proposes the generalized zero-shot learning (GZSL) to improve the expandability of ZSL. Different from ZSL, GZSL also has seen classes used at test time. Some studies [3, 13, 14, 5, 15, 16] project image feature to semantic space and treat image as the class with closest semantic feature to the projected image feature. Recently, inspired by the generative ability of generative adversarial networks (GANs),  [17, 8, 18] leverages GANs to generate synthesized visual feature from semantic features and noise samples, and designs a visual pivot regularization to simulate visual distribution with greater inter-class discrimination. By generating missing features for unseen classes, they convert ZSL to a conventional classification problem, and some classical method such as nearest neighbors can be used.

Figure 1: Schema of the proposed method. We visualize semantic features and corresponding rectified semantic features of 20 classes from APY by multidimensional scaling (MDS)  [19]. Classes overlapped in semantic space can be rectified by the proposed semantic rectifying network.
Figure 2: The architecture of Semantic Rectifying Generative Adversarial Network. (a) Semantic Rectifying Network (SRN) rectifies semantic feature into a more discriminative space guided by the seen visual information. (b) Semantic Rectifying Generative Adversarial Network makes use of the rectified semantic feature to generate visual feature, and discriminate which into real of fake. (c) the Pre-reconstruction Network allows our generator to learn a more exact distribution of visual features; the Post-reconstruction reconstruction network allows synthetic visual feature to be translated into their semantic features.

However, semantic features are difficult to exactly define due to the overlap of the common properties such as color, shape, and texture for many class. For example, elephant and tiger are both giant and with tail. Thus, as shown in Fig. 1, semantic features of some similar classes may cluster in a small region and are indistinguishable in the semantic space. Obviously, wolf, cat, dog and even zebra are quite close in semantic space because these classes have overlapped attributes or similar descriptions. Accordingly, synthesizing visual feature from these indistinguishable semantic features is unreasonable. Some works seek to address this problem by learning an extra space apart from visual space. [9] and [10] propose to automatically mine latent and discriminative semantic feature from visual feature. [11] constructs an align space as a trade-off between semantic and visual space. However, as containing two heterogeneous kinds of information, the aligned space may be misled by the noise from image.

In this paper, we adopt another strategy, i.e., to rectify the semantic space into a more reasonable one guided by visual feature. We first design a Semantic Rectifying Network (SRN) to pre-rectify undistinguished semantic feature. Based on the generative adversarial network, as shown in Fig. 2, a Semantic Rectifying Generative Adversarial Network (SR-GAN) is then proposed to generate visual feature from the rectified semantic feature. As shown in Fig. 1, semantic features that are over-crowded in original feature space become distinguishable in the rectified semantic space. Moreover, a pre-reconstruction and a post-reconstruction network are proposed, which construct two cycles to preserve the consistency of semantic feature and visual feature. We evaluate the proposed method on four datasets, i.e., AWA1, CUB, APY and SUN. The experimental results demonstrate that our approach outperforms state-of-the-art for both ZSL and GZSL tasks.

2 Methodology

2.1 Formulation

Given an image , the proposed model can recognize it as a specific class even that is unseen. Following [4], we take instance as input in the training stage, where is the visual feature of in the visual space , in the semantic space

is the semantic features extracted from attributes or descriptions of class, and

denotes the corresponding seen class label. is the set of seen class labels. In the testing stage, given an image, ZSL and GZSL will recognize it as an unseen class or an either seen or unseen class . As shown in Fig 2, our framework consists of three components: (a) Structure Rectifying Network (SRN) to rectify semantic space with refers visual space; (b) Semantic Rectifying Generative Adversarial Network (SR-GAN) to synthesize pseudo visual feature and do zero-shot classification; (c) a pre-reconstruction and a post-reconstruction networks to keep the consistency between visual feature and semantic feature.

2.2 Semantic rectifying network

The primary obstacle of ZSL is that difficult to guarantee the distribution of visual space and semantic space are corresponding. Specifically, the vague class attributes and descriptions make model confusing, as well as generate convincing visual feature. To this end, we design a Semantic Rectifying Network (SRN), denoted as and shown in Fig. 2

(a), to rectify the class structures between the visual space and the semantic space. SRN consists of a multi-layer perceptron (MLP) activated by Leaky ReLU 


, and the output layer has a Sigmoid activation. We define the visual pivot vector

for each class across dataset, and for the -th class we have


where is the number of instances with class , and is the -th visual feature of class

. We argue that for any two classes, their visual pivots have similar relationship as their semantic feature. Thus, we use the cosine similarity function

to provide the similarities between pair of visual pivot and semantic feature, and obtain a rectifying loss for SRN


where is the number of seen classes. The first term of Eq. (2.2) is the structure loss expressing the directional distance between the rectified semantic features and visual features, and the second term is a semantic loss, which measures the consumption of semantic information after rectifying. Note that we fix the parameters of SRN after training it.

0:  The batch size , learning rate , the number of discriminator training loop .
1:  Initialize randomly
2:  for SRN training iterations do
3:     Sample seen visual features and semantic features
4:     Update by Eq. 2.2
5:  end for
6:  Fix and initialize randomly
7:  for SR-GAN training iterations do
8:     for  do
9:        Sample a mini-batch of visual features , corresponding semantic features and random vector
10:        Fix and update by Eq. 2.3
11:     end for
12:     Sample a mini-batch of visual features , corresponding semantic features and random vector
13:     Fix and update by Eq. 3
14:     Update by Eq. 6
15:     Update by Eq. 7
16:  end for
Algorithm 1 Training procedure of our approach.

2.3 Semantic rectifying GAN

Generative Adversarial Network (GAN) has been demonstrated useful for ZSL [8], as the ability to generate visual features from semantic feature. However, indiscriminately feeding vague semantic feature into a generator may undermine the generated visual feature. By a pre-trained SRN model, we can easily obtain more distinguished semantic feature. Therefore, we design a semantic rectifying GAN (SR-GAN) model that translates these rectified semantic features into visual features. As shown in Fig. 2(b), the proposed SR-GAN has a generator and a discriminator . For , we have three types of input, i.e., the original semantic feature , the rectified semantic feature , and the random vector

sampled from the normal distribution.

consists of a four-layers MLP with a residual connection, where the first three layers are with leaky ReLU activation and the output layer is activated by Tanh. The loss of generator is defined as:


where denotes the expected value. The first term of Eq. (3) is a standard generator loss of Wasserstein GAN (W-GAN) [21]. The second term is a cross entropy loss and the third term is a visual pivot loss. A visual pivot loss can be computed as the Euclidean distance between the prototypes of synthesized features and real features for each class:


For the discriminator , it takes synthetic features or real features as input and has two output branches. One branch is to distinguish the input is real or fake, and the other branch will classify the given input into different classes wherever in ZSL or GZSL. Consequently, the loss of is defined as:


where the first second terms are the standard discriminator loss of W-GAN, and the third term is the gradient penalty. This gradient penalty term do help Wasserstein GAN get rid of pathological behavior [21], and denotes the penalty coefficient. The last term is an auxiliary classification loss.

Zero-Shot Learning Generalized Zero-Shot Learning
DAP [3] 44.1 40.0 33.8 39.9 0 88.7 0 1.7 67.9 3.3 4.8 78.3 9.0 4.2 25.1 7.2

CONSE [22]
45.6 34.3 26.9 38.8 0.4 88.6 0.8 1.6 72.2 3.1 0 91.2 0 6.8 39.9 11.6

SSE [23]
60.1 43.9 34.0 51.5 7.0 80.5 12.9 8.5 46.9 14.4 0.2 78.9 0.4 2.1 36.4 4.0

54.2 52.0 39.8 56.5 13.4 68.7 22.4 23.8 53.0 32.8 4.9 76.9 9.2 16.9 27.4 20.9

SJE [24]
65.6 53.9 32.9 53.7 11.3 74.6 19.6 23.5 59.2 33.6 3.7 55.7 6.9 14.7 30.5 19.8

LATEM [15]
55.1 49.3 35.2 55.3 7.3 71.7 13.3 15.2 57.3 24.0 0.1 73.0 0.2 14.7 28.8 19.5

ESZSL [14]
58.2 53.9 38.3 54.5 6.6 75.6 12.1 12.6 63.8 21.0 2.4 70.1 4.6 11.0 27.9 15.8

ALE [5]
59.9 54.9 39.7 58.1 16.8 76.1 27.5 23.7 62.8 34.4 4.6 73.7 8.7 21.8 33.1 26.3

SYNC [25]
54.0 55.6 23.9 56.3 8.9 87.3 16.2 11.5 70.9 19.8 7.4 66.3 13.3 7.9 43.3 13.4

SAE [16]
53.0 33.3 8.3 40.3 1.8 77.1 3.5 7.8 54.0 13.6 0.4 80.9 0.9 8.8 18.0 11.8

68.2 55.8 41.13 61.3 19.2 86.5 31.4 23.9 60.6 34.3 14.17 78.63 24.01 21.7 34.5 26.7
PSR [26] - 56.0 38.4 61.4 - - - 24.6 54.3 33.9 13.5 51.4 21.4 20.8 37.2 26.7

CDL [11]
69.9 54.5 43.0 63.6 28.1 73.5 40.6 23.5 55.2 32.9 19.8 48.6 28.1 21.5 34.7 26.5

71.97 55.44 44.02 62.29 41.46 83.08 55.31 31.29 60.87 41.34 22.34 78.35 34.77 22.08 38.29 27.36
Table 1: Comparison with the state-of-the-art method on four datasets.
Zero-Shot Learning Generalized Zero-Shot Learning
SR-GAN:baseline 69.13 52.88 39.98 60.49 34.04 84.29 48.48 27.53 58.96 37.53 19.39 77.70 31.03 21.46 36.16 26.93

69.11 53.14 40.79 61.18 35.05 83.41 49.36 27.96 61.92 38.53 20.38 79.80 32.47 21.88 38.26 27.83

70.43 55.21 43.44 61.25 37.92 83.84 52.22 30.45 61.10 40.64 21.65 72.39 33.33 20.69 39.46 27.15

71.97 55.44 44.02 62.29 41.46 83.08 55.31 31.29 60.87 41.34 22.34 78.35 34.77 22.08 38.29 27.36
Table 2: Effects of different components on four datasets.

2.4 Pre-reconstruction and post-reconstruction

By the above process, the model is able to synthesize good visual feature to some extent. However, there still exists a significant problems, i.e., the generated visual feature has poor consistency with the input semantic. Accordingly, we propose a pre-reconstruction and a post-reconstruction modules to keep the consistency of visual feature and semantic feature. Specifically, as shown in Fig. 2(c), the pre-reconstruction network takes the real visual feature as input, but the constructed semantic feature and random noise are then fed into the generator , and builds consistency loss between visual feature and reconstructed visual feature. On the contrary, the post-reconstruction network , keeping in step with the generator , takes the generated feature as input, and builds consistency loss between reconstructed semantic feature and random noise. The pre-reconstruction loss for and the post-reconstruction loss for can be computed by


where represents the concatenation operator. Obviously, and can be considered as two Auto-Encoders [27]. The pre-reconstruction allows generator learn a more convincing visual distribution by forcing generator to restore real visual feature from encoded random vector . And the post-reconstruction enhances the relationship between synthetic visual feature and the corresponding class semantics by minimizing the difference between the reconstructed and original semantic feature. Finally, by integrating the pre-reconstruction loss and post-reconstruction loss, the loss of can be modified as :


We have the training procedure in Algorithm 1. We first train SRN, and fix its parameter after training. Then we train generator and discriminator of SR-GAN in turn, but train discriminator more times (default value is 5, following  [21]) than generator. Note that we experientially update the parameters of several times for Eq. 3, Eq. 6 and Eq. 7, as we find this is very useful to obtain a more reliable generator.

3 Experiment

3.1 Implemented details

Datasets. We evaluate our approach on four benchmark datasets for ZSL and GZSL: (1) Caltech-UCSD-Birds 200-2011 (CUB) [28] has 11,788 images, 200 classes of birds annotated with 312 attributes; (2) Animals with Attributes (AWA) [6] is coarse-grained and has 30,475 images, 50 types, and 85 attributes; (3) Attribute Pascal and Yahoo (APY) [29] contains 15,339 images, 32 classes and 64 attributes; (4) SUN Attribute (SUN) [30] annotates 102 attributes on 14,340 images from 717 types of scene. For all four datasets, we use the widely-used ZSL and GZSL split proposed in  [4].

Evaluation metrics

. We use the evaluation metrics proposed in  

[4]. For ZSL, we measure the average per-class top-1 accuracy (T1) of unseen classes . For GZSL, we compute the average per-class top-1 accuracy of seen classes , denoted by , and unseen classes , denoted by

, and their harmonic mean, i.e.


Figure 3: Seen class visualizations in APY, AWA1, and CUB.

3.2 Comparison to the state-of-the-arts

We compare the proposed method against several state-of-the-art methods in both ZSL and GZSL setting in Table 1. Obviously, SR-GAN achieves the best performance in most situations. On the one hand, for ZSL learning, we get two new state-of-the-arts on AWA1 and APY, and also obtain comparable results on the other datasets. On the other hand, for GZSL learning, our approach achieve the best performance for and on all four datasets, which indicates that our approach can improve the performances of unseen classes while maintaining a high accuracy of seen classes . Specifically, for we have a great boost for all datasets, which indicates our approach alleviates the seen-unseen bias problem better than other approaches under the GZSL scenario. Unlike other methods that utilize fixed semantic information, SR-GAN make original semantic feature more discriminative, which is similar to CDL [11]. However, the aligned space in CDL is a compromise between semantic and visual space. In addition, CDL computes the feature similarities in semantic space, visual space and aligned space separately, which reduce the final performance. And our approach automatically explores the discrimination in rectifying space and also preserve the original semantic information to some extent.

3.3 Ablation study

We analyze the importance of every component of the proposed framework in Table 2. We denote two reconstruction networks, semantic rectifying network and the rest of our whole model as rec, SRN, and baseline, respectively. We evaluate three variants of our model by removing different components. The performance of the baseline is unremarkable for almost all datasets except of AWA1. With the help of semantic rectifying network (rec), the performances slightly increases, e.g. for in AWA1, SR-GAN:baseline+rec is better than baseline only (35.05%vs. 34.04%). It indicates that our reconstruction networks enhance the imagination of our model for unseen classes. With the SRN, the performances significantly increases: the performance of SR-GAN:baseline+SRN in all dataset for almost all accuracies is better than the baseline, which indicates that our SRN effectively rectifies semantic features to a more distinguishable space to generate more realized visual features.

3.4 Visualization of rectifying space

To validate that the proposed SRN is effective for rectifying semantic features more distinguishable, we visualize all seen classes of APY, AWA1 and CUB and visualize their semantic features, the corresponding rectified semantic features, and the pivots of visual features in a 2-D plane by multidimensional scaling (MDS)  [19]. Visualization results are shown in Figure 3. We can see that the original semantic features are not distinguishable enough, e.g. wolf, cat, dog, even zebra all accumulate in a too small region, which is non-corresponding with the visual space. In contrast, all classes keep a reasonable distance in the rectified semantic space obviously. This significantly proves the proposed semantic rectifying network is effective to help distinguish semantic features.

4 Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. 61876121, 61472267, 61728205, 61502329, 61672371), Primary Research & Development Plan of Jiangsu Province (No. BE2017663), Aeronautical Science Foundation (20151996016), Jiangsu Key Disciplines of Thirteen Five-Year Plan(No. 20168765), Suzhou Institute of Trade & Commerce Research Project(KY-ZRA1805), and Foundation of Key Laboratory in Science and Technology Development Project of Suzhou(No. SZS201609).

5 Conclusion

In this paper, we propose a novel generative approach for Zero-Shot Learning (ZSL) by synthesizing visual features from rectified semantic features produced by a proposed semantic rectifying network (SRN). SRN maps original indiscriminative semantic features to rectified semantic features that are more distinguishable. Additionally, to guarantee the effectiveness of rectified semantic features and synthetic visual features, a pre-reconstruction and a post-reconstruction networks are proposed, and they preserve semantic details and keep the real visual distribution. Experimental results show that the proposed approach achieves state-of-the-art performance on ZSL task and boosts a great level(0.66% 14.71%) for GZSL.


  • [1] F Lyu, F Hu, VS Sheng, Z Wu, Q Fu, and B Fu, “Coarse to fine: Multi-label image classification with global/local attention,” in IEEE ISC2, 2018.
  • [2] F Lyu, Q Wu, F Hu, and M Tan,

    “Attend and imagine: Multi-label image classification with visual attention and recurrent neural network,”

    IEEE Transaction on Multimedia, 2019.
  • [3] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling, “Attribute-based classification for zero-shot visual object categorization,” TPAMI, 2014.
  • [4] Yongqin Xian, Bernt Schiele, and Zeynep Akata, “Zero-shot learning—the good, the bad and the ugly,” in CVPR, 2017.
  • [5] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid, “Label-embedding for image classification,” TPAMI, 2016.
  • [6] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in CVPR, 2009.
  • [7] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele, “Learning deep representations of fine-grained visual descriptions,” in CVPR, 2016.
  • [8] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal, “A generative adversarial approach for zero-shot learning from noisy texts,” in CVPR, 2018.
  • [9] Yan Li, Junge Zhang, Jianguo Zhang, and Kaiqi Huang, “Discriminative learning of latent features for zero-shot recognition,” in CVPR, 2018.
  • [10] Jie Song, Chengchao Shen, Jie Lei, An-Xiang Zeng, Kairi Ou, Dacheng Tao, and Mingli Song, “Selective zero-shot classification with augmented attributes,” in ECCV, 2018.
  • [11] Huajie Jiang, Ruiping Wang, Shiguang Shan, and Xilin Chen, “Learning class prototypes via structure alignment for zero-shot recognition,” in ECCV, 2018.
  • [12] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha, “An empirical study and analysis of generalized zero-shot learning for object recognition in the wild,” in ECCV, 2016.
  • [13] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al., “Devise: A deep visual-semantic embedding model,” in NIPS, 2013.
  • [14] Bernardino Romera-Paredes and Philip Torr, “An embarrassingly simple approach to zero-shot learning,” in ICML, 2015.
  • [15] Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele, “Latent embeddings for zero-shot classification,” in CVPR, 2016.
  • [16] Elyor Kodirov, Tao Xiang, and Shaogang Gong,

    “Semantic autoencoder for zero-shot learning,”

    in CVPR, 2017.
  • [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in NIPS, 2014.
  • [18] Zihan Ye, Fan Lyu, Jinchang Ren, Yu Sun, Qiming Fu, and Fuyuan Hu, “Dau-gan: Unsupervised object transfiguration via deep attention unit,” LNCS, 2018.
  • [19] Joseph B Kruskal, “Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis,” Psychometrika, 1964.
  • [20] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng, “Rectifier nonlinearities improve neural network acoustic models,” in ICML, 2013.
  • [21] Martin Arjovsky, Soumith Chintala, and Léon Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
  • [22] S. Bengio Y. Singer J. Shlens A. Frome G. Corrado M. Norouzi, T. Mikolov and J. Dean., “Zero-shot learning by convex combination of semantic embeddings,” in ICLR, 2014.
  • [23] Ziming Zhang and Venkatesh Saligrama, “Zero-shot learning via semantic similarity embedding,” in ICCV, 2015.
  • [24] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele, “Evaluation of output embeddings for fine-grained image classification,” in CVPR, 2015.
  • [25] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha, “Synthesized classifiers for zero-shot learning,” in CVPR, 2016.
  • [26] Yashas Annadani and Soma Biswas, “Preserving semantic relations for zero-shot learning,” in CVPR, 2018.
  • [27] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
  • [28] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011.
  • [29] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth, “Describing objects by their attributes,” in CVPR, 2009.
  • [30] Genevieve Patterson, Chen Xu, Hang Su, and James Hays,

    “The sun attribute database: Beyond categories for deeper scene understanding,”

    IJCV, 2014.