Mitigating Generation Shifts for Generalized Zero-Shot Learning

by   Zhi Chen, et al.
The University of Queensland

Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information (e.g., attributes) to recognize the seen and unseen samples, where unseen classes are not observable during training. It is natural to derive generative models and hallucinate training samples for unseen classes based on the knowledge learned from the seen samples. However, most of these models suffer from the `generation shifts', where the synthesized samples may drift from the real distribution of unseen data. In this paper, we conduct an in-depth analysis on this issue and propose a novel Generation Shifts Mitigating Flow (GSMFlow) framework, which is comprised of multiple conditional affine coupling layers for learning unseen data synthesis efficiently and effectively. In particular, we identify three potential problems that trigger the generation shifts, i.e., semantic inconsistency, variance decay, and structural permutation and address them respectively. First, to reinforce the correlations between the generated samples and the respective attributes, we explicitly embed the semantic information into the transformations in each of the coupling layers. Second, to recover the intrinsic variance of the synthesized unseen features, we introduce a visual perturbation strategy to diversify the intra-class variance of generated data and hereby help adjust the decision boundary of the classifier. Third, to avoid structural permutation in the semantic space, we propose a relative positioning strategy to manipulate the attribute embeddings, guiding which to fully preserve the inter-class geometric structure. Experimental results demonstrate that GSMFlow achieves state-of-the-art recognition performance in both conventional and generalized zero-shot settings. Our code is available at:


GSMFlow: Generation Shifts Mitigating Flow for Generalized Zero-Shot Learning

Generalized Zero-Shot Learning (GZSL) aims to recognize images from both...

Global Semantic Consistency for Zero-Shot Learning

In image recognition, there are many cases where training samples cannot...

Zero-Shot Logit Adjustment

Semantic-descriptor-based Generalized Zero-Shot Learning (GZSL) poses ch...

A Boundary Based Out-of-Distribution Classifier for Generalized Zero-Shot Learning

Generalized Zero-Shot Learning (GZSL) is a challenging topic that has pr...

DFS: A Diverse Feature Synthesis Model for Generalized Zero-Shot Learning

Generative based strategy has shown great potential in the Generalized Z...

Counterfactual Zero-Shot and Open-Set Visual Recognition

We present a novel counterfactual framework for both Zero-Shot Learning ...

IntraQ: Learning Synthetic Images with Intra-Class Heterogeneity for Zero-Shot Network Quantization

Learning to synthesize data has emerged as a promising direction in zero...

1. Introduction

Deep learning techniques have significantly boosted the performance of many tasks in computer vision (Luo et al., 2018b, a; Wang et al., 2020b; Luo et al., 2020c, b; Wang et al., 2020a; Luo et al., 2019, 2020a; Zhang et al., 2021a, b, c, 2017). However, the performance gains come at the cost of an enormous amount of labeled data. Generally, shallow classification algorithms need a certain amount of training data of every class, while deep learning dramatically amplifies the need. Data labeling is time-consuming and expensive, even if the raw data is usually plentiful. Moreover, it is unrealistic to require labeled data for every class, and therefore much attention from researchers has been drawn to Zero-Shot Learning (ZSL) as a solution (Akata et al., 2015a; Xian et al., 2016; Akata et al., 2015b; Yang et al., 2016; Li et al., 2019c). By incorporating side information, e.g., class-level semantic attributes, ZSL transfers semantic-visual relationships from the seen classes to unseen classes without any visual samples. While conventional ZSL aims to recognize only unseen classes, it is infeasible to assume that we will only come across samples from unseen classes. Hence, in this paper, we consider a more realistic and challenging task, Generalized Zero-Shot Learning (GZSL), to classify over both seen and unseen classes (Huang et al., 2019; Schonfeld et al., 2019; Xian et al., 2018b; Li et al., 2020).

Figure 1. An illustration of the generation shifts. a) Implicit semantic encoding causes the semantic inconsistency when generating unseen samples. b) The synthesized samples collapse to fixed modes. c) The synthesized samples fail to fully preserve the geometric relationships in semantic space.

GZSL can be roughly categorized into embedding-based methods (Xian et al., 2016; Akata et al., 2015a, b) and generative methods (Li et al., 2019a; Xian et al., 2018b; Schonfeld et al., 2019; Li et al., 2019b). Embedding-based methods usually cast the semantic and visual features into the same space, and thus the compatibility scores between the visual features and all classes are computed to make predictions. In contrast, generative methods (Long et al., 2017; Xian et al., 2018b; Schonfeld et al., 2019; Chen et al., 2020b; Shen et al., 2020; Li et al., 2021) cast the GZSL problem into a supervised classification task by generating synthesized visual features for unseen classes. Then a supervised classifier can be trained on both real seen visual features and the synthesized unseen visual features.

Figure 2. An illustration of the proposed GSMFlow framework. The conditional generative flow is comprised of a series of conditional affine coupling layers. Particularly, the perturbation is injected into the original visual features to complement the potential patterns and the global semantics are computed with relative positioning to semantic anchors. For inference, a latent variable is inferred from the visual features of an image sample

conditioned on a global semantic vector

. Inversely, given drawn from a prior distribution and a global semantic vector , GSMFlow can generate a visual sample accordingly.

In spite of the advances achieved by the generative paradigm, the synthesized visual features may not be guaranteed to cover the real distribution of unseen classes, since the unseen features are not exposed during training. Due to the gap between the synthesized and the real distributions of unseen features, the trained classifier tends to misclassify the unseen samples. Therefore, it is critical to analyze the causes and mitigate the resulted generation shifts. In this paper, as illustrated in Figure 1, we identify three common types of generation shifts in generative zero-shot learning methods:

  • Semantic inconsistency: The state-of-the-art GZSL approach (Shen et al., 2020) focuses on preserving the exact generation cycle consistency by a normalizing flow. It constructs a complex prior and disentangle the outputs into semantic and non-semantic vectors. Such implicit encoding may cause the generated samples to be incoherent with the given attributes and deviated from the real distributions.

  • Variance decay: The synthesized samples commonly collapse into fixed modes and fail to capture the intra-class variance of unseen samples, which can be originated from different poses, illumination and background.

  • Structural permutation: It is also hard to fully preserve the geometric relationships between different attribute categories in the shared subspace. If the relative position of unseen samples changes in the visual space, the recognition performance will inevitably degenerate.

To address the aforementioned generation shifts, in this paper, we propose a novel framework for generalized zero-shot learning, namely Generation Shifts Mitigating Flow (GSMFlow), as depicted in Figure 2. In particular, to tackle the semantic inconsistency, we explicitly embed the semantic information into the transformations in each of the coupling layers during inference and generation, enforcing the unseen visual feature generation to be semantically consistent. Furthermore, to mitigate the variance decay issue, a visual perturbation strategy is proposed to diversify training samples by dynamically injecting the perturbation noise into the input samples. With this strategy, the decision boundary could be adjusted for suiting the real unseen samples at the test time. Moreover, to alleviate the semantic structural permutation, the relative positioning mechanism is adopted to correct the attribute representations by preserving the global geometrical relationship to the specific semantic anchors. Specifically, the visual features of a real sample are first extracted from a backbone network, e.g., ResNet101. In the inference stage, the real sample is perturbed to be a virtual sample. The virtual sample is then input into a conditional generative flow . The inference process is conditioned on the manipulated global semantic vector to infer the latent factors drawn from a prior distribution. The learned generative flow has its inverse transformation as the generative network. Similarly, a global semantic vector is progressively injected into each of the inverse coupling layers to generate unseen visual features from latent factors. We then train a unified classifier with the real seen features and the generated unseen features, which aims to accurately recognize both seen and unseen classes at the test time. To sum up, the contributions of our work are listed as follows:

  • We propose a novel GSMFlow framework for GZSL, which explicitly incorporates the class-level semantic information into both the forward and the inverse transformations of the conditional generative flow, which encourages the synthesized samples to be more coherent with the respective semantic information.

  • We propose a visual perturbation strategy to recover the intrinsic variance in the synthesized unseen samples. By injecting perturbation noise into the seen training sample, we diversify the training samples to enrich the original feature space. As the generative flow is exposed to more diverse virtual samples, the synthesized samples of unseen classes can thus capture more visual potentials.

  • To preserve the geometric relationship between different semantic vectors in the semantic space, we choose different semantic anchors to revise the representation of the attributes.

  • Comprehensive experiments and in-depth analysis on four GZSL benchmark datasets demonstrate the state-of-the-art performance by the proposed GSMFlow framework in the GZSL tasks.

The rest of the paper is organised as follows. We briefly review related work in Section 2. GSMFlow is presented in Section 3, followed by the experiments in Section 4. Lastly, Section 5 concludes the paper.

2. Related Work

2.1. Traditional Zero-shot Learning

Traditional solutions towards zero-shot learning are mostly the embedding-based methods. The pioneering method ALE (Akata et al., 2015a) proposes to employ embedding functions to measure the compatibility scores between a semantic embedding and a data sample. SJE (Akata et al., 2015b) extends ALE by using structured SVM (Tsochantaridis et al., 2005) and takes advantage of the structured outputs. DeViSE (Frome et al., 2013) constructs a deep visual semantic embedding model to map 4096-dimensional visual features from AlexNet (Krizhevsky et al., 2012) to 500 or 1000-dimensional skip-gram semantic embeddings. EZSL (Romera-Paredes and Torr, 2015) theoretically gives the risk bound of the generalization error and connects zero-shot learning with the domain adaptation problem. More recently, SAE (Kodirov et al., 2017)

develops a cycle embedding approach with an autoencoder to reconstruct the learned semantic embeddings into a visual space. Later, the cycle architecture is also investigated in generative methods

(Chen et al., 2020a). However, these early methods have not achieved satisfactory results on zero-shot learning. Particularly, when applying on the GZSL task, the unseen class performance is even worse. Recently, thanks to the advances of generative models, by generating missing visual samples of unseen classes, zero-shot learning can be converted into a supervised classification task.

2.2. Generative GZSL

A number of generative methods have been applied for GZSL, e.g., Generative Adversarial Nets (GANs) (Goodfellow et al., 2014), Variational Autoencoders (VAEs) (Kingma and Welling, 2013), and Alternating Back-Propagation algorithms (ABPs) (Han et al., 2017). f-CLSWGAN (Xian et al., 2018b) presents a WGAN-based (Arjovsky et al., 2017) approach to synthesize unseen visual features based on semantic information. CADA-VAE (Schonfeld et al., 2019) proposes to stack two VAEs, each for one modality, and aligns the latent spaces. The latent space, thus, can enable information sharing among different modality sources. LisGAN (Li et al., 2019a) is inspired by the multi-view property of images and improves f-CLSWGAN by encouraging the generated samples to approximate at least one visually representative view of samples. CANZSL (Chen et al., 2020a) considers the cycle-consistency principle of image generation and proposes a cycle architecture by translating synthesized visual features into semantic information. ABP-ZSL (Zhu et al., 2019) adopts the rarely studied generative models ABPs to generate visual features for unseen classes. GDAN (Huang et al., 2019) incorporates a flexible metric in the model’s discriminator to measure the similarity of features from different modalities.

The bidirectional conditional generative models enforce cycle-consistent generation and allow the generated images to truthfully reflect the conditional information (Zhu et al., 2017; Chen and Luo, 2019; Chen et al., 2020a). Instead of encouraging cycle consistency through adding additional reverse networks, generative flows are bidirectional and cycle-consistent in nature. The generative flows are designed to infer and generate within the same network. Also, the generative flows are lightweight comparing to other methods as no auxiliary networks are needed, e.g., discriminator for GANs, variational encoder for VAEs. Some conditional generative flows (Ardizzone et al., 2018, 2019) are proposed to learn image generation from class-level semantic information. IZF (Shen et al., 2020) adopts the invertible flow model (Ardizzone et al., 2018) for GZSL. It implicitly learns the conditional generation by inferring the semantic vectors. Such implicit encoding may cause the generated samples to be incoherent with the given attributes. Instead, we explicitly blend the semantic information into each of the coupling layers of the generative flow, learning the semantically consistent visual features. We also argue that the MMD regularization in IZF is inappropriate for GZSL, since the seen and unseen classes are coalesced. More explanation is depicted in Section 3.2.

2.3. Visual Perturbation

Our proposed visual perturbation approach is similar to data augmentation techniques. In many deep learning tasks, data augmentation (Shorten and Khoshgoftaar, 2019) is an effective strategy to improve model performance. Generally, bigger datasets better facilitate the learning of deep models (Sun et al., 2017). Common data augmentation techniques on image data include flipping, rotating, cropping, color jittering, edge enhancement, random erasing (Mikołajczyk and Grochowski, 2018), etc. Instead of augmenting the raw data samples, DeVries and Taylor (DeVries and Taylor, 2017) propose an augmentation strategy in the features space. In the dense feature space, the augmentation is more direct and unlimited augmented feature samples can be yielded.

Generative zero-shot learning is intrinsically a data augmentation approach to handle the zero-shot learning problem by augmenting the training data of the unseen classes. However, the augmentation power is limited by the training data of the seen classes, as the empirical distribution of training data cannot faithfully reflect the underlying real distribution of seen classes. In this case, we resort to a data augmentation strategy to complement the potential patterns depicted seen classes and further transfer to unseen classes.

It is noteworthy that the motivation of our approach is also different from these data augmentation techniques. These methods statically augment from the real samples for training, resulting in limited augmented samples. Instead of static augmentation, our perturbation noises are sampled dynamically, leading to unlimited augmented samples.

3. Methodology

This section begins with formulating the GZSL problem and introducing the notations. The proposed GSMFlow is outlined next, followed by the visual perturbation strategy and the relative positioning approach to mitigate the generation shifts problem. The model training and zero-shot recognition are introduced lastly.

3.1. Preliminaries

Consider two datasets - a seen set and an unseen set . The seen set contains training samples and the corresponding class labels , i.e., . Similarly, , where is the number of unseen samples. There are seen and unseen classes, so that and . Note that the seen and unseen classes are mutually exclusive. There are class-level semantic vectors , where the class-level semantic vectors for the seen and unseen classes are and , respectively. represents the semantic vector for the -th class. In the setting of GZSL, we only have access to the seen samples and semantic vectors during training. Hence, for brevity, in the demonstration of the training process, we will omit the superscript for all the seen samples, i.e., , .

3.2. Conditional Generative Flow

To model the distribution of unseen visual features, we resort to conditional generative flows (Ardizzone et al., 2019), a stream of simple yet powerful generative models.

A conditional generative flow learns inference and generation within the same network. Let

be a random variable from a particular prior distribution

, e.g., Gaussian, which has the same dimension as the visual features. We denote the inference process as the forward transformation , whose inverse transformation is the generation process. The transformation is composed of bijective transformations , the forward and inverse computations are the composition of the bijective transformations. Then, we can formalize the conditional generative flow as:


where the transformation parameterized with learns to transform a sample in the target distribution to the prior data distribution conditioned on the corresponding class-level semantic vector . The forward transformation has its inverse transformation , which flows from the prior distribution towards the target data distribution :


With the change of variable formula, the bijective function can be trained through maximum log-likelihood:


where the latter half term denotes the logarithm of the determinant of the Jacobian matrix.

According to Bayes’ theorem, the posterior distribution

for the parameter is proportional to . The objective function can then be formulated as:

Figure 3. The flowcharts of the conditional affine coupling layers. (a) The transformation flowchart when training the model. (b) The inverse direction for generating new samples.

The composed bijective transformations are conditional affine coupling layers. As shown in Figure 3, in the forward transformation, each layer splits the input vector into two factors . Note that in the first coupling layer, the input is . Within each coupling layer, the internal functions , , and , are formulated as:


where is the element-wise multiplication and the outputs and are fed into the next affine transformation. In the last coupling layer, the output will be . In the inverse direction, the conditional affine coupling module takes and as inputs:


where is the element-wise division. As proposed in Real NVP (Dinh et al., 2016), when combining coupling layers, the Jacobian determinant remains tractable and its logarithm is the sum of and over visual feature dimensions.

Discussion: Comparison to IZF. Comparing to state-of-the-art approach IZF (Shen et al., 2020) that also leverages a conditional generative flow in GZSL, our proposed GSMFlow differs in two aspects:

1) Explicit conditional encoding. In the training stage, IZF learns conditional encoding by disentangling the visual features into semantic and non-semantic vectors:


We argue that this approach of incorporating conditional information is implicit and may cause the generated samples to be incoherent with the given attributes and deviated from the real distribution. Instead, we explicitly blend the semantic information into each of the bijective coupling layers. By repeatedly enhancing the impact of the semantic information in the model, the generated visual samples tend to be more semantically consistent.

2) Drawbacks of the negative MMD. In IZF, a negative Maximum Mean Discrepancy (MMD) regularization is applied to increase the discrepancy between the generated unseen distributions and the real distributions of the seen classes. We argue that this regularization is infeasible for the two mixed distributions of seen and unseen classes. MMD is commonly used in the domain adaptation that minimizes the discrepancy between the source and the target domains. However, in GZSL, as the seen and unseen classes generally come from the same domain, their visual relationships are highly coalesced. Hence, generally separating the two distributions may cause unexpected distribution shifts, which is undesirable for generating realistic unseen samples.

In addition to the above differences, we also identify two problems, i.e., variance decay and structural permutation. Accordingly, the visual perturbation and the relation positioning strategies are proposed to handle the two problems.

3.3. Visual Perturbation

To mitigate the variance decay problem, we introduce a simple yet effective visual perturbation strategy to preserve the intra-class variance. We begin with sampling a perturbation vector

from Gaussian distribution with the same size

as the visual features:


In the high-dimensional visual space, we aim to make the perturbation more diverse. Hence, we construct a dropout mask to selectively filter out some of the dimensions of the perturbation vectors:



is the probability of keeping the perturbation in the

-th dimension. Then, we have the perturbation vector after applying the dropout map:


A virtual sample can be then yielded by:


where is the coefficient that controls the degree of perturbation.

While the visual perturbation reveals the underlying virtual samples, we may unexpectedly incorporate some noisy samples resulting in the distribution shifts. We argue that the prototype of each class should be invariant when introducing virtual samples. Thus, in order to avoid distribution shift, we aim to fix the prototype for each class when perturbing the real samples. A class prototype is defined as the mean vector of the samples in the same class from the empirical distribution :


where is the sample number of the -th class in the training set.

When generating a visual sample conditioned on -th class with the corresponding semantic vector , the expected mean sample should be close to the class prototype given the prior as the mean vector from the prior distribution, i.e., all zeros :


where is the number of seen classes.

3.4. Relative Positioning

To enhance the inter-class relationship for unseen classes, we introduce the relative positioning technique, preserving the geometric information between different attribute categories in the shared subspace. We revise the semantic vectors by measuring the responses to a particular number of semantic anchors. The revised semantic vectors are defined as the global semantic vectors. We begin with constructing a semantic graph with the class-level semantic vectors. The edges

are defined as the cosine similarities between all semantic vectors:


where refers to the similarity between -th class and -th class. Then, for each class, we calculate the sum similarities to all other classes:


where denotes the total number of seen classes. We define the three semantic anchors , , with the highest, lowest, and median sum similarities to other semantic vectors. The global semantic vectors are then acquired by computing the responses from these three semantic anchors.

The dimensionality of visual features is usually much higher than semantic vectors, e.g., 2,048 vs. 85 in the AWA dataset. Thus, the generation process can be potentially dominated by visual features. The Section 4.7 discusses the impact of the semantic vectors’ dimensionality. To avoid this issue, we apply three functions to map the semantic responses to a higher dimension. The global semantic vector for each class can then be formulated as:


The global semantic vectors revised by relative positioning are fed into the conditional generation flow as the conditional information.

3.5. Training and Zero-shot Inference

With the derived virtual samples and the global semantic vectors , the objective functions and in Equation 4 and Equation 13 should be rewritten as:


Then, the overall objective function of the proposed GSMFlow is formulated as:


where is the coefficient of the prototype loss.

After the conditional generative flow is trained on the seen classes with the virtual samples and the global semantic vectors, it is leveraged to generate visual features of unseen classes:


A softmax classifier is then trained on the real visual features of seen classes and the synthesized visual features of unseen classes. For the coming test samples from either seen or unseen classes, the softmax classifier aims to predict the corresponding class label accurately.

4. Experiments

Methods U S H U S H U S H U S H
LATEM (Xian et al., 2016) 0.1 73.0 0.2 13.3 77.3 20.0 15.2 57.3 24.0 6.6 47.6 11.5
ALE (Akata et al., 2015a) 4.6 73.7 8.7 14.0 81.8 23.9 23.7 62.8 34.4 13.3 61.6 21.9
SJE (Akata et al., 2015b) 1.3 71.4 2.6 8.0 73.9 14.4 23.5 59.2 33.6 - - -
SAE (Kodirov et al., 2017) 0.4 80.9 0.9 1.1 82.2 2.2 7.8 54.0 13.6 - - -
LFGAA (Liu et al., 2019) - - - 27.0 93.4 41.9 36.2 80.9 50.0 - - -
TCN (Jiang et al., 2019) 24.1 64.0 35.1 61.2 65.8 63.4 52.6 52.0 52.3 - - -
DVBE (Min et al., 2020) 32.6 58.3 41.8 63.6 70.8 67.0 53.2 60.2 56.5 - - -
GAZSL (Zhu et al., 2018) 14.2 78.6 24.0 35.4 86.9 50.3 31.7 61.3 41.8 28.1 77.4 41.2
f-CLSWGAN (Xian et al., 2018b) 32.9 61.7 42.9 56.1 65.5 60.4 43.7 57.7 49.7 59.0 73.8 65.6
CANZSL (Chen et al., 2020a) - - - 49.7 70.2 58.2 47.9 58.1 52.5 58.2 77.6 66.5
CADA-VAE (Schonfeld et al., 2019) 31.7 55.1 40.3 55.8 75.0 63.9 51.6 53.5 52.4 51.6 75.6 61.3
f-VAEGAN-D2 (Xian et al., 2019) - - - 57.6 70.6 63.5 48.4 60.1 53.6 56.8 74.9 64.6
EUC-VAE (Chen et al., 2021) 35.0 62.7 44.9 55.2 78.9 64.9 50.8 55.1 52.9 54.0 79.0 64.1
TF-VAEGAN (Narayan et al., 2020) - - - 59.8 75.1 66.6 52.8 64.7 58.1 62.5 84.1 71.7
E-PGN (Yu et al., 2020) - - - 52.6 83.5 64.6 52.0 61.1 56.2 71.5 82.2 76.5
IZF (Shen et al., 2020) 42.3 60.5 49.8 60.6 77.5 68.0 52.7 68.0 59.4 - - -
GSMFlow 42.0 62.3 50.2 64.5 82.1 72.3 61.4 67.4 64.3 86.6 87.8 87.2
Table 1.

Performance comparison in accuracy (%) on four datasets. We report the accuracies of unseen, seen classes and their harmonic mean, which are denoted as U, S and H. The best results of the harmonic mean are highlighted in bold.

and represent embedding-based and generative methods, respectively.

In this section, we evaluate our approach GSMFlow in both generalized zero-shot learning and conventional zero-shot learning tasks. We first introduce the datasets and experimental settings and then compare GSMFlow with the state-of-the-art methods. Finally, we study the effectiveness of the proposed model with a series of ablation study and hyper-parameter sensitivity analysis.

4.1. Datasets

We conduct experiments on four widely used benchmark datasets of image classification. They are two fine-grained datasets, i.e., Caltech-UCSD Birds-200-2011 (CUB) (Wah et al., 2011) and Oxford Flowers (FLO) (Nilsback and Zisserman, 2008), two coarse-grained datasets, i.e., Attribute Pascal and Yahoo (aPaY) (Farhadi et al., 2009) and Animals with Attributes 2 (AWA) (Lampert et al., 2013). CUB consists of 11,788 images from 200 fine-grained bird species, in which 150 selected as seen classes and 50 as unseen classes. For FLO, it contains 8,189 images from 102 flower categories, 82 of which are chosen as seen classes. The class-level semantic vectors of CUB and FLO are extracted from the fine-grained visual descriptions (10 sentences per image), yielding 1,024-dimensional character-based CNN-RNN features (Reed et al., 2016) for each class. aPaY dataset contains 18,627 images from 42 classes. There are 30 seen classes and 12 unseen classes respectively. Each class is annotated with 64 attributes. AWA2 is a considerably larger dataset with 30,475 images from 50 classes and they are annotated with 85 attributes. For the split, 40 of the total classes are selected as seen classes and ten as unseen classes. For each dataset, We follow the proposed split setting in (Xian et al., 2018a) and (Nilsback and Zisserman, 2008).

4.2. Implementation Details

Our framework is implemented with the open-source machine learning library PyTorch. The conditional generative flow consists of a series of affine coupling layers. Each affine coupling layer is implemented with two fully connected (FC) layers and the first FC layer is followed by a LeakyReLU activation function. The hidden dimension of the FC layer is set as 2,048. The coefficients

of perturbation degree and the coefficient of the prototype loss are set within {0.02, 0.05, 0.15, 0.3, 0.5} and {1, 3, 10, 20, 30}. The dimension of global semantic vectors varies in {128, 256, 512, 1,024, 2,048} and the number of the affine coupling layers varies in {1, 3, 5, 10, 20}. The corresponding results are given in Section 4.7. The function

is implemented with an FC layer and a ReLU activation function. We use Adam optimizer with

, and set the batch size to 256 and set the learning rate to 3e-4. All the experiments are performed on a Lenovo workstation with two NVIDIA GeForce GTX 2080 Ti GPUs.

4.3. Evaluation Metrics

To avoid the failure of classification accuracy for imbalanced class distributions, we adopt average per-class Top-1 accuracy as the fair evaluation criteria for conventional ZSL and the seen and unseen set performance in GZSL:


where is the number of testing classes. A correct prediction is defined as the highest probability of all candidate classes. Following (Xian et al., 2018a), the harmonic mean of the average per-class Top-1 accuracies on seen and unseen classes are used to evaluate the performance of generalized zero-shot learning. It is computed by:


4.4. Comparisons with State-of-the-art Methods

Table 1 summarizes the performance comparison between our proposed GSMFlow and other state-of-the-art methods in the setting of GZSL. We choose the most representative state-of-the-art methods for comparison, the top eight methods marked with are embedding-based methods, and the below eight methods and our proposed method marked with are generative methods. It can be seen that our proposed framework consistently outperforms other methods. Among the generative methods, IZF (Shen et al., 2020) also leverages normalizing flows as the base generative model. However, we show that by mitigating the generation shifts, our proposed GSMFlow achieves significant improvement on these four datasets.


ALE (Akata et al., 2015a)
39.7 62.5 54.9 48.5
SJE (Akata et al., 2015b) 31.7 61.9 54.0 53.4
ESZSL (Romera-Paredes and Torr, 2015) 38.3 58.6 51.9 51.0
LFGAA (Liu et al., 2019) - 68.1 67.6 -
DCN (Liu et al., 2018) 43.6 65.2 56.2 -
TCN (Jiang et al., 2019) 43.6 65.2 56.2 -
GAZSL (Zhu et al., 2018) 41.1 70.2 55.8 60.5
f-CLSWGAN (Xian et al., 2018b) 40.5 65.3 57.3 69.6
cycle-CLSWGAN (Felix et al., 2018) - 66.8 58.6 70.3
CADA-VAE (Schonfeld et al., 2019) - 64.0 60.4 -
f-VAEGAN-D2 (Xian et al., 2019) - 71.1 61.0 67.7
DLFZRL (Tong et al., 2019) 46.7 70.3 61.8 -
TF-VAEGAN (Narayan et al., 2020) - 72.2 64.9 70.8
E-PGN (Yu et al., 2020) - 73.4 72.4 85.7
IZF (Shen et al., 2020) 44.9 74.5 67.1 -
GSMFlow 49.2 72.7 76.4 86.9
Table 2. Conventional ZSL accuracy (%). The best results are formatted in bold.

4.5. Conventional Zero-shot Learning

Even if GSMFlow is mainly proposed for GZSL, to further validate the effectiveness of our proposed framework, we also conduct experiments in the context of conventional ZSL. Table 2 summarizes the conventional zero-shot learning performance. We achieve better performance than all the compared methods on the aPaY, CUB, and FLO datasets. For the AWA dataset, we can also achieve comparatively good performance.

Settings U S H U S H

IZF w/o constraints
35.2 54.2 42.7 38.1 78.9 51.4

GSMFlow w/o constraints
36.5 60.0 45.4 53.3 67.6 59.6

GSMFlow w/o VP
38.4 62.3 47.6 62.2 71.3 66.4

GSMFlow w/o RP
39.6 61.3 48.1 62.9 80.4 70.6

42.0 62.3 50.2 64.5 82.1 72.3

Table 3. Effects of different components on aPaY and AWA datasets. U, S and H represent unseen, seen and harmonic mean, respectively. The best results of harmonic mean are formatted in bold.
Figure 4. Class-wise performance comparison between with and without visual perturbations. The vertical labels are the groundtruth and the horizontal labels are the predictions.

4.6. Ablation Study

To analyze the contribution of each proposed component and the merit from addressing the three problems in generative zero-shot learning, i.e., semantic inconsistency, variance dacay, and structural permutation, we conduct an ablation study on the proposed GSMFlow. We decompose the complete framework into four variants. These include: IZF w/o constraints - the conditional generative flow adopted in compared method (Shen et al., 2020); GSMFlow w/o constraints - the generative flow in our proposed framework without visual perturbation and global semantic learning; GSMFlow w/o VP - the complete framework without visual perturbation; and GSMFlow w/o RP - the original class-level semantic vectors are used.

Comparing between IZF w/o constraints and GSMFlow w/o constraints, we can see the performance comparison between the two ways of incorporating conditional information in the generative flow. It can be seen that, instead of mixing the conditional information with the prior input in IZF, our explicit conditional strategy that progressively injects the semantic information into the affine coupling layers during training can achieve higher accuracy on GZSL. The results on the variants GSMFlow w/o VP shows that by mitigating the structural permutation issue, the relative positioning strategy that captures geometric relationships between semantic vectors can significantly improve the GZSL performance. GSMFlow w/o RP indicates that effectiveness of the visual perturbation strategy to prevent the variance decay. When combining these components together, we achieve the best performance results.

Figure 5.

Distribution comparison of the unseen classes in the AWA dataset. (a) The real distributions of the visual features extracted from the backbone. (b) The synthesized distributions by cGAN. (c) The synthesized distributions by GSMFlow.

Figure 6. Hyper-parameter sensitivity. The horizontal axis indicates the varying hyper-parameters for (a) perturbation weight, (b) prototype loss weight, (c) semantic vector dimensions, and (d) number of affine coupling layers. The vertical axis reports the corresponding performance.

To further investigate the performance boost through visual perturbation, in Figure 4, we compare the class-wise accuracy with and without visual perturbation by illustrating the confusion matrices in the two settings. Without visual perturbation, we can notice that horses can be easily misclassified as sheep and can only achieve 14% accuracy. The performance surges to 55% when we introduce the visual perturbation strategy. Similar observations hold for other classes.

4.7. Hyper-parameter Analysis

GSMFlow mainly involves four hyper-parameters in the model training, as shown in Figure 6. Varying the perturbation weight from 0.02 to 0.5, in Figure 6(a), we find that the aPaY and AWA can reach the peak at around 0.02 and 0.15, respectively. It turns out that the weight of the prototype loss does not have a significant impact on the performance. In Figure 6(b), we vary between 1 and 30, and observe that at 3 and 10, the best performance results are achieved on aPaY and AWA2. As discussed in Section 3.4, the semantic vectors usually have lower dimensions than the visual features, which makes the visual features dominate the generation process in the affine coupling layers. In Figure 6(c), we report the impact from the dimensionality of the semantic vectors. It can be seen that low-dimensional semantic vectors tend to jeopardize the performance. The best performance results are both achieved at 1,024 dimensions. We also investigate the impact on the number of conditional affine coupling layers, which directly influences the generative model size. In Figure 6(d), we can see the best performance results are from 3 and 5 layers respectively for the aPaY and AWA datasets.

4.8. t-SNE visualization

The quantitative results reported for GZSL (Section 4.4) and ZSL (Section 4.5) demonstrate that the visual samples of the unseen classes generated by GSMFlow are of good quality and effective for classification tasks. To further gain an insight into the quality of the generated samples and validate the motivation illustrated in Figure 1

, we compare the empirical distribution of the real unseen class data and the synthesized unseen class data by a conditional generative adversarial network (cGAN) and our proposed GSMFlow, as depicted in Figure

5. It can be seen that the distributions of the synthesized unseen data from cGAN suffer from the generation shifts problem, which is undesirable for approximating optimal decision boundaries. Specifically, each class is collapsed to fixed modes without capturing enough variances. In contrast, thanks to the visual perturbation strategy, the synthesized samples by GSMFlow are much more diverse, which helps to approximate optimal decision boundaries. The class-wise relationship is also reflected. For example, comparing to other animal species, horse and giraffe share some attribute values. Also, all the marine species, blue whale, walrus, dolphin, and seal are well separated from other animals. As a result, the generated samples are semantically consistent.

5. Conclusion

In this paper, we propose a novel Generation Shifts Mitigating Flow (GSMFlow) framework, which is comprised of multiple conditional affine coupling layers for learning unseen data synthesis. There are three potential problems that trigger the generation shifts for this task, i.e., semantic inconsistency, variance decay, and the structural permutation. First, we explicitly blend the semantic information into the transformations in each of the coupling layers, reinforcing the correlations between the generated samples and the corresponding attributes. Second, a visual perturbation strategy is introduced to diversify the generated data and hereby help adjust the decision boundary of classifier. Third, to avoid structural permutation in the semantic space, we propose a relative positioning strategy to manipulate the attribute embeddings, guiding which to fully preserve the inter-class geometric structure. An extensive suite of experiments and analysis show that MGSFlow can outperform existing generative approach for GZSL that suffers from the problems of the generation shifts.


  • (1)
  • Akata et al. (2015a) Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. 2015a. Label-embedding for image classification. TPAMI 38, 7 (2015), 1425–1438.
  • Akata et al. (2015b) Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. 2015b. Evaluation of output embeddings for fine-grained image classification. In CVPR. 2927–2936.
  • Ardizzone et al. (2018) L. Ardizzone, J. Kruse, S. Wirkert, D. Rahner, E. W. Pellegrini, R. S. Klessen, L. Maier-Hein, C. Rother, and U. Köthe. 2018. Analyzing inverse problems with invertible neural networks. arXiv preprint arXiv:1808.04730 (2018).
  • Ardizzone et al. (2019) L. Ardizzone, C. Lüth, J. Kruse, C. Rother, and U. Köthe. 2019. Guided image generation with conditional invertible neural networks. arXiv preprint arXiv:1907.02392 (2019).
  • Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. 2017. Wasserstein generative adversarial networks. In ICML.
  • Chen et al. (2021) Zhi Chen, Zi Huang, Jingjing Li, and Zheng Zhang. 2021. Entropy-based uncertainty calibration for generalized zero-shot learning. In Australasian Database Conference 2021.
  • Chen et al. (2020a) Z. Chen, J. Li, Y. Luo, Z. Huang, and Y. Yang. 2020a. Canzsl: Cycle-consistent adversarial networks for zero-shot learning from natural language. In WACV.
  • Chen and Luo (2019) Z. Chen and Y. Luo. 2019. Cycle-Consistent Diverse Image Synthesis from Natural Language. In ICMEW. IEEE, 459–464.
  • Chen et al. (2020b) Z. Chen, S. Wang, J. Li, and Z. Huang. 2020b. Rethinking Generative Zero-Shot Learning: An Ensemble Learning Perspective for Recognising Visual Patches. In ACM MM.
  • DeVries and Taylor (2017) T. DeVries and G. W. Taylor. 2017. Dataset augmentation in feature space. In ICLR Workshop.
  • Dinh et al. (2016) L. Dinh, J. Sohl-Dickstein, and S. Bengio. 2016. Density estimation using real nvp. arXiv preprint arXiv:1605.08803 (2016).
  • Farhadi et al. (2009) A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. 2009. Describing objects by their attributes. In CVPR.
  • Felix et al. (2018) R. Felix, V. B. Kumar, I. Reid, and G. Carneiro. 2018. Multi-modal cycle-consistent generalized zero-shot learning. In ECCV. 21–37.
  • Frome et al. (2013) A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. 2013. Devise: A deep visual-semantic embedding model. In NeurIPS.
  • Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, Da. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. 2014. Generative adversarial nets. In NeurlPS. 2672–2680.
  • Han et al. (2017) T. Han, Y. Lu, S. Zhu, and Y. Wu. 2017. Alternating back-propagation for generator network. In AAAI, Vol. 31.
  • Huang et al. (2019) H. Huang, C. Wang, P. S. Yu, and C. Wang. 2019. Generative Dual Adversarial Network for Generalized Zero-shot Learning. In CVPR. 801–810.
  • Jiang et al. (2019) Huajie Jiang, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2019. Transferable contrastive network for generalized zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision. 9765–9774.
  • Kingma and Welling (2013) Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
  • Kodirov et al. (2017) E. Kodirov, T. Xiang, and S. Gong. 2017. Semantic autoencoder for zero-shot learning. In CVPR. 3174–3183.
  • Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NeurIPS.
  • Lampert et al. (2013) C. H. Lampert, H. Nickisch, and S. Harmeling. 2013. Attribute-based classification for zero-shot visual object categorization. TPAMI 36, 3 (2013), 453–465.
  • Li et al. (2019a) Jingjing Li, Mengmeng Jin, Ke Lu, Zhengming Ding, Lei Zhu, and Zi Huang. 2019a. Leveraging the Invariant Side of Generative Zero-Shot Learning. In CVPR.
  • Li et al. (2021) Jingjing Li, Mengmeng Jing, Ke Lu, Lei Zhu, and Heng Tao Shen. 2021. Investigating the bilateral connections in generative zero-shot learning. IEEE Transactions on Cybernetics (2021).
  • Li et al. (2019b) Jingjing Li, Mengmeng Jing, Ke Lu, Lei Zhu, Yang Yang, and Zi Huang. 2019b. Alleviating Feature Confusion for Generative Zero-shot Learning. In Proceedings of the 27th ACM International Conference on Multimedia. 1587–1595.
  • Li et al. (2019c) Jingjing Li, Mengmeng Jing, Ke Lu, Lei Zhu, Yang Yang, and Zi Huang. 2019c. From zero-shot learning to cold-start recommendation. In AAAI.
  • Li et al. (2020) Jingjing Li, Mengmeng Jing, Lei Zhu, Zhengming Ding, Ke Lu, and Yang Yang. 2020. Learning modality-invariant latent representations for generalized zero-shot learning. In ACM MM.
  • Liu et al. (2018) S. Liu, M. Long, J. Wang, and M. I. Jordan. 2018. Generalized zero-shot learning with deep calibration network. In NeurIPS. 2005–2015.
  • Liu et al. (2019) Y. Liu, J. Guo, D. Cai, and X. He. 2019. Attribute Attention for Semantic Disambiguation in Zero-Shot Learning. In ICCV. 6698–6707.
  • Long et al. (2017) Y. Long, L. Liu, L. Shao, F. Shen, G. Ding, and J. Han. 2017. From zero-shot learning to conventional supervised classification: Unseen visual data synthesis. In CVPR.
  • Luo et al. (2020a) Yadan Luo, Zi Huang, Yang Li, Fumin Shen, Yang Yang, and Peng Cui. 2020a. Collaborative learning for extremely low bit asymmetric hashing. IEEE TKDE (2020).
  • Luo et al. (2020b) Yadan Luo, Zi Huang, Zijian Wang, Zheng Zhang, and Mahsa Baktashmotlagh. 2020b. Adversarial bipartite graph learning for video domain adaptation. In ACM MM.
  • Luo et al. (2020c) Yadan Luo, Zi Huang, Zheng Zhang, Ziwei Wang, Mahsa Baktashmotlagh, and Yang Yang. 2020c.

    Learning from the Past: Continual Meta-Learning with Bayesian Graph Neural Networks. In

  • Luo et al. (2019) Yadan Luo, Zi Huang, Zheng Zhang, Ziwei Wang, Jingjing Li, and Yang Yang. 2019.

    Curiosity-driven reinforcement learning for diverse visual paragraph generation. In

    ACM MM. 2341–2350.
  • Luo et al. (2018a) Yadan Luo, Ziwei Wang, Zi Huang, Yang Yang, and Cong Zhao. 2018a. Coarse-to-fine annotation enrichment for semantic segmentation learning. In CIKM.
  • Luo et al. (2018b) Yadan Luo, Yang Yang, Fumin Shen, Zi Huang, Pan Zhou, and Heng Tao Shen. 2018b. Robust discrete code modeling for supervised hashing. Pattern Recognition 75 (2018), 128–135.
  • Mikołajczyk and Grochowski (2018) A. Mikołajczyk and M. Grochowski. 2018. Data augmentation for improving deep learning in image classification problem. In IIPhDW. IEEE.
  • Min et al. (2020) S. Min, H. Yao, H. Xie, C. Wang, Z. J. Zha, and Y. Zhang. 2020. Domain-aware Visual Bias Eliminating for Generalized Zero-Shot Learning. In CVPR.
  • Narayan et al. (2020) S. Narayan, A. Gupta, F. S. Khan, C. G. Snoek, and L. Shao. 2020. Latent Embedding Feedback and Discriminative Features for Zero-Shot Classification. arXiv preprint arXiv:2003.07833 (2020).
  • Nilsback and Zisserman (2008) M. E. Nilsback and A. Zisserman. 2008. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing.
  • Reed et al. (2016) S. Reed, Z. Akata, H. Lee, and B. Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In CVPR.
  • Romera-Paredes and Torr (2015) B. Romera-Paredes and P. Torr. 2015. An embarrassingly simple approach to zero-shot learning. In ICML. 2152–2161.
  • Schonfeld et al. (2019) E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata. 2019. Generalized zero-and few-shot learning via aligned variational autoencoders. In CVPR. 8247–8255.
  • Shen et al. (2020) Y. Shen, J. Qin, L. Huang, L. Liu, F. Zhu, and L. Shao. 2020. Invertible zero-shot recognition flows. In ECCV. Springer, 614–631.
  • Shorten and Khoshgoftaar (2019) C. Shorten and T. M. Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of Big Data 6, 1 (2019), 1–48.
  • Sun et al. (2017) C. Sun, A. Shrivastava, S. Singh, and A. Gupta. 2017. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV. 843–852.
  • Tong et al. (2019) B. Tong, C. Wang, M. Klinkigt, Y. Kobayashi, and Y. Nonaka. 2019. Hierarchical disentanglement of discriminative latent features for zero-shot learning. In CVPR. 11467–11476.
  • Tsochantaridis et al. (2005) I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. 2005. Large margin methods for structured and interdependent output variables. JMLR (2005).
  • Wah et al. (2011) C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. 2011. The caltech-ucsd birds-200-2011 dataset. (2011).
  • Wang et al. (2020a) Ziwei Wang, Zi Huang, and Yadan Luo. 2020a.

    Human Consensus-Oriented Image Captioning. In

  • Wang et al. (2020b) Zijian Wang, Yadan Luo, Zi Huang, and Mahsa Baktashmotlagh. 2020b. Prototype-matching graph network for heterogeneous domain adaptation. In ACM MM.
  • Xian et al. (2016) Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. 2016. Latent embeddings for zero-shot classification. In CVPR. 69–77.
  • Xian et al. (2018a) Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. 2018a. Zero-shot learning—A comprehensive evaluation of the good, the bad and the ugly. TPAMI (2018).
  • Xian et al. (2018b) Y. Xian, T. Lorenz, B. Schiele, and Z. Akata. 2018b. Feature generating networks for zero-shot learning. In CVPR. 5542–5551.
  • Xian et al. (2019) Y. Xian, S. Sharma, B. Schiele, and Z. Akata. 2019. f-VAEGAN-D2: A feature generating framework for any-shot learning. In CVPR. 10275–10284.
  • Yang et al. (2016) Yang Yang, Yadan Luo, Weilun Chen, Fumin Shen, Jie Shao, and Heng Tao Shen. 2016. Zero-shot hashing via transferring supervised knowledge. In Proceedings of the 24th ACM international conference on Multimedia. 1286–1295.
  • Yu et al. (2020) Y. Yu, Z. Ji, J. Han, and Z. Zhang. 2020. Episode-Based Prototype Generating Network for Zero-Shot Learning. In CVPR. 14035–14044.
  • Zhang et al. (2021a) Peng-Fei Zhang, Zi Huang, and Xin-Shun Xu. 2021a. Privacy-preserving Learning for Retrieval. (2021).
  • Zhang et al. (2017) Peng-Fei Zhang, Chuan-Xiang Li, Meng-Yuan Liu, Liqiang Nie, and Xin-Shun Xu. 2017. Semi-relaxation supervised hashing for cross-modal retrieval. In ACM MM.
  • Zhang et al. (2021b) Peng-Fei Zhang, Yang Li, Zi Huang, and Xin-Shun Xu. 2021b. Aggregation-based Graph Convolutional Hashing for Unsupervised Cross-modal Retrieval. IEEE TMM (2021).
  • Zhang et al. (2021c) Peng-Fei Zhang, Yadan Luo, Zi Huang, Xin-Shun Xu, and Jingkuan Song. 2021c. High-order nonlocal Hashing for unsupervised cross-modal retrieval. World Wide Web (2021).
  • Zhu et al. (2017) J. Zhu, T. Park, P. Isola, and A. Efros. 2017.

    Unpaired image-to-image translation using cycle-consistent adversarial networks. In

    ICCV. 2223–2232.
  • Zhu et al. (2018) Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. 2018. A generative adversarial approach for zero-shot learning from noisy texts. In CVPR. 1004–1013.
  • Zhu et al. (2019) Y. Zhu, J. Xie, B. Liu, and A. Elgammal. 2019. Learning feature-to-feature translator by alternating back-propagation for generative zero-shot learning. In CVPR. 9844–9854.