A Gating Model for Bias Calibration in Generalized Zero-shot Learning

03/08/2022
by   Gukyeong Kwon, et al.
Georgia Institute of Technology
6

Generalized zero-shot learning (GZSL) aims at training a model that can generalize to unseen class data by only using auxiliary information. One of the main challenges in GZSL is a biased model prediction toward seen classes caused by overfitting on only available seen class data during training. To overcome this issue, we propose a two-stream autoencoder-based gating model for GZSL. Our gating model predicts whether the query data is from seen classes or unseen classes, and utilizes separate seen and unseen experts to predict the class independently from each other. This framework avoids comparing the biased prediction scores for seen classes with the prediction scores for unseen classes. In particular, we measure the distance between visual and attribute representations in the latent space and the cross-reconstruction space of the autoencoder. These distances are utilized as complementary features to characterize unseen classes at different levels of data abstraction. Also, the two-stream autoencoder works as a unified framework for the gating model and the unseen expert, which makes the proposed method computationally efficient. We validate our proposed method in four benchmark image recognition datasets. In comparison with other state-of-the-art methods, we achieve the best harmonic mean accuracy in SUN and AWA2, and the second best in CUB and AWA1. Furthermore, our base model requires at least 20 parameters than state-of-the-art methods relying on generative models.

READ FULL TEXT VIEW PDF

page 1

page 5

page 6

page 10

page 12

06/27/2017

A Unified approach for Conventional Zero-shot, Generalized Zero-shot and Few-shot Learning

Prevalent techniques in zero-shot learning do not generalize well to oth...
08/25/2020

Bias-Awareness for Zero-Shot Learning the Seen and Unseen

Generalized zero-shot learning recognizes inputs from both seen and unse...
03/27/2019

Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibration

Zero-shot learning (ZSL) for image classification focuses on recognizing...
10/21/2019

Zero-shot Learning via Simultaneous Generating and Learning

To overcome the absence of training data for unseen classes, conventiona...
07/25/2022

LETS-GZSL: A Latent Embedding Model for Time Series Generalized Zero Shot Learning

One of the recent developments in deep learning is generalized zero-shot...
07/22/2019

Bayesian Zero-Shot Learning

Object classes that surround us have a natural tendency to emerge at var...
05/08/2021

Continuous representations of intents for dialogue systems

Intent modelling has become an important part of modern dialogue systems...

I Introduction

Advancement in machine learning has primarily been driven by a large amount of labeled data. In particular, a supervised learning framework which utilizes fully annotated data such as ImageNet 

[7] achieves state-of-the-art performance in diverse applications such as object recognition, detection, and segmentation [12, 33, 6]. However, supervised learning has clear limitations when generalizing in numerous real-world scenarios because of expensive data collection and annotation. Also, to generalize the supervised model to a new class, the model needs to be trained with a large amount of data for the new class even though the new class is similar to other trained classes. These limitations motivate the development of other learning paradigms that do not require fully annotated data.

Zero-shot learning (ZSL) aims at learning a model that generalizes to untrained classes [24, 23]

. To achieve this goal, auxiliary information such as attributes of the unseen class is utilized. For example, in the application of image recognition, assume that a classifier is trained for ‘horse’ class and ‘striped cat’ class. If we have auxiliary information of textual description for a new class ‘zebra’ such as “zebra is a horse with stripes”, the classifier can associate the ‘horse’ features and ‘stripe‘ features from training images to learn the new class ‘zebra’. Depending on the evaluation set up, ZSL can be further categorized into the standard ZSL and generalized zero-shot learning (GZSL). In standard ZSL, test images are drawn only from unseen classes. However, GZSL focuses on achieving high accuracy for both seen and unseen class test images. In this paper, we specifically tackle the problem of GZSL for image recognition.

Fig. 1: Comparison between the non-gating method and the gating method.

One of the main challenges in GZSL is a biased model prediction caused by the inherently unbalanced training set. During the training of GZSL algorithms, both visual and attribute features are available for seen classes while only attribute features are provided for unseen classes. Hence, the unbalanced training set causes models to overfit on seen class data and perform well for seen classes but poorly for unseen classes. Several approaches [25, 5] have been proposed to overcome this challenge by calibrating prediction scores for seen classes. However, we still observe that most of the unseen classes are misclassified as seen classes. In these calibration methods, the classifier makes a prediction out of the search space that contains both seen and unseen classes. Thus, the prediction scores for unseen classes cannot completely avoid competing with the biased prediction scores for seen classes. We propose using a gating model to tackle the biased prediction challenge in GZSL.

In Fig. 1, we compare the standard (non-gating) method and the gating method in GZSL to highlight the differences. For both models, a visual representation, , is obtained by giving an input image to the vision encoder. Assume that an unseen class image is given to the models. and denote the number of seen classes and unseen classes. The gating method consists of three components, which are a gating model, a seen expert, and an unseen expert. The seen expert and the unseen expert are trained to correctly classify seen and unseen classes, respectively. The gating model first performs unseen class detection which aims at correctly predicting whether the image is from the seen or the unseen classes. Based on the unseen class detection result, either the seen expert or the unseen expert is chosen to predict the class. While in the standard non-gating method, a class is predicted out of total classes, the gating method predicts a class out of either classes or classes. Thus, the model can avoid comparing the biased seen class prediction scores with the unseen class prediction scores in the gating method.

We propose a two-stream autoencoder-based gating model which possesses several advantages over other GZSL methods. In particular, we utilize representations from latent space and cross-reconstruction space to characterize association between query visual input and attributes and perform accurate unseen class detection. Also, our two-stream autoencoder provides a unified framework for both the gating model and the unseen expert. The latent representations are trained to be class-discriminant and directly utilized for unseen class classification. Therefore, no additional unseen expert needs to be trained, which leads to the computational efficiency of the proposed method. Furthermore, we show that both experts can be separately optimized and the gating model can be easily combined with other state-of-the-art methods to improve the overall GZSL performance. In summary, the main contributions of this paper are three folds:

  1. [label=, leftmargin=0.5cm]

  2. We propose a gating model which prevents biased prediction toward seen classes and achieves state-of-the-art performance in four benchmark image recognition datasets.

  3. We validate that the proposed method is easily combined with any existing state-of-the-art methods to further improve the performance. Such generalizability is a contribution by itself to reflect the impact of GatingAE.

  4. We achieve effective unseen class detection and classification in a unified framework which significantly reduces the number of model parameters.

The structure of this paper is organized as follows. First, we comprehensively review the related works in Section II. Representation learning using the two-stream autoencoder is explained in Section III. In Section IV, we discuss the proposed unseen class detection method and classification. The experiments are discussed in Section V and we conclude the paper in Section VI.

Ii Related Works

We broadly review existing works related to ZSL and GZSL. We categorize the related works into three categories and explain details of the works in each category.

Joint representations for visual features and attributes Learning joint representations for visual and attribute data is essential to utilize the association between them and solve ZSL problems. In [10], semantic knowledge learned from text data is used as a type of attributes and aligned with visual representations in the joint embedding space. The authors in [14] use autoencoders to obtain representations for visual data and textual attributes. Also, a cross-modality distribution matching constraint is imposed to align representations from both modalities. The authors in [31] propose to learn a projection for each class of images to model the relationship between seen and unseen classes. In [4], the joint representation is learned by matching the graph structure of semantic space and model space. The authors in [29]

propose to map images to the semantic embedding space through the convex combination of semantic embedding vectors. A latent probabilistic model and a low-rank semantics grouping method are respectively proposed in 

[48] and [44] to learn the statistical relationship between visual and attribute representations. The authors in [41, 34] propose to learn compatibility functions that can relate the visual features with attribute representations. In [17], the authors propose to learn the joint representations through contrastive learning and generalize the representations to unseen classes by imposing a transferability constraint. In [15], a dense attribute-based attention mechanism is proposed to align attributes with local visual features instead of global feature vectors from images. Our two-stream autoencoder shares representations from each stream to learn joint embedding.

Generative models for feature generation

Generative models such as generative adversarial networks (GANs) 

[11] and variational autoencoders (VAEs) [19] have been widely used to generate unseen class visual features and directly use them for training a classifier. The authors in [43] use a Wasserstein GAN (WGAN) [2] conditioned on attribute information to generate unseen visual features. [8] and [37] impose a multi-modal cycle consistency loss and gradient matching loss on the WGAN, respectively, to generate class discriminant unseen class data. [26] uses a diffusion regularization which aims at reducing the reluctant dimensions in the synthesized data and diffusing information to all the dimensions. In [45], a modified WGAN is used to generate visual prototypes in an episode-based training setup. [51]

uses a single conditional generator trained via an alternating backpropagation algorithm to generate visual features. Instead of GANs, several methods are based on conditional VAEs to synthesize samples 

[28, 20]. In [38], a two-stream VAE is utilized to generate latent representations for unseen class samples and the latent representations are used to train a classifier. Although generative model-based approaches have achieved successful performance in GZSL, generative models often require a large number of model parameters. Our proposed method is not relying on generative models such as VAEs or GANs and requires significantly less computational resources to train the model.

Calibration of biased prediction toward seen classes Several works have focused on preventing models from making a biased prediction toward seen classes and achieving high accuracy performance for both seen and unseen classes in GZSL. [39]

proposes a gating model which estimates the local outlier probability for unseen class detection. The authors in 

[47] propose a gating model which uses not only a seen and an unseen expert but also a general all class classifiers together to make a prediction. In [3], adaptive confidence-based smoothing is utilized with the soft-gating model which combines prediction scores from the seen and the unseen expert. According to this work, our proposed method can be categorized as a hard-gating model which uses either seen or unseen expert for each input. [27] filters out seen class samples by thresholding the entropy of the predicted scores and predicts the seen and the unseen classes separately. The authors in [25] propose using temperature scaling [13] and an entropy-based regularization to mitigate the overfitting on seen class data. [5] calibrates the seen class prediction by using calibrated stacking which reduces the prediction score for seen classes.

The unseen class detection in the proposed method is largely inspired by techniques in anomaly detection which share the goal of detecting unseen class samples. In particular, an unsupervised learning framework of the autoencoder is widely explored for anomaly detection. Following the classic work 

[16], the authors in [36, 50, 35] use the reconstruction error of the autoencoder to detect anomalies. In [52]

, the authors fit Gaussian mixture models (GMMs) to the representations of the autoencoder and the likelihood is measured to detect anomalies.

[1] detects unseen class samples using an autoregressive density estimation model learned in the latent space of the autoencoder. [22, 21] propose gradient-based representations obtained from the autoencoder to detect anomalies from a model perspective. Compared to the most of the existing anomaly detection algorithms, we characterize unseen class samples using data from two different modalities, vision and attribute. Also, anomaly detection algorithms only need to detect unseen class samples but the proposed algorithm learns to detect and classify seen and unseen class data. Therefore, learning representations that can separate seen and unseen class samples while being discriminant for each class is an critical step in the proposed method.

Iii Representation Learning Using
A Two-Stream Autoencoder

In this section, we define the problem of GZSL and explain the two-stream autoencoder for learning joint representations of visual features and attributes.

Iii-a Problem Setup

We first define notations for the training data. Assume that denote sets of visual features for seen class training images, seen classes, and seen class attributes, respectively. Since there is an associated attribute for each class, the sets for seen classes and their attributes can be written as and , where defines the number of seen classes. If the class of a visual feature vector, , is , where is a class index, the training sample can be given as a pair of . We also have access to unseen class attributes, , and their associated unseen classes, , but do not have access to unseen class visual features during training. Assume that a set of visual features for seen class test images and that for unseen class test images are denoted as and , respectively. In contrast to the standard ZSL where test images are from unseen classes, a visual feature vector, , are drawn from the union of seen and unseen class test sets, , in GZSL. The goal of GZSL is to learn a classifier, , which can predict the correct label for and it can be formulated as , where is the model parameters.

Fig. 2: Training of the two-stream autoencoder.

Iii-B Two-stream Autoencoder

We use a two-stream autoencoder to learn representations that associate visual features with attributes. The two-stream autoencoder consists of a vision stream and an attribute stream. Also, each stream has an encoder and a decoder denoted as and , respectively for the vision stream and and for the attribute stream.

We train the autoencoder by imposing three different losses as shown in Fig. 2. The first loss is a reconstruction error, . Assume that a vision input, , the class of which is and an associated attribute, , are given to the autoencoder. Reconstruction for the vision and the attribute input can be denoted as , , respectively. We measure the distance between the input and the reconstruction for each modality to obtain the reconstruction error. The reconstruction error for each modality is combined as follows:

(1)

In addition, we impose a cross-reconstruction error to align representations from visual features and attributes. The cross-reconstruction error has been widely used in the context of multimodal representation learning [49]. In particular, we train the autoencoder model to reconstruct one modality input from the other modality input as depicted in Fig. 2 2). The visual features and the attributes are sequentially processed by the vision encoder and the attribute decoder, and attribute encoder and the vision decoder, respectively. We use distance to measure the cross-reconstruction error, , which is formulated as follows:

(2)

We empirically found that the distance for and results in better representations for unseen class detection, which will be explained in Section IV, than distance.

Finally, we train the model with a cross entropy loss, to obtain class discriminant latent representations. As shown in Fig. 2 3), we first obtain the visual latent representation as and all the seen attribute latent representations as , where . When the class of visual input is assumed to be , the loss is computed as

(3)

The term in the numerator contributes to minimize the distance for the positive pair of visual and attribute representations while the terms in the denominator enforce to maximize the distance for negative pairs.

The overall loss for the autoencoder, is given as follows:

(4)

where is empirically determined to balance the cross entropy loss and the reconstruction losses. The two-stream autoencoder and losses that we use are also commonly explored in other existing works [38, 46]. However, we highlight that we still achieve the effective characterization of unseen classes with these generic model and loses. The simplicity of our representation learning framework allows our bias calibration technique based on the unseen class characterization to be easily combined with other existing techniques.

Fig. 3: Unseen class detection using distance features in the latent space and the cross-reconstruction space of the two-stream autoencoder.

Iv Unseen Class Detection and Classification

We primarily focus on obtaining descriptive features that can characterize unseen classes from the two-stream autoencoder. In particular, we use the distance between visual and attribute representations as a feature for unseen class detection. The attributes that describe the seen and the unseen classes are available during training and testing. Therefore, we use the attribute representations of the autoencoder as references and compute the distance from the visual representation to the seen and unseen attribute representations. Since the autoencoder is trained to align the seen class visual input and its attribute, the seen class visual input will reside closer to the seen attribute than the unseen class visual input in the representation spaces. The unseen visual representations are not enforced to be aligned with seen attribute nor unseen attribute representations during training. Therefore, from the perspective of the training objectives, the unseen visual representations do not need to reside close to any of seen or unseen attribute representations. However, from the perspective of generalization, since the network has learned to align corresponding visual and attribute representations, we hypothesize that the unseen visual representations are more likely to be aligned with the corresponding unseen attribute representations than seen attribute representations. Hence, by comparing whether the visual representation is closer to the seen or the unseen attribute representations, we can achieve seen and unseen class detection.

We obtain the distance features in both latent space and cross-reconstruction space of the two-stream autoencoder. In particular, the latent space is relatively lower dimension than cross-reconstruction space in our two-stream autoencoder. Therefore, distances obtained in those two spaces abstract features at different semantic levels. We use both low and high dimensional distance features to define the unseen class score which indicates the possibility of the query sample being an unseen class. The detailed steps for unseen class score calculation in both spaces and the final classification are discussed in the following subsections.

Iv-a Unseen class detection in the latent space

We visualize unseen class detection using the latent representations in the left side of Fig. 3. The latent representation of the query visual feature, , is obtained as . We also generate the latent representations for all the seen and unseen attributes in . We denote all the seen and unseen attribute latent representations as and , respectively. We extract distance features by computing the minimum distance from the visual representation to the seen and the unseen attribute representations. The distance features for the seen class, , and the unseen class, , are calculated as follows:

(5)
(6)

We use the exponential of distance since this term is used in the cross entropy loss to align the latent representations during training. We utilize the ratio between these distances to obtain an unseen class score in the latent space, , which is defined as . The seen class visual input will result in smaller , larger , and consequently smaller than the unseen class visual input. Therefore, high indicates that the query input is likely to be an unseen class. We can detect the query as an unseen class when the unseen class score is above a certain threshold. Otherwise, the query is detected as a seen class.

Iv-B Unseen class detection in the cross-reconstruction space

We can also obtain the unseen class score in the cross-reconstruction space as shown in the right side of Fig. 3. Similar to the calculation of the unseen class score in the latent space, we input all the seen and unseen attributes to the trained attribute encoder. Then, we use the vision decoder to cross-reconstruct images from attributes. The cross-reconstruction of seen and unseen class attributes are denoted as and , respectively. We extract the distance features in the cross-reconstruction space by comparing the query visual features and the cross-reconstructions from attributes. The minimum distance from the query visual input to the seen cross-reconstruction, , and to the unseen cross-reconstruction, , are computed as follows:

(7)
(8)

distance is used by following the cross-reconstruction error imposed during training. We combine two distance features by computing the ratio and utilize it as an unseen class score from the cross-reconstruction space. When the query visual input is from seen classes, the input should be close to one of seen class cross-reconstructions and achieve smaller than the unseen class input. Also, from the generalization perspective explained in Section IV, the unseen class input is more likely to be aligned with one of the unseen cross-reconstructions and result in lower than the seen class input. Therefore, we can detect unseen classed by comparing .

Iv-C Overall unseen class detection and classification

We finally use both distance features obtained in the latent space and the cross-reconstruction space to detect unseen class samples. We obtain the final unseen class score, , as

(9)

where

is a hyperparameter to balance two distances from the latent space and the cross-reconstruction space. We perform baseline experiments to compare the GZSL performance with three different unseen class scores,

, , . We use which shows the best performance in the baseline experiments for the state-of-the-art comparison. For the seen expert, we train a supervised linear classifier with one layer, , using available visual features of seen class training images. For the unseen expert, we do not train any additional model but performs 1-nearest neighbor classification in the latent space to predict the class. To be specific, we measure the distance between visual latent representation and all the unseen attribute latent representations, and the class of the closest unseen attribute representation is predicted as a label. We can formulate the overall seen and unseen class detection, and the classification as follows:

(10)

where is the final class prediction of . The hyperparameters and are found using the validation set provided in [42]. Also, by following the training protocol described in [3], we re-train the model from the scratch using the union of the training and the validation sets after finding and . Since we utilize a compact model of the two-stream autoencoder for the gating, the proposed method is called GatingAE.

V Experiments

We validate the effectiveness of the proposed gating model through rigorous baseline experiments. Also, we highlight the GZSL performance of GatingAE in comparison with other state-of-the-art methods. Finally, comprehensive ablation studies are conducted to experimentally support the advantages of GatingAE.

Fig. 4: Sample images from CUB, SUN, AWA1, and AWA2.
Model Seen Exeprt CUB SUN AWA2 AWA1
S U H S U H S U H S U H
No gating 1-NN 64.4 36.6 46.8 35.0 19.0 24.7 87.8 25.9 40.0 85.3 23.8 37.2
GatingAE
1-NN 55.0 54.2 54.6 29.8 48.1 36.8 81.0 54.7 65.3 76.8 54.9 64.0
Linear CLF 58.6 54.2 56.4 35.7 48.1 40.9 83.1 54.7 66.0 78.7 54.9 64.7
GatingAE
1-NN 45.0 58.8 51.0 27.4 50.4 35.5 79.0 55.4 65.1 73.1 55.8 63.3
Linear CLF 47.1 58.8 52.3 32.4 50.4 39.5 80.9 55.4 65.8 74.9 55.8 63.9
GatingAE
Linear CLF 58.1 54.9 56.4 38.1 45.4 41.4 81.3 57.3 67.2 72.8 59.7 65.6
TABLE I: Baseline comparison in CUB, SUN, AWA2, and AWA1 datasets. S: Seen class accuracy, U: Unseen class accuracy, H: Harmonic mean accuracy. Top 2 harmonic mean accuracies for each dataset are highlighed in bold.

V-a Experimental Setup

Datasets We validate the proposed method using four benchmark image recognition datasets: Caltech-UCSD Birds-200-2011 (CUB) [40], SUN Attribute (SUN) [30], Animals with Attribute 2 (AWA2) [42], and Animals with Attribute 1 (AWA1) [23]. Also, we use the proposed splits in  [42] for all the datasets. CUB is a fine-grained dataset with bird images from species. For the attributes, we use text representations obtained by averaging 10 sentence features per image [32]. SUN is also a fine-grained image dataset which contains visual scene images from classes. Each scene class is annotated with a -dimensional attribute representation. AWA2 and AWA1 are both coarse-scale image datasets. AWA2 and AWA1 consist of and animal images, respectively. Both datasets have classes and each class is annotated with a -dimensional attribute representation. As suggested in [42], we use -dimensional image representations obtained from the top-layer pooling units of ResNet-101 [12]

pre-trained on ImageNet 

[7] as visual input for all four datasets. Sample images from four datasets are visualized in Fig. 4.

Implementation details The encoder and the decoder of each stream in the autoencoder consist of two linear layers and ReLUs are used after the first layer of the encoder and the decoder. The dimension of the latent space is and the batch size of is used. The hyperparameter is searched in the range with the step size of . We use Adam optimizer [18] with the learning rate of and train the two-stream autoencoder for epochs. For the seen expert, we train the one layer linear classifier using the batch size of and Adam optimizer with the learning rate of .

Evaluation metrics

We use average per-class top-1 accuracy which is a widely accepted evaluation metric for GZSL to evaluate the proposed method. In particular, we separately calculate the average accuracy for seen classes and unseen classes. We also report the harmonic mean (

) of the seen class accuracy () and the unseen class accuracy (), which is calculated as

. For the evaluation of the unseen class detection performance, we use area under receiver operation characteristic curve (AUC) and false positive rate at true positive rate 0.95 (FPR).

V-B Baseline Comparison

We validate the effectiveness of the gating model through comprehensive baseline experiments in Table I. We compare the GZSL performance of four different models. All four models are based on the same two-stream autoencoder trained as described in Section III-B. However, the gating approach used in the inference stage differs for them. As shown in the first column of Table I, the first model (No gating) predicts the class through 1-nearest neighbor (1-NN) classification based on the latent representations of the autoencoder without any gating approaches. To be specific, the predicted class is given as , where . Since no gating approach is used, we note that the classification is made out of classes. The second and the third models use unseen class scores in the latent space (GatingAE ()), and in the cross reconstruction space, (GatingAE ()), for gating, respectively. For the two models, we use both 1-NN classifier and a linear classifier (Linear CLF) as seen experts and compare the performance. Finally, we combine distance features obtained in the latent space and the cross-reconstruction space, and use as an unseen class score for gating. We use a linear classifier as a seen expert. For all the gating models, 1-NN classifier applied on the latent representations is used as an unseen expert. We report average per-class top-1 accuracy for seen classes (S), unseen classes (U), and the harmonic mean of them (H), in CUB, SUN, AWA2, and AWA1 datasets.

Effectiveness of the proposed gating model (No gating vs. GatingAE) We highlight the contribution of the gating model by comparing the performance of the No gating model and GatingAE models using and separately as unseen class scores. For fair comparison, we compare models using the 1-NN classifier as seen experts. GatingAE () with the 1-NN classifier significantly outperforms the No gating model by and in terms of the harmonic mean accuracy in CUB, SUN, AWA2, and AWA1, respectively. Furthermore, GatingAE () with the 1-NN classifier shows higher harmonic mean accuracy than the No gating model by a margin of , , , and in the four datasets.

We believe two advantages of GatingAE mainly contribute to the significantly improved performance. First, GatingAE prevents the biased model prediction toward seen classes. The gating model separates the prediction search space and the class is predicted only among seen classes or unseen classes for each sample. This prevents unseen class prediction scores from being directly compared with the biased seen class scores. On the other hand, in the No gating model, the biased seen class scores and the unseen class scores are directly compared and the class with the maximum score is predicted. This lead to misclassification of unseen class samples into seen classes. We observe that the seen accuracy of the No gating model is at least times and at most times higher than the unseen accuracy of the same model across the four datasets. GatingAE avoids this biased prediction and achieves significantly improved harmonic mean accuracy. Second, the gating approach reduces the dimension of the prediction space for the classifiers. The seen and the unseen experts of GatingAE predict a class out of or classes, respectively, instead of total number of classes as in the No gating model. The reduction of prediction space allows experts to focus on less number of classes for the classification, which lead to better accuracy performance. With these two advantages, GatingAE significantly improves the harmonic mean accuracy.

Method CUB SUN AWA2 AWA1
S U H S U H S U H S U H
LATEM [41] 57.3 15.2 24.0 28.8 14.7 19.5 77.3 11.5 20.0 71.7 7.3 13.3
DeViSE [10] 53.0 23.8 32.8 27.4 16.9 20.9 74.7 17.1 27.8 68.7 13.4 22.4
f-CLSWGAN [43] 57.7 43.7 49.7 36.6 42.6 39.4 68.9 52.1 59.4 61.4 57.9 59.6
ReViSE [14] 28.3 37.6 32.3 20.1 24.3 22.0 39.7 46.4 42.8 37.1 46.1 41.1
CADA-VAE [38] 53.5 51.6 52.4 35.7 47.2 40.6 75.0 55.8 63.9 72.8 57.3 64.1
TCN [17] 52.0 52.6 52.3 37.3 31.2 34.0 65.8 61.2 63.4 76.5 49.4 60.0
ABP [51] 54.8 47.0 50.6 36.8 45.3 40.6 72.6 55.3 62.6 67.1 57.3 61.8
COSMO [3] 57.8 44.4 50.2 37.7 44.9 41.0 - - - 80.0 52.8 63.6
GMN [37] 54.3 56.1 55.2 33.0 53.2 40.7 - - - 71.3 61.1 65.8
E-PGN [45] 61.1 52.0 56.2 - - - 83.5 52.6 64.6 83.4 62.1 71.2
3ME [9] 60.1 49.6 54.3 - - - - - - 65.7 55.5 60.2
DAZLE [15] 59.6 56.7 58.1 24.3 52.3 33.2 75.7 60.3 67.1 - - -
DVBE* [27] 60.2 53.2 56.5 37.2 45.0 40.7 70.8 63.6 67.0 - - -
1-132 GatingAE 58.1 54.9 56.4 38.1 45.4 41.4 81.3 57.3 67.2 72.8 59.7 65.6
GatingAE + f-CLSWGAN 58.1 55.4 56.7 38.1 45.3 41.4 81.3 60.3 69.3 72.3 62.5 67.2
TABLE II: State-of-the-art comparison in CUB, SUN, AWA2, and AWA1 datasets. S: Seen class accuracy, U: Unseen class accuracy, H: Harmonic mean accuracy. Top 2 harmonic mean accuracies for each dataset are highlighed in bold.
Fig. 5: Scatter plot of seen and unseen accuracy for each state-of-the-art algorithm. For an ideal GZSL algorithm, the data point is expected to stay close the middle gray dotted line and the top right corner.

Advantage of using an independently trained expert (1-NN vs. Linear CLF) We compare the performance of GatingAEs using the 1-NN classifier and the linear classifier (Linear CLF) as seen experts. By comparing these two models, we emphasize that GatingAE can be easily combined with any independently trained expert to further improve the performance. As a case study, we train a linear classifier as a seen expert independently from the gating model or the unseen expert using available visual training data. We show that the linear classifier can improve the seen accuracy without sacrificing unseen class accuracy. In Table I, GatingAE () and GatingAE () are combined with the independently trained linear classifiers to achieve higher seen class accuracy than the GatingAE with the 1-NN classifier by at least and , respectively, in all four datasets while not compromising the unseen class accuracy. This highlights the applicability of GatingAE with any contributions from seen or unseen experts to further improve the GZSL performance.

Complementary distance features for gating (GatingAE () vs. GatingAE ()) We compare GatingAE () and GatingAE () with GatingAE () to show the contribution of descriptive distance features from both latent space and the cross-reconstruction space on the GZSL performance. In particular, we compare GatingAEs using the linear classifiers as seen experts because they achieve better performance than GatingAEs using the 1-NN classifiers. In Table I, GatingAE () consistently achieves higher harmonic mean accuracy than GatingAE () and GatingAE () across all the datasets except that GatingAE () achieves the same harmonic mean accuracy as GatingAE () in CUB. We believe the better performance of GatingAE () is resulted from the complementary distance features obtained in the latent space and the cross-reconstruction space. Considering that the latent space is lower dimensional than the cross-reconstruction space, the distance features from different spaces contribute to perform gating at different levels of data abstraction. Therefore, by combining both features for , GatingAE () utilizes the advantages of each feature and achieves higher harmonic mean accuracy than both GatingAE () and GatingAE ().

V-C Comparison With State-of-the-art Algorithms

We compare GatingAE with 13 state-of-the-art GZSL algorithms and report the performance in Table II

. A hyphen (-) indicates that the authors of the algorithm have not validated their method in the corresponding dataset. For fair comparison with DVBE, we use their reported performance without finetuning the backbone architecture of ResNet-101 for visual feature extraction. Excluding GatingAE + f-CLSWGAN, the base GatingAE achieves the best harmonic mean accuracy in SUN and AWA2, and the third highest harmonic mean accuracy in CUB and AWA1. Although GatingAE does not achieves the best performance in CUB and AWA1, GatingAE performs more robustly across datasets compared to other algorithms. For instance, DAZLE achieves the highest harmonic mean accuracy in CUB but its harmonic mean accuracy in SUN is

highest out of 14 algorithms. Also, although E-PGN achieves the highest harmonic mean accuracy in AWA1, its harmonic mean accuracies in CUB and AWA2 are both highest out of 14. GatingAE achieves the highest average rank of over all four datasets in terms of the harmonic mean accuracy. In comparison with the state-of-the-art soft-gating model COSMO, GatingAE which is based on the hard-gating achieves better performance in all datasets. Since the soft-gating model predicts a class using the combination of seen and unseen class prediction scores, the bias toward seen classes still affects the classification of unseen classes. However, GatingAE completely separates the classification of seen and unseen classes and mitigates the effect of the bias in unseen class classification.

We visualize the seen and the unseen accuracies of all the state-of-the-art methods in Fig 5 to analyze the balance between the seen and the unseen accuracies. In particular, the x-axis and the y-axis in each scatter plot indicate the seen accuracy and the unseen accuracy of each method, respectively. Also, the gray dotted line in the middle indicates the same seen and unseen accuracies. An ideal GZSL method should achieve high accuracy for both seen and unseen classes and should not be biased toward either seen classes or unseen classes. Therefore, the accuracy of the ideal method is expected to be plotted close to the top right corner while staying close the dotted gray line. In CUB, GatingAE is one of the most closest methods to the dotted gray line and the top right corner. While DAZLE and GMN are located close to GatingAE in CUB, they are biased toward the unseen class accuracy and located far away from the center line in SUN. Although there are several methods staying close to the center line than GatingAE in AWA2, GatingAE still achieves the highest harmonic mean acccuracy in AWA2. In AWA1, GatingAE shows comparable performance to E-PGN and GMN while being located close to the center dotted line. This shows that GatingAE achieves generalized high accuracy performance for both seen and unseen classes across all four datasets.

We also show that GatingAE can be easily combined with other state-of-the art methods to further improve the performance. Since each expert can be independently improved in GatingAE, the state-of-the-art methods can be simply utilized as a seen or an unseen expert. Furthermore, GatingAE can benefit from the state-of-the-art methods based on generative models, although the state-of-the-art methods do not achieve better GZSL performance than GatingAE. As a case study, we use f-CLSWGAN which is one of the earliest GZSL methods based on a WGAN [2]. f-CLSWGAN generates unseen visual features to tackle the problem of GZSL. We use these generated unseen visual features from f-CLSWGAN to finetune and improve the unseen expert independently from the seen expert. We report the performance of GatingAE + f-CLSWGAN in Table II. The base GatingAE significantly outperforms f-CLSWGAN by a margin of , , , and in CUB, SUN, AWA2, and AWA1, respectively. However, GatingAE still benefits from f-CLSWGAN and GatingAE + f-CLSWGAN achieves higher harmonic mean accuracy than individual GatingAE and f-CLSWGAN. Since we only finetune the unseen expert, GatingAE + f-CLSWGAN improves the unseen class accuracy over GatingAE while keeping the seen accuracy intact. Also, GatingAE + f-CLSWGAN achieves the best performance in SUN and AWA2, and the second best performance in CUB and AWA1 in terms of the harmonic mean accuracy. Although we only show one case study of using f-CLSWGAN, the same finetuning approach can be utilized with other GZSL algorithms based on generative models such as [8, 37, 28].

Gating Model CUB SUN AWA1
H AUC FPR H AUC FPR H AUC FPR
MAX-SOFTMAX-3 [3] 43.6 0.734 0.796 38.4 0.610 0.923 53.1 0.886 0.568
CB-GATING-3 [3] 44.7 0.820 0.720 40.1 0.777 0.775 56.8 0.925 0.455
GatingAE
75.7 0.972 0.143 38.0 0.777 0.775 62.1 0.889 0.566
GatingAE
61.7 0.926 0.324 34.8 0.753 0.820 61.3 0.890 0.561
GatingAE
74.9 0.970 0.156 38.8 0.779 0.762 62.7 0.894 0.550
TABLE III: Gating performance comparison between GatingAEs and gating models proposed in COSMO [3]. Ideally, higher harmonic mean accuracy (H), higher AUC, and lower false positive rate at true positive rate 0.95 (FPR) are desired. Top 2 scores in each evaluation metric are highlighted.

V-D Ablation Study

Gating performance comparison with COSMO We compare the gating performance of GatingAE with the state-of-the-art gating method COSMO [3] in Table III. In particular, the authors of COSMO split the validation set into a Gating-Train set and a Gating-Val set and report the gating performance in the Gating-Val set. Following the same protocol, we also train the two-stream autoencoder in the original training set, tune the hyperparameters in the Gating-Train set, and finally report the gating performance in the Gating-Val set. We compare GatingAEs based on , , with two gating models proposed in COSMO, which are MAX-SOFTMAX-3 and GB-GATING-3. The gating performance is validated in terms of the harmonic mean accuracy, AUC, and FPR. We note that GatingAE achieves higher harmonic mean accuracy than COSMO in all the test sets of CUB, SUN, and AWA1 as shown in Table II.

GatingAEs significantly outperform MAX-SOFTMAX-3 and GB-GATING-3 in terms of all evaluation metrics in CUB. This further supports the significant performance gap between GatingAE and COSMO in the test set of CUB shown in Table II. In SUN, while GatingAE () achieves slightly lower harmonic mean accuracy in the Gating-Val set compared to CB-GATING-3, it achieves better detection performance with higher AUC and lower FPR. In AWA1, GatingAE () and GatingAE () achieve significantly higher harmonic mean accuracy while achieving lower AUC and higher FPR than CB-GATING-3 in the GZSL-val set. As shown in Table II, GatingAE () outperforms COSMO by a large margin of 2.0 harmonic mean accuracy in the test set of AWA1. Considering that the GZSL-Val set is around three time smaller than the test set of AWA1, we argue that GatingAE () maintains its gating performance and learns better class discriminant representations in the relatively large-scale test set of AWA1.

Analysis on the gating performance from each distance feature We decompose the unseen class scores used in GatingAE to understand the contribution of each distance feature on gating. In particular, we report the AUC scores obtained by separately using the latent space distance features, and , and the cross-reconstruction space distance features, and , as unseen class scores. Also, we compare the AUC scores from individual distance features with those from , , and , which are the combination of the distance features. This highlights that the distance features are complementary to each other for gating. We report the AUC scores obtained in the test set of CUB, SUN, AWA2, and AWA1 in Table IV. and shows significant improvement of the AUC scores over , , , . This shows that the distance features from seen and unseen classes are combined to effectively classify whether the query is from seen classes or unseen classes. In addition, shows higher AUC than and in SUN, AWA2, and AWA1. In CUB, performs marginally better than . We argue that GatingAE combines all the complementary distance features from seen classes, unseen classes, latent space, and cross-reconstruction space to achieve accurate gating results, and consequently choose correct experts for tackling GZSL problems.

Unseen Class Score CUB SUN AWA2 AWA1
0.511 0.546 0.686 0.650
0.596 0.520 0.459 0.516
0.842 0.774 0.934 0.917
1-52
0.496 0.550 0.725 0.697
0.574 0.500 0.421 0.427
0.808 0.769 0.933 0.907
1-52
0.841 0.783 0.940 0.918
TABLE IV: AUC performance obtained from using the distance in the latent space and the cross-reconstruction space as an unseen class score.
Fig. 6: Qualitative analysis on the failure cases of GatingAEs using unseen class scores from different representation spaces in AWA2. Latent, Cross, and Combined refer to the class predictions of GatingAEs using , , and , respectively.
Model f-CLSWGAN [43] CADA-VAE [38] GatingAE
# of parameters 19,514,062 7,398,716 5,860,138
TABLE V: Comparison of the number of model parameters between GatingAE and other generative model-based GZSL algorithms.

Qualitative analysis on complementary distance features We perform qualitative analysis on the failure cases of GatingAEs using , , and in Figure 6. In particular, unseen class query images are given to GatingAEs and we analyze cases where either GatingAE () or GatingAE ( fails, and both of them fail in predicting correct classes. Through this analysis, we further highlight that the latent and the cross-reconstruction features capture unseen classes at different levels of data abstraction. The distance features from the low dimensional latent space focus more on abstracted global features while the cross-reconstruction features capture low-level local characteristics. In the first row of Figure 6, only GatingAE () misclassfies the unseen class Giraffe into the seen class Deer. Giraffe and Deer shares local features of brown and white furs. However, they are clearly distinguished by the global features of the Giraffe such as a long neck and long legs. The unseen class score captures these global features that misses to predict the correct class. In the second row, Bobcat is misclassified as Leopard by GatingAE () while being correctly classified by GatingAE (). Since both classes are in the cat family, they are differentiated only by low-level local features such as sharpness of the ears and body patterns. We believe these local features are better captured by than . Finally, in the last row, we show two examples of the Dolphin class where both GatingAE () and GatingAE () misclassify them into the seen class Killer whale while GatingAE () correctly predicts the unseen class. Dolphin shares most of the characteristic features with and Killer whale, which makes unseen class detection challenging. However, we incorporate both local and global features from and in GatingAE () and achieve the correct prediction. For all the query images given in Figure 6, GatingAE () predicts the correct classes when one or both of GatingAE () and GatingAE () fail. This show that GatingAE () effectively combines the advantages of each feature abstracted at different levels.

Computational efficiency of GatingAE GatingAE is computationally efficient because of its compact two-stream linear autoencoder and the unified framework for the gating model and the unseen expert. To highlight the computational efficiency of GatingAE, we compare the number of parameters required to be trained for GatingAE with f-CLSWGAN and CADA-VAE in Table V. Several state-of-the-art methods such as [8, 37] are developed on top of f-CLSWGAN. Also, CADA-VAE uses a two-stream VAE which is the closest architecture to our two-stream linear autoencoder. By comparing with these two models which are based on the simple generative models, we emphasize that GatingAE is even simpler while achieving the state-of-the-art performance. As shown in Table V, f-CLSWGAN and CADA-VAE require around times and times more parameters than GatingAE, respectively. CADA-VAE uses the same number of layers and the same dimension for the latent space as GatingAE. However, CADA-VAE has to learn more parameters for a latent constraint and a classifiers for classes while GatingAE only needs to train a classifier for number of seen classes. In addition, the state-of-the art soft-gating model COSMO uses f-CLSWGAN as an unseen expert. Hence, COSMO requires to train more than 19 million parameters of f-CLSWGAN. However, GatingAE uses the 1-NN classifier which does not need to train any additional parameters as an unseen expert. Therefore, GatingAE uses significantly less computational resources while outperforming these state-of-the-art methods.

We also compare the computational time for training one epoch of GatingAE and f-CLSWGAN using a single GPU GeForce GTX TITAN X. For f-CLSWGAN, we use the official code released by the authors to measure the training time. We use AWA2 and AWA1 datasets for this experiment since these datasets have about twice to three times more training samples that CUB and SUN. In both AWA2 and AWA1 datasets, GatingAE takes seconds and seconds for training one epoch, respectively, while f-CLSWGAN takes seconds and seconds, respectively. GatingAE requires only about of training time that is required by f-CLSWGAN. This is mainly because that f-CLSWGAN is based on a GAN which is adversarially trained with more model parameters of the generator and the discriminator than GatingAE. Given that f-CLSWGAN is based on the simplest GAN for feature generating in GZSL, we believe GatingAE has a strong advantage in computational efficiency compared to other state-of-the-art methods using more complicated GAN-based models [8, 37].

Vi Conclusion

We propose a GZSL algorithm, GatingAE, which utilizes the two-stream autoencoder as a gating model to prevent biased prediction and achieve high accuracy performance for both seen and unseen classes data. In particular, we utilize distance features obtained from the latent space and the cross-reconstruction space of the autoencoder for gating. Based on the gating results, either the seen or the unseen class expert is chosen to perform the target task. We thoroughly validate the gating performance and the overall GZSL performance in the application of image recognition. GatingAE achieves the state-of-art performance in four benchmark image recognition datasets. Also, several advantages of GatingAE such as complementary distance features for gating, using independently trained experts, and computational efficiency are highlighted through baseline experiments and ablation studies. We plan to further explore the characterization of the bias presented in the training data and the utilization of the bias information to calibrate the prediction for learning with limited data.

References

  • [1] D. Abati, A. Porrello, S. Calderara, and R. Cucchiara (2019)

    Latent space autoregression for novelty detection

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 481–490. Cited by: §II.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §II, §V-C.
  • [3] Y. Atzmon and G. Chechik (2019) Adaptive confidence smoothing for generalized zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11671–11680. Cited by: §II, §IV-C, §V-D, TABLE II, TABLE III.
  • [4] S. Changpinyo, W. Chao, B. Gong, and F. Sha (2016) Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5327–5336. Cited by: §II.
  • [5] W. Chao, S. Changpinyo, B. Gong, and F. Sha (2016) An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In European Conference on Computer Vision, pp. 52–68. Cited by: §I, §II.
  • [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §I.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §I, §V-A.
  • [8] R. Felix, V. B. Kumar, I. Reid, and G. Carneiro (2018) Multi-modal cycle-consistent generalized zero-shot learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: §II, §V-C, §V-D, §V-D.
  • [9] R. Felix, M. Sasdelli, I. Reid, and G. Carneiro (2019) Multi-modal ensemble classification for generalized zero shot learning. arXiv preprint arXiv:1901.04623. Cited by: TABLE II.
  • [10] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. Cited by: §II, TABLE II.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §II.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §V-A.
  • [13] G. Hinton, O. Vinyals, and J. Dean (2015)

    Distilling the knowledge in a neural network

    .
    arXiv preprint arXiv:1503.02531. Cited by: §II.
  • [14] Y. Hubert Tsai, L. Huang, and R. Salakhutdinov (2017) Learning robust visual-semantic embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3571–3580. Cited by: §II, TABLE II.
  • [15] D. Huynh and E. Elhamifar (2020) Fine-grained generalized zero-shot learning via dense attribute-based attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4483–4493. Cited by: §II, TABLE II.
  • [16] N. Japkowicz (1999) Concept learning in the absence of counterexamples: an autoassociation-based approach to classification. Rutgers The State University of New Jersey-New Brunswick. Cited by: §II.
  • [17] H. Jiang, R. Wang, S. Shan, and X. Chen (2019) Transferable contrastive network for generalized zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9765–9774. Cited by: §II, TABLE II.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-A.
  • [19] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §II.
  • [20] V. Kumar Verma, G. Arora, A. Mishra, and P. Rai (2018) Generalized zero-shot learning via synthesized examples. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4281–4289. Cited by: §II.
  • [21] G. Kwon, M. Prabhushankar, D. Temel, and G. AlRegib (2020) Backpropagated gradient representations for anomaly detection. In European Conference on Computer Vision, pp. 206–226. Cited by: §II.
  • [22] G. Kwon, M. Prabhushankar, D. Temel, and G. AlRegib (2020) Novelty detection through model-based characterization of neural networks. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 3179–3183. Cited by: §II.
  • [23] C. H. Lampert, H. Nickisch, and S. Harmeling (2009) Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958. Cited by: §I, §V-A.
  • [24] H. Larochelle, D. Erhan, and Y. Bengio (2008) Zero-data learning of new tasks.. In AAAI, Vol. 1, pp. 3. Cited by: §I.
  • [25] S. Liu, M. Long, J. Wang, and M. I. Jordan (2018) Generalized zero-shot learning with deep calibration network. Advances in Neural Information Processing Systems 31, pp. 2005–2015. Cited by: §I, §II.
  • [26] Y. Long, L. Liu, F. Shen, L. Shao, and X. Li (2017) Zero-shot learning using synthesised unseen visual data with diffusion regularisation. IEEE transactions on pattern analysis and machine intelligence 40 (10), pp. 2498–2512. Cited by: §II.
  • [27] S. Min, H. Yao, H. Xie, C. Wang, Z. Zha, and Y. Zhang (2020) Domain-aware visual bias eliminating for generalized zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12664–12673. Cited by: §II, TABLE II.
  • [28] A. Mishra, S. Krishna Reddy, A. Mittal, and H. A. Murthy (2018) A generative model for zero shot learning using conditional variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2188–2196. Cited by: §II, §V-C.
  • [29] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean (2013) Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650. Cited by: §II.
  • [30] G. Patterson and J. Hays (2012) Sun attribute database: discovering, annotating, and recognizing scene attributes. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2751–2758. Cited by: §V-A.
  • [31] S. Rahman, S. Khan, and F. Porikli (2018) A unified approach for conventional zero-shot, generalized zero-shot, and few-shot learning. IEEE Transactions on Image Processing 27 (11), pp. 5652–5667. Cited by: §II.
  • [32] S. Reed, Z. Akata, H. Lee, and B. Schiele (2016) Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–58. Cited by: §V-A.
  • [33] S. Ren, K. He, R. Girshick, and J. Sun (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39 (6), pp. 1137–1149. Cited by: §I.
  • [34] B. Romera-Paredes and P. Torr (2015) An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning, pp. 2152–2161. Cited by: §II.
  • [35] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli (2018) Adversarially learned one-class classifier for novelty detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3379–3388. Cited by: §II.
  • [36] M. Sakurada and T. Yairi (2014) Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, pp. 4. Cited by: §II.
  • [37] M. B. Sariyildiz and R. G. Cinbis (2019) Gradient matching generative networks for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2168–2178. Cited by: §II, §V-C, §V-D, §V-D, TABLE II.
  • [38] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019) Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8247–8255. Cited by: §II, §III-B, TABLE II, TABLE V.
  • [39] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng (2013) Zero-shot learning through cross-modal transfer. Advances in neural information processing systems 26, pp. 935–943. Cited by: §II.
  • [40] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §V-A.
  • [41] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele (2016) Latent embeddings for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 69–77. Cited by: §II, TABLE II.
  • [42] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata (2018) Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence 41 (9), pp. 2251–2265. Cited by: §IV-C, §V-A.
  • [43] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata (2018) Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5542–5551. Cited by: §II, TABLE II, TABLE V.
  • [44] B. Xu, Z. Zeng, C. Lian, and Z. Ding (2021) Semi-supervised low-rank semantics grouping for zero-shot learning. IEEE Transactions on Image Processing 30, pp. 2207–2219. Cited by: §II.
  • [45] Y. Yu, Z. Ji, J. Han, and Z. Zhang (2020) Episode-based prototype generating network for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14035–14044. Cited by: §II, TABLE II.
  • [46] Y. Yu, Z. Ji, J. Han, and Z. Zhang (2020) Episode-based prototype generating network for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14035–14044. Cited by: §III-B.
  • [47] H. Zhang and P. Koniusz (2018) Model selection for generalized zero-shot learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §II.
  • [48] Z. Zhang and V. Saligrama (2016) Zero-shot learning via joint latent similarity embedding. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6034–6042. Cited by: §II.
  • [49] Q. Zhao, L. Zong, X. Zhang, Y. Li, and X. Tang (2020) A multimodal clustering framework with cross reconstruction autoencoders. IEEE Access. Cited by: §III-B.
  • [50] C. Zhou and R. C. Paffenroth (2017) Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 665–674. Cited by: §II.
  • [51] Y. Zhu, J. Xie, B. Liu, and A. Elgammal (2019) Learning feature-to-feature translator by alternating back-propagation for generative zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9844–9854. Cited by: §II, TABLE II.
  • [52] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen (2018) Deep autoencoding gaussian mixture model for unsupervised anomaly detection. International Conference on Learning Representations. Cited by: §II.