Addressing target shift in zero-shot learning using grouped adversarial learning

03/02/2020 ∙ by Saneem Ahmed Chemmengath, et al. ∙ ibm 0

In this paper, we present a new paradigm to zero-shot learning (ZSL) that is trained by utilizing additional information (such as attribute-class mapping) for specific set of unseen classes. We conjecture that such additional information about unseen classes is more readily available than unsupervised image sets. Further, on close examination of the underlying attribute predictors of popular ZSL algorithms, we find that they often leverage attribute correlations to make predictions. While attribute correlations that remain intact in the unseen classes (test) benefit the prediction of difficult attributes, change in correlations can have an adverse effect on ZSL performance. For example, detecting an attribute 'brown' may be the same as detecting 'fur' over an animals' image dataset captured in the tropics. However, such a model might fail on unseen images of Arctic animals. To address this effect, termed target-shift in ZSL, we utilize our proposed framework to design grouped adversarial learning. We introduce grouping of attributes to enable the model to continue to benefit from useful correlations, while restricting cross-group correlations that may be harmful for generalization. Our analysis shows that it is possible to not only constrain the model from leveraging unwanted correlations, but also adjust them to specific test setting using only the additional information (the already available attribute-class mapping). We show empirical results for zero-shot predictions on standard benchmark datasets, namely, aPY, AwA2, SUN and CUB datasets. We further introduce to the research community, a new experimental train-test split that maximizes target-shift to further study its effects.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Zero-shot learning (ZSL) algorithms are designed to train classifiers on

seen classes for predicting any set of unseen classes. The models generalize by utilizing additional information such as attributes that persist between seen and unseen classes. Recent variants of the zero-shot paradigm, termed transductive ZSL, utilize the unlabelled test set as unsupervised additional information that can enhance generalization. However, obtaining a significant number of images from unseen classes of interest may not always be feasible. On the other hand, obtaining additional information about the label space is more readily available, such as class descriptions from knowledge banks (such as Wikipedia). For instance, it would be frugal to simply construct an attribute-to-class mapping by utilizing such sources of information. For example, for an unseen class Wolf, it would be possible to find attributes describing it, such as from common sources.

Figure 1: Traditional zero-shot learning algorithms utilize a set of seen classes (and associated information such as attributes-class mapping) to prepare a classifier for any set of unseen classes. This paper presents a variant of zero-shot learning that utilizes additional information from specific unseen classes (such as the already available attributes-class mapping) to create a tailored classifier. We show that such a paradigm of zero shot learning can be useful for correcting target shift in attributes.

This paper presents a new paradigm of zero-shot learning (ZSL) algorithms by designing test-class specific classifier using additional information from the label space. Similar to the popular transductive variant of ZSL, the proposed framework utilizes information about the unseen class labels (rather than image or features) to create a tailored classifier. Specifically, our experiments rely on attribute-class mappings to create a unseen class specific classifier that is robust to target shift

in the attributes space. Given a scene understanding problem with images

and class label with attributes , we use the term target shift

, to describe a scenario where the probability distribution

changes from train and test settings.

Specifically, the contributions of this work are as follows:

  • We present a new zero-shot learning paradigm where the classifier can be tailored to a specific set of unseen classes by only utilizing additional information such as attribute-class mapping. Specifically, we show that the proposed framework is effective in curtaining target-shift between attributes of seen and unseen classes.

  • We first present a principled analysis of the effects of target shift on a controlled synthetic dataset. With adversarial learning, we show that it is not only possible to prepare a learning model for test sets of unseen classes with de-correlated labels, but also ones with different (opposite) correlations. The observations on such a controlled setting strongly motivates the proposed framework that uses adversarial learning.

  • Building on the analysis, we propose grouped adversarial learning (gAL) paradigm for target shift, that is able to distinguish between useful attribute correlations and correcting for harmful (shifting) correlations between attributes. We show that the performance of a simple and popular baseline ZSL algorithm, (ESZSL) [31], can be improved with the proposed gAL. Further, our novel task grouping technique and an adversarial task weighting scheme allows gAL to be universally applicable to any attribute-prediction based ZSL architecture that is end-to-end trainable.

  • We demonstrate the performance of gAL on four standard zero-shot learning benchmarks, namely, Animals-with-Attributes-2 (AwA2) [39], Attribute Pascal and Yahoo (aPY) [11], Scene UNderstanding (SUN) [40], and Caltech UCSD Birds (CUB) [37] datasets. The proposed gAL is able to improve performance of the popular ESZSL algorithm by correcting for the target-shift utilizing only the attribute-class mapping of the test. Further, we release a new experimental protocol (train-test split) that maximizes target-shift between the seen and unseen classes to further study this problem.

2 Related Work

Zero shot learning has been extensively studied in literature [39, 31, 24, 34, 6, 28, 36, 3, 32], with several variants such as transductive ZSL and generalized ZSL. Xian et al. [39]

performed an extensive benchmarking of several SOTA algorithms under a common benchmark protocol, representation vectors and hyper-parameter tuning. They also showed that the performance of linear compatibility models are comparable with more complex hybrid models with joint representations.

Addressing negative effects of label correlations has been looked at in the areas of domain adaptation under target shift [45, 27], debiasing[44, 41, 43], privacy preservation literature[16, 19, 9], and multi-task learning[47, 30, 21]. In target shift setting, the marginal distribution of label are different for train and test set while having same. Change in label distributions often results in correlation among labels in train and test being different, resulting in models capturing unwanted label correlation from training set. In de-biasing and privacy preservation settings protected variables (de-bias) or sensitive/private variables(privacy) are correlated with the label in the data and trained models reflects those biases or reveal the private information.

The work on target shift [45, 26, 27] uses importance re-weighting on training instance to match the probability of train set with that of test set. This process will perform poorly when cardinality of label set

is large (curse of dimensionality). This setting also assumes that instances of the labels in test set should strictly be a subset of that of train set which does not occur in the case of zero shot learning setting. In zero shot learning, different label (attribute) combinations define a class, and train and test sets have different groups of classes.

The architectures of the adversarial learning frameworks for the tasks of debaising and privacy preservation are also relevant to our work. They learn representations that are invariant to protected or private information. In the case of fair classification [43, 44] information such as gender and zip code are protected labels, and these approaches propose to learn a model which uses representations invariant to these particular labels. Adversarial learning is also used in domain adaptation [14] to make learned representations invariant to source and target domain. However, in such a setup, the invariant label is already known and usually deals with just one adversarial label. These methods do not deal with multiple adversaries at a time, as is the case when multiple attributes are involved in ZSL.

In multitask learning (MTL) several regularization based methods are proposed to mitigate negative effects of label correlation [47, 21, 30]

which attempt to decorrelate label predictors using special regularizers which enforces predictors of different labels to use non-overlapping set of features. The overall intent of these techniques is to decorrelate a multi label classification model. However, such regularizers are not applicable for learning based features with end-to-end trainable neural networks. Variants of the zero-shot learning algorithms such as transductive ZSL focus on the inefficiencies in the representation space. In this work, we show that its possible to tailor ZSL algorithms to specific unseen classes using only additional information from the label space. Further, such an approach is helpful to curtail the target-shift in the underlying (implicit) attribute predictor of the simple and popular ESZSL algorithm.

3 Proposed Framework

In this section, we discuss our novel grouped adversarial learning technique to correct target shift in zero shot learning models.

3.1 Main Idea

In vanilla ZSL setting, no information about unseen (test) classes is given in the training phase and one hopes that the trained model extends to any set of classes given their attribute-to-class mappings In this paper we propose to use the attribute-to-class mapping of the unseen classes in the training phase to build a tailored ZSL model.

Many of the zero shot learning (ZSL) algorithms learns to map input instance to a predefined set of attributes which describes seen and unseen classes. We use to denote attribute vector. In the training phase, models learn multilabel attribute predictors hoping

that it will generalize well to unseen combinations of attributes. Looking past the task of predicting classes and only considering instances and attributes we can view ZSL as a special case of transfer learning problem. Here attribute distributions differ from seen to unseen classes and we propose to view this as a domain adaptation under

target shift [27, 42, 26], where attribute marginals for training set and that for test set are different but conditionals remains the same. Correcting for target shift one needs to have

with training data. We propose to use attribute-to-class mapping function to estimate

by assuming that unseen classes are equally likely in the test set. One could also estimate from unlabelled test data using BBSE [27].

To the best of our knowledge, there are no prior work in correcting target shift for multilabel classification problem. Existing techniques like importance reweighting for multiclass classification [10, 45] weighs each instance with

in the loss function. Such a reweighting could be seen as a way of matching attribute distributions of train and test set. This technique cannot be extended to a multilabel setting as the number of unique combinations of

could be exponentially high, resulting in very noisy estimates of (curse of dimensionality). The case of zero shot learning is harder as combinations in train set do not appear in the test set, essentially resulting in the weights to be zero ( for all ).

As it is impossible in zero shot learning setup to match the whole attribute distributions of seen(train) and unseen(test) classes, we propose to match all pairwise distributions . One straightforward way to achieve this would be to train prediction models for all pairs of attributes, but it is not straightforward to combine all pairwise predictors to a single attribute set prediction. To this end, we propose to use adversarial learning to correct target shift. Pairwise distributions can be factorized into three terms: marginals and , and correlation coefficient between and . Similarly, target shift for pair of attributes can be captured as shifts in two marginal distributions and a further shift in correlation among attributes. We propose to use importance reweighting to correct changes in the marginal distributions and adversarial learning to correct the correlation shift.

We utilize a similar formulation of adversarial learning as the ones popularly used in unsupervised domain adaptation [14] and debiasing prediction models [19, 16, 44]. For a prediction model for we use as an adversarial task and vice versa (when decorrelating against ). We show in the synthetic experiments that with appropriate weighting for adversarial task, one can correct correlation shift for the pair of labels. The following are the benefit of using adversarial learning to correct correlation shift:

  • Adversarial learning can be applied on ZSL methods even when they do not explicitly predict attributes. We show later in the paper how adversarial learning could be applied on compatibility learning type of ZSL methods, specifically ESZSL [31].

  • Predictor for one attribute can have several adversarial branches connected to it simultaneously minimize all pairwise correlation shift against this particular attribute, with the right weighting scheme.

Next, we present a controlled setting for using adversarial learning to correct target shift in two label scenario with synthetic data with fixed correlation differences. We then further extend this approach to larger number of attributes using grouped adversarial learning (gAL).

3.2 Adversarial Learning for Target Shift

Figure 2: Synthetic experiments analysis: (a) Probabilistic data generation to create data with target shift (b) Model accuracy on test set with varying label correlation, when model was trained on training set with correlation (vertical dotted lines the diagram) (c) Model weights on features. Note that best models for and give equal positive weights to first and last five features respectively

We analyze the performance of adversarial learning to correct target shift using synthetic data as it allows us to create training and test sets with varying correlations which is not otherwise possible with real data. This analysis makes two counter-intuitive observations and motivates the proposed formulation which is presented later in Sec.3.3.

Data Generation: Our synthetic dataset consists of 10-dimensional real vector instances , with two binary labels and (primary and auxiliary, respectively). As show in Fig.2(a) we generate the data from a probabilistic generative system with different label distributions for training and test sets while keeping conditional the same throughout. This way synthetic data generated has target shift. To generates a data point:

  1. [leftmargin=*]

  2. First, we sample from label distribution.

  3. The first 5 features for instance are then sampled according to the primary label

    from a mixture of two 5 dimensional multivariate Gaussian distributions with identity covariance matrix. If the

    then 5 features are sampled from first Gaussian, otherwise from the second Gaussian.

  4. Similarly, the second set of 5 features are sampled the same way from another set of mixture of two 5 dimensional multivariate Gaussian distributions, this time sampled according to the auxiliary label .

Setup: For both and , we choose corresponding means of the Gaussian distributions in such a way that the best linear classifier has positive and equal weights for all the 5 features. This ensures that all 5 features are equally important. We ensure that there are no class-imbalance issues with either of the two labels by keeping . We also keep the distance between Gaussian distributions corresponding to primary label and auxiliary label fixed to which corresponds to Bayes accuracy of .

For these synthetic experiments we are interested in analyzing the predictive performance of the primary label trained at a given label correlation and evaluated against multiple test sets with varying label correlations. We fix label correlation in train set to and create test sets with correlations from to .

For this experiment, we trained a model on training set with and tested out it’s performance on test set which only differs from training set in . For training set we sample 1000 instances and for test set we sample a very high number of 50,000 instances to avoid sampling bias in evaluations.

We compare following algorithms in this analysis:

  1. [leftmargin=*]

  2. Baseline

    : A linear logistic regression classifier trained only on the primary label


  3. Sharing

    : A 2 label MLP with one hidden layer (2 neurons) that predicts

    , with no activations. Common hidden layer encourages sharing.

  4. Adv-: An adversarial learning model with one hidden layer of two neurons (as encoder), a label predictor for primary label and discriminator to predict auxiliary label with adversarial weight .

Observations and Insights: Fig.2(b) illustrates the test accuracy on primary label prediction against label correlation in test set. Fig.2(c) visualizes model weights on 10-dimensional feature vector. All models are essentially linear functions. We make the following observations from this exercise:

  • [leftmargin=*]

  • Given a model that is trained on a dataset with a certain label correlation , applying adversarial learning with the right choice of not only improves its performance in test data with uncorrelated labels , but also on test datasets with opposite correlations111By opposite correlation we mean correlation with different signs.

  • As shown in Fig. 2(b), performance of baseline model is monotonically affected by the change in correlation between and . Further, we observe that the performance is less affected when the correlation increases with the same polarity. A similar observation was made by [18] in bias setting and is termed bias amplification. On the other hand, adversarial models (-) is more invariant to various label correlations in the test set that is consequence of target shift.

  • In the target shift context, as shown in Fig. 2(b), sharing is useful when the expected label correlation during test remains close to train or increases. However, the performance drastically reduces if the target shift in test set has lower or opposite label correlation.

  • The choice of adversarial weight (hyper-parameter ) is critical to the performance of the model for a given test correlation. For instance, in this setup, = is the best choice when test set is uncorrelated i.e., , whereas a larger is more suitable for test correlations near . The choice of even causes the models to achieve higher accuracy in a target shifted test set than the training set.

  • Fig. 2(c) illustrates the associated weights of the model to the input feature vector. As expected, the weights of the features indicative of primary label (first five features) is higher than that of the auxiliary label (last five features). By design, for data generated with zero label correlation, the best model will assign zero weights to features corresponding to auxiliary labels. However, we notice that the baseline model exploits the positive label correlation and assigns positive weights on features corresponding to the auxiliary label to make predictions.

  • As the for adversarial models increases, we observe that the model weights for the features corresponding to the auxiliary label are reduced. Furthermore, for larger value of , the model assigns negative weights on features corresponding to . Negative weights on last five features imply that the model has captured opposite correlation between labels even though such a correlation is not observed in training.

3.3 Grouped Adversarial Learning (gAl)

We now describe our novel grouped adversarial learning to correct the effects of target shift in attribute prediction of zero-shot learning algorithms where the typically, a large number of attributes (e.g., parts of animals or birds) are predicted for unknown classes.

3.3.1 Adversarial Weighting

In case of large number of attributes, multiple adversarial branches need to be used. For any given attribute, there might be several attributes for which pairwise correlation considerably shifts between train and test set. In such a scenario, we apply as many adversarial branches to that attribute encoder to account for the change in correlation. Choice of adversarial weights becomes crucial and hyper parameter tuning to find the best adversarial weights for every pair of labels becomes intractable. We propose to use statistics of the data and insights from previous section to design an adversarial weighting scheme.

Figure 3: Proposed model architecture for (a) Attribute prediction task and (b) Zero shot class prediction with grouped adversarial learning. Each attribute group latent representation is adversarially trained with all remaining groups. The attribute predictions are multiplied with the attribute-class mapping prior matrix. This illustration is for three attribute groups, however, the model architecture in the experiments is extended according to the dateset.

As observed in the previous section, adversarial learning performs poorly when the correlation between a pair of labels is amplified in the test set when compared with that in training set. We also observe that when correlation in test set moves in the opposite direction of that in training set, adversarial learning helps. We also observe that larger the change in correlation, higher values of adversarial weights is desired. Keeping these in mind we propose adversarial weighting scheme called . For attributes and , with train and test correlation coefficients and , is defined as:


For positive value, the corresponding pair of attributes are allowed to be adversarial to each other with adversarial weights being . Simultaneously, attribute pairs whose values are zero either have same correlation among them in training and test set or have higher correlation in test set than in the train set with same sign (positive or negative). From multi-task learning (MTL) literature [4] and analysis mentioned in the previous section, we observe that allowing such pairs of attributes to share

helps the performance. We allow hidden layers of the models corresponding to such attributes to achieve this, as is common with deep learning models.

3.3.2 Attribute Grouping

For the proposed ZSL with adversarial learning in the case of large number of attributes, we propose to utilize a grouping strategy of similar attributes. As in previous work, the group membership could be based on semantic similarity of attributes (color, shape, habitat) [21, 20, 35, 22]. However, in the context of target shift, we hypothesize that grouping tasks based on correlation shift may be more beneficial. Specifically, the proposed measure of correlation shift among attribute pairs in the same group should be low and that of across groups should be high. We form groups by clustering attributes using as the distance measure by using spectral co-clustering[8].

3.4 Model Architecture

As illustrated in Fig.3 (a), for a given group membership of attributes and the weighting scheme, we propose a one-vs-all architecture for label prediction, with every group jointly predicting the member attributes constrained by all other groups as branches. We reiterate that the proposed grouped adversarial learning (gAL) is designed to preserve model performance on test sets with target shift. The model prediction also continues to benefit from sharing within group members and is adversarialy constrained from leveraging unintended correlations with suitable weighting scheme. The model first encodes an image through a trainable ConvNet backbone, such as ResNet-101. The feature encoding produced is connected to

latent representations through fully connected layers (with ReLU), each corresponding to a group. The latent representation of the group is responsible to predict all the attributes thus enabling sharing. Further, for each latent representation, we connect

adversarial arms with the same number of fully connected layers (with ReLU) as the primary arm. Note that each latent representation is updated from the prediction arm and adversarially updated from the remaining arms. We evaluate the model performance solely based on the primary task for each latent group representation.

ESZSL: As illustrated in Fig. 3(b), the ES-Zero shot learning (ESZSL) [31]

algorithm utilizes a multi-layer perceptron to directly predict the class label via a fixed attribute-to-class transformation matrix. The approach makes attribute prediction implicit to the model, thereby scaling each predictions by importance to the unknown classes. While, several other formulations of ZSL exists, we use ESZSL for its surprisingly good performance on ZSL benchmarks


. Further, for the proposed approach, we replace the original closed-form convex solution with stochastic gradient descent for optimization to enable end-to-end training with any backbone architecture.

Loss function: Apart from change in label correlation, target shift could cause change in individual label priors (class imbalance). While we discuss that adversarial learning with right set of adversaries and associated weighting scheme handles change in label correlation due to target shift, it does not explicitly handle change in label priors. Hence, we use balanced loss functions (such as balanced cross entropy) in our models.

Optimization: Adversarial learning models can be optimized using two approaches. One is to use an alternating optimization approach similar to GAN [15]. We use the other approach which is a special gradient flipping layer before the adversarial arms called gradient reversal layer [13]. This layer allows all layers to be trained simultaneously. The group latent representations gets negated gradients from adversarial arms which push latent representations to diminish the performance of adversarial arms. The extent of the influence of the negative gradients is controlled by a single weight parameter, the adversarial weight.

Dataset #attributes #seen classes (train+val) #unseen classes #seen images (train + val) #unseen images correlation
mean mean @top 50%
aPY 64 15+5 12 7415 7924 0.073 0.145
AWA2 85 27+13 10 29409 7913 0.161 0.319
CUB 312 100+50 50 8821 2967 0.019 0.036
SUN 102 580+65 72 12900 1440 0.016 0.033
aPY-CS 64 15+5 12 10990 4349 0.132 0.246
AWA2-CS 85 27+13 10 32486 4836 0.255 0.483
CUB-CS 312 100+50 50 8859 2929 0.041 0.076
SUN-CS 102 580+65 72 12900 1440 0.074 0.136
Table 1: Statistics of datasets with attribute correlation shift between train and test sets
DAP [25] 33.8 46.1 40.0 39.9
IAP [25] 36.6 35.9 24.0 19.4
CONSE [29] 26.9 44.5 34.3 38.8
CMT [33] 28.0 37.9 34.6 39.9
SSE [46] 34.0 61.0 43.9 51.5
LATEM [38] 35.2 55.8 49.3 55.3
ALE [1] 39.7 62.5 54.9 58.1
DEVISE [12] 39.8 59.7 52.0 56.5
SJE [2] 32.9 61.9 53.9 53.7
SYNC [5] 23.9 46.6 55.6 56.3
SAE [23] 8.3 54.1 33.3 40.3
PSR [3] 38.4 63.8 56.0 61.4
ESZSL [31] 38.3 58.6 53.9 54.5
ESZSL* [31] 34.4 53.8 48.6 55.5
ESZSL-gAL 39.8 61.4 52.2 59.3
Table 2: Performance (% Average per-class Accuracy) of proposed approach and popular vanilla ZSL approaches with ResNet-101 pre-trained features. * Our ESZSL implementation of the closed-form solution.
Method aPY-CS AWA2-CS
ESZSL* [31] 20.5 36.9
ESZSL-gAL 23.6 40.3
Table 3: Performance (% Average per-class Accuracy) on our proposed high correlation-shift splits of the datasets.

4 Experimental Analysis

We utilize the Animals-with-Attributes-2 (AWA2) [39], Attribute Pascal and Yahoo (aPY) [11], Scene UNderstanding (SUN) [40], and Caltech UCSD Birds (CUB) [37] datasets to demonstrate the advantages of the proposed grouped Adversarial Learning (gAL) derived from our ZSL framework. For all the datasets, we follow the experimental protocol consistent with previous literature [39] (protocol details in Table 1). The focus of the experimental study is to evaluate the ability of gAL to counter the effects of target shift on the popular ESZSL [31] algorithm. We show that gAL substantially improves the performance of ESZSL and produces comparable results to other SOTA algorithms trained on the same feature representations (see Table 2).

4.1 Datasets and Protocol

The details of the experimental protocols for all four datasets are described in Table 1. Briefly, AWA2 consists of images of animals in natural settings, with 50 animal classes and 85 annotated attributes which describes each class. The aPY dataset is a smaller dataset with 32 classes and 64 attributes. On the other hand, the CUB dataset consists of 200 classes with 312 attributes and the SUN dataset comprises of 102 attributes and 717 classes. In all cases, we utilize the “attribute to class prior” matrices provided for with the dataset.

Protocol: The experimental protocol for all four datasets is described in [39]. The protocol is designed such that the validation set is also zero-shot in nature. We utilize the feature representation (of length 2048) obtained from ResNet-101 [17]

model that is pre-trained on ImageNet

[7] provided by the authors of [39].

Correlation-shift analysis and new splits: Table 1 also shows the mean difference in correlation, measured by (Eq. 1) and measured for the top of attribute pairs. We highlight the significantly high change in correlation for the AWA2 and aPY datasets. In addition to the standard protocol used in literature, we generate a new experimental split of train, validation and test such that the difference in correlation (measured by ) is maximized. Using a greedy selection approach, we are able to obtain new splits, particularly of AWA2 and aPY with significantly higher while maintaining the same class count in each split. We present the performance of the baselines and proposed approach on this new split in Table 3, which highlights the problems of target shift, and the ability of gAL to correct for them.

(a) AWA2
(b) aPY
Figure 4: The performance (% Average per-class Accuracy) for both the proposed correlation-shift validation and test splits show the importance of adversarial weight selection to obtain best model and early stopping. Further, dataset with higher requires larger adversarial weight.

4.2 Analysis

Next, we discuss the experiments of the proposed approach gAL, reported in table 2 and 3:

  • [leftmargin=*]

  • The performance of the simple and popular ESZSL with the gAL framework, improves its performance compared to the original baseline from [39], as shown in Table 2. Specifically, aPY by 5.4%, AwA improves by 7.6%, CUB by 3.6% and SUN by 3.8%. Note that the baseline ESZSL is a closed-form solution while ESZSL-gAL uses stochastic gradient descent. We compare against our reproduced results from a publicly available implementation. Similar to [39], we are unable to exactly reproduce the original reported numbers.

  • The SOTA algorithms represented in Table 2 (obtained from [39], and [3]) are a representative set that cover Linear Compatibility methods (ALE, DEVISE, SJE, SAE, ESZSL), Nonlinear Compatibility methods (CMT, LATEM, PSR), Learning Intermediate Attribute Classifiers (IAP, DAP) and hybrid models (SSE, CONSE). All approaches use the same ResNet-101 feature representations and training and model selection protocols. It is interesting to note the comparable performance of a linear compatibility approach ESZSL-gAL with the state-of-art non-linear compatibility approach such as PSR.

  • The performance of ESZSL (closed form solution) and the proposed ESZSL with gAL on the newly introduced correlation-shift splits is shown in Table 3. On the new splits, the performance improvement on AwA2-CS is by 6.6%, and aPY-CS by 3.1%. Further, as shown in Table 1, the difference is correlation is significant for AwA2-CS and aPY-CS at 0.483 and 0.246 respectively for top 50% attribute pairs.

  • The proposed data-driven attribute grouping scheme is based on correlation statistics of the dataset (), which measures the directional correlation of the attributes. We use the popular spectral co-clustering method and empirically identify goodness of fit for each dataset AwA:8, aPY:3, SUN:8, CUB:5). The AwA and CUB datasets have made available semantic grouping (AWA:10, CUB:28) of attributes that have been extensively utilized in previous literature[21]. However, we observe that the proposed co-clustering method obtained clusters with better performance and also continue to have semantic relevance, for instance, the cluster {‘lean’, ‘swims’, ‘fish’, ‘arctic’, ‘coastal’, ‘ocean’, ‘water’} clearly represents the aquatic animal classes of the AwA dataset.

  • We attach a trainable layers (500) with ReLU to the base model. Next, we add latent representation per group (100) which is fully connected to the group specific attribute prediction (with sigmoid activation). All models are trained with learning rate of . The model hyper-parameters (number and size of intermediate layers, batch-size, , ) have all been picked from a large parameter sweep for best validation error.

  • In all the proposed models, we normalize the adversarial weights proportionally to and the corresponding primary task such that they are equally weighted. As mentioned before, the choice of is also essential to maintain performance on the primary task. Due to the large number of adversarial tasks for any given latent vector, training without the proposed adversarial weighting scheme leads to poor prediction performance. Fig. 4 shows the importance of adversarial weight on the performance of the models. Note that the datasets with higher mean require larger adversarial weights to achieve better test accuracy, as shown earlier in the synthetic data analysis.

  • As mentioned above, the weighting scheme utilized in adversarial training is critical for performance in target shifted test sets. Further, we observe that models with adversarial arms without suitable weighting quickly reduce to degenerate performances.

  • The proposed gAL approach constructs the attribute clusters (and correspondingly the model architecture) along with the adversarial weighting from the statistics of the test set. However, the best model is selected with early stopping on validation accuracy. The difference in performance from validation and test highlight the difficulty of early stopping and model selection in adversarial learning (see Fig. 4). Other unsupervised model selection approaches may be explored here.

5 Conclusion and Future Work

In this work, we show that adversarial learning coupled with a grouping and adversarial weight strategies can be an effective way to curtail target-shift in zero shot learning settings. We propose a new paradigm for zero shot learning that leverages additional information on the unseen class label distribution, such as attribute-class mapping to design and weight the proposed grouped adversarial learning. Specifically, we substantially improve the performance of the simple and popular ESZSL algorithm by correcting for target-shift on four standard benchmark datasets. Our results indicate a potential improvement can be achieved in any supervised zero-shot learning with attribute predictions that is seeking robustness to target shift, provided some indication of expected attribute correlation in test is available (such as domain expertise or explainability models). As discussed in previous literature, optimization of adversarial learning is challenging and require large hyperparameter sweeps. A functional and flexible PyTorch implementation was built for the experimental evaluation of this work along with all hyperparameter tuning heuristics, which has been open-sourced to the community.


  • [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid (2015) Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence 38 (7), pp. 1425–1438. Cited by: Table 2.
  • [2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele (2015) Evaluation of output embeddings for fine-grained image classification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2927–2936. Cited by: Table 2.
  • [3] Y. Annadani and S. Biswas (2018) Preserving semantic relations for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7603–7612. Cited by: §2, Table 2, 2nd item.
  • [4] A. Argyriou, T. Evgeniou, and M. Pontil (2007) Multi-task feature learning. In Advances in neural information processing systems, pp. 41–48. Cited by: §3.3.1.
  • [5] S. Changpinyo, W. Chao, B. Gong, and F. Sha (2016) Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5327–5336. Cited by: Table 2.
  • [6] S. Changpinyo, W. Chao, B. Gong, and F. Sha (2016) Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5327–5336. Cited by: §2.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Cited by: §4.1.
  • [8] I. S. Dhillon (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–274. Cited by: §3.3.2.
  • [9] H. Edwards and A. Storkey (2016) Censoring representations with an adversary. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §2.
  • [10] C. Elkan (2001) The foundations of cost-sensitive learning. In

    International joint conference on artificial intelligence

    Vol. 17, pp. 973–978. Cited by: §3.1.
  • [11] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth (2009-06) Describing objects by their attributes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1778–1785. External Links: Document, ISSN Cited by: 4th item, §4.
  • [12] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. Cited by: Table 2.
  • [13] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation


    International Conference on Machine Learning

    pp. 1180–1189. Cited by: §3.4.
  • [14] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §2, §3.1.
  • [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §3.4.
  • [16] J. Hamm (2017) Minimax filter: learning to preserve privacy from inference attacks. The Journal of Machine Learning Research 18 (1), pp. 4704–4734. Cited by: §2, §3.1.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  • [18] L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach (2018) Women also snowboard: overcoming bias in captioning models. arXiv preprint arXiv:1807.00517. Cited by: 2nd item.
  • [19] Y. Iwasawa, K. Nakayama, I. Yairi, and Y. Matsuo (2017) Privacy issues regarding the application of dnns to activity-recognition using wearables and its countermeasures by use of adversarial training. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 1930–1936. External Links: Document, Link Cited by: §2, §3.1.
  • [20] L. Jacob, J. Vert, and F. R. Bach (2009) Clustered multi-task learning: a convex formulation. In Advances in neural information processing systems, pp. 745–752. Cited by: §3.3.2.
  • [21] D. Jayaraman, F. Sha, and K. Grauman (2014) Decorrelating semantic visual attributes by resisting the urge to share. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1629–1636. Cited by: §2, §2, §3.3.2, 4th item.
  • [22] Z. Kang, K. Grauman, and F. Sha (2011) Learning with whom to share in multi-task feature learning.. In International Conference on Machine Learning, Vol. 2, pp. 4. Cited by: §3.3.2.
  • [23] E. Kodirov, T. Xiang, and S. Gong (2017)

    Semantic autoencoder for zero-shot learning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3174–3183. Cited by: Table 2.
  • [24] E. Kodirov, T. Xiang, and S. Gong (2017) Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3174–3183. Cited by: §2.
  • [25] C. H. Lampert, H. Nickisch, and S. Harmeling (2013) Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3), pp. 453–465. Cited by: Table 2.
  • [26] Y. Lin, Y. Lee, and G. Wahba (2002) Support vector machines for classification in nonstandard situations. Machine learning 46 (1-3), pp. 191–202. Cited by: §2, §3.1.
  • [27] Z. Lipton, Y. Wang, and A. Smola (2018) Detecting and correcting for label shift with black box predictors. In International Conference on Machine Learning, pp. 3128–3136. Cited by: §2, §2, §3.1.
  • [28] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean (2013) Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650. Cited by: §2.
  • [29] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean (2014) Zero-shot learning by convex combination of semantic embeddings. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: Table 2.
  • [30] B. Romera-Paredes, A. Argyriou, N. Berthouze, and M. Pontil (2012) Exploiting unrelated tasks in multi-task learning. In International Conference on Artificial Intelligence and Statistics, pp. 951–959. Cited by: §2, §2.
  • [31] B. Romera-Paredes and P. Torr (2015) An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning, pp. 2152–2161. Cited by: 3rd item, §2, 1st item, §3.4, Table 2, Table 3, §4.
  • [32] M. B. Sariyildiz and R. G. Cinbis (2019-06) Gradient matching generative networks for zero-shot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [33] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng (2013) Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pp. 935–943. Cited by: Table 2.
  • [34] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng (2013) Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pp. 935–943. Cited by: §2.
  • [35] S. Thrun and J. O’Sullivan (1998) Clustering learning tasks and the selective cross-task transfer of knowledge. In Learning to learn, pp. 235–257. Cited by: §3.3.2.
  • [36] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 4. Cited by: §2.
  • [37] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: 4th item, §4.
  • [38] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele (2016) Latent embeddings for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 69–77. Cited by: Table 2.
  • [39] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata (2018) Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence. Cited by: 4th item, §2, §3.4, 1st item, 2nd item, §4.1, §4.
  • [40] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010-06) SUN database: large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. , pp. 3485–3492. External Links: Document, ISSN Cited by: 4th item, §4.
  • [41] Q. Xie, Z. Dai, Y. Du, E. Hovy, and G. Neubig (2017) Controllable invariance through adversarial feature learning. In Advances in Neural Information Processing Systems, pp. 585–596. Cited by: §2.
  • [42] Y. Yu and Z. Zhou (2008) A framework for modeling positive class expansion with single snapshot. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 429–440. Cited by: §3.1.
  • [43] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork (2013) Learning fair representations. In International Conference on Machine Learning, pp. 325–333. Cited by: §2, §2.
  • [44] B. H. Zhang, B. Lemoine, and M. Mitchell (2018) Mitigating unwanted biases with adversarial learning. arXiv preprint arXiv:1801.07593. Cited by: §2, §2, §3.1.
  • [45] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang (2013) Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pp. 819–827. Cited by: §2, §2, §3.1.
  • [46] Z. Zhang and V. Saligrama (2015) Zero-shot learning via semantic similarity embedding. In Proceedings of the IEEE international conference on computer vision, pp. 4166–4174. Cited by: Table 2.
  • [47] Y. Zhou, R. Jin, and S. C. Hoi (2010)

    Exclusive lasso for multi-task feature selection

    In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 988–995. Cited by: §2, §2.