ATZSL: Defensive Zero-Shot Recognition in the Presence of Adversaries

10/24/2019 ∙ by Xingxing Zhang, et al. ∙ 0

Zero-shot learning (ZSL) has received extensive attention recently especially in areas of fine-grained object recognition, retrieval, and image captioning. Due to the complete lack of training samples and high requirement of defense transferability, the ZSL model learned is particularly vulnerable against adversarial attacks. Recent work also showed adversarially robust generalization requires more data. This may significantly affect the robustness of ZSL. However, very few efforts have been devoted towards this direction. In this paper, we take an initial attempt, and propose a generic formulation to provide a systematical solution (named ATZSL) for learning a robust ZSL model. It is capable of achieving better generalization on various adversarial objects recognition while only losing a negligible performance on clean images for unseen classes, by casting ZSL into a min-max optimization problem. To address it, we design a defensive relation prediction network, which can bridge the seen and unseen class domains via attributes to generalize prediction and defense strategy. Additionally, our framework can be extended to deal with the poisoned scenario of unseen class attributes. An extensive group of experiments are then presented, demonstrating that ATZSL obtains remarkably more favorable trade-off between model transferability and robustness, over currently available alternatives under various settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many practical applications, we need the model to have the ability to determine the class labels for the data belonging to unseen classes. The following are some popular application scenarios [49]:

  • The number of target classes is large. Generally, human beings can recognize at least 30,000 object classes [5]. However, collecting sufficient labeled instances for such a large number of classes is challenging. Thus, existing image datasets can only cover a small subset of these classes.

  • Target classes are rare. An example is fine-grained object recognition. Suppose we want to recognize flowers of different breeds [31]. It is hard and even prohibitive to collect sufficient image instances for each specific flower breed.

  • Target classes change over time. An example is recognizing images of products belonging to a certain style and brand. As products of new styles and new brands appear frequently, for some new products, it is difficult to find corresponding labeled instances [34].

  • Annotating instances is expensive and time consuming. For example, in the image captioning problem [47], each image in the training data should have a corresponding caption. This problem can be seen as a sequential classification problem. The number of object classes covered by the existing image-text corpora is limited, with many classes not being covered in practice.

Fig. 1: The visual images and class prototypes provided for several classes in benchmark dataset AWA2 [51].

To solve this problem, Zero-Shot Learning (ZSL) [37, 55, 7, 39, 52, 14, 44, 23]

is proposed. The goal is to recognize data belonging to the classes that have no labeled samples. Since its inception, it has become a fast-developing field in machine learning with a wide range of applications in computer vision. Due to lack of labeled samples in the unseen class domain, auxiliary information is necessary for ZSL to transfer knowledge from the seen to the unseen classes. As shown in Fig. 

1, existing methods usually provide each class with one class prototype derived from text

(e.g., attribute vector 

[24, 29, 30] or word vector [1, 15, 46]). This is inspired by the way human beings recognize the world. For example, with the knowledge that “a zebra looks like a horse, and with stripes”, we can recognize a zebra even without having seen one before, as long as we know what a “horse” is and how “stripes” look like.

Despite achieving outstanding performance on general recognition tasks, state-of-the-art classifiers remain highly vulnerable to small, imperceptible, and adversarial perturbations 

[13]

. This vulnerability in the ZSL scenario is also inevitable, where an unseen object is transformed by an undetectable amount. It may be even easier to be attacked for ZSL models compared with normal supervised learning models, since 

[43] has proved that adversarially robust generalization requires more data, while there does not exist any training data for unseen classes in ZSL. In particular, while people mostly care about the transferability and discriminability, the robustness of ZSL models is usually ignored though it is also extremely important. We have empirically proved the adversarial perturbations in zero-shot recognition are more intricate to be addressed by existing ZSL models. This comes from the fact that without labeled samples in the unseen class domain, ZSL models need to transfer knowledge not only from the seen to the unseen classes, but also from the clean to the adversarial examples with unknown adversaries. Consequently, it is difficult and risky to apply such fragile ZSL models in practice, especially in security-critical areas. For example, if we use a zero-shot activity recognition model in self-driving cars, adversarial examples of those unseen activities could allow an attacker to cause the car to take unwanted actions.

Besides in the visual space, the adversarial examples in the semantic space are also inevitable. This is because the class prototypes of unseen classes, either human-defined attributes or automatically extracted word vectors, are all obtained independently from visual samples, and then easily been poisoned undetectably. However, these prototypes play a key role in transferring knowledge from the seen to the unseen classes, and thus even a small perturbation will degrade the recognition performance. One example, when the description is changed to “a zebra looks like a horse, and without stripes”, we are not able to recognize a zebra “without stripes”.

Motivated by the two key observations in the visual and semantic spaces above, in this work, we focus on developing a robust ZSL model. Different from previous ZSL methods, we propose to train a novel ZSL model adversarially. By injecting adversarial examples into our model, it is able to achieve promising generalization on unseen objects recognition under various image or attribute attacks. And meanwhile, the recognition performance on clean data has only non-obvious degradation by designing a defensive relation prediction network. In particular, the added overhead to generate and train adversarial examples can be negligible. We emphasize our contributions in four aspects:

  • To the best of our knowledge, our work describes the first algorithmic framework to jointly optimize the two key goals: ZSL robustness against adversarial attacks, and transferability to unseen classes.

  • We develop ATZSL: an Adversarially Trained Zero-Shot Learning model, to alleviate the instability issue of ZSL caused by small, well sought perturbations of both images and attributes.

  • By casting ZSL into a constrained min-max optimization form, our ATZSL transfers knowledge from the clean to the adversarial examples, thus achieving competitive trade-off between the performances in clean and adversarial domains.

  • The seen and unseen class domains are bridged by a defensive relation prediction network in ATZSL. The improvements over alternative ZSL models are consistently significant under various settings.

Notation Description
Set of seen classes and set of unseen classes, respectively
Visual space and semantic space, respectively
Number of training samples
Number of seen classes and number of unseen classes, respectively
The th labeled training sample: image and label
The th unlabeled testing sample: image
The th seen class and its class prototype
The th unseen class and its class prototype
The th training sample and testing sample both with an adversarial perturbation
The th seen class prototype and unseen class prototype both with an adversarial perturbation
Attack magnitude
TABLE I: Key Notations

2 Related Work

In this section, we first briefly introduce some related works of zero-shot learning, and then present a review for various adversarial attacks.

Zero-shot Learning. According to whether information about the testing data is involved during model learning, existing ZSL models consist of inductive [2, 7, 27, 42] and transductive settings [16, 17, 26, 40]. Specifically, this transduction in ZSL can be embodied in two progressive degrees: transductive for specific unseen classes [33] and transductive for specific testing samples [56]. Although transductive settings can rectify the domain shift caused by the different distributions of the training and the testing samples, it is problematic in many practical scenarios. Thus, we take an inductive setting in this work.

From the view of constructing the visual-semantic interactions, existing inductive ZSL methods fall into four categories. The first group learns a projection function from the visual to the semantic space with a linear [3, 30, 32] or a non-linear model [10, 39, 46, 53]. The testing unseen data are then classified by matching the visual representations in the class semantic embedding space with the unseen class prototypes. The second group chooses the reverse projection direction [2, 50, 54], to alleviate the hubness problem caused by nearest neighbour search in a high dimensional space [41]. The testing unseen data are then classified by resorting to the most similar pseudo visual examples in the visual space. The third group is a combination of the first two groups by taking the encoder-decoder paradigm but with the visual feature or class prototype reconstruction constraint [2, 27]. It has been verified that the projection function learned in this way is able to generalize better to the unseen classes. The last group learns an intermediate space, where both the visual and the semantic spaces are projected to [8, 9, 21, 33].

However, they all ignore a fact that each class in ZSL is only provided with one attribute vector, which is not enough to represent all the samples in this class. Consequently, the learned projection may not be effective enough to recognize the samples from the same class. Besides, the non-visual components are often included in the provided semantic prototypes, such as “smart”, “agility”, and “inactive” in benchmark dataset AWA2 [51]. Based only on visual information, these attributes are almost impossible to be predicted and even with the level of random guess as observed from Fig. 2. Thus, the learned projection cannot generalize well on the unseen class domain, although work normally in the seen class domain through supervised training. Considering such facts, we choose to learn an intermediate space to perform zero-shot recognition. In particular, different from existing models, our ATZSL connects the visual and semantic embeddings in series instead of in parallel to avoid amplifying the small perturbations.

Fig. 2: The predictability of each binary attribute, measured with classification accuracy by fine-tuning the last layer of Resnet101 [19].

Adversarial Attacks.

Evaluating the robustness of a neural network can be done by crafting adversarial examples at the testing time with a specific attack algorithm 

[6, 11, 18, 28, 36, 38, 45]. It is worth noting that adversarial attacks do not tamper with the targeted model but rather force it to produce incorrect outputs. The effectiveness of such attacks is determined mainly by the amount of information available to the adversary about the model. There is a large body of work on testing phase attacks, it can be broadly classified into either White-Box or Black-Box attacks. An adversary in white-box attack has total knowledge about the model used for inference. The access to internal model weights for a white-box attack, such as FGSM [18] and Deepfool [38]

, usually corresponds to a very strong adversarial attack since it works on the gradient of the network loss function. Black-box attack (e.g., ZOO 

[11] and NES-PGD [22]), on the contrary, assumes no knowledge about the targeted model. It usually analyses the vulnerability of the model by generating an implicit approximation to the actual gradient, based on a greedy local search.

3 Adversarially Trained Zero-Shot Learning

In this section, we first set up the ZSL problem in the presence of adversaries (Section 3.1), then develop an ATZSL model to improve robustness of ZSL in the visual space (Section 3.2), and finally derive an algorithm to solve ATZSL (Section 3.3).

3.1 Problem Definition

Given labeled training samples belonging to the set of seen classes , ATZSL aims to learn a robust zero-shot classifier that is effective especially in the following two scenarios: i) ATZSL can classify not only a clean testing sample but also its adversarial sample belonging to the set of unseen classes ; ii) ATZSL can classify a clean testing sample even when the unseen class prototypes are attacked. The key notations used throughout this paper are summarized in Table I.

3.2 ATZSL: Formulation

Fig. 3: The architecture of the proposed ATZSL to perform robust zero-shot recognition with a defensive relation prediction network. , , and

represent the feature extraction module, attribute embedding module, and relation prediction module, respectively.

The key to ZSL is to transfer knowledge from the seen to the unseen classes. Motivated by fact that class prototypes (e.g., attribute vectors) can bridge the two domains, we consider transferring the prediction strategy of image-attribute relation. For this end, we design a relation prediction network that consists of three modules: a feature extraction module , an attribute embedding module , and a relation prediction module , as illustrated in Fig. 3. The image and the attribute are fed into the feature extraction module and the attribute embedding module , respectively, which produce two embedding vectors and . Then, the concatenated embedding of and is fed into the relation prediction module , which eventually produces a relation score measuring the similarity between and . Thus, we generate scores for the relation between the sample and each seen class prototype by

(1)

It is worth noting that we choose concatenation operator, instead of the popularly used product operator (i.e., ) in previous works, to avoid amplifying the small perturbations of input. Actually, the attribute embedding serves as an additional feature map to enhance the discriminability of input image with the original feature map . More importantly, concatenation operator can mitigate the domain shift caused by appearance variations of each attribute across all classes.

Relation Based Cross Entropy Loss.

In our ATZSL framework, the probability of a sample

belonging to the -th seen class (i.e., ) can be measured by the relation between and , i.e.,

(2)

where is the temperature to mitigate overfitting [20], and is a common option. In particular, “softens” the softmax (raises the output entropy) with . As , , which leads to maximum uncertainty. As , the probability collapses to a point mass (i.e. ). Since does not change the maximum of the softmax function, the class prediction remains unchanged if is applied after convergence. Plugging the probability in Eq. (2) into cross-entropy loss over the training sample from the seen classes yields

(3)

where . if and otherwise.

ATZSL Formulation. To improve the robustness of our ZSL model (corresponding to the first goal in Section 3.1), we further inject adversarial examples into the training set during model training [18, 28]. Specifically, when a clean image comes to the ZSL model, the attacker is allowed to perturb the image into with a bounded magnitude. Then based on the loss function in Eq. (3), we define the overall optimization formulation of the proposed ATZSL as

(4)

where , and denotes the predefined bound for the attack magnitude. The second term in Eq. (4) aims to produce the adversarial samples by maximally deteriorating the ZSL model performance. balances the effects of clean and adversarial samples on the model training.

During the test phase, the predicted class of sample is given by in the standard ZSL case, and in the generalized ZSL case. The latter is more practical for real applications since prediction is made over both seen and unseen classes [57].

3.3 ATZSL: Algorithm

It is not trivial to solve the optimization problem in Eq. (4), since the last term in the objective function is a maximization optimization problem. Considering the last term as a new function, i.e., , then we can solve Eq. (4) as a standard minimization problem using Adam [25]. The main bottleneck is to determine the gradient of . We refer to the following theorem [12]:

Theorem 1.

(Danskin’s theorem) Let ( is a compact set) be a differential function in terms of two vector variables . Assume for every , is convex in . Further assume is also a differential function in terms of , and for any , the optimal is unique in . Then we have the following result: .

Based on this theorem, instead of computing the gradient of directly, we can first find the worst adversarial image by , and then we have (the same for and ). Thus, the last roadblock is how to find . Inspired by the generation method of adversarial examples in IFGSM attack [28], we obtain

using the sign stochastic gradient descent algorithm, where the update formula for the

-th step of iterations is: , and . A recent study [4] has validated the convergence of sign (stochastic) gradient descent algorithm. Algorithm 1 shows the detailed steps of our ATZSL algorithm against image attacks.

3.4 Extension

Besides providing improved robustness for ZSL in the visual space, ATZSL can be easily extended to deal with the attacks in semantic space, corresponding to the second goal in Section 3.1. Specifically, one unseen class prototype provided in ZSL can be considered directly analogous to one testing image, and thus can be attacked unavoidably like testing images. To deal with such adversarial attribute attacks effectively, we need to modify our framework in Eq. (4) as follows:

(5)

where , and . The second term in Eq. (5) aims to produce the adversarial class prototypes by maximally deteriorating the ZSL model performance. It is worth noting that for attack bound constraint, different from the attack in the visual space, attack is more reasonable in this scenario. This is because denotes the max change of a pixel for image attacks, while the max total change of all attributes for attribute attacks. The detailed algorithm for Eq. (5) is provided in Algorithm 2.

Input: Dataset , prototypes , stepsize sequence , update steps and , temperature , and weight .
Initialize ;
for  to  do
       Sample from ;
       Draw

from a truncated normal distribution

;
       ;
       ;
       for  to  do
             ;
             ;
            
       end for
      ;
       Update using Adam with ;
end for
Output: Model parameters
Algorithm 1 ATZSL for image attacks
Input: Dataset , prototypes , stepsize sequence , update steps and , temperature , and weight .
Initialize ;
for  to  do
       Sample from ;
       for  to  do
             Sample from ;
             Draw from a truncated normal distribution ;
             ;
             ;
             ;
             ;
            
       end for
      ;
       Update using Adam with ;
end for
Output: Model parameters
Algorithm 2 ATZSL for attribute attacks

4 Evaluation Setup and Metrics

In this section, we first provide the evaluation protocols, e.g. dataset splits and evaluation metrics, and then detail our experimental implementation.

4.1 Datasets and Protocols

Among the most widely used datasets for ZSL, we select one coarse-grained dataset (Animals with Attributes2 (AWA2)111http://cvml.ist.ac.at/AwA2/ [51]), and one fine-grained dataset (CUB-200-2011 Birds (CUB)222http://www.vision.caltech.edu/visipedia/CUB-200-2011.html [48]). Specifically, AWA2 has 85 attributes, containing 37,322 images from 50 different animal classes. On average, each class includes 746 images where the least populated class, i.e. mole, has 100 and the most populated class, i.e. horse has 1645 examples. CUB has a large number of classes and attributes, containing 11,788 images from 200 different types of birds annotated with 312 attributes.

We adopt the novel rigorous protocol333http://www.mpi-inf.mpg.de/zsl-benchmark proposed in [51]

, insuring that none of the unseen classes appear in ImageNet 1K, since ImageNet 1K is used to pre-train the Resnet model. Otherwise the zero-shot rule would be violated. Specifically, AWA2 and CUB datasets consist of 27 and 100 training (seen) classes respectively while 10 and 50 test (unseen) classes. The hyperparameter search is performed on a disjoint set of validation set of 13 and 50 classes respectively. In particular, this protocol involves two settings:

Standard ZSL and Generalized ZSL. The latter emerges recently under which the test set contains data samples from both seen and unseen classes. This setting is thus clearly more reflective of real-world application scenarios. By contrast, the test set in standard ZSL only contains data samples from the unseen classes.

4.2 Evaluation Metrics

At test phase of ZSL, we use the unified evaluation protocol proposed in [51]. Specifically, under the standard ZSL setting, we adopt the average per-class top-1 accuracy (T1) in the following way:

Under the generalized ZSL setting, we compute the harmonic mean (

HM) of and to favor high accuracies on both seen and unseen classes as follows:

where and are the average per-class top-1 accuracies of recognizing the testing samples from the seen and the unseen classes, respectively, and

Finally, to evaluate the overall performance (i.e., trade-off on clean and adversarial data) under the standard ZSL setting, we further compute the harmonic mean () of and , where and represent the T1 metric in the clean and the adversarial scenarios, respectively. Similarly, let and denote the HM metric in the clean and adversarial scenarios, respectively. The harmonic mean () of and is used to evaluate the overall performance under the generalized ZSL setting.

4.3 Implementation Details

By tuning on the validation set, we set , ,

, and train ATZSL for 400 epochs on the two datasets with weight decay

. The learning rate is initialized to with Adam and then annealed by 10% every 10 epochs. Specifically, i) For the feature extraction module, before being fed into the pre-trained Resnet34 network444https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py, a color image is first normalized with the means

and standard deviations

with respect to each channel. ii) For the attribute embedding module, the continuous attributes555Continuous attributes are also provided by datasets, as binary attributes have been shown to be weaker than continuous attributes. between values 0 and 100 are fed into the MLP network. The size of hidden layer (Fig. 3) is set to 300 and 400 for AWA2 and CUB respectively, and the output size is set to the same dimension (512) as the image embedding for both datasets. Additionally, we add weight decay (

regularisation), and ReLU non-linearity for two fully-connected layers. iii) For the relation prediction module, the image and prototype embeddings are concatenated before being fed into MLPs with hidden layer size 400 for both datasets. The first fully-connected layer is also with ReLU. We add weight decay (

regularisation) in the attribute embedding module as there is a hubness problem in cross-modal mapping for ZSL which can be best solved by mapping the semantic feature vector to the visual feature space with regularisation.

Compared Methods. We choose to compare with several competitive and representative ZSL methods that have achieved the state-of-the-art results recently, including DEM666https://github.com/lzrobots/DeepEmbeddingModel_ZSL [54], FGN777http://datasets.d2.mpi-inf.mpg.de/xian/cvpr18xian.zip [52], MC-ZSL888https://github.com/rfelixmg/frwgan-eccv18 [14], and LDF999https://github.com/shaoniangu/Zero_shot_learning_using_LDF_tensorflow [32]. Specifically, DEM [54] and LDF [32] perform better in the second and last group of ZSL methods (described in Section 2), respectively. FGN [52] and MC-ZSL [14] are Generative Adversarial Network (GAN) based ZSL models. In addition, to validate the effectiveness of our proposed adversarial training, we evaluate our ATZSL without adversarial samples, i.e., , dubbed Baseline.

During the test phase of each ZSL method, besides using clean data (i.e., “No attack”), we evaluate the model robustness using five widely used white-box attacks (including FGSM [18], IFGSM [28], DeepFool [38], CW [6], and WRM [45]), and one black-box attack (ZOO [11]). It is worth noting that these attack methods have been standardized in adversarial learning. IFGSM [28] is a straightforward extension of FGSM [18] by applying it multiple times with small step size. Compared with FGSM [18], DeepFool [38] can generate adversarial perturbations with lower norm. CW [6] introduces three new attacks for the , , and distance metrics, and in our experiments, we adopt untargeted attacks. WRM [45] additionally considers a Lagrangian penalty formulation of perturbation of the underlying data distribution in a Wasserstein ball to achieve worst-case perturbations of training data. ZOO [11]

is a zeroth order optimization based attack to directly estimate the gradients of the targeted model for generating adversarial examples, and we also adopt the widely used

untargeted attacks to our model.

(a) Image attack on AWA2
(b) Attribute attack on AWA2
(c) Image attack on CUB
(d) Attribute attack on CUB
Fig. 4: Comparison (T1:%) of several ZSL methods on two datasets under the standard ZSL setting in various attack scenarios. For horizontal axis, ’N’, ’F’, ’I’, ’D’, ’C’, ’W’, and ’Z’ represent No attack, FGSM, IFGSM, DeepFool, CW, WRM, and ZOO attack methods, respectively. The numbers with respect to ’F’, ’I’, and ’W’ represent the attack magnitudes.
Attack type Magnitude DEM [54] FGN [52] MC-ZSL [14] LDF [32] Baseline ATZSL(Ours)
FGSM [18] 35.2 39.8 38.6 39.7 42.3 50.7(8.4)
30.5 32.8 29.5 28.9 34.5 47.2(12.7)
25.7 21.8 24.4 20.4 23.2 41.3(15.6)
IFGSM [28] 32.3 38.1 35.7 35.0 35.6 51.5(13.4)
23.6 30.4 30.9 28.7 27.7 48.3(17.4)
17.1 19.1 20.9 17.6 19.2 40.6(19.7)
DeepFool [38] - 30.8 34.3 27.9 30.2 38.4 51.2(12.8)
CW [6] - 34.1 32.0 29.3 28.9 34.0 47.5(13.4)
WRM [45] 33.1 30.3 35.2 39.9 38.2 49.6(9.7)
24.7 23.5 27.2 31.1 34.6 42.1(7.5)
17.1 19.1 20.2 27.4 25.4 39.9(12.5)
ZOO [11] - 30.9 30.3 31.1 26.0 33.7 45.6(11.9)
TABLE II: Comparative Results (:%) of Several ZSL Methods on AWA2 under the Standard ZSL Setting with Various Attacks in the Visual Space. ’-’ Represents that the Attack Magnitude Is Not Required for That Attack Method. The Number in Parentheses Measures the Improvement Obtained by Our ATZSL over the Strongest Competitor in Terms of

Attack Settings. i) During model training of ATZSL, we employ the IFGSM attacker [28] to find adversarial examples. Specifically, for attacks in the visual space, the attack magnitude in each epoch is drawn from a truncated normal distribution defined in interval with for both datasets. Due to different dimensions of attributes, for attacks in the semantic space, with underlying normal distribution for AWA2, and with underlying normal distribution for CUB. ii) During the test phase, we adopt five widely used white-box attacks (including FGSM [18], IFGSM [28], DeepFool [38], CW [6], and WRM [45]), and one black-box attack (ZOO [11]). Specifically, for FGSM [18], IFGSM [28], and WRM [45] attacks, we fix the image perturbation magnitude as for both datasets in the visual space. Specially, has the same range for any datasets since each pixel is in . Generally, for imperceptibility. While in the semantic space, we fix the attribute perturbation magnitude as and for AWA2 and CUB respectively. That is, the attribute attack magnitude on CUB is larger than that on AWA2, since CUB has 312 attributes while AWA2 has only 85. Additionally, we set the number of attack iterations as for IFGSM [28], DeepFool [38], CW [6], and WRM [45] attacks.

Attack type Magnitude DEM [54] FGN [52] MC-ZSL [14] LDF [32] Baseline ATZSL(Ours)
FGSM [18] 30.0 32.6 29.3 29.4 33.9 46.8(12.9)
24.7 25.6 21.9 22.2 26.2 41.4(15.2)
22.3 22.8 17.2 15.7 21.7 38.6(15.8)
IFGSM [28] 27.3 29.7 27.5 28.6 30.2 42.0(11.8)
19.0 21.3 20.2 23.1 20.1 35.7(12.6)
17.1 17.5 16.5 20.2 17.4 31.8(11.6)
WRM [45] 28.9 30.7 33.8 30.6 31.3 49.6(15.8)
20.9 25.7 24.8 23.6 24.8 41.6(15.9)
17.9 21.9 18.4 17.1 21.7 38.6(16.7)
TABLE III: Comparative Results (:%) of Several ZSL Methods on AWA2 under the Standard ZSL with Various Attacks in the Semantic Space. The Number in Parentheses Measures the Improvement Obtained by Our ATZSL over the Strongest Competitor in Terms of
Attack type Magnitude DEM [54] FGN [52] MC-ZSL [14] LDF [32] Baseline ATZSL(Ours)
FGSM [18] 26.3 28.3 27.8 25.0 29.4 50.9(21.5)
23.1 21.5 21.8 18.8 24.9 48.7(23.8)
17.7 16.2 17.4 15.4 19.9 46.1(26.2)
IFGSM [28] 21.9 24.8 19.9 23.7 23.5 49.2(24.4)
15.8 20.3 19.4 17.1 17.5 43.5(23.2)
12.9 16.7 15.5 14.4 14.7 37.7(21.0)
DeepFool [38] - 19.2 14.5 19.1 12.8 21.8 45.2(23.4)
CW [6] - 19.6 16.2 17.2 16.8 21.3 40.2 (18.9)
WRM [45] 25.9 25.9 21.3 27.8 29.3 50.6(21.3)
19.3 18.2 21.0 17.7 21.9 43.4 (21.5)
15.8 16.6 16.2 15.3 20.7 40.3(19.6)
ZOO [11] - 17.0 16.2 22.5 20.9 17.2 41.8(19.3)
TABLE IV: Comparative Results (:%) of Several ZSL Methods on CUB under the Standard ZSL with Various Attacks in the Visual Space. ’-’ Represents that the Attack Magnitude Is Not Required for That Attack Method. The Number in Parentheses Measures the Improvement Obtained by Our ATZSL over the Strongest Competitor in Terms of

5 Experimental Results and Analysis

We first present standard ZSL results on AWA2 and CUB datasets with various attacks in visual and semantic spaces. Then, we provide results for the generalized ZSL setting. Additionally, several ablation experiments are further conducted to demonstrate the effectiveness of our ATZSL. Finally, we also provide complexity analysis of our method to show its efficiency.

5.1 Standard ZSL

To evaluate the effectiveness of our proposed ATZSL under the standard ZSL setting, we first perform recognition task on AWA2 and CUB datasets without any attacks, and then with six typical attack methods, including FGSM [18], IFGSM [28], DeepFool [38], CW [6], WRM [45], and ZOO [11].

Standard ZSL Results on AWA2. We first evaluate the robustness and transferability of ATZSL in Eq. (4), where the adversarial images are generated to train our ZSL model. Fig. 4(a) shows the compared results on AWA2 dataset in terms of average per-class top-1 accuracy (T1) with other five ZSL methods, i.e., DEM [54], FGN [52], MC-ZSL [14], LDF [32], and our Baseline. Specifically, the performance of standard ZSL without any attacks (i.e., ) for each method is denoted by ‘N’ in horizontal axis. And the remainder in horizontal axis denote the performances with six types of attacks (i.e., ) in the visual space, where we also set three kinds of attack magnitudes (i.e., ) for FGSM [18], IFGSM [28], and WRM [45] attacks. To illustrate the trade-off of each method on clean and adversarial samples more straightforwardly, we further provide the harmonic mean () of and in Table II. By contrast, it can be observed that our ATZSL obtains remarkably more favorable trade-off of zero-shot recognition performances on clean and adversarial images of AWA2, over currently available alternatives. The improvements achieved by our model over the strongest competitor range from 7.5% to 19.7%. Additionally, Fig. 4(a) shows that image attacks deteriorate the performance of each compared ZSL method under no attack scenario (i.e., ). However, our ATZSL shows the most non-obvious degradation in various image attack scenarios. Therefore, ATZSL paves a possible direction to improve robustness of ZSL.

Then, we evaluate the robustness and transferability of ATZSL in Eq. (5), where the adversarial attributes of seen classes are generated to train our ZSL model. Fig. 4(b) shows the compared results on AWA2 in terms of average per-class top-1 accuracy (T1) with five ZSL methods. ‘N’ in horizontal axis still denotes the performance without any attacks (i.e., ) for each method, and the remainder represent under three types of attacks, i.e., FGSM [18], IFGSM [28], and WRM [45] with . Correspondingly, Table III presents the harmonic mean (), i.e., the trade-off of each method on clean and adversarial attributes. As expected, our ATZSL can also obtain more promising trade-off of zero-shot recognition performances on clean and adversarial attributes of AWA2 over all competitors, generally more than 11.6%. Meanwhile, as shown in Fig. 4(b), our ATZSL can achieve the consistently slightest performance degradation in various attribute attack scenarios compared with no attack case.

Standard ZSL Results on CUB. Likewise, under the standard ZSL setting, we further evaluate the performances of ATZSL in Eq. (4) and Eq. (5) on CUB dataset. First, we train our ATZSL by Eq. (4), and test on both clean and adversarial images of CUB. The compared results in terms of average per-class top-1 accuracy (T1) with five representative ZSL methods are plotted in Fig. 4(c), where ‘N’ in horizontal axis denotes the performance without any attacks (i.e. ), and the remainder represent under six types of attacks with three types of magnitudes. Table IV additionally presents the harmonic mean () of and , to show the trade-off of each method on clean and adversarial images. We can find that the improvements obtained by our ATZSL over the strongest competitor are averagely around 21.5%. Then, we train our ATZSL by Eq. (5), and test on both clean and adversarial attributes of CUB. Fig. 4(d) shows the average per-class top-1 accuracy (T1) of each method under various attack settings, including no attack case () and three kinds of attacks with three types of magnitudes (). The trade-off of each method on clean and adversarial attributes is quantitatively compared in Table V with harmonic mean () of and . In this scenario, our method still can improve the trade-off performance over the strongest competitor by an obvious margin (14.6%26.8%). From these results on CUB, we can draw similar conclusions with AWA2. Specially, the improvements obtained by our ATZSL on CUB are more significant than those on AWA2, for both image and attribute attacks. This is possibly due to the fact that the samples in CUB are fine-grained, and thus we can generate more reliable adversarial samples to train a more robust ZSL model.

In summary, the four tables demonstrate that our ATZSL can achieve obviously more competitive trade-off of standard zero-shot recognition performances on clean and adversarial data. More importantly, such obvious improvements have not sacrificed the T1 performance of our ATZSL in the no attack scenario, as shown in the four figures.

Attack type Magnitude DEM [54] FGN [52] MC-ZSL [14] LDF [32] Baseline ATZSL(Ours)
FGSM [18] 19.9 22.9 19.8 22.1 26.3 42.2(15.9)
13.7 6.4 7.8 9.0 11.1 31.2 (17.5)
0.6 2.5 2.7 5.3 5.5 24.5(19.0)
IFGSM [28] 16.9 18.9 16.1 16.9 21.7 40.2 (18.5)
4.4 9.9 5.2 2.5 7.3 25.9 (16.0)
1.2 2.0 0.8 2.2 3.7 18.3(14.6)
WRM [45] 18.9 14.5 15.0 17.5 17.7 40.5(21.6)
5.3 3.5 10.7 11.2 7.3 38.0(26.8)
2.2 0.6 1.4 2.2 3.9 27.8(23.9)
TABLE V: Comparative Results (:%) of Several ZSL Methods on CUB under the Standard ZSL with Various Attacks in the Semantic Space. The Number in Parentheses Measures the Improvement Obtained by Our ATZSL over the Strongest Competitor in Terms of
(a) Image attack on AWA2
(b) Attribute attack on AWA2
(c) Image attack on CUB
(d) Attribute attack on CUB
Fig. 5: Comparison (HM:%) of several ZSL methods on two datasets under the generalized ZSL setting in various attack scenarios. For horizontal axis, ’N’, ’F’, ’I’, ’D’, ’C’, ’W’, and ’Z’ represent No attack, FGSM, IFGSM, DeepFool, CW, WRM, and ZOO attack methods, respectively. The numbers with respect to ’F’, ’I’, and ’W’ represent the attack magnitudes.
Attack type Magnitude DEM [54] FGN [52] MC-ZSL [14] LDF [32] Baseline ATZSL(Ours)
FGSM [18] 4.4 5.9 6.4 6.4 6.3 27.9(21.5)
3.5 2.7 4.2 5.7 4.1 23.5(17.8)
0.8 0.8 2.0 0.6 0.6 15.0(13.0)
IFGSM [28] 2.5 5.3 2.4 4.0 2.5 25.8(20.5)
0.4 2.5 0.8 2.7 4.3 19.7(15.4)
0.4 0.2 0.6 0.2 2.4 13.9(11.5)
DeepFool [38] - 2.5 4.6 4.2 2.5 6.8 25.1(18.3)
CW [6] - 6.7 2.5 6.4 5.3 7.7 23.7 (16.0)
WRM [45] 5.8 6.1 2.5 3.9 4.4 25.2 (19.1)
2.1 2.5 0.2 2.3 2.0 20.9(18.4)
0.6 0.6 0.2 0.4 0.2 15.1(14.5)
ZOO [11] - 4.9 4.4 0.6 8.3 7.3 25.2(16.9)
TABLE VI: Comparative Results (:%) of Several ZSL Methods on AWA2 under the Generalized ZSL with Various Attacks in the Visual Space. ’-’ Represents that the Attack Magnitude Is Not Required for That Attack Method. The Number in Parentheses Measures the Improvement Obtained by Our ATZSL over the Strongest Competitor in Terms of

5.2 Generalized ZSL

Compared with standard ZSL, it is much more technically difficult to perform generalized ZSL since prediction needs to be made over both seen and unseen classes. Even so, we still perform generalized ZSL tasks using several ZSL methods on both AWA2 and CUB datasets, under various attack settings. In particular, we can directly use the ATZSL model trained for standard ZSL tasks in Section 5.1 for generalized ZSL during the test phase.

Attack type Magnitude DEM [54] FGN [52] MC-ZSL [14] LDF [32] Baseline ATZSL(Ours)
FGSM [18] 6.3 12.2 7.8 9.0 8.0 38.5(26.3)
1.2 4.2 4.4 3.5 4.1 35.1(30.7)
0.4 2.2 0.6 0.4 0.8 29.4(27.2)
IFGSM [28] 5.8 3.5 7.3 9.0 5.5 36.5(27.5)
2.0 0.4 3.5 3.7 2.0 32.7(29.0)
1.0 0.4 0.6 2.3 0.8 23.6(21.3)
WRM [45] 10.4 7.0 6.4 8.0 6.6 36.1(25.7)
4.9 2.0 4.1 2.2 3.5 30.1(25.2)
2.5 0.8 0.6 0.6 1.4 23.8(21.3)
TABLE VII: Comparative Results (:%) of Several ZSL Methods on AWA2 under the Generalized ZSL with Various Attacks in the Semantic Space. The Number in Parentheses Measures the Improvement Obtained by Our ATZSL over the Strongest Competitor in Terms of
Attack type Magnitude DEM [54] FGN [52] MC-ZSL [14] LDF [32] Baseline ATZSL(Ours)
FGSM [18] 3.6 10.9 11.1 8.6 9.3 35.8(24.7)
2.4 3.7 2.3 3.7 4.4 28.2(23.8)
0.6 2.0 2.2 0.8 1.8 21.2(19.0)
IFGSM [28] 3.2 3.8 9.5 9.4 7.9 33.5(24.0)
1.7 0.6 4.2 2.7 4.4 24.0(19.6)
1.0 0.6 3.5 1.8 1.6 15.3(11.8)
DeepFool [38] - 3.6 7.9 5.0 4.2 10.6 29.8 (19.2)
CW [6] - 5.0 4.8 5.7 4.4 6.7 28.2(21.5)
WRM [45] 4.8 5.8 4.6 8.3 5.9 34.2 (25.9)
0.2 0.6 2.0 5.5 0.6 28.7 (23.2)
0.2 0.8 1.8 0.6 0.8 16.5(14.7)
ZOO [11] - 5.9 7.7 6.7 5.5 7.6 30.5 (22.8)
TABLE VIII: Comparative Results (:%) of Several ZSL Methods on CUB under the Generalized ZSL with Various Attacks in the Visual Space. ’-’ Represents that the Attack Magnitude Is Not Required for That Attack Method. The Number in Parentheses Measures the Improvement Obtained by Our ATZSL over the Strongest Competitor in Terms of
Attack type Magnitude DEM [54] FGN [52] MC-ZSL [14] LDF [32] Baseline ATZSL(Ours)
FGSM [18] 3.6 13.9 16.7 14.1 17.0 37.9(20.9)
0.6 7.1 8.3 4.6 7.8 33.4(25.1)
0.2 4.4 3.5 0.6 5.3 28.8(23.5)
IFGSM [28] 3.6 14.4 12.8 17.1 14.9 34.5(17.4)
0.8 6.0 7.8 5.8 6.6 27.4 (19.6)
0.8 3.5 1.8 2.9 5.7 18.9(13.2)
WRM [45] 2.0 17.1 10.5 12.8 15.5 37.5 (20.4)
1.5 8.8 2.3 3.5 6.2 31.5(22.7)
0.6 1.0 0.6 2.0 4.2 28.2 (24.0)
TABLE IX: Comparative Results (:%) of Several ZSL Methods on CUB under the Generalized ZSL with Various Attacks in the Semantic Space. The Number in Parentheses Measures the Improvement Obtained by Our ATZSL over the Strongest Competitor in Terms of

Generalized ZSL Results on AWA2. Under the generalized ZSL setting, we first evaluate the performances of ATZSL in Eq. (4) and Eq. (5) on AWA2 dataset, corresponding to the image and attribute attacks respectively. It is worth noting that the two models are trained as those in standard ZSL on AWA2. During the test phase, we need to search in both the seen and the unseen class spaces, where the attack implementation is the same as that in Section 5.1. After attacking images in AWA2 with six types of attack methods and three types of magnitudes, we show the compared results in terms of harmonic mean (HM) with five ZSL methods in Fig. 5(a). Concretely, ‘N’ in horizontal axis still denotes the performance without any attacks (i.e., ), and others represent . Obviously, unlike those ZSL baselines that are very sensitive to image attacks, our ATZSL achieves the slightest degradation in various image attack scenarios. Meanwhile, Table VI reports the harmonic mean () of and under each attack method, i.e., the trade-off of each method on clean and adversarial images. It can be seen that our ATZSL can also perform best on the generalized ZSL task even with various image attacks. The improvements over the strongest competitor range from 11.5% to 21.5%. Moreover, by attacking attributes in AWA2 with three kinds of attack methods and three types of magnitudes, Fig. 5(b) plots the compared results in terms of harmonic mean (HM), including and for each attack method. Their harmonic means () are computed in Table VII. It is as expected that our ATZSL has more promising trade-off performance on clean and adversarial attributes (generally more than 21.3%), and meanwhile, only loses a negligible performance on clean attributes as shown in ‘N’ of the horizontal axis of Fig. 5(b).

(a)
(b)
(c)
(d)
(e)
Fig. 6: The test unseen images on CUB dataset attacked by various attack methods with five types of attack magnitudes ().
(a)
(b)
(c)
Fig. 7: The t-SNE of all class attributes on CUB dataset attacked by various attack methods during the test phase with three types of attack magnitudes (). For each figure, the dark color represents the seen class attributes without any attacks, and the light color represents the unseen class attributes with no attacks, and three types of attacks respectively.

Generalized ZSL Results on CUB. Likewise, we further evaluate the generalized ZSL performances of ATZSL in Eq. (4) and Eq. (5) on CUB dataset. By attacking images with various methods, we plot the compared results in terms of harmonic mean (HM) in Fig. 5(c), including and for each attack method. Their harmonic means () are computed in Table VIII. Then, by attacking attributes with several methods, we plot the compared results in terms of harmonic mean (HM) in Fig. 5(d), including and for each attack method. Their harmonic means () are presented in Table IX. By contrast, we can make consistent conclusions with generalized ZSL on AWA2. That is, the robustness and transferability abilities can indeed be jointly optimized better by our ATZSL. Specifically, as reported in Table VIII and Table IX, even though the unseen images and attributes are attacked, and may be consequently similar with those seen class images and attributes, respectively. ATZSL still can improve the overall performance (i.e., harmonic mean ) over the strongest alternative by an obvious margin (11.8% 25.9% for image attacks, and 13.2% 25.1% for attribute attacks, respectively).

Besides the similar conclusion to standard ZSL, we have an additional observation for generalized ZSL. As shown in Fig. 4 and Fig. 5, the performances in terms of HM in generalized ZSL are much weaker than T1 in standard ZSL. This is not surprising since the seen classes are included in the search space which act as distractors for the samples that come from unseen classes.

To straightforwardly illustrate the attack effect on images, we show one image in CUB dataset attacked by six kinds of methods during the test phase in Fig. 6 corresponding to five attack magnitudes. It can be observed that the magnitude required attack methods (i.e., FGSM [18], IFGSM [28], and WRM [45]) are generally perceptible when . In addition, by comparing original image with various attacked images, we can find that the gradient based attack methods, including FGSM [18], IFGSM [28], and WRM [45], are very strong, thus being challenging to defense. Furthermore, we also illustrate the attribute distribution before and after being attacked using t-SNE [35]. As shown in Fig. 7 for CUB dataset with three types of attack magnitudes, a small adversarial perturbation is able to change the relationship between the seen and the unseen class attributes. This will definitely deteriorate the zero-shot recognition performance by transferring wrong information, if it is not been defensed. More visualized results are provided in the supplementary material. The same conclusion can be summarized from AWA2 dataset. Therefore, the research on robustness of zero-shot learning is very necessary and valuable.

(a) Image attack on AWA2
(b) Attribute attack on AWA2
(c) Image attack on CUB
(d) Attribute attack on CUB
Fig. 8: Comparison (T1:%) of several ZSL methods with their adapted versions on two datasets under the standard ZSL setting in various attack scenarios. For horizontal axis, ’N’, ’F’, ’I’, ’D’, ’C’, ’W’, and ’Z’ represent No attack, FGSM, IFGSM, DeepFool, CW, WRM, and ZOO attack methods, respectively. The numbers with respect to ’F’, ’I’, and ’W’ represent the attack magnitudes.
(a) Image attack on AWA2
(b) Attribute attack on AWA2
(c) Image attack on CUB
(d) Attribute attack on CUB
Fig. 9: Comparison (HM:%) of several ZSL methods with their adapted versions on two datasets under the generalized ZSL setting in various attack scenarios. For horizontal axis, ’N’, ’F’, ’I’, ’D’, ’C’, ’W’, and ’Z’ represent No attack, FGSM, IFGSM, DeepFool, CW, WRM, and ZOO attack methods, respectively. The numbers with respect to ’F’, ’I’, and ’W’ represent the attack magnitudes.

5.3 Ablation Study

Actually, we have conducted a vital ablation study, namely Baseline in Section 5.1 and Section 5.2. By comparison, ATZSL consistently outperforms Baseline on AWA2 and CUB datasets under both standard and generalized ZSL settings for image or attribute attacks. This just verifies that training ZSL models with adversarial samples or attributes is indeed effective. To further verify this fact, we additionally train the state-of-the-art ZSL methods (DEM [54] and LDF [32]) with adversarial examples, named DEM-adv and LDF-adv. It is worth noting that GAN based ZSL methods (e.g., FGN [52] and MC-ZSL [14]) have not been considered since they are very complex and also cannot been adapted for image attacks. The key to adaption is to generate adversarial images or attributes by maximizing original objective function as that in Eq. (4) or Eq. (5). Particularly, we need to adapt DEM [54] as an end-to-end model by training it with the feature extraction module (Resnet34) simultaneously. Fig. 8 shows the T1 performances of adapted competitors under the standard ZSL setting. It can be seen that for both datasets, there exists an obvious gap between DEM-adv and DEM [54] (resp. LDF-adv and LDF [32]) in terms of T1 in each attack scenario. Likewise, under the generalized ZSL setting, we also present the performances (i.e., HM) of adapted competitors on two datasets under various attack scenarios in Fig. 9. Consequently, the similarly large gap between DEM-adv and DEM [54] (resp. LDF-adv and LDF [32]) re-verifies that using adversarial data can indeed improve the robustness of ZSL models. More importantly, as shown in Fig. 4 and Fig. 5, under no attack scenario (denoted by ‘N’ in horizontal axis), our Baseline has achieved higher performance than competitors (e.g., DEM [54], LDF [32], FGN [52], and MC-ZSL [14]). Thus adversarially training on our baseline instead of others is more promising. This can be verified by an observation in Fig. 8 and Fig. 9 that our ATZSL consistently outperforms the two adapted ZSL methods (DEM-adv and LDF-adv).

Forward (Clean) Finding adversarial data Forward (Adversarial) Backward Total
AWA2 Image attack 22 226 22 103 378
Attribute attack 2 17 2 9 31
CUB Image attack 7 70 7 34 119
Attribute attack 2 12 2 6 23
TABLE X: Training Time (sec.) of Each Part During An Iteration, Where the Numbers in Bold Are the Added Overhead Using Our ATZSL

Besides, we further conduct two additional ablation studies: i) We train our ATZSL without clean samples or attributes (i.e., ), named Baseline2, to verify the necessity of retaining clean data. ii) We replace the concatenation operator in Eq. (1) with product operator for feature maps, named Baseline3, to verify the advantage of concatenation as claimed in Section 3.2. In particular, we set in Baseline3 for convenient implementation. The related results are presented in Fig. 8 and Fig. 9, corresponding to the standard and the generalized ZSL settings respectively101010The trade-off of each method is presented in the supplementary material.. It is as expected that our ATZSL performs consistently best compared with our various baselines under any attacks. Specifically, an obvious gap between the performances of ATZSL and Baseline2 demonstrates the first ablation study, while the gap between Baseline and Baseline3 manifests the second one. In essence, clean and adversarial samples play an equally important role for training a robust ZSL model.

5.4 Complexity Analysis

To demonstrate the efficiency of our ATZSL in Eq. (4) and Eq. (5) for image and attribute attacks, respectively, we further provide the computational costs during model training under the standard ZSL setting (the same as the generalized ZSL setting). Specifically, given the network parameters , , and , the forward complexity of our model is to compute each pair-wise relationship score between every training example and each seen class prototype. and denote the dimensions of class prototype and image embedding, respectively. Thus, it is actually efficient to obtain all clean scores with the complexity of . Supposing denotes the backward complexity of our network on all clean data, finding the worst adversarial data only relies on the gradient with respect to input but not all the variables, and thus costs must far less than . In particular, is the update steps in IFGSM during training, and generally small (we set in our experiments). Consequently, with an additional forward cost on the generated adversarial data, the overall cost of our algorithm is far less than . This is practically acceptable though more than original . In particular, the complexity of our baseline, i.e., , is really low in ZSL compared with those methods taking far more complicated nonlinear formulation.

To present the overhead of ATZSL more intuitively, we further provide the running time of each part in our algorithm. Concretely, with 2661 MiB and 739 MiB GPU memory usage for image and attribute attacks on AWA2 dataset respectively (2717 MiB and 1207 MiB on CUB dataset), Table X reports the physical training time during an iteration of our ATZSL. The max numbers of iteration are 300 and 2000 for image and attribute attacks, respectively. It can be concluded that the added overhead, including finding and training the worst adversarial data, can be neglected in practice to the significant accuracy improvements achieved by these adversarial data. In addition, our ATZSL, though adds some overhead, still takes comparable or even less time than GAN based ZSL methods (e.g., FGN [52] and MC-ZSL [14]).

6 Conclusion

In this work, we addressed the new problem of simultaneously achieving high robustness and transferability in zero-shot learning. For this end, we developed ATZSL: an adversarially trained zero-shot model by integrating the two goals in one unified constrained optimization framework. Specifically, a defensive relation prediction network was designed to transfer knowledge from the seen to the unseen class domains. Meanwhile, during the training phase, we injected adversarial images or attributes into our network to transfer knowledge from the clean to the adversarial domains, thus learning a robust ZSL model. Extensive experiments have been carried out to endorse the effectiveness and efficiency of ATZSL by observing the following: i) compared with state-of-the-art ZSL models and their variants, our ATZSL performs consistently better on various adversarial data for zero-shot recognition task, while only loses a negligible performance on clean data; ii) under the standard or generalized ZSL setting, ATZSL can consistently remain robust against different attackers both in the visual and semantic spaces.

We emphasize the first attempt to study robustness of ZSL creatively, which is really practically important. Moreover, ATZSL could serve as a benchmark for later scholars, though defense with adversarial data is general. Based on ATZSL, we are recently trying to narrow the large accuracy gap between clean and adversarial samples on unseen classes of ZSL, with multi-level distribution alignment constraint. This is a valuable issue since it is still challenging even on general classification tasks.

References

  • [1] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele (2015) Evaluation of output embeddings for fine-grained image classification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2927–2936. Cited by: §1.
  • [2] Y. Annadani and S. Biswas (2018) Preserving semantic relations for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7603–7612. Cited by: §2, §2.
  • [3] A. Bansal, K. Sikka, G. Sharma, R. Chellappa, and A. Divakaran (2018) Zero-shot object detection. In European Conference Computer Vision, pp. 397–414. Cited by: §2.
  • [4] J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar (2018) SIGNSGD: compressed optimisation for non-convex problems. In International Conference on Machine Learning, pp. 559–568. Cited by: §3.3.
  • [5] I. Biederman (1987) Recognition-by-components: a theory of human image understanding.. Psychological review 94 (2), pp. 115––147. Cited by: item -.
  • [6] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §2, §4.3, §4.3, TABLE II, TABLE IV, §5.1, TABLE VI, TABLE VIII.
  • [7] S. Changpinyo, W. Chao, B. Gong, and F. Sha (2016) Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5327–5336. Cited by: §1, §2.
  • [8] S. Changpinyo, W. Chao, B. Gong, and F. Sha (2018) Classifier and exemplar synthesis for zero-shot learning. arXiv preprint arXiv:1812.06423. Cited by: §2.
  • [9] S. Changpinyo, W. Chao, and F. Sha (2017) Predicting visual exemplars of unseen classes for zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3476–3485. Cited by: §2.
  • [10] L. Chen, H. Zhang, J. Xiao, W. Liu, and S. Chang (2018) Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1043–1052. Cited by: §2.
  • [11] P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh (2017) Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

    ,
    pp. 15–26. Cited by: §2, §4.3, §4.3, TABLE II, TABLE IV, §5.1, TABLE VI, TABLE VIII.
  • [12] J. M. Danskin (2012) The theory of max-min and its application to weapons allocation problems. Vol. 5, Springer Science & Business Media. Cited by: §3.3.
  • [13] A. Fawzi, H. Fawzi, and O. Fawzi (2018) Adversarial vulnerability for any classifier. In Advances in neural information processing systems, pp. 1186–1195. Cited by: §1.
  • [14] R. Felix, V. B. Kumar, I. Reid, and G. Carneiro (2018) Multi-modal cycle-consistent generalized zero-shot learning. In Proceedings of the European Conference on Computer Vision, pp. 21–37. Cited by: §1, §4.3, TABLE II, TABLE III, TABLE IV, §5.1, §5.3, §5.4, TABLE V, TABLE VI, TABLE VII, TABLE VIII, TABLE IX.
  • [15] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. Cited by: §1.
  • [16] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong (2015) Transductive multi-view zero-shot learning. IEEE transactions on pattern analysis and machine intelligence 37 (11), pp. 2332–2345. Cited by: §2.
  • [17] Z. Fu, T. Xiang, E. Kodirov, and S. Gong (2018) Zero-shot learning on semantic class prototype graph. IEEE transactions on pattern analysis and machine intelligence 40 (8), pp. 2009–2022. Cited by: §2.
  • [18] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §2, §3.2, §4.3, §4.3, TABLE II, TABLE III, TABLE IV, §5.1, §5.1, §5.1, §5.2, TABLE V, TABLE VI, TABLE VII, TABLE VIII, TABLE IX.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Fig. 2.
  • [20] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.2.
  • [21] Y. Hubert Tsai, L. Huang, and R. Salakhutdinov (2017) Learning robust visual-semantic embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3571–3580. Cited by: §2.
  • [22] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin (2018) Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning, pp. 2142–2151. Cited by: §2.
  • [23] M. Kampffmeyer, Y. Chen, X. Liang, H. Wang, Y. Zhang, and E. P. Xing (2019)

    Rethinking knowledge graph propagation for zero-shot learning

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11487–11496. Cited by: §1.
  • [24] P. Kankuekul, A. Kawewong, S. Tangruamsub, and O. Hasegawa (2012) Online incremental attribute-based zero-shot learning. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3657–3664. Cited by: §1.
  • [25] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
  • [26] E. Kodirov, T. Xiang, Z. Fu, and S. Gong (2015) Unsupervised domain adaptation for zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2452–2460. Cited by: §2.
  • [27] E. Kodirov, T. Xiang, and S. Gong (2017)

    Semantic autoencoder for zero-shot learning

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3174–3183. Cited by: §2, §2.
  • [28] A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: §2, §3.2, §3.3, §4.3, §4.3, TABLE II, TABLE III, TABLE IV, §5.1, §5.1, §5.1, §5.2, TABLE V, TABLE VI, TABLE VII, TABLE VIII, TABLE IX.
  • [29] C. H. Lampert, H. Nickisch, and S. Harmeling (2009) Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958. Cited by: §1.
  • [30] C. Lampert, H. Nickisch, and S. Harmeling (2014) Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3), pp. 453–465. Cited by: §1, §2.
  • [31] J. Lei Ba, K. Swersky, S. Fidler, et al. (2015)

    Predicting deep zero-shot convolutional neural networks using textual descriptions

    .
    In Proceedings of the IEEE International Conference on Computer Vision, pp. 4247–4255. Cited by: item -.
  • [32] Y. Li, J. Zhang, J. Zhang, and K. Huang (2018) Discriminative learning of latent features for zero-shot recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7463–7471. Cited by: §2, §4.3, TABLE II, TABLE III, TABLE IV, §5.1, §5.3, TABLE V, TABLE VI, TABLE VII, TABLE VIII, TABLE IX.
  • [33] S. Liu, M. Long, J. Wang, and M. I. Jordan (2018) Generalized zero-shot learning with deep calibration network. In Advances in Neural Information Processing Systems, pp. 2006–2016. Cited by: §2, §2.
  • [34] Y. Long, L. Liu, L. Shao, F. Shen, G. Ding, and J. Han (2017) From zero-shot learning to conventional supervised classification: unseen visual data synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1627–1636. Cited by: item -.
  • [35] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §5.2.
  • [36] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017)

    Towards deep learning models resistant to adversarial attacks

    .
    arXiv preprint arXiv:1706.06083. Cited by: §2.
  • [37] T. Mensink, E. Gavves, and C. G. Snoek (2014) Costa: co-occurrence statistics for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2441–2448. Cited by: §1.
  • [38] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2574–2582. Cited by: §2, §4.3, §4.3, TABLE II, TABLE IV, §5.1, TABLE VI, TABLE VIII.
  • [39] P. Morgado and N. Vasconcelos (2017) Semantically consistent regularization for zero-shot recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6060–6069. Cited by: §1, §2.
  • [40] L. Niu, A. Veeraraghavan, and A. Sabharwal (2018) Webly supervised learning meets zero-shot learning: a hybrid approach for fine-grained classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7171–7180. Cited by: §2.
  • [41] M. Radovanović, A. Nanopoulos, and M. Ivanović (2010)

    Hubs in space: popular nearest neighbors in high-dimensional data

    .
    Journal of Machine Learning Research 11 (Sep), pp. 2487–2531. Cited by: §2.
  • [42] B. Romera-Paredes and P. Torr (2015) An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning, pp. 2152–2161. Cited by: §2.
  • [43] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry (2018) Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, pp. 5014–5026. Cited by: §1.
  • [44] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019) Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8247–8255. Cited by: §1.
  • [45] A. Sinha, H. Namkoong, and J. Duchi (2017) Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571. Cited by: §2, §4.3, §4.3, TABLE II, TABLE III, TABLE IV, §5.1, §5.1, §5.1, §5.2, TABLE V, TABLE VI, TABLE VII, TABLE VIII, TABLE IX.
  • [46] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng (2013) Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pp. 935–943. Cited by: §1, §2.
  • [47] S. Venugopalan, L. Anne Hendricks, M. Rohrbach, R. Mooney, T. Darrell, and K. Saenko (2017) Captioning images with diverse objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5753–5761. Cited by: item -.
  • [48] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §4.1.
  • [49] W. Wang, V. W. Zheng, H. Yu, and C. Miao (2019) A survey of zero-shot learning: settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology 10 (2), pp. 1–13. Cited by: §1.
  • [50] X. Wang, Y. Ye, and A. Gupta (2018) Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6857–6866. Cited by: §2.
  • [51] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata (2018) Zero-shot learning a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence. Cited by: Fig. 1, §2, §4.1, §4.1, §4.2.
  • [52] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata (2018) Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5542–5551. Cited by: §1, §4.3, TABLE II, TABLE III, TABLE IV, §5.1, §5.3, §5.4, TABLE V, TABLE VI, TABLE VII, TABLE VIII, TABLE IX.
  • [53] Y. Yu, Z. Ji, Y. Fu, J. Guo, Y. Pang, and Z. (. Zhang (2018)

    Stacked semantic-guided attention model for fine-grained zero-shot learning

    .
    In Advances in Neural Information Processing Systems, pp. 5998–6007. Cited by: §2.
  • [54] L. Zhang, T. Xiang, and S. Gong (2017) Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030. Cited by: §2, §4.3, TABLE II, TABLE III, TABLE IV, §5.1, §5.3, TABLE V, TABLE VI, TABLE VII, TABLE VIII, TABLE IX.
  • [55] Z. Zhang and V. Saligrama (2015) Zero-shot learning via semantic similarity embedding. In Proceedings of the IEEE international conference on computer vision, pp. 4166–4174. Cited by: §1.
  • [56] A. Zhao, M. Ding, J. Guan, Z. Lu, T. Xiang, and J. Wen (2018) Domain-invariant projection learning for zero-shot recognition. In Advances in Neural Information Processing Systems, pp. 1025–1036. Cited by: §2.
  • [57] P. Zhu, H. Wang, and V. Saligrama (2019) Generalized zero-shot recognition based on visually semantic embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2995–3003. Cited by: §3.2.