Semantic Similarity Based Softmax Classifier for Zero-Shot Learning

by   Shabnam Daghaghi, et al.
Rice University

Zero-Shot Learning (ZSL) is a classification task where we do not have even a single training labeled example from a set of unseen classes. Instead, we only have prior information (or description) about seen and unseen classes, often in the form of physically realizable or descriptive attributes. Lack of any single training example from a set of classes prohibits use of standard classification techniques and losses, including the popular crossentropy loss. Currently, state-of-the-art approaches encode the prior class information into dense vectors and optimize some distance between the learned projections of the input vector and the corresponding class vector (collectively known as embedding models). In this paper, we propose a novel architecture of casting zero-shot learning as a standard neural-network with crossentropy loss. During training our approach performs soft-labeling by combining the observed training data for the seen classes with the similarity information from the attributes for which we have no training data or unseen classes. To the best of our knowledge, such similarity based soft-labeling is not explored in the field of deep learning. We evaluate the proposed model on the four benchmark datasets for zero-shot learning, AwA, aPY, SUN and CUB datasets, and show that our model achieves significant improvement over the state-of-the-art methods in Generalized-ZSL and ZSL settings on all of these datasets consistently.



There are no comments yet.


page 1

page 2

page 3

page 4


Feature Generating Networks for Zero-Shot Learning

Suffering from the extreme training data imbalance between seen and unse...

Prototypical Priors: From Improving Classification to Zero-Shot Learning

Recent works on zero-shot learning make use of side information such as ...

CRL: Class Representative Learning for Image Classification

Building robust and real-time classifiers with diverse datasets are one ...

Hard Negative Mining for Metric Learning Based Zero-Shot Classification

Zero-Shot learning has been shown to be an efficient strategy for domain...

Zero-shot Learning with Complementary Attributes

Zero-shot learning (ZSL) aims to recognize unseen objects using disjoint...

Learning Visually Consistent Label Embeddings for Zero-Shot Learning

In this work, we propose a zero-shot learning method to effectively mode...

Classifier Crafting: Turn Your ConvNet into a Zero-Shot Learner!

In Zero-shot learning (ZSL), we classify unseen categories using textual...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Previous Works

Supervised classifiers, specifically Deep Neural Networks, need a large number of labeled samples to perform well. Deep learning frameworks are known to have limitations in fine-grained classification regime and detecting object categories with no labeled data [1, 2, 3, 4]. On the contrary, humans can recognize new classes using their previous knowledge. This power is due to the ability of humans to transfer their prior knowledge to recognize new objects [5, 6]. Zero-shot learning aims to achieve this human-like capability for learning algorithms, which naturally reduces the burden of labeling.

In zero-shot learning problem, there are no training samples available for a set of classes, referred to as unseen classes. Instead, semantic information (in the form of visual attributes or textual features) is available for unseen classes [7, 8]. Besides, we have standard supervised training data for a different set of classes, referred to as seen classes along with the semantic information of seen classes. The key to solving zero-shot learning problem is to leverage trained classifier on seen classes to predict unseen classes by transferring knowledge analogous to humans.

Early variants of ZSL assume that during inference, samples are only from unseen classes. Recent observations [9, 10, 3] realize that such an assumption is not realistic. Generalized ZSL (GZSL) addresses this concern and considers a more practical variant. In GZSL there is no restriction on seen and unseen classes during inference. We are required to discriminate between all the classes. Clearly, GZSL is more challenging because the trained classifier is generally biased toward seen classes.

‘ In order to create a bridge between visual space and semantic attribute space, some methods utilize embedding techniques [11, 12, 2, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] and the others use semantic similarity between seen and unseen classes [24, 25, 26]. The embedding based models follow three various directions; mapping visual space to semantic space [11, 12, 2, 13, 14, 2], mapping semantic space to the visual space [15, 16, 27, 28], and finding a latent space then mapping both visual and semantic space into the joint embedding space [17, 18, 19, 20, 21, 22, 23]. While semantic similarity based models intend to represent each unseen class as a mixture of seen classes.

The loss functions in embedding based models only have training samples from the seen classes. For unseen classes, we do not have any samples. It is not difficult to see that this lack of training samples biases the learning process towards seen classes only. One of the recently proposed techniques to address this issue is augmenting the loss function with some unsupervised regularization such as entropy minimization over the unseen classes


Another recent methodology which follows a different perspective is deploying Generative Adversarial Network (GAN) to generate synthetic samples for unseen classes by utilizing their attribute information [30, 31, 32]. Although generative models boost the results significantly, it is difficult to train these models. Furthermore, the training requires generation of large number of samples followed by training on a much larger augmented data which hurts their scalability.

Our Contribution:

We propose a traditional fully connected neural network architecture with cross-entropy loss for the problem of GZSL/ZSL. There are two key differences which allow such a model to predict unseen classes accurately without even having a single training example. The Softmax layer consists of nodes for all the classes, including the seen and the unseen. Besides, the weights in the final layer comprise of the attributes themselves and are non-trainable. In particular, the weight vector for node corresponding to class

is the given attribute vector for class in the problem of ZSL and it does not change over the course of training.

The key novelty of our approach is using soft-labeling that enables the training data from the seen classes to also train the unseen class. We directly use attribute similarity information between the correct seen class and the unseen classes to create a soft unseen label for each training data. As a result of this soft labeling, training instances for seen classes also serve as soft training instance for the unseen class without increasing the training corpus. This soft labeling leads to implicit supervision for the unseen classes that eliminates the need for any unsupervised regularization such as entropy loss in [29].

Soft-labeling along with crossentropy loss enables a simple MLP network to tackle GZSL problem. our proposed model, which we call Z-Softmax, is simple (unlike GANs) and efficient (unlike visual-semantic pairwise embedding models) approach which outperforms the current state-of-the-art methods in GZSL and ZSL settings on four benchmark datasets with a significant margin.

2 Problem Definition

In zero-shot learning problem, a set of training data on seen classes and a set of semantic information (attributes) on both seen and unseen classes are given. The training dataset includes samples where is the visual feature vector of the -th image and

is the one-hot-encoded true label. All samples in

belong to seen classes and during training there is no sample available from unseen classes . The total number of classes is . Semantic information or attributes , are given for all classes and the collection of all attributes are represented by attribute matrix .

In the inference phase, our objective is to predict the correct classes (either seen or unseen) of the test dataset . The classic ZSL setting assumes that all test samples in belong to unseen classes and tries to classify test samples only to unseen classes . While in a more realistic setting i.e. GZSL, there is no assumption about correct classes of test data and we aim at classifying samples in to either seen or unseen classes .

3 Proposed Methodology

3.1 Network Architecture and Training Strategy

The overall proposed methodology is shown in Figure 1. We map visual space to semantic space (), then compute the similarity score () between true attributes and the attribute/semantic representation of the input (x

). Finally, the similarity score is fed into a Softmax, and the probability of all classes are computed. Figure


depicts the details of our proposed network. For the visual features as the input, for all four benchmark datasets, we use the extracted visual features by a pre-trained ResNet-101 on ImageNet provided by

[3]. We do not fine-tune the CNN that generates the visual features unlike model in [29]. In this sense, our proposed model is also fast and straightforward to train.

Figure 1: The overall framework of the proposed Z-Softmax classifier. Semantic representation is obtained via a visual-to-semantic mapping of visual features x. Similarity scores (dot-products) of and all class attributes A are calculated and passed through a Softmax to produce all class probabilities.
Figure 2: Architecture of the proposed MLP for Z-Softmax classifier. Layers #1 and #2 provide the nonlinear embedding to map visual features to attribute space and their weights , are learned by SGD. The output layer with non-trainable weights A, basically calculates dot-products of semantic representation of the input and all class attributes simultaneously.

As Figure 2 illustrates our architecture, first a nonlinear mapping is employed to transfer visual features x (e.g. CNN extracted image features) to semantic domain, . Then, consistency (similarity) between semantic representations and class attributes is calculated by dot-product . As the result, a zero-shot classification task can be accomplished by assigning the class with the maximum value of consistency measure (dot-product) to the image instance. We propose to pass consistency scores of all seen and unseen classes through Softmax to produce class probabilities and tackle GZSL problem. In our proposed framework,

is a multi-layer perceptron (MLP) embedding with trainable parameters given by

W (Figure 2).

The attribute matrix includes both seen and unseen class attributes:


The proposed method is a multi-class probabilistic classifier that produces a -dimensional vector of probabilities p for each sample as follows:


Where is a -dimensional vector of all consistency scores of an input sample .The following defines the similarity score (dot-product) between semantic representation of sample and attribute :


Each element of vector p, represents an individual class probability that can be shown below:


This Softmax as the activation function of the last layer of the network is calculated on the total number of classes

. A natural choice to train a multi-class probabilistic classifier is the cross-entropy loss which we later show naturally integrates our idea of soft labeling. During training, we aim at learning the nonlinear mapping i.e. obtaining network weights W through:


where and

are regularization factors which are obtained through hyperparameter tuning, and

represents the categorical cross-entropy loss for each training sample as defined below:


where is the total number of classes .

The following is the unified crossentropy loss for each sample (or x for simplicity) and its separated terms involving seen and unseen classes:


Plugging in and as normalized versions of and into the above equation:


We deploy similarity based soft labeling of unseen classes that allows us to learn both seen and unseen signatures simultaneously via the above-mentioned simple architecture.

3.2 Soft Labeling

In ZSL problem, we do not have any training instance from unseen classes, so the output nodes corresponding to unseen classes are always inactive during learning. Standard supervised training biases the network towards seen classes only. Moreover, the available similarity information between the seen and unseen attributed is never utilized.

We propose soft labeling based on the similarity between semantic attributes. For each seen sample, we represent its relationship to unseen categories by obtaining semantic similarity (dot-product) using the seen class attribute and all the unseen class attributes. In the simplest form, for every training data, we can find the nearest unseen class to the correct seen class label and assign a small probability (partial membership or soft label) of this instance to be from the closest unseen class. Note, each training sample only contains a label which comes from the set of seen classes. With soft labeling, we enrich the label with partial assignments to unseen classes.

In a more general soft labeling approach, we propose assigning a distribution to all the unseen classes. A natural choice is to transform seen-to-unseen similarities (dot-product) to probabilities (soft labels) shown in Equation (9). The unseen distribution is obtained for each seen class by calculating dot-product of seen class attribute and all unseen classes’ attributes and squashing all these dot-product values by Softmax to acquire probabilities. In this case, we distribute the probability among all unseen classes based on the obtained unseen distribution. This proposed strategy results in a soft label for each seen image during training, which as we show later helps the network to learn unseen categories.

Our proposed model with distribution on unseen (DU) is referred to as Z-Softmax DU throughout the rest of the paper. The nearest unseen method is a special case of unseen distribution soft labeling where the probability of the nearest unseen class is and the rest of unseen classes are assigned to zero. It is natural to implement two or three nearest unseen approach, however, a better strategy is to utlize a temperature parameter [33, 29] on the similarity score. The temperature parameter , controls the flatness of the unseen distribution; higher temperature results in flatter distribution over unseen categories and lower temperature creates a more ragged distribution with peaks on nearest unseen classes. A small enough temperature basically results in the nearest unseen approach. The Impact of temperature on unseen distribution is depicted in Figure 4.a for a particular seen class. Soft labeling implicitly introduces unseen visual features into the network without generating fake unseen samples as in generative methods [30, 31, 32]. Hence our proposed approach is able to reproduce same effect as in generative models without the need to create fake samples and train generative models that are known to be difficult to train. Below is the formal description of temperature Softmax:


where and are temperature parameter and total probability assigned to unseen distribution, respectively. Also is the soft label (probability) of unseen class for training sample . It should be noted that is the sum of all unseen soft labels i.e. .

Utilizing Equations 8 and 9, , where are soft labels of unseen classes and is the temperature softmax where , we obtain the following:


Hence minimizing is equivalent to minimizing which is the weighted sum of cross-entropy of seen classes and cross-entropy of unseen classes. Hyperparameter acts as a trade-off coefficient between seen and unseen cross-entropy loss (Figure 3 ). We can see that the regularizer is a weighted cross entropy on unseen class, which leverages similarity structure between attributes compared to uniform entropy function of [29].

3.2.1 Intuition

We illustrate the intuition with AwA dataset [7], a ZSL benchmark dataset, and its proposed seen-unseen split [3]. Consider a seen class squirrel. We compute unseen classes closest to the class squirrel in terms of attributes. We naturally find that the closest class is rat and the second closest is bat, while other classes such as horse, dolphin, sheep, etc. are not close. This is not surprising as squirrel and rat share several attribute. It is naturally desirable to have a classifier that gives rat higher probability than other classes. If we force this softly, we can ensure that classifier is not blind towards unseen classes due to lack of any training example.

From a learning perspective, without any regularization, we cannot hope classifier to classify unseen classes accurately. This problem was identified in [29], where they proposed entropy-based regularization in the form of Deep Calibration Network (DCN). DCN uses cross-entropy loss for seen classes, and regularize the model with entropy loss on unseen classes to train the network. Authors in DCN postulate that minimizing the uncertainty (entropy) of predicted unseen distribution of training samples, enables the network to become aware of unseen visual features. While minimizing uncertainty is a good choice of regularization, it does not eliminate the possibility of being confident about the wrong unseen class. Clearly, in our example above, the uncertainty can be minimized even when the classifier gives high confidence to an unseen class dolphin on an image of seen class squirrel. Furthermore, in many cases if several unseen classes are close to the correct class, we may not actually want low uncertainty. Utilizing similarity based soft-labeling implicitly regularizes the model in a supervised fashion. The similarity values naturally has information of how much certainty we want for specific unseen class. We believe that this supervised regularization is the critical difference why our model outperforms DCN with a significant margin.

4 Experiment

We conduct comprehensive comparison of our proposed Z-Softmax NU and DU with the state-of-the-art methods for GZSL and ZSL settings on four benchmark datasets (Table 1). Our model outperforms the state-of-the-art methods on both GZSL and ZSL settings for all benchmark datasets.

4.1 Dataset

The proposed method is evaluated on four benchmark ZSL datasets. The statistics for the datasets are shown in table 1. Animal with Attributes (AwA) [7, 8] dataset is a coarse-grained benchmark dataset for ZSL/GSZl. It has 30475 image samples from 50 classes of different animals and each class comes with side information in the form of attributes (e.g. animal size, color, specific feature, place of habitat). Attribute space dimension is 85 and this dataset has a standard split of 40 seen and 10 unseen classes introduced in [8]. Caltech-UCSD-Birds-200-2011 (CUB) [34] is a fine-grained ZSL benchmark dataset. It has 11,788 images from 200 different types of birds and each class comes with 312 attributes. The standard ZSL split for this dataset has 150 seen and 50 unseen classes [17]. SUN Attribute (SUN) [35] is a fine-grained ZSL benchmark dataset consists of 14340 images of different scenes and each scene class is annotated with 102 attributes. This dataset has a standard ZSL split of 645 seen and 72 unseen classes. attribute Pascal and Yahoo (aPY) [36] is a small and coarse-grained ZSL benchmark dataset which has 14340 images and 32 classes of different objects (e.g. aeroplane, bottle, person, sofa, …) and each class is provided with 64 attributes. This dataset has a standard split of 20 seen classes and 12 unseen classes.

Dataset #Attributes #Seen Classes #Unseen Classes #Images
AwA 85 40 10 30475
CUB 312 150 50 11788
aPY 64 20 12 18627
SUN 102 645 72 14340
Table 1: Statistics of four ZSL benchmark datasets

4.2 Evaluation Metric

For the purpose of validation, we employ the validation splits provided along with PS [3]

to perform cross-validation for hyper-parameter tuning. The main objective of GZSL is to simultaneously improve seen samples accuracy and unseen samples accuracy i.e. imposing a trade-off between these two metrics. As the result, the standard GZSL evaluation metric is harmonic average of seen and unseen accuracy. This metric is chosen to encourage the network not be biased toward seen classes. Harmonic average of accuracies is defined in Equation

11 where and are seen and unseen accuracies, respectively.

Method U S H U S H U S H U S H
DAP [7] 0.0 88.7 0.0 4.8 78.3 9.0 1.7 67.9 3.3 4.2 25.1 7.2
ALE [37] 16.8 76.1 27.5 4.6 73.7 8.7 23.7 62.8 34.4 21.8 33.1 26.3
SJE [18] 11.3 74.6 19.6 3.7 55.7 6.9 23.5 59.2 33.6 14.7 30.5 19.8
ConSE [38] 0.4 88.6 0.8 0.0 91.2 0.0 1.6 72.2 3.1 6.8 39.9 11.6
Sync [39] 8.9 87.3 16.2 7.4 66.3 13.3 11.5 70.9 19.8 7.9 43.3 13.4
DeViSE [20] 13.4 68.7 22.4 4.9 76.9 9.2 23.8 53.0 32.8 16.9 27.4 20.9
CMT [2] 0.9 87.6 1.8 1.4 85.2 2.8 7.2 49.8 12.6 8.1 21.8 11.8
ZSKL [4] 18.9 82.7 30.8 10.5 76.2 18.5 21.6 52.8 30.6 20.1 31.4 24.5
DCN [29] 25.5 84.2 39.1 14.2 75.0 23.9 28.4 60.7 38.7 25.5 37.0 30.2
Z-Softmax NU 34.5 62.96 44.5 21.2 59.5 31.0 34.8 45.5 39.4 40.0 26.8 32.1
Z-Softmax NU (std) (1.9) (1.9) (1.4) (4.1) (1.4) (4.4) (0.47) (0.46) (0.31) (1.1) (0.75) (0.42)
Z-Softmax DU 50.7 75.8 60.7 23.7 56.3 33.1 42.3 51.2 46.3 44.0 30.2 35.8
Z-Softmax DU (std) (1.5) (1) (1) (3.9) (1.4) (3.9) (0.7) (0.9) (0.4) (0.6) (0.3) (0.3)
Table 2: Results of GZSL methods on ZSL benchmark datasets under Proposed Split (PS) [3]. U, S and H respectively stand for Unseen, Seen and Harmonic average accuracies. Nearest unseen (NU) and distribution on unseen (DU) are two variants of our Z-Softmax model.

4.3 Implementation Details

We utilized Keras 


with TensorFlow back-end 

[41] to implement our model.

The input to the model is the visual features of each image sample extracted by a pre-trained ResNet-101 [42] on ImageNet provided by [3]. The dimension of visual features is 2048.

To evaluate Z-Softmax, we follow the popular experimental framework and the Proposed Split (PS) in [3] for splitting classes into seen and unseen classes to compare GZSL/ZSL methods. Utilizing PS ensures that none of the unseen classes have been used in the training of ResNet-101 on ImageNet.

We cross-validate , mini-batch size , , hidden layer size and activation function

{tanh, sigmoid, hard-sigmoid, relu} to tune our model. Also we ran our experiments on a machine with 56 vCPU cores, Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHZ and 2 NVIDIA-Tesla P100 GPUs each with 16GB memory. The code is provided in the supplementary material.

(a) AwA
(b) aPY
(c) CUB
(d) SUN
Figure 3: Plots of seen (), unseen () and harmonic average () accuracies versus total probability () assigned to unseen classes are shown for all four ZSL datasets. The maximum obtained harmonic accuracy is also marked by ().

4.4 Generalized Zero-Shot Learning Results

To demonstrate the effectiveness of Z-Softmax model in GZSL setting, we comprehensively compare our proposed method with state-of-the-art discriminative GZSL models in Table 2. Since we use the standard split, the published results of other GZSL models are directly comparable.

We implemented two variants of soft labeling for our Z-Softmax model. In Z-Softmax NU, we assign a non-zero probability () to the nearest unseen class of the corresponding seen class of each training sample. The other version of our model is Z-Softmax DU, which distributes probability

among the unseen classes based on their semantic similarity to seen classes. In Z-Softmax DU, the flatness/raggedness of unseen probability distribution is controlled by temperature parameter

. We report the accuracies of both Z-Softmax NU and DU in Table 2. To obtain statistically consistent results, the reported accuracies are averaged over 30 trials (using different initialization) after tuning hyper-parameters with cross-validation.

As reported in Table 2, unseen and harmonic accuracies of both variants of our model significantly outperform current state-of-the-art GZSL methods on all four benchmark datasets. Also the proposed Z-Softmax DU shows meaningful improvement over Z-Softmax NU, highlighting the advantage of soft labeling via similarity-based distribution. Overall, we could achieve a state-of-the-art performance on ZSL benchmark datasets while keeping the model simple and efficient.

Soft labeling employed in Z-Softmax gives the model new flexibility to trade-off between seen and unseen accuracies during training and come up with a higher value of , which is the standard metric for GZSL. Assigned unseen soft labels (unseen probability ) in both NU and DU setting enables the classifier to gain more confidence in recognizing unseen classes, which in turn results in considerably higher . As the classifier is now discriminating between more classes we get marginally lower . However, balancing and with the cost of deteriorating leads to much higher . This trade-off phenomenon is depicted in Figure 3 for all datasets. The flexibility provided by soft labeling is examined by obtaining accuracies for different values of . In Figure 3.a and 3.b, by increasing total unseen probability , increases and decreases as expected. From the trade-off curves, there is an optimal where takes its maximum value as shown in Figure 3. Maximizing is the primary objective in a GZSL problem that can be achieved by the trade-off knob, .

It should be noted that both AwA and aPY datasets (Figure 3.a and 3.b) are coarse-grained class datasets. In contrast, CUB and SUN datasets are fine-grained with hundreds of classes and highly unbalanced seen-unseen split, and hence their accuracies have different behavior concerning , as shown in Figure 3.c and 3.d. However, harmonic average curve still has the same behavior and possesses a maximum value at an optimal .

4.5 Zero-Shot Learning Results

We also evaluated our Z-Softmax DU model on ZSL setting where it the test data is assumed to be from the unseen class only. We observe in Table 3 that results have the same trend. Again Z-Softmax outperforms the state-of-the-art methods in ZSL framework on four benchmark datasets.

Method AwA aPY CUB SUN
DAP [7] 44.1 33.8 40.0 39.9
ALE [37] 59.9 39.7 54.9 58.1
SJE [18] 65.6 32.9 53.9 53.7
ESZSL[12] 58.2 38.3 53.9 54.5
ConSE [38] 45.6 26.9 34.3 38.8
Sync [39] 54.0 23.9 55.6 56.3
DeViSE [20] 54.2 39.8 52.0 56.5
CMT [2] 39.5 28.0 34.6 39.9
DCN [29] 65.2 43.6 56.2 61.8
Z-Softmax DU 68.8 53.1 59.8 61.9
Table 3: Zero-Shot Learning results under Proposed Split (PS) [3] with ResNet-101 features [42].
Figure 4: The impact of temperature parameter for AwA dataset. (a) unseen soft labels (before multiplying ) produced by temperature Softmax Equation (9) for various , (b) accuracies versus for proposed Z-Softmax DU classifier.

4.6 Illustration of Soft Labeling

Figure 4 shows the effect of and the consequent assigned unseen distribution on accuracies for AwA dataset. Small enforces to be concentrated on nearest unseen class while large , spread over all the unseen classes and basically does not introduce helpful unseen class information to the classifier. The optimal value for is 0.2 for AwA dataset as depicted in Figure 4.b. The impact of on the assigned distribution for unseen classes is shown in Figure 4.a when seen class is squirrel in AwA dataset. Unseen distribution with , well represents the similarities between seen class (squirrel) and similar unseen classes (rat, bat, bobcat) and basically verifies the result of Figure 4.b where is the optimal temperature. While in the extreme cases, when , distribution on unseen classes in mostly focused on the nearest unseen class and consequently the other unseen classes’ similarities are ignored. Also flattens the unseen distribution which results in high uncertainty and does not contribute helpful unseen class information to the learning.

5 Conclusion

We proposed a discriminative GZSL/ZSL classifier with visual-to-semantic mapping and cross-entropy loss. During training, while Z-Softmax learns a seen class, it is simultaneously soft labeled with the information about similar unseen classes based on semantic class attributes. Soft-labeling offers a trade-off between seen and unseen accuracies and provides the capability to adjust these accuracies based on the particular application. We achieve a state-of-the-art performance, for both ZSL and GZSL, on all four ZSL benchmark datasets while keeping the model simple and efficient.


  • [1] Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang.

    The application of two-level attention models in deep convolutional neural network for fine-grained image classification.


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 842–850, 2015.
  • [2] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 935–943. Curran Associates, Inc., 2013.
  • [3] Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the ugly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4582–4591, 2017.
  • [4] Hongguang Zhang and Piotr Koniusz. Zero-shot kernel learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7670–7679, 2018.
  • [5] Yanwei Fu and Leonid Sigal. Semi-supervised vocabulary-informed learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5337–5346, 2016.
  • [6] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • [7] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951–958. IEEE, 2009.
  • [8] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014.
  • [9] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In European Conference on Computer Vision, pages 52–68. Springer, 2016.
  • [10] Walter J Scheirer, Anderson de Rezende Rocha, Archana Sapkota, and Terrance E Boult. Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence, 35(7):1757–1772, 2013.
  • [11] Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. Zero-shot learning with semantic output codes. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1410–1418. Curran Associates, Inc., 2009.
  • [12] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In Francis Bach and David Blei, editors,

    Proceedings of the 32nd International Conference on Machine Learning

    , volume 37 of Proceedings of Machine Learning Research, pages 2152–2161, Lille, France, 07–09 Jul 2015. PMLR.
  • [13] Maxime Bucher, Stéphane Herbin, and Frédéric Jurie. Improving semantic embedding consistency by metric learning for zero-shot classiffication. In European Conference on Computer Vision, pages 730–746. Springer, 2016.
  • [14] Xun Xu, Timothy Hospedales, and Shaogang Gong. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision, 123(3):309–333, 2017.
  • [15] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2021–2030, 2017.
  • [16] Elyor Kodirov, Tao Xiang, Zhenyong Fu, and Shaogang Gong. Unsupervised domain adaptation for zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2452–2460, 2015.
  • [17] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence, 38(7):1425–1438, 2016.
  • [18] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2927–2936, 2015.
  • [19] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [20] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2121–2129. Curran Associates, Inc., 2013.
  • [21] Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. Latent embeddings for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 69–77, 2016.
  • [22] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via joint latent similarity embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6034–6042, 2016.
  • [23] Ziad Al-Halah, Makarand Tapaswi, and Rainer Stiefelhagen. Recovering the missing link: Predicting class-attribute associations for unsupervised zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5975–5984, 2016.
  • [24] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via semantic similarity embedding. In Proceedings of the IEEE international conference on computer vision, pages 4166–4174, 2015.
  • [25] Zhenyong Fu, Tao Xiang, Elyor Kodirov, and Shaogang Gong. Zero-shot object recognition by semantic manifold distance. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2635–2644, 2015.
  • [26] Thomas Mensink, Efstratios Gavves, and Cees GM Snoek. Costa: Co-occurrence statistics for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2441–2448, 2014.
  • [27] Seyed Mohsen Shojaee and Mahdieh Soleymani Baghshah. Semi-supervised zero-shot learning by a clustering-based approach. arXiv preprint arXiv:1605.09016, 2016.
  • [28] Meng Ye and Yuhong Guo. Zero-shot classification with discriminative semantic representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7140–7148, 2017.
  • [29] Shichen Liu, Mingsheng Long, Jianmin Wang, and Michael I Jordan. Generalized zero-shot learning with deep calibration network. In Advances in Neural Information Processing Systems, pages 2005–2015, 2018.
  • [30] Ashish Mishra, Shiva Krishna Reddy, Anurag Mittal, and Hema A Murthy.

    A generative model for zero shot learning using conditional variational autoencoders.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2188–2196, 2018.
  • [31] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1004–1013, 2018.
  • [32] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5542–5551, 2018.
  • [33] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [34] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. CUB Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  • [35] Genevieve Patterson and James Hays. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2751–2758. IEEE, 2012.
  • [36] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1778–1785. IEEE, 2009.
  • [37] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for attribute-based classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 819–826, 2013.
  • [38] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.
  • [39] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5327–5336, 2016.
  • [40] François Chollet. keras., 2015.
  • [41] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
  • [42] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.