Zero-Shot Learning (ZSL) (Larochelle et al., 2008) is a difficult classification framework introduced to push training, understanding, and evaluating models towards those whose performance generalizes to new, unseen concepts. In order to perform well under this framework, learned models must be able to make “useful” inferences about unseen concepts (e.g., correctly label), given parameters learned only from seen training concepts and additional semantic information. This is fundamentally an important framework for evaluating and understanding models meant to be used in real world scenarios, where a representative sample of labeled training data is expensive, and/or not all relevant test cases may be known at training time. Training models that perform well under this framework requires thinking beyond the normal classification by incorporating ideas such as systematic generalization (Bahdanau et al., 2019) and compositionality (Tokmakov et al., 2018).
While the original ZSL setting introduced in Larochelle et al. (2008)
was agnostic to the exact methodology, more recent image ZSL approaches almost uniformly use features from very large “backbone” neural networks such as InceptionV2(Szegedy et al., 2016) and ResNet101 (He et al., 2016) pretrained on the Imagenet dataset (Russakovsky et al., 2015). In terms of absolute performance, this approach appears to be well-justified, as state-of-the-art results on various ZSL (Yosinski et al., 2014; Sun et al., 2017; Huh et al., 2016; Azizpour et al., 2015) and non-ZSL benchmarks (Li et al., 2019; Zhang et al., 2018b; He and Peng, 2017) all learn on top of these pretrained backbones. There is also a valid, intuitive analogy between pretraining on large datasets to how humans are exposed to large amounts of diverse data during their “training” (Biederman, 1987).
However, we have many concerns with this approach. First, relative success in transfer learning has been shown to be highly dependent on the pretrained backbone encoder(Xian et al., 2018) on Imagenet. In addition, using different datasets in the pretraining phase has been shown to lead to very different transfer learning performance on related tasks (Cui et al., 2018). Next, while Imagenet features have been shown to work well for ZSL tasks with similar image datasets, there are no guarantees that such a pretraining framework would exist with general ZSL settings, either in existence or suitability for the task. Conversely, it can be hard in practice to meaningfully evaluate a Zero-Shot learner, as some of the benchmarking test classes have been shown to be present in the pretraining dataset (Xian et al., 2017).
Finally, we believe this approach misses the point. ZSL isn’t meant to be primarily an exercise of achieving the best absolute performance on a set of benchmarks; it should first and foremost be used as a framework for training, understanding, and evaluating models on their ability to reason about new, unseen concepts. Despite the absolute performance gains of the methods above that use Imagenet features, the use of backbones hyper-optimized for supervised performance on Imagenet and the Imagenet dataset itself represent nuisance variables in a larger effort to understand how to learn generalizable concepts from scratch.
For these reasons and more, we wish to return ZSL back to its roots, distinguishing it from the bulk of recent ZSL literature, by “introducing” Zero-Shot Learning from scratch (ZFS).111This important framework isn’t really new, but it has been ignored by the community, so we feel it is appropriate to give it a new name. In this setting, the model is evaluated on its ability to perform classification using auxiliary attributes and labels trained only using the data available from the training split of the target dataset. We believe that ZFS will provide researchers with a better experimental framework to understand the factors that improve performance in Zero-Shot generalization.
ZFS is a fundamentally harder setting than that of using encoders pretrained on Imagenet, as we need to learn all lower-level features from scratch in a way that will generalize to unseen concepts. As a first good step to solving ZFS, we follow intuitions from works on compositionality and locality (Tokmakov et al., 2018; Stone et al., 2017), focusing on methods that encourage good local representations.
The contributions of our work are as follows:
We introduce Zero-Shot Learning from scratch (ZFS), an extention of the ZSL task, which we believe will be an important benchmark for understanding the generalization properties of learning algorithms, models, and encoder backbones.
We evaluate several supervised and unsupervised methods on their ability to learn features that generalize in the ZFS setting by training a prototypical network on top of those features (in a similar way to what was done in Snell et al., 2017, with Imagenet features).
Motivated by compositionality and locality, we introduce a local objective to the above methods and show improvement on the ZFS task.
2 Zero-Shot Learning from scratch
Following our concerns outlined above with performing ZSL using pretrained Imagenet encoders, we introduce Zero-Shot Learning from Scratch (ZFS), which we believe provides a better experimental framework to understand the factors that contribute to generalization to unseen concepts at test-time. In addition to the ZSL experimental framework outlined in Larochelle et al. (2008), ZFS simply adds one additional additional requirement:
ZFS requirement: No model parameters can contain information about (e.g., can be learned from) data outside that from the training split of the target dataset.
For example, for the Caltech-UCSD-Birds-200-2011 (CUB, Wah et al., 2011) dataset, the model to be evaluated under ZFS can only be learned using sample birds the training split. ZFS removes nuisance variables related to pretraining from understanding why different learning algorithms might work, thus making comparisons more meaningful. ZFS also opens to door to using encoder backbones that are non-standard and not hyper-optimized for a different task, the suitability of which is not well understood or studied in the Imagenet-pretrained setting.
2.2 Hypothesis and Approach
For images, we hypothesize that good performance in ZFS requires representations that are locally compositional. For example in the CUB dataset, locally compositional would mean that the bird representations is a composition of local “part” representations (e.g., representations of “yellow beak”, “black wings”, etc). We and others (Zhu et al., 2019)
believe that learning representations this way in this setting will generalize well, as samples in the train and test split have similar marginal distributions over the parts, yet vary in their composition (i.e., in their joint distributions). Once good part representations are learned, a typical ZSL method such as prototypical networks(Snell et al., 2017)
can be used to classify by matching their composition with given class-specific semantic features.
In this work, we train convolutional image encoders using either supervised or unsupervised learning, then use prototypical networks to perform zero-shot learning onto these fixed representations. Using prototypical networks for ZFS requires minimal parameters or hyper-parameters tuning, is well-studied(Huang et al., 2019; Finn et al., 2017)
, and performance is very close to the state of the art for Imagenet-pretrained benchmarks. We encourage the image encoder to extract semantically relevant features at earlier stages of the network by introducing an auxiliary local loss to the local feature vectors of a convolutional layer (which corresponds to a patch patch of the image).
A visual representation of our approach is provided in Figure 1, while the training procedure is described in Algorithm 1. We introduce one of two additional auxiliary classification tasks onto the local features of one (or more) of the earlier layers of the encoder. The first auxiliary task - which we refer to as Local Label-based Classifier (LC) - consists of an additional simple classifier, trained with a standard cross entropy loss over the training classes. The second auxiliary task - Local Attribute-based Classifier (AC) - trains a separate prototypical network to project local features to an embedding space that matches attribute embeddings. The local loss is then added with the model’s base loss
(e.g. reconstruction loss for VAEs, discriminator loss for AAE) and all models are trained end-to-end using back-propagation and stochastic gradient descent.
3 Related works
3.1 Zero-Shot Learning
Zero-Shot Learning (ZSL) can be formulated as a meta-learning problem (Vinyals et al., 2016). The absence of examples from the test distribution to fine-tune the model (contrary to Few-Shot Learning (FSL, Li et al., 2006)) results in the fact that methods that learn an initialization such as MAML (Finn et al., 2017) are hard to apply, and almost all recent state-of-the-art methods (Akata et al., 2015; Changpinyo et al., 2016; Kodirov et al., 2017; Zhang et al., 2017; Sung et al., 2018) rely instead on metric learning by applying two steps: (1) learning a suitable embedding function that maps data samples and class attributes to a common subspace, (2) performing nearest neighbor classification at test-time with respect to the embedded class attributes.
Compositional representations have been a focus in the cognitive science literature (Biederman, 1987; Hoffman and Richards, 1984) with regards to the ability of intelligent agents to generalize to new concepts. There has recently been a renewed interest in learning compositional representations for applications such as computational linguistics (Tian et al., 2016), and generative models (Higgins et al., 2017b). For FSL, Tokmakov et al. (2018) introduced an explicit penalty to encourage compositionality when dealing with attribute representations of classes. Alet et al. (2018) encourages compositionality in meta-learning by using a modular architecture. In this work, we focus on implicit constraints to encourage compositionality, adding a local attribute or label classifier to provide additional local gradients to the encoder.
Local features have been extensively used to improve different model’s performance. Self-attention over local features resulted in large improvements in generative models (Zhang et al., 2018a)
. Attention over local features is commonly used in image captioning(Li et al., 2017), visual question answering (Kim et al., 2018) and fine-grained classification (Sun et al., 2018).
Deep InfoMax (DIM, Hjelm et al., 2018) is an unsupervised model that maximizes the mutual information maximization between local and global features to learn unsupervised representations. Performance of DIM is promising for unsupervised representation learning using linear or nonlinear classification probes, and extensions have achieved state-of-the-art on numerous related tasks (Veličković et al., 2018; Bachman et al., 2019). Because of this, we hypothesize that DIM may learn good representations suitable for ZFS, and this is validated in our experiments.
We use the following standard ZSL benchmarks for ZFS: Animals with Attributes 2 (AwA2, Xian et al., 2018), 30,475 images, 50 different animal classes, attributes of dimension 85. Caltech-UCSD-Birds-200-2011 (CUB, Wah et al., 2011), 11,788 images, 200 different types of birds, attributes of dimension 312. SUN Attribute (SUN, Patterson and Hays, 2012), 14,340 images, 717 types of scenes, attributes of dimension 102. As in (Xian et al., 2017) we used normalized versions of these attributes as semantic class representations.
We perform comparisons between a set of common representation learning methods, both supervised and unsupervised. All methods are trained using only the data in the train set, which includes images, class labels, and class attributes. The methods selected for comparison are: Fully supervised label classifier (FC), Variational auto-encoders (VAE) , Variational auto-encoders (-VAE), Adversarial auto-encoders (AAE) and Deep InfoMax (DIM).
|Prototypical Networks Snell et al. (2017)||
|VAE (Kingma and Welling, 2013)||
|-VAE (Higgins et al., 2017a)||
|AAE (Makhzani et al., 2015)||
|DIM (Hjelm et al., 2018)||
While common Zero-Shot Learning methods consider large encoders pretrained on Imagenet (Russakovsky et al., 2015) to simplify experiments and due to capacity constraints, we choose to consider only smaller networks. We considered both a small encoder derived from the DCGAN architecture (Radford et al., 2015) and similar in capacity to those used in early few-shot learning models such as MAML (Finn et al., 2017). We also consider AlexNet (Krizhevsky et al., 2012) to gain insight on the impact of the encoder backbone. It is important to note that overall the encoders we use are significantly smaller than the “standard” backbones common in state-of-the-art Imagenet-pretrained ZSL methods. We believe restricting the encoder’s capacity decreases the overall complexity, but does not hinder our ability to extract understanding of what methods work from our experiments.
To perform a rigorous evaluation of zero-shot classification tasks, the sets of train and test classes need to be strictly disjoint. Xian et al. (2017) shows how the latter does not hold for the most commonly used train/test splits in previous ZSL work, due to the presence of many test examples in the Imagenet training set. They propose a new data split addressing these issues, the Proposed Split (PS), which we use in our experiments. The number of train/test classes is: CUB - 150/50, AwA2 - 40/10, SUN 645/72. All models are evaluated on Top-1 accuracy. We pretrain the encoder using each of the previously mentioned methods (strictly on the considered dataset, as per the ZSF requirement). We then train a prototypical network on top of the (fixed) learned representation. For each case, we consider the effect of adding local label-based classifiers (LC) and local attribute-based classifiers (AC) during the encoder training.
All models used in this paper have been implemented in PyTorch. We use a batch size of 64. All images have been resized to size. During training, random crops with aspect ratio were performed. During test, center crops with the same ratio were used. While most ZSL approaches do not use crops (due to the fact that they used pre-computed features), this experimental setup was show to be efficient in the field of text to image synthesis (Reed et al., 2016). All models are optimized with Adam and a learning rate of 0.0001. The final output of the encoder is of dimension 1024 across all models. Local experiments were performed extracting features from the third layer in the network. These features have dimension for the AlexNet based encoder and for the simple CNN encoder.
Results of our proposed approach in the framework of ZFS are displayed in Table 1. For all methods considered (with the exception of AAE on AwA2), the addition of local information results in an increase in Top-1 accuracy. For the models whose loss is only based on the global representation (VAE, -VAE, AAE), the label-based local task performs better than the attribute based one, but not by a significant margin.
The best performing model is DIM, whose accuracy is comparable to (on CUB) and significantly higher (on AwA2 and SUN) than the fully supervised model (FC). Moreover, DIM is the only model for which the attribute-based local task shows significant improvement over the label-based one. DIM is by definition a model that strongly leverages on local information, supporting our hypothesis that locality is a fundamental ingredient for generalization.
These results suggest that the proposed auxiliary losses can have a significant positive influence on models who already have a notion of locality, but are less effective in the case where the model mostly relies on global information (as a reconstruction task necessarily needs to take into account the whole input).
5 Conclusion and future work
Motivated by the need for more realistic evaluation settings for Zero-Shot Learning methods, we proposed a new evaluation framework where training is strictly performed only on the benchmark data, with no pretraining on additional datasets. In the proposed setting, we hypothesize that the fundamental ingredients for successful transfer learning are locality and compositionality. We propose an auxiliary loss term that encourages these characteristics and evaluate a range of models on the CUB, AwA2 and SUN datasets. We observe how the proposed approach yields to improvements in ZSL accuracy, thus confirming our hypothesis.
- Evaluation of output embeddings for fine-grained image classification. . External Links: Cited by: §3.1.
- Modular meta-learning. arXiv preprint arXiv:1806.10166. Cited by: §3.2.
- Factors of transferability for a generic convnet representation. IEEE transactions on pattern analysis and machine intelligence 38 (9), pp. 1790–1802. Cited by: §1.
- Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §3.3.
- Systematic generalization: what is required and can it be learned?. In International Conference on Learning Representations, External Links: Cited by: §1.
- Recognition-by-components: a theory of human image understanding.. Psychological review 94 (2), pp. 115. Cited by: §1, §3.2.
- Synthesized classifiers for zero-shot learning. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §3.1.
- Large scale fine-grained categorization and domain-specific transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4109–4118. Cited by: §1.
- Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §2.2, §3.1, §4.1.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
- Fine-grained image classification via combining vision and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5994–6002. Cited by: §1.
- Beta-vae: learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: Table 1.
- Scan: learning hierarchical compositional visual concepts. arXiv preprint arXiv:1707.03389. Cited by: §3.2.
Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §3.3, Table 1.
- Parts of recognition. Cognition 18 (1-3), pp. 65–96. Cited by: §3.2.
- Centroid networks for few-shot clustering and unsupervised few-shot classification. arXiv preprint arXiv:1902.08605. Cited by: §2.2.
- What makes imagenet good for transfer learning?. arXiv preprint arXiv:1608.08614. Cited by: §1.
- Bilinear attention networks. In Advances in Neural Information Processing Systems, pp. 1564–1574. Cited by: §3.3.
- Auto-Encoding Variational Bayes. arXiv e-prints, pp. arXiv:1312.6114. External Links: Cited by: Table 1.
Semantic autoencoder for zero-shot learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §3.1.
Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.1.
- Zero-data learning of new tasks.. Cited by: §1, §1, §2.1.
- One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28 (4), pp. 594–611. Cited by: §3.1.
- An analysis of pre-training on object detection. External Links: Cited by: §1.
Image caption with global-local attention.
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §3.3.
- Adversarial autoencoders. CoRR abs/1511.05644. External Links: Cited by: Table 1.
- SUN attribute database: discovering, annotating, and recognizing scene attributes. In Proceeding of the 25th Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §4.1.
- Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §4.1.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §1, §4.1.
- Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175. Cited by: 2nd item, §2.2, Table 1.
- Teaching compositionality to cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5058–5067. Cited by: §1.
Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pp. 843–852. Cited by: §1.
- Multi-attention multi-class constraint for fine-grained image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 805–821. Cited by: §3.3.
- Learning to compare: relation network for few-shot learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: Cited by: §3.1.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1.
Learning semantically and additively compositional distributional representations. arXiv preprint arXiv:1606.02461. Cited by: §3.2.
- Learning compositional representations for few-shot recognition. arXiv preprint arXiv:1812.09213. Cited by: §1, §1, §3.2.
- Deep graph infomax. arXiv preprint arXiv:1809.10341. Cited by: §3.3.
- Matching networks for one shot learning. CoRR abs/1606.04080. External Links: Cited by: §3.1.
- The caltech-ucsd birds-200-2011 dataset. Cited by: §2.1, §4.1.
- Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §4.1.
- Zero-shot learning - A comprehensive evaluation of the good, the bad and the ugly. CoRR abs/1707.00600. External Links: Cited by: §1, §4.1, §4.1.
- How transferable are features in deep neural networks?. In Advances in neural information processing systems, pp. 3320–3328. Cited by: §1.
- Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §3.3.
- Learning a deep embedding model for zero-shot learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §3.1.
- Fine-grained visual categorization using meta-learning optimization with sample selection of auxiliary data. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 233–248. Cited by: §1.
- Learning for new visual environments with limited labels. CoRR abs/1901.09079. External Links: Cited by: §2.2.