Training classifiers on specific non-generic domains requires a non negligible effort on data annotation. Although this might be easy for some of the target categories, it might become too costly for others due to the long-tail distribution of samples. This has motivated the development of models that can be trained on little data (few-shot learning) or no data at all (zero-shot learning). In zero-shot classification (ZSC)lampert2009 we are given a set of labeled samples from a known set of categories and the goal is to learn a model that is able to cast predictions over a set of categories not seen during training. Despite having identified the difficulties and particularities in the evaluation of different approaches xian2018zero , little attention has been paid to the effect of considering different training class partitions for a given problem. Given the large number and diversity of models proposed in the literature in the recent years, we believe this is an important factor to be considered when choosing between competing approaches. In this work, we start exploring this problem. Our preliminary experiments using different datasets of varying granularity and two simple baselines confirm our hypothesis: performance differences observed in the literature might be not as significant as it seems due to the large variability observed across different subsets of training classes.
2 Experiments and Discussion
In ZSC we are given a training set . The goal is to learn a mapping from that can be used to classify samples over a different set . We consider the standard ZSC setting, where . Given a representation for each , a common approach akata-15 ; eszsl is to learn a function to reflect the degree at which and agree on a given concept. Given a test sample , its class is predicted as .
The work of Xian et al. xian2018zero identified several problems in the evaluation methodology used in the ZSC literature. One key contribution of their work was the proposal of fixed set of train/test class splits for different datasets. Although this addresses many of the evaluation problems identified in xian2018zero , it does not considers the effect of varying training class partitions. We believe analyzing not only the mean but also the variability of the zero-shot predictive performance under changing training configurations is an important factor towards a more thoughtful evaluation of the different methods. Our work is a first step in that direction.
shows the mean and standard deviation over different training partitions for two simple baselines, ESZSLeszsl and SJEakata-15 , on two fine-grained (SUNsun and CUBcub ) and two coarse-grained (AWA1lampert2009 and AWA2xian2018zero
) datasets, using different class partitions sampled at random. We observe a great deal of variability whether the sample per class imbalance is considered (avg. acc.) or not (avg. per-class acc.). We observe that the difference in performance (as reported in the literature) might bias the selection of one method over the other even when their difference is not statistically significant. The table also show p-values of a Wilcoxon signed-rank test computed from 22 different partitions chosen at random. We see that while for the fine-grained cases we can reject the null hypothesis for a fairly low confidence level, this is not the case in the coarse-grained data regime. Although the difference in mean values seems high (for the standards observed in the literature), the variability observed in the experiments warns against choosing one method over the other.
Ensemble learning for ZSC.
Beyond the identification of the variability problem, we ran experiments using standard ensemble techniques as an attempt to mitigate its effect. The idea is that by combining more than one predictor into a single model, it is possible to reduce the variance by averagingdietterich2000ensemble . One popular approach is the Bootstrap Aggregation or Bagging meta-algorithm breiman1996bagging . It is based on learning different predictors using different subsets of training samples and aggregating them via a suitable voting scheme. The hard voting scheme assigns the class predicted by the majority, i.e. . In soft voting, prediction is given by the highest score over all the models , i.e. .
In the context of ZSC, we use different (random) subsets of training categories to generate the set of base predictors, i.e. we learn a set predictors using a proportion of randomly chosen classes from the original training set. We use hard and soft ensembles considering and , e.g. means training different models using of the full set of training categories. Each sub-problem is trained on a different subset of training classes. We sample 4 different sub-problems for each combination. Baseline performances are as follows: on SUN, on CUB, and ,
on AWA1 and AWA2 respectively. We use the ResNet101 features and continuous attribute vectors fromxian2018zero and normalize both to unit norm. We found that as the proportion increases performance approaches the baseline, which is to be expected since the set of training categories tends to resemble the original set. The standard deviation may marginally decrease but with a considerable loss in performance. This situation is more noticeable in the case of AWA1 and AWA2, both coarse-grained datasets, compared to the others. Table 1 shows the ensemble results for . 111Different combinations of voting schemes and accuracy metrics lead to similar conclusions. Beyond these observations, the use of ensemble does not lead to an increase on the overall ZSC performance. Alternatives to this formulation is the topic of our current research.
|SUN||55.61 (2.16)||56.81 (2.02)||56.77 (1.98)||57.03 (1.73)|
|CUB||50.89 (2.92)||53.45 (2.84)||54.39 (2.84)||54.83 (2.72)|
|AWA1||65.35 (6.52)||68.38 (7.49)||69.70 (7.63)||70.52 (7.31)|
|AWA2||66.90 (3.70)||70.39 (4.23)||72.16 (4.26)||73.13 (4.52)|
|Avg. acc.||ESZSL||55.90 (1.95)||53.49 (2.10)||69.66 (9.94)||71.10 (10.94)|
|SJE||59.16 (2.37)||56.08 (3.03)||68.85 (7.96)||68.84 (11.16)|
|Avg. per-class acc.||ESZSL||55.92 (1.94)||53.81 (2.20)||69.34 (9.02)||71.48 (9.54)|
|SJE||59.73 (2.17)||56.19 (2.44)||69.48 (8.27)||69.34 (9.63)|
- (1) Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In , pages 2927–2936, 2015.
- (2) L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
- (3) T. G. Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
- (4) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- (5) C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951–958. IEEE, 2009.
- (6) G. Patterson and J. Hays. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2751–2758. IEEE, 2012.
- (7) B. Romera-Paredes and P. Torr. An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning, pages 2152–2161, 2015.
- (8) C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
- (9) Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence, 2018.