The state-of-art performances in visual recognition obtained by Convolutional Neural Networks (CNNs) are subject to the availability of a large set of annotated training data to learn the model. Since it is rarely the case for many practical tasks of interest (target-tasks
), one usually adopts a transfer-learning approach[Oquab et al.(2014)Oquab, Bottou, Laptev, and Sivic, Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson, Popescu et al.(2015)Popescu, Etienne, and Le Borgne] which relies on a CNN pre-trained on a source task
with sufficient annotated data (often ImageNet[Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei]
) then further truncated to provide the representations of the samples of target-task. Then, even with few annotated data, this last can usually be learned with a linear classifier. Such approaches raise the question of the similarity of thesource-task on which the representation has been learned and the target-task on which it is used. Although this similarity is not easy to formalize, one has the intuition that the closer the both tasks the better the representation will be adapted to the target-task. This consideration leads to several methods that tend to obtain more universal representations [Bilen and Vedaldi(2017), Conneau et al.(2017)Conneau, Kiela, Schwenk, Barrault, and Bordes, Kokkinos(2017), Rebuffi et al.(2017)Rebuffi, Bilen, and Vedaldi, Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot, Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti], that is to say that are more adapted to a large set of diverse target-tasks, in a transfer-learning scenario.
The general idea of these methods is to diversify the classification problem of the source-task in order to obtain more features, able to adequately represent new target-datasets, from more domains, in a larger context. All these approaches vary the problem by creating new categories having an existing label. However, most of them studied the effect of adding categories extracted from ImageNet, either generic categories [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Mettes et al.(2016)Mettes, Koelma, and Snoek, Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot, Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti] or specific ones [Zhou et al.(2014)Zhou, Lapedriza, Xiao, Torralba, and Oliva, Azizpour et al.(2015)Azizpour, Razavian, Sullivan, Maki, and Carlsson, Bilen and Vedaldi(2017), Rebuffi et al.(2017)Rebuffi, Bilen, and Vedaldi], that are at the bottom of a hierarchy such as ImageNet, except [Joulin et al.(2016)Joulin, van der Maaten, Jabri, and Vasilache, Vo et al.(2015)Vo, Ginsca, Le Borgne, and Popescu, Vo et al.(2017)Vo, Ginsca, Le Borgne, and Popescu] that use web annotations with noisy labels. In general, the usage of specific categories tends to provide better performances than generic ones [Bilen and Vedaldi(2017), Rebuffi et al.(2017)Rebuffi, Bilen, and Vedaldi], although combining them can significantly boost the universalizing capacity of the CNN [Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot, Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti]. Yet, even for the most specific categories, it is plausible that it exists a variety of semantics within the class that is not explored (e.g., one could imagine to split the object-class according to the different poses of the object). Clearly, the limiting point is the availability of such finest annotation (e.g., poses, contexts, attributes) for existing specific classes.
In this article, we argue that exploring finer classes than the most specific existing ones, can significantly increase the diversity of the problem, therefore improve the universality of the representation learned in the internal layers of the CNN. The main difficulty is the lack of annotation below the most specific levels. We propose to rely on unsupervised learning (clustering) to determine these finer categories within each specific category. Our contribution is three-fold. First, we show that the use of finer categories rather than the most specific ones to learn CNNs, improves the universality of the resulting representation, even when the finer classes are determined randomly
within each specific class. Second, the usage of a K-means based approach leads to slightly better results although the resulting clusters are strongly imbalanced. To fix this, our core contribution splits and merges the specific categories to automatically determine better balanced finer classes, leading to better results. Last, we show that CNNs learned with our approach provide a better complementary to standard CNN representations than those learned on generic categories.
Let note that if the target-task has enough data, the representation can be adapted to the target-task by fine-tuning. This is nevertheless out of the scope of this work, because it is a complementary process to the transfer-learning in itself, that will always improve the performances, and especially, because fine-tuning modifies the representations, which leads to a bias that hides the real ability of a universalizing method [Huh et al.(2016)Huh, Agrawal, and Efros]. Hence, in this paper, we are only interested into studying the universality of the representations, independently of many possible refinements of a full adaptation method on each target-task.
Previous works [Dong et al.(2013)Dong, Xia, Chen, Feng, Huang, and Yan, Dong et al.(2015)Dong, Chen, Feng, Jia, Huang, and Yan, Xiang et al.(2017)Xiang, Choi, Lin, and Savarese, Chen et al.(2017)Chen, Lu, and Fan] exploited sub-categories in the context of visual recognition. In [Dong et al.(2013)Dong, Xia, Chen, Feng, Huang, and Yan, Dong et al.(2015)Dong, Chen, Feng, Jia, Huang, and Yan], an object instance affinity graph is computed from intra-class similarities and inter-class ambiguities then the visual subcategories are detected by the graph shift algorithm. The process is nevertheless quite computationally demanding and applied to object detection on small target datasets only. In [Xiang et al.(2017)Xiang, Choi, Lin, and Savarese, Chen et al.(2017)Chen, Lu, and Fan], subcategories that are learned from extrapolated feature maps and fine-tuned on a target-dataset, are used within a CNN to improve region proposal for object detection. To the best of our knowledge, our paper is the first to propose the usage of subcategories determined by unsupervised learning on a source-task, in order to improve universality of representations. More related to universality, [Bilen and Vedaldi(2017), Rebuffi et al.(2017)Rebuffi, Bilen, and Vedaldi, Rebuffi et al.(2018)Rebuffi, Bilen, and Vedaldi]
added annotated data from more domains as well as domain-specific neurons to an initial set of domain-agnostic ones. Contrary to them, our method only modifies the source-problem at zero cost of annotation. Our work is closer to the approach of[Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot, Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti] that proposes to relabel specific categories into generic ones (that match the upper categorical-levels), to learn an additive CNN with the same architecture. Nevertheless, our approach is interested into the “opposite way”, that is, creating finer classes than those at the bottom of a hierarchy (ImageNet), for which no annotation exists and thus their method can not be applied.
We evaluated our proposal on the problem of universality, that is, in a transfer-learning scheme using multiple target-tasks (i.e., ten classification benchmarks from multiple domains, including actions, food, scenes, birds, aircrafts, etc.). In particular, in comparable settings (using ILSVRC as source-task and two architectures, AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] and DarkNet [Redmon and Farhadi(2017)]), we showed that our method outperforms state-of-the-art ones.
2 Proposed Method
We propose a new universalizing method that consists in training a network on a set of categories that are finer than those of the finest-level of a hierarchy (e.g., ImageNet hierarchy or any set of categories). In Sec. 2.1, we start by describing its general principle as well as two baselines that splits them either randomly or by clustering their features. With such baseline, the number of finer classes must be a priori fixed, thus we propose a “bottom-up clustering-based merging” approach that determines a better splitting automatically (Sec. 2.2). Furthermore, we propose to combine the features learned on the specific categories and those learned on the finer ones to get an even more universal representation (Sec. 2.3).
2.1 FiNet: Network Trained on Finer-Classes
The leaf nodes of the ImageNet hierarchy represents the finest or most specific categories that are annotated. More generally, this is the case for the set of categories of any classification dataset. To go towards our goal of automatically obtaining finer categories (without annotations) from the finest ones, a baseline approach consists in using a random partitioning of the specific categories or a simple clustering-based approach of their image-features. The first baseline, randomly assigns every image of a specific category to one of clusters. The second one, first, learns a CNN (noted SpeNet) on the specific categories, uses one of its layer as features-extractor for every image, then determines
clusters using K-means on these vectors. A more sophisticated way is our final method that is presented in Sec.2.2. Note that, in all cases, the splitting is performed on specific categories, that already contains quite similar samples/vectors. Once the finer classes obtained, we train another network (denoted FiNet) on the same images used to train SpeNet, but labeled among the obtained finer-classes. The whole set of finer-classes forms the new finest-level of the hierarchy. Our general principle is illustrated in Fig. 1 and presented more formally below.
Let us consider a semantic hierarchy with hyponymy relations, that is to say a set of categories organized according to “is-a” relations (e.g., ImageNet [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] hierarchy). This hierarchy denoted is a directed acyclic graph of levels of nodes, with being all the nodes and the set of directed edges between the nodes. Each node corresponds to the -th category at level in and is the number of categories at level . A hierarchy-edge indicates that class subsumes class . Let us also consider an initial dataset containing a set of images labeled among the categories at level and let us denote the number of images labeled among the -th category at level . Note that, . Each image of the dataset, is associated to a given category for 111Let denotes , in all the paper.. Let us denote the representation of an image extracted from layer of the network trained on (i.e., SpeNet). Let also
being the set of features extracted from all the images belonging to the category.
In order to construct the finer-categories of each category of the previous level , we apply a clustering algorithm (e.g., K-means, MeanShift or BUCBAM presented in Sec. 2.2) on (that builds centroids) where the feature vector of each image () is assigned to the nearest centroid (hard-coding [Liu et al.(2011)Liu, Wang, and Liu]), which forms the finer classes. This process is applied for all which gives the finer classes, that forms the nodes of the finer level . This latter results in a new dataset , for which each image is associated to a given category for . Note that by construction, every subsumes all its finer-categories , thus we have (. The whole process results in a new hierarchical level that forms the new hierarchy with . corresponds to the union of and the edges that connect each category to its finer ones . It is important to point out that, depending on the clustering algorithm, will depend on or be the same for all categories. This is discussed in the next section.
The new dataset is used to train (softmax cross-entropy loss minimized by SGD) the FiNet network, which has neurons on its last layer. FiNet is then used as features-extractors for the images of the target-tasks: .
2.2 Bottom-Up Clustering-Based Merging
We empirically observed (see Fig. 3) that clustering approaches with a fixed for each category (e.g., Kmeans) usually leads to a FiNet that gives better universality results than approaches that adapt to each category (e.g., Affinity-Propagation). Indeed, this latter tends to provide a set of finer classes with many clusters containing few images and a couple of clusters containing a large number of images, leading to an undesirable imbalanced dataset that penalizes the network training. Even if the use of fixed- clustering methods leads to more balanced data, it remains sub-optimal since it sets the same amount of clusters for all specific categories (), while this may depend on the content of each category. Furthermore, in fixed- clustering methods, the value is an hyper-parameter that is cross-validated on the target-tasks, which are not accessible during the learning on the source-task, in the context of universality [Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti]. Hence, the cross-validation of should be performed only with the source-task, which is not trivial (optimal on the source-task not necessary optimal one for the target-tasks). To overcome the drawbacks of fixed and adapted- clustering methods, we proposed an hybrid one called “Bottom-Up Clustering-BAsed Merging” (BUCBAM). It roughly starts with the clusters obtained by a fixed- clustering and automatically sets the amount of clusters for each specific category by enforcing a more balanced resulting set of finer-classes. Specifically, it consists in three main steps (illustrated on the left of Fig. 2): (i) splitting the specific categories into clusters (with a large ); (ii) attaching small clusters to the closest bigger ones (to avoid imbalanced data); and (iii) merging the most similar ones, with respect to a proposed similarity criteria.
Formally, BUCBAM starts with a large amount of finer-classes per category with and being the same for all . Let us assume we have finer-categories obtained from the category of the previous level through a fixed- clustering method. Let denote the whole set of features extracted from the images of a given category through the SpeNet . Note that, corresponds to the amount of images in each 222For simplicity, we omit the power indices , and in the following.. The goal of BUCBAM is to get an amount of clusters depending on the images of each category . To do so, it first prunes out the small clusters (i.e., all the such that , , with ), by re-assigning their samples (that were assigned to ) to the category of the closest feature vector , with and being a function that provides the closest vector (e.g., k-NN algorithm with Euclidean distance) in the set of features belonging to the other and large clusters (i.e., all with ). Pruning small clusters for all categories , results in a set of finer-classes per class . The last step of BUCBAM is to merge the similar clusters. To do that, a classifier is trained for each cluster – using features of samples as positives and same amount of samples from a diverse class as negatives – and evaluated on the images of all other clusters. The diverse category is created by randomly picking elements equiprobably from all the categories . The evaluation of the classifiers provides a similarity matrix for each category . This last is used to merge similar clusters and let dissimilar ones disjoint. More precisely, a first strategy is to consider clusters and symmetrically similar (BUCBAM-SS) if: and , with and a high score (close to 1). Another strategy is to consider, clusters asymmetrically similar (BUCBAM-AS) if only one constraint is respected and the other is greater than , with a medium score. In both cases (that are illustrated in Fig. 2), dissimilar clusters are desirably let disjoint. Merging similar clusters for all classes , results in a set of finer-classes per category .
2.3 SpeFiNet: Combining Specific and Finer Features
Following the approach of [Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti] – which roughly consists in training initial features on an initial set of categories, then learning new features on new set of categories and finally combining initial and new features –, we propose to learn the new features with our FiNet (rather than a network trained on generic categories [Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot, Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti]) and combine them with the features of the initial SpeNet to get a representation even more universal. This method is denoted SpeFiNet in the following. Formally, the final SpeFiNet representation combines specific and finer features and is computed for an image of a target-task as: , where is a fusion operator, and is a normalization function. In practice for the normalization and fusion, we respectively choose the L-infinite norm (-) and the concatenation. To the best of our knowledge, we are the first to propose to combine a SpeNet (trained on specific categories) and a FiNet (trained on finer categories) to get more universal representations.
3 Experimental Results
Universalizing methods are evaluated in a transfer-learning scheme on multiple target-tasks [Cer et al.(2018)Cer, Yang, Kong, Hua, Limtiaco, John, Constant, Guajardo-Cespedes, Yuan, Tar, et al., Conneau et al.(2017)Conneau, Kiela, Schwenk, Barrault, and Bordes, Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti]. More precisely, a source-task is used to train a network that acts as a representation extractor on the data of the target-tasks. Each target-task is trained with a simple predictor on top of the representations extracted from the samples of the target-task. Note that, fine-tuning the representations on the target-tasks could always improve performances but induces a bias avoiding correct evaluation of universality [Conneau and Kiela(2018), Huh et al.(2016)Huh, Agrawal, and Efros, Subramanian et al.(2018)Subramanian, Trischler, Bengio, and Pal, Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti]. Hence, following the literature, simple predictors that do not modify the representations learned on the source-task are used. In particular, here for the target-tasks, we used a classification task with datasets from multiple visual domains (presented below) and for the predictor, we used a one-versus-all SVM classifier for each class. Even if [Conneau and Kiela(2018), Rebuffi et al.(2017)Rebuffi, Bilen, and Vedaldi, Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti] initiated a work around universality evaluation, it seems to remain an open problem. Hence here, since we only have benchmarks that are evaluated in terms of accuracy and precision, we evaluate universalizing methods in terms of average of their performances on the multiple benchmarks.
For the source-task, we used ILSVRC [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei] and ILSVRC* (half of the former, detailed in [Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot]). For the target-tasks, we used ten datasets from multiple domains, including general objects (VOC07 [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman], NWO [Chua et al.(2009)Chua, Tang, Hong, Li, Luo, and Zheng], CA101 [Fei-Fei et al.(2006)Fei-Fei, Fergus, and Perona], CA256 [Griffin et al.(2007)Griffin, Holub, and Perona]), scenes (MIT67 [Quattoni and Torralba(2009)]), actions (stACT [Yao et al.(2011)Yao, Jiang, Khosla, Lin, Guibas, and Fei-Fei]), birds (CUB [Wah et al.(2011)Wah, Branson, Welinder, Perona, and Belongie]), plants (FLO [Nilsback and Zisserman(2008)]), food (FOOD [Bossard et al.(2014)Bossard, Guillaumin, and Van Gool]) and airplanes (AIRC [Maji et al.(2013)Maji, Kannala, Rahtu, Blaschko, and Vedaldi]). The characteristics of all the datasets are detailed in supplementary material.
Our method consists in the combination of a SpeNet and FiNet. For both networks, we used two architectures, namely the classical AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] and the deeper DarkNet [Redmon and Farhadi(2017)]. They are respectively trained on the images of ILSVRC* and ILSVRC. SpeNet is thus respectively trained to recognize and specific categories. In contrast, FiNet is trained to recognize a set of finer-classes ( depends on the splitting method), for which we used four variants: (i) random splitting with fixed, denoted Random-K (ii) K-means clustering with fixed, denoted Cluster-K (iii) BUCBAM splitting with asymmetrically similar clusters merging, denoted BUCBAM-AS and (iv) BUCBAM splitting with symmetrically similar clusters merging, denoted BUCBAM-SS. Note that, the BUCBAM methods leads to a depending on the content of each category. In Sec. 3.2, we provides some statistics of the resulting dataset of each method, including the total amount of finer-classes. In Cluster-K and BUCBAM methods, we extract features from the penultimate layer to represent the samples of each class, which results in features of 4096 dimensions for AlexNet and 1000 for DarkNet. Specific to BUCBAM, the , and parameters are respectively set to 32, 15 and 0.8. Indeed, has to be large, and we found that as long as is larger than 20 our method provides the same splitting result. ensures to train a network with at least 15 images per class. We obtained similar results with . The parameters is not critical since similar clusters generally provides very high (close to 1.0) classification scores.
|SPV [Azizpour et al.(2015)Azizpour, Razavian, Sullivan, Maki, and Carlsson]||66.6||74.7||54.7||53.2||37.4||45.1||36.0||51.9||52.4|
|SPV [Mettes et al.(2016)Mettes, Koelma, and Snoek, Tamaazousti et al.(2017b)Tamaazousti, Le Borgne, Popescu, Gadeski, Ginsca, and Hudelot]||67.7||73.0||54.3||50.5||37.1||44.9||36.8||50.3||51.8|
|AMECON [Chami et al.(2017)Chami, Tamaazousti, and Le Borgne]||61.1||58.7||40.6||45.8||24.3||32.7||26.1||36.4||44.5|
|WhatMakes [Huh et al.(2016)Huh, Agrawal, and Efros]||64.0||69.4||50.1||45.6||33.7||41.9||15.0||42.8||45.3|
|ISM [Wu et al.(2016)Wu, Li, Kong, and Fu]||62.5||68.8||50.7||28.5||37.9||42.6||34.0||50.0||46.9|
|GrowingBrain-RWA [Wang et al.(2017)Wang, Ramanan, and Hebert]||69.1||74.8||55.9||50.4||40.0||48.4||38.6||56.1||54.2|
|FSFT [Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti]||67.5||73.9||55.0||44.6||40.4||47.1||38.7||56.8||53.0|
|MuCaLe-Net [Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot]||69.5||76.0||56.8||54.7||41.3||48.5||35.6||54.8||54.6|
|MulDiP-Net [Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti]||69.8||77.5||58.3||47.9||43.7||50.2||37.4||59.7||55.6|
|GenNet [Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot]||83.2||91.5||78.1||73.2||64.4||72.6||52.5||78.9||48.5||46.2||68.9|
|MulDiP-Net [Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti]||84.1||92.7||80.1||73.9||66.4||74.5||61.2||82.1||53.5||49.3||71.8|
3.1 Comparison to the State-of-the-Art
We compare the results obtained by our method with those of the literature, in particular, all the methods re-implemented and reported in [Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti]. For fair comparisons, we followed their training configuration, and trained our method with an AlexNet network and ILSVRC* as source-task. Moreover, instead of using our more diverse set of ten target-datasets, we used the eight ones used in their paper. The results are reported on Table 1. We first observe that our methods are always better than the reference method used in [Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot, Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti, Bilen and Vedaldi(2017), Rebuffi et al.(2017)Rebuffi, Bilen, and Vedaldi, Rebuffi et al.(2018)Rebuffi, Bilen, and Vedaldi], namely SpeNet. In particular our best method (BUCBAM) exhibits a boost of 6 points on average, compared to SpeNet. Let also note that SpeFiNet is always better than FiNet, itself better than SpeNet, regardless the splitting method. More precisely, the BUCBAM splitting method is significantly better than the best Cluster one, without the high cost of cross-validation of the parameter. Compared to state-of-the-art methods, ours achieves the best performances, that is, almost 2 points of improvement compared to the most competitive MulDiP-Net method [Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti], while it surpasses all other methods by more than 3 points. A last salient result is the fact that a SpeFiNet (whatever the splitting method) is significantly better than MuCaLe-Net [Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot] which has been trained on the best generic categories (manually obtained from categorical-levels [Tamaazousti et al.(2016)Tamaazousti, Le Borgne, and Hudelot, Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot]). This latter clearly demonstrates than combining features trained on specific categories with those trained on finer categories is better than combining them with those trained on generic categories.
Furthermore, since [Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti] reported better results with a deeper network (DarkNet) trained on the full ILSVRC, we also implemented our method in the same configuration. Since MulDiP-Net provides the best results of the literature on the problem of universality, we only compare to them for this setting. The results are reported on Table 2. While the improvement is only slightly better, our method still beats the competitive MulDiP-Net. Moreover, a salient observation is that our method tend to be much better than theirs on the fine-grained classification benchmarks, which are more challenging. As in the previous setting, SpeFiNet is better than FiNet which is itself better than SpeNet. We also compared to the GenNet (which is the generic sub-component of MulDiP-Net [Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti]) and we observe that FiNet-BUCBAM is better by 0.9 points.
3.2 Analysis and Comparison to Baselines
In this section, we perform an in-depth analysis of our method through an ablation study, a comparison to baselines and visualization of some statistics. In particular, in supplementary we compared our method to multiple clustering baselines, namely Spectral Clustering and Affinity propagation. The former provides a fixed set of clusters per category, while the latter leads to a dynamic set of clusters. A summary of results is presented on the left of Fig.3, where we plot, a bar for each method, that represents its average performance on the set of 10 benchmarks described in Sec 3. We also tested the Mean-Shift algorithm with many different bandwidth values, but it always led to many clusters with one or two images, and one cluster containing all the remaining images. This setting providing very low results, we did not report them. From these results, we observe that our BUCBAM method is better than all the baselines including other existing algorithms. Rather than the average performances, in the diagram on the right of Fig. 3, we illustrated the detailed results (on the ten benchmarks) of the SpeFiNet-BUCBAM, FiNet-BUCBAM and SpeNet methods. We clearly observe that the diagram of our SpeFiNet-BUCBAM overlaps FiNet-BUCBAM, which itself overlaps the reference SpeNet method. In addition, we provide in supplementary material the detailed results of all the methods on all the target-tasks.
In Figure 8, we visualize some of the clusters obtained by each splitting method (random, clustering and BUCBAM). To do so, we highlight three clusters for two specific categories (two blocks of three rows of five images). On the left, the clusters are determined from a random distribution within the full specific category, leading to clusters that contain its full diversity. On the contrary, with the K-means clustering (middle), the clusters exhibits a more coherent aspect. For example, for the goldfish category, the cluster report close-up views of fish that are rather seen on their profile. We have a similar behaviour for the banjo category with cluster and . With the proposed BUCBAM method (right), the clusters are even more specific than in the K-means case. For instance, for the goldfish category, we clearly identify a cluster that represents “many golfishes” (), “on goldfish in a close-up view” () and some images on which the fish tank is visible (). Also for the banjo class, we also clearly observe that our method identified a cluster that represents “person playing banjo” and even “person playing banjo in a concert” . Importantly, while the clustering method tend to results in duplicate clusters (e.g., with ; with ; etc.), ours tend to provide only dissimilar results, thank to our merging process.
In supplementary material, we also provided some statistics of our method and baselines (i.e.
, histograms of average amount of clusters per category, histograms of intra-class variance of the clusters) and more visualizations of the obtained clusters, through the visualization of some images in some clusters and the features of each image of the clusters in a 2D dimensional space, after performing a PCA on their full features. This highlights the clear interest of our method, in terms of cluster relevancy and the balance of resulting data, compared to the random and clustering baselines.
In this paper, we tackled the problem of universality of representations with a new method relying on categories that are finer than the most specific ones of the ImageNet hierarchy. These last being the finest that are annotated, we proposed a method that automatically add a hierarchical-level to the ImageNet hierarchy. A network trained on the categories of such finer-level provides a more universal representation than with the upper levels. In practice, it leads to significantly better results in a transfer-learning scheme, on 10 publicly available datasets from diverse domains.
We also showed that a K-means and, surprisingly, a random partitioning of the leaf nodes of ImageNet already gives interesting results, although below than the proposed approach. It nevertheless suggests that the general principle highlighted in this article could be fruitful to design new CNN-based representations that are more universal in a transfer-learning context. Furthermore, it should be noted that our principle is neither limited to the ImageNet hierarchy nor to the classification task. Indeed, it could be applied to any hierarchy or dataset and on other tasks, such as detection, segmentation or keypoint estimation, as considered in[Wang et al.(2018)Wang, Russakovsky, and Ramanan].
The following reports some supplementary material that is not required to understand the main article but provide complements or illustrations. Hence, the additional elements were produced using the same version of the approach explained in our main paper and include the following items: (i) the detailed characteristics of the datasets used in this paper (Section A); (ii) detailed results of the comparison of our method with the baselines (Section B); and finally (iii) some illustrations of the clusters obtained by the different methods as well as some statistics (Section C).
Appendix A Datasets: Detailed Characteristics
In Table 3, we report the characteristics of all datasets used in the article to learn CNN on a source-task and to estimate the performances of a universalizing method, that is to say, its performances on a set of target-tasks in the context of transfer-learning. For this, we used the most commonly used dataset as source-task, namely ILSVRC [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei] which is a subset of ImageNet [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] that contains millions images labeled among 1,000 specific categories. We also follow the literature [Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti] for fair comparisons and thus used as source-task the ILSVRC* dataset, that corresponds to half of ILSVRC. Regarding the target-tasks, we follow the literature [Bilen and Vedaldi(2017), Rebuffi et al.(2017)Rebuffi, Bilen, and Vedaldi, Rebuffi et al.(2018)Rebuffi, Bilen, and Vedaldi, Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti] and used ten target-datasets in a classification task. In particular, here we used benchmarks from various domains, namely objects, actions, scenes, as well as, fine-grained objects like aircrafts, birds, cars and plants. In order to show the visual variability of the chosen target-datasets we used to evaluate universalizing methods, we report in Figure 5, some example images of each of them.
. For each dataset, we detail seven characteristics. Each column of the table corresponds to a certain characteristic: (1) domain of the images; (2) amount of categories; (3) average amount of training-images per category; (4) whether the dataset contains multiple categories per image (✓) or no (✗); (5) amount of training examples; (6) amount of test examples; and (7) the standard evaluation metric (Accuracy and mean Average Precision, respectively denoted byAcc. and mAP). Example images of each dataset are presented on Figure 5.
Appendix B Comparison to Baselines: Detailed Results
In Figure 3 of the main paper, we reported the synthesis results of the comparison of our methods with several baselines. Thus here, we provide detailed results, that is to say, results of all the methods on each benchmark as well as their average performance on all of them. This is reported in Table 4. Even if already mentioned in the main paper, let recall the most salient results: (i) SpeFiNet is always better than FiNet which is always better than SpeNet, regardless the splitting method; (ii) the proposed BUCBAM splitting method gives better results than the best Kmeans one, at zero cost of parameter cross-validation; and (iii) the proposed BUCBAM is always better than all other methods, especially Random, Spectral and Affinity. Additionally, in Figure 6, we display the average performances of FiNet-Kmeans- and SpeFiNet-Kmeans- according different values of , which are compared to the performance of a classical SpeNet.
Appendix C Splitting Methods: Statistics and Visualization
In this section we illustrate some interesting properties of the random, cluster and BUCBAM splitting methods. In particular, we first highlight some statistics in Figure 7. Indeed, on top we plot, for each method, the histogram of amount of images per cluster for all the specific categories of the initial dataset. Note that, the more the histogram forms a pointed spike, the more the data are balanced. Here we clearly observe that the random splitting method provides the most balanced data, while in contrast other methods tend to contain clusters of various sizes. highly imbalanced data, However, if they provide imbalanced data, it is important to mention that clustering and BUCBAM provide relevant clusters that are based on the semantic encoded on the image features, which contain thus samples that are more visually similar. Let also note that, compared to the histogram of clustering that has a long tail and starts at near-zero, the histogram of BUCBAM is more flat and starts around 20, meaning that no tail is modeled in the data and very small clusters are not considered.
At bottom of Figure 7, we reported the histogram of intra-class variance of the clusters obtained from all the specific categories, by the three splitting methods. In this, it is important to note that a small width of the histogram means that the set of clusters contains very similar samples. While random provides the smallest width of the peak, it is necessary to observe this is due to the fact that it has almost the same amount of images per cluster, thus it is not relevant. In contrast, BUCBAM that provides a large set of amount of categories, also provides a width of the peak that is lowest than clustering, meaning that it provides clusters with more similar samples.
While previously, we reported some global statistics of the resulting clusters from the different splitting methods, here we rather show the most representative samples of the clusters obtained by each splitting method. Indeed, this is reported on Figure 8, on which we highlight three clusters (three rows of images) for five specific categories (five blocks of three rows of images). On the left, the clusters are determined from a random distribution within the full specific category, leading to clusters that contain its full diversity. Note by the way, how diverse the specific categories are, and imagine how generic categories (used in [Tamaazousti et al.(2016)Tamaazousti, Le Borgne, and Hudelot, Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot]) could be, which may explain why the GenNet of [Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot] may not provide good results (since it is hard from it to discover relevant features). On the contrary, with our splitting method and more precisely the K-means clustering (middle), the clusters exhibits a more coherent aspect. For example, for the goldfish category, the cluster report close-up views of fish that are rather seen on their profile. We have a similar behaviour for the bicycle category with cluster and . With the method we propose (right), the clusters are even more specific than in the K-means case. For instance, for the goldfish category, we clearly identify a cluster that represents “many golfishes” (), “on goldfish in a close-up view” () and some images on which the fish tank is visible (). Also for the banjo class, we also clearly observe that our method identified a cluster that represents “person playing banjo” and even “person playing banjo in a concert” . Importantly, while the clustering method tend to results in duplicate clusters (e.g., with ; with ; with etc.), ours tend to provide only dissimilar results, thank to our merging process.
Finally, we also computed a principal component analysis of the representations of each specific category and projected the vectors on the first two principal components, keeping a different color for each (new) finer category (Figure9
). As expected, with the random split, the vectors are uniformly distributed while the two other methods tend to form some groups. Although these results are qualitative, one can see that the proposed BUCBAM method exhibits slightly more grouped points than the K-means.
- [Azizpour et al.(2015)Azizpour, Razavian, Sullivan, Maki, and Carlsson] Hossein Azizpour, Ali Razavian, Josephine Sullivan, Atsuto Maki, and Stefan Carlsson. Factors of transferability for a generic convnet representation. PAMI, 2015.
- [Bilen and Vedaldi(2017)] H. Bilen and A. Vedaldi. Universal representations: The missing link between faces, text, planktons, and cat breeds. arXiv:1701.07275, 2017.
[Bossard et al.(2014)Bossard, Guillaumin, and Van Gool]
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool.
Food-101 – mining discriminative components with random forests.In ECCV, 2014.
- [Cer et al.(2018)Cer, Yang, Kong, Hua, Limtiaco, John, Constant, Guajardo-Cespedes, Yuan, Tar, et al.] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.
- [Chami et al.(2017)Chami, Tamaazousti, and Le Borgne] Ines Chami, Youssef Tamaazousti, and Hervé Le Borgne. Amecon: Abstract meta concept features for text-illustration. In ICMR, 2017.
- [Chen et al.(2017)Chen, Lu, and Fan] Tao Chen, Shijian Lu, and Jiayuan Fan. S-cnn: Subcategory-aware convolutional networks for object detection. PAMI, 2017.
- [Chua et al.(2009)Chua, Tang, Hong, Li, Luo, and Zheng] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao. Zheng. Nus-wide: A real-world web image database from national university of singapore. In ACM Conference on Image and Video Retrieval, CIVR, 2009.
- [Conneau and Kiela(2018)] Alexis Conneau and Douwe Kiela. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449, 2018.
- [Conneau et al.(2017)Conneau, Kiela, Schwenk, Barrault, and Bordes] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364, 2017.
- [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
- [Dong et al.(2013)Dong, Xia, Chen, Feng, Huang, and Yan] Jian Dong, Wei Xia, Qiang Chen, Jianshi Feng, Zhongyang Huang, and Shuicheng Yan. Subcategory-aware object classification. In CVPR, 2013.
- [Dong et al.(2015)Dong, Chen, Feng, Jia, Huang, and Yan] Jian Dong, Qiang Chen, Jiashi Feng, Kui Jia, Zhongyang Huang, and Shuicheng Yan. Looking inside category: subcategory-aware object recognition. Transactions on Circuits and Systems for Video Technology, 2015.
- [Everingham et al.()Everingham, Van Gool, Williams, Winn, and Zisserman] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge 2012.
- [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge. IJCV, 2010.
- [Fei-Fei et al.(2006)Fei-Fei, Fergus, and Perona] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. PAMI, 2006.
- [Griffin et al.(2007)Griffin, Holub, and Perona] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. 2007.
- [Huh et al.(2016)Huh, Agrawal, and Efros] Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes imagenet good for transfer learning? arXiv:1608.08614, 2016.
- [Joulin et al.(2016)Joulin, van der Maaten, Jabri, and Vasilache] Armand Joulin, Laurens van der Maaten, Allan Jabri, and Nicolas Vasilache. Learning visual features from large weakly supervised data. In ECCV, 2016.
- [Kokkinos(2017)] Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, 2017.
- [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- [Liu et al.(2011)Liu, Wang, and Liu] Lingqiao Liu, Lei Wang, and Xinwang Liu. In defense of soft-assignment coding. In ICCV, 2011.
- [Maji et al.(2013)Maji, Kannala, Rahtu, Blaschko, and Vedaldi] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013.
- [Mettes et al.(2016)Mettes, Koelma, and Snoek] Pascal Mettes, Dennis Koelma, and Cees G. M. Snoek. The imagenet shuffle: Reorganized pre-training for video event detection. In ICMR, 2016.
[Nilsback and Zisserman(2008)]
Maria-Elena Nilsback and Andrew Zisserman.
Automated flower classification over a large number of classes.
IEEE Computer Vision, Graphics & Image Processing, 2008.
- [Oquab et al.(2014)Oquab, Bottou, Laptev, and Sivic] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In CVPR, 2014.
- [Popescu et al.(2015)Popescu, Etienne, and Le Borgne] Adrian Popescu, Gadeski Etienne, and Hervé Le Borgne. Scalable domain adaptation of convolutional neural networks. preprint arXiv:1512.02013, 2015.
- [Quattoni and Torralba(2009)] Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In CVPR, 2009.
- [Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. CoRR, 2014.
- [Rebuffi et al.(2017)Rebuffi, Bilen, and Vedaldi] S-A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple visual domains with residual adapters. In NIPS, 2017.
- [Rebuffi et al.(2018)Rebuffi, Bilen, and Vedaldi] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Efficient parametrization of multi-domain deep neural networks. In CVPR, 2018.
- [Redmon and Farhadi(2017)] Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. In CVPR, 2017.
- [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015.
- [Subramanian et al.(2018)Subramanian, Trischler, Bengio, and Pal] Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. In ICLR, 2018.
- [Tamaazousti et al.(2016)Tamaazousti, Le Borgne, and Hudelot] Youssef Tamaazousti, Hervé Le Borgne, and Céline Hudelot. Diverse concept-level features for multi-object classification. In ICMR, 2016.
- [Tamaazousti et al.(2017a)Tamaazousti, Le Borgne, and Hudelot] Youssef Tamaazousti, Hervé Le Borgne, and Céline Hudelot. Mucale-net: Multi categorical-level networks to generate more discriminating features. In CVPR, 2017a.
- [Tamaazousti et al.(2017b)Tamaazousti, Le Borgne, Popescu, Gadeski, Ginsca, and Hudelot] Youssef Tamaazousti, Hervé Le Borgne, Adrian Popescu, Etienne Gadeski, Alexandru Ginsca, and Céline Hudelot. Vision-language integration using constrained local semantic features. CVIU, 2017b.
- [Tamaazousti et al.(2018)Tamaazousti, Le Borgne, Hudelot, Seddik, and Tamaazousti] Youssef Tamaazousti, Hervé Le Borgne, Céline Hudelot, Mohamed El Amine Seddik, and Mohamed Tamaazousti. Learning more universal representations for transfer-learning. arXiv:1712.09708, 2018.
- [Vo et al.(2015)Vo, Ginsca, Le Borgne, and Popescu] Phong Vo, Alexandru Lucian Ginsca, Hervé Le Borgne, and Adrian Popescu. Effective training of convolutional networks using noisy web images. In proc. 13th International Workshop on Content-Based Multimedia Indexing (CBMI 2015), 2015.
- [Vo et al.(2017)Vo, Ginsca, Le Borgne, and Popescu] Phong D Vo, Alexandru Ginsca, Hervé Le Borgne, and Adrian Popescu. Harnessing noisy web images for deep representation. CVIU, 2017.
- [Wah et al.(2011)Wah, Branson, Welinder, Perona, and Belongie] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
- [Wang et al.(2018)Wang, Russakovsky, and Ramanan] Jingyan Wang, Olga Russakovsky, and Deva Ramanan. The more you look, the more you see: towards general object understanding through recursive refinement. In WACV, 2018.
- [Wang et al.(2017)Wang, Ramanan, and Hebert] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by increasing model capacity. In CVPR, 2017.
[Wu et al.(2016)Wu, Li, Kong, and Fu]
Yue Wu, Jun Li, Yu Kong, and Yun Fu.
Deep convolutional neural network with independent softmax for large scale face recognition.In ACM, 2016.
- [Xiang et al.(2017)Xiang, Choi, Lin, and Savarese] Yu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. Subcategory-aware convolutional neural networks for object proposals and detection. In WACV, 2017.
- [Yao et al.(2011)Yao, Jiang, Khosla, Lin, Guibas, and Fei-Fei] Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. Human action recognition by learning bases of action attributes and parts. In ICCV, 2011.
- [Zhou et al.(2014)Zhou, Lapedriza, Xiao, Torralba, and Oliva] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. In NIPS, 2014.