Learning More Universal Representations for Transfer-Learning

12/27/2017 ∙ by Youssef Tamaazousti, et al. ∙ MIT CEA 0

Transfer learning is commonly used to address the problem of the prohibitive need in annotated data when one want to classify visual content with a Convolutional Neural Network (CNN). We address the problem of the universality of the CNN-based representation of images in such a context. The state-of-the-art consists in diversifying the source problem on which the CNN is learned. It reduces the cost for the target problem but still requires a large amount of efforts to satisfy the source problem needs in annotated data. We propose an unified framework of the methods that improve the universality by diversifying the source problem. We also propose two methods that improve the universality but pay special attention to limit the need of annotated data. Finally, we propose a new evaluation protocol to compare the ability of CNN-based representation to tackle the problem of universality. It demonstrates the interest of our work on 10 publicly available benchmarks, relating to a variety of visual classification problems.



There are no comments yet.


page 1

page 4

page 10

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans are able to recognize a scene and the objects that composed it with disconcerting ease in comparison to a machine. According to Atkinson’s cognitive study [2], such capabilities result from the development of a powerful internal representation in their infancy, which they re-use later in life to solve multiple problems. In analogy, machines based on neural-networks also perceives the data with representations that they use to solve tasks. However, while vision in a human works well for multiple problems, machines are only able to solve one at a time. This latter motivated some recent works, that wanted to “mimic” such abilities of humans by learning universal models [26, 53] or universal representations [4, 37, 38]. By the former, we mean models that are able to solve every possible task (recognition, detection, segmentation, etc.), and by the latter, we mean representations that are able to encode every possible element (materials, objects, scenes, etc.), in every possible configuration (scale, occlusion, context, etc.). Note that, without a good encoding of the data, a task can not be solved efficiently, thus a first step toward universal models is to get universal representations. We are far from expecting a representation that could encode any information, thus the actual purpose of our work is to improve the universality of a given representation, at a minimal cost.

Fig. 1: A universal representation is a set of features learned on a source-problem (SP), that exhibit good performances when it is transfered on many target-problems (TPs). A universal representation is obtained by a reference method (A) that consists in training a network on a SP (a). When it is transfered on a large set of TPs (e), it leads to a high aggregated performance on it. In this paper, we propose to build a more universal representation by: (i) normalizing and combining the features of (A) with those of a network (yellow block) learned after a source-problem variation (SPV) (c) – that, transforms an initial SP (a) into a new one (b) – and (ii) re-training both networks on their initial SP through focused self fine-tuning

denoted FSFT (d). Since our method (B) is more universal than (A), it has a higher universality score according to the proposed evaluation-metric.

Given a reference universal representation, a straightforward way to increase its universality, is undoubtedly the use of more training data (tasks and domains). For instance, [4, 37, 38] proposed to simultaneously learn a wide set of different visual domains, by better considering the additional data, through the use of scaling parameters that learns the statistics of each domain. In the same vein, Subramanian et al. [45] considered multiple datasets with different tasks and proposed to find which of the available multi-task learning algorithms and set of tasks provides the best universal level. Note that, this approach of adding more data is limited by availability of data. Hence, here we are interested by increasing universality from a fixed set of data (tasks and domains). Conneau et al. [11, 9, 10] have also the same goal. However, while their approach consists to find the best algorithm, we argue that we could structure the training-data in a way that provides different tasks to learn more or better features.

The overall contribution of this article (illustrated in Fig. 1) consist in a method that increases universality from a fixed set of data. In the vein of MuCale [47], we proposed MuLDiP-Net that: (i) variates the initial source-problem (SP); (ii) uses its resulting SPs to train new features; and (iii) combines all the features to form a more universal representation. By construction, MulDiP-Net learns new discriminative features without any additional images or annotations. Regarding the variation, we propose a formalism that consist to start from a set of specific categories (e.g., rottweiler, pit-bull) and group them in more generic ones (e.g., dog), according to the upper-levels of a given hierarchy (e.g., WordNet, hierarchy obtained by clustering the specific categories, or categorical-levels relating to Human-categorization [24, 40]). The proposed formalism is quite general and makes MuCale a particular case of our method, that groups categories according to categorical-levels only. We also introduce a new method named focused self fine-tuning (FSFT), based on self-training [58] of CNNs on the same SP. This approach increases the universality without any additional data, and more interestingly, at the same network-capacity. FSFT can also be used to reduce the dimension of representations. Combined to MulDiP-Net, it thus improves the performances while reducing the network-capacity.

We also address the problem of universality evaluation. Rebuffi et al. [37] proposed an interesting Visual Decathlon Challenge (VDC) to measure universality on multiple domains. Furthermore, they proposed a VDC metric that coherently aggregates the scores on the multiple domains. However, they restricted their evaluation in a classical end-to-end scheme: learn the representation on the train-set and evaluate on the test-set. Such scheme do not completely match with the claim of Atkinson111In all the document, the “claim of Atkinson” states for: “the ability of humans to develop a universal and powerful internal representation of images in the early years of their development and re-use it (almost) as is later in life for solving multiple kind of different problems” [2]., that is, learning on one environment and use the learned representation on different problems. Hence, we propose to go further by evaluating universality in a transfer-learning (TL) scenario [58, 3]. TL, naturally fits with [2], with the source-problem that corresponds to the “infancy learning environment” and the target-problems to those “solved later in life”. Also, while the VDC metric respects some desirable criteria, we argue that more are required for universality, and thus proposed a new metric that respects more of them.

The present manuscript differs in many ways from our previous work on this topic [47]. In particular, it includes the following new contributions: (1) a new universalizing method based on self-training of neural-networks [58] (Sec. 3); (2) a more general formalism of the MulDiP-Net approach based on the principle of source-problem variation by grouping (Sec. 4), allowing the use of multiple variations and thus making [47] a special case of our approach; (3) the formalization of FSFT as a dimensionality reduction method and its combination with MulDiP-Net (Sec. 4.5), making this latter more performing while significantly reducing its network-capacity; (4) the evaluation of universality in a TL scheme, that better fits the claim of [2] and the highlight of four desirable criteria for universality evaluation as well as the proposition of a new metric (Sec. 5); and (5) more extensive and detailed experimental results, including a comparison to the state-of-the-art based on ten available target-problems (Sec. 6), as well as an in-depth analysis of our approach including an ablation study and the impact of some important component of our approach (Sec. 6.3 and supplementary material).

2 Related Work

2.1 Learning Universal Representations

With UberNet, Kokkinos [26] considered a unified architecture that is trained end-to-end to tackle several vision tasks (universal model) and addressed resulting technical challenges. In the same vein, Subramanian et al. [45] used multiple datasets with different NLP tasks and proposed to find which of the available multi-task learning algorithms and set of tasks are the best to learn universal representations. Alternatively, Bilen and Vedaldi [4] rather considered that a universal representation should be able to address multiple visual domains. They proposed to learn a compact representation able to perceive a wide set of domains, using scaling parameters to learn the statistics of each. It has been extended in [37, 38] by respectively using sequential and parallel residual adapters. All these works use more data, and it is undoubtedly the most straightforward way towards universality, but it is limited by its availability. It is thus interesting to improve universality from a fixed set of data, which is our goal as well as that of Conneau et al. [11, 9]. However, in contrast to their approach, that consists to find the best task and algorithm for universality, ours is to structure the data in a way that automatically provides different tasks to learn more or better features.

All the above works consider multiple source-problems (e.g., tasks [45], domains [37] or problems with same images but different categories as our work) to learn the universal representation. It is important to note that learning the representation with one network (using multi-task, multi-label or recursive [53] learning) is just a way to do it. In this paper we proposed to rely on multiple networks through independent learning, and the advantages and drawbacks of these methods will be detailed in Sec. 2.2.2.

2.2 Approaches for Universalizing Methods

Many works can be considered as universalizing methods, in the sense that they diversify and increase feature detectors, either by modifying the source-problem (SP) on which a network is learned, or relying on ensemble models that learn several networks on different problems. We propose a unified view of these approaches, using the notion of Source-Problem Variation (SPV, Sec. 4.1).

2.2.1 Learning one Network on a Modified SP

To diversify an initial problem, a first approach consists in adding new categories and corresponding annotated images. Three kinds of categories are added: specific [3, 4, 37, 59] (e.g. rottweiler), generic [28, 31, 49] (e.g. dog) and noisy [51, 25]. In some cases, the categories added contain data from multiple domains [4, 37]. These methods slightly increase the capacity of the network by adding parameters specific to each domain. In all cases, this approach can be quite powerful to increase the universality, but at a high cost since it requires many additional data and corresponding annotations. Another limitation lies in the learning methodology. Indeed, the network is often trained jointly on the generic and specific data, resulting into a mix of generic and specific features in the intermediate layers. Since a softmax loss is often used, it considers generic and specific categories as mutually exclusive, that is not the case. At the opposite, we propose to group the categories according to their level and separately learn the features on each to respect the real-world semantics. Alternatives to this proposal are the works of [23, 6] that respectively group hierarchically or by clustering. However (as shown in the experiments), for the sake of universality, it is preferable to group at categorical-levels as we propose.

2.2.2 Learning a Set of Networks on a Set of SPs

The methods of the second approach [1, 22, 32, 35, 55, 56] give an answer to the problem that in joint training, different kinds of features are undesirably mixed. Indeed, all the works in this approach use an ensemble-model on different source problems and a sequential training procedure. They train a network on an initial problem (that contains specific or generic categories) and they fine-tune it on another problem or on a set of smaller problems. As a consequence, even if the scope of the new representation is increased by this process, all the features learned on the new problems are biased toward those of the model previously learned on the initial problem. Thus, due to their sequential learning procedure, these approaches do not combine different types of knowledge (specific and generic categories) but only consider one of them, that of the last problem used for training. A consequence of the latter point is that, they need many models (i.e., more than ) to get significant diversity in the set of features, which is very costly. The proposed MulDiP-Net method, provides an answer to this limitation by carrying independent network training (one network per problem).

From the point-of-view of the strategy used to vary the source problem, all the methods [1, 22, 35, 56] from this approach re-label specific categories into non-semantic generic ones, that is to say categories that do not exist in the real world) [6]

and capture the common properties among many object classes independently of an actual common semantics. These generic categories are built using hierarchical clustering on low/mid-level features (obtained from a network trained on the initial SP) of images among the initial set of categories. As a consequence, these methods are dependent to the visual low/mid-level features that can lead to irrelevant categories when low/mid-level features fail to capture the dissimilarity between different categories. For the sake of universality, it is preferable to rely on a grouping process that uses explicit human categorization expertise in order to reflect semantically-real relations between categories.

2.3 Universality Evaluation

Inspired by studies on the visual brain (claim of [2]), authors of [4, 37, 38] evaluated universality of representations as their ability to simultaneously cover a large range of visual domains. Their evaluation consists in learning and testing on the same problem and only the visual domain differs. Moreover, since many problems are considered, they proposed a metric to aggregate the scores in each task, that respects their proposed criterion of significance. While such criterion is undoubtedly useful for universality evaluation, it should be noted that their classical evaluation framework does not completely match with [2]. In contrast to our transfer-learning scheme, their scheme do not consider the terms “re-used later in life” which means that multiple and different problems should be solved by a universal representation. Moreover, in addition to their criterion of significance and the coherent aggregation considered in [45], we proposed a total of six additional criteria important for universality evaluation, as well as two metrics that respect almost all the criteria. Nevertheless, while such TL scheme has been already used in the NLP community [11, 9, 8, 45] for universal representations, to the best of our knowledge, we are the first to propose it in the vision community and more importantly, to link it to the claim of [2].

2.4 Cognitive Studies in Computer Vision

A last line of work deals with the inspiration from cognitive studies in computer vision 

[13, 30, 34, 46]. Generally, their goal is to output basic-level concepts of an image from a set of predicted finer ones. An exception is the work of [46], that is closest to ours since they consider categorical-levels in their representation. As in our work, their system reflects the psychological hint stating that, even if humans tend to categorize objects at the subordinate-level, they are still aware of the other categorical-levels [40]

. However, the key difference is how we integrate that hint as well as the purpose of its consideration. Indeed, our goal is to diversify the features learned in CNNs, while they aim at solving the problem of generic categories that output low scores because of their high intra-class variance, in order to force their beneficial consideration. Moreover, we opt for an integration at three levels (data, learning and representation), while they do it only after the computation of their semantic representation 

[18, 48].

3 Focused Self Fine-Tunning

We propose a new method that takes advantage of the principle of the re-training of neural networks on the same problems, and thus does not need more data [3, 31, 49], nor increasing the network capacity [44, 1, 54, 47]. Our approach relates to the work of [58] who proposes an extensive study of the effect of different self-training methods (i.e., re-training a neural network on the same problem it was trained originally). They explored two learning-strategies: (i) frozen-based re-training (that we call Frozen Self-Training and denote FrST) and (ii) fine-tuning based re-training (that we call Self Fine-Tuning and denote SFT). More precisely, both methods re-train an initial network (that we denoted initNet), with being its set of trained weights. Both methods consist in three steps: (i) the weights of FrST and SFT are splitted into two sets ( and ), with containing the weights of the first layers and , the weights of the last ones (ii) the two sets of weights are initialized differently – in FrST and SFT, the first layers are initialized with the weights of the pre-trained initNet and the last layers are initialized randomly –; and finally (iii) the re-training of the weights – FrST retrains only last layers and “frozes” the first ones, while SFT retrains them all with the same learning-rate. Their extensive study leads to some interesting conclusions that motivate our approach. In particular, they showed that: (i) FrST hurts the performances of initNet because“initNet contained fragile co-adapted features on successive layers” (i.e., features that interacted with each other in a fragile way during initNet learning such that this co-adaptation could not be relearned by the randomly initialized upper layers alone) and (ii) SFT slightly increases the performance because “it aims to recover co-adapted features that were trained by the initNet”. Simply said, it is important to preserve some knowledge acquired during the learning of the original network, but it is a sub-optimal to completely focus the training on the last layers. As a consequence, we propose a new method called Focused Self Fine-Tuning (FSFT) that can be seen as an hybrid view of the two previous ones. As in [58], the re-training principle consists in dividing the weights learned on the initNet into two sets, initializing them differently (first with the pre-trained initNet, others randomly), jointly minimizing them but with different learning-rates. An illustration is given in Fig. 2.

More formally, let us consider a source training database containing images with their associated labels . Let us also consider a network that was trained on

by minimizing a loss function

with a certain optimization algorithm. Let note being its set of trained weights splitted in two sets, namely that contains the weights of the first layers and that contains those of the remaining layers. FSFT consists to re-train the network by minimizing the same loss-function as initNet, on the same training database , which is expressed by:



is the predicted probability vector for images

using learnable weights . The set of weights is initialized with those of the first layers of initNet (i.e., ) and is initialized randomly. The individual weights , that forms the full sets (with ), are updated through:


where, respectively corresponds to the learning-rate of the first layers (when ) and last ones (when ). More specifically, , with is a parameter that can be set by cross-validation. To summarize, our FSFT method preserves some knowledge acquired during the learning of the initNet (by initializing the weights of its first layers, with the firsts of initNet). It also focuses the training on the last layers (by completely training the last layers, while allowing, through factor , some slight change of the first ones).

Fig. 2: Focused Self Fine-Tunning (FSFT). An initial network (gray branch) denoted initNet (first layer’s weights denoted , last ones and the classifier ) is trained on a source-database, by minimizing the loss-function . Once initNet trained, another network (black branch) denoted FSFT (first layer’s weights denoted , last ones and the classifier ) is trained on the same source-database and solves the same problem (minimize same loss ). The weights (black block filled in gray) of the first layers of FSFT are initialized with the firsts of initNet () and weakly updated through training. The weights of the last layers (black blocks filled in white) are randomly initialized and fully trained.

4 Multi Discriminative-Problem Network

The previous method is efficient without needing additional data nor capacity but always solves the same problem. To learn more universal representations, one must solve different problems. In this section, we propose another universalizing method that combine different but complementary features learned on different problems. It consists in three components: (i) source problem variation (SPV) (Sec. 4.1) – that define a new source problems (SPs) – with a special emphasis on variation with grouping (Sec. 4.2); (ii) independent training of networks on a set of multiple SPs obtained by the SPV (Sec. 4.3); and (iii) extraction of features from the set of trained networks followed by a particular combination to form the more universal representation (Sec. 4.4). Since each SP is a discriminative model solved by a neural network, the method is named “Multi Discriminative Problem Network” (MulDiP-Net, illustrated in Fig. 3). In Sec. 4.5, we use the FSFT (Sec. 3) as a dimensionality reduction method in the MulDiP-Net.

Fig. 3: Illustration of our MulDiP-Net method. Let consider an initial source problem (SP) consisting in a set of images and their associated labels . MulDiP-Net consists in three phases: (i) variation () of the initial SP () into new ones (); (ii) training networks ( and ) on the whole set of SPs (); and (iii) normalization () followed by combination (

) of the features extracted from each trained network in order to form a more universal representation. More precisely, phase (i) is a source problem variation (SPV) (

) applied on the initial SP (), which outputs a new SP (). After applying SPV functions, we get a set of +1 SPs containing the new SPs and the initial one (only one variation illustrated here, thus ). Phase (ii) consists in the learning of one network for each of the +1 SPs, resulting in a set of +1 trained networks: . Each network (that is a composition of a features-extractor and a classifier ) is trained by minimizing the loss function computed using the output of the predictor and the ground-truth of the SP . Phase (iii) consists in the extraction of the more universal representations from images of a target-task . Specifically, it passes the images into the trained networks and gets features (through the extractors ) that are independently normalized () and fused () in order to output the final representation .

4.1 SPV: Source-Problem Variation

A source problem , consists in a set of pairs with being a training image and its associated label. The images are labeled according a label set of categories. By solving this source problem (SP), the CNN learns features that discriminate between the images of the different categories of the given SP. For instance, if we consider a network that solves a SP containing images of lemons and green-apples, the network will learn different features than one that solves a SP containing lemons and strawberries’s images. Hence, changing the SP to be solved by a CNN can lead to a change in the set of learned features. Motivated by this assumption222Assumption empirically validated by [59]. Indeed, they have shown that a CNN trained on scenes learns different features (i.e., object detectors) than one trained on objects (which learns object-part detectors)., we propose the principle of Source Problem Variation (SPV) that is a variation function that transforms an initial source problem into a new one . Such a function has the following form:

with the following constraints:


where corresponds to the size of images (i.e., width , height and depth with for RGB images), is an element of the new SP , with each training image labeled according the label-set containing categories. Regarding the constraints, the first ensures that at least one element of has to be variated compared to the elements of and the second one warrants that at least one element of has to be in . With these constraints, taking a dataset completely different than the initial one is not a SPV and in contrast, changing all the data (images and labels) without keeping any element from the initial SP is also not a SPV.

In practice, the variations can be of many nature and act both on the images and/or the labels ). We consider three types of variations: (i) adding images or categories to the initial SP; (ii) splitting the initial SP ; (iii) grouping categories (thus images) of the initial SP (Fig. 4).

This principle, although not explicitly described as a SPV, have been already applied in the literature. For instance, variations on the image set while preserving the set of categories have been proposed in [21, 43]; adding variations has been used by [4, 37, 38] (by adding data labeled among specific categories) and [31, 49] (by adding data labeled among generic categories); splitting variations on the set of categories has been performed in [22, 1, 56, 3, 55] and finally, variation of the categories by grouping them has been explored in [6, 23, 42]. However, note that a SPV can be applied for different goals and as explained in the beginning of this section, our goal is to learn different features than those learned on the initial SP, but complementary

when combined together. Thus, with respect to our goal, the SPV has to be associated to a learning procedure and learn new neurons. Beyond the stated aim, our work introduce a new type of SPV based on the

grouping of categories and their associated images.

Fig. 4: Illustration of three kind of SPV: Splitting, Adding and Grouping (see figure legend for description of each graphical elements). An initial source problem (SP) is illustrated in (A). (B) represents the output of the splitting SPV () which results into two diminished sets of training-data (each of them contains less images and categories) compared to the initial SP i.e., and , with ). In contrast, as illustrated in (C), an adding SPV () results in an increased set of training data (more images and categories), compared to (i.e., and )). The last example is our grouping SPV (), illustrated in (D). It results in the same amount of training-data (same set of images but labeled according less amount of categories), compared to (i.e., and ).

4.2 SPV Based on Grouping

Grouping-SPV consists in the grouping of images through the grouping of their initial categories. In practice, it can be done in many ways (e.g., randomly, based on clustering or semantically). Here, motivated by knowledge on human categorization, we propose a semantic grouping, through categorical-levels. The advantage of the semantic-grouping lies in its semantic aspect, which aims to get a new SP that is highly different than the initial one but relevant, which is not necessarily the case with random and clustering-based grouping. Compared to other SPVs, the grouping one has important advantages. Indeed, compared to adding SPV (which needs more annotated data, costly to obtain) grouping SPV does not need more annotated data, and compared to splitting SPV (which decreases the performances of the networks, since it decreases considerably the amount of training data), grouping SPV maintains exactly the same amount of images.

Semantic-grouping SPV consists in the grouping of specific categories into generic ones, according to a semantic knowledge. This semantic knowledge is generally represented in the form of hierarchies. Here, we focus on a particular semantic knowledge, named categorical-levels [24, 40, 50] that consists in a hierarchy of categories mostly used by Humans to categorize objects. Let us consider a semantic hierarchy with hyponymy relations (i.e., a set of categories organized according to “is-a” relations). In practice here we use categorical-levels, but hierarchical-levels (i.e., levels of the ImageNet [12] or WordNet) or clustering-levels could be used (more details in last paragraph of this section). Formally, the starting semantic hierarchy (ImageNet) is a directed acyclic graph consisting of a set of nodes and directed edges . Each node is a label and is a hierarchy-edge indicating that label subsumes label . Let us also consider an initial source-problem containing images labeled among specific categories belonging to , such that .

We now consider a categorical-level defined according to human cognition. Let us note a set of categories that belong to a categorical-level (i.e, subordinate level for =0, basic for =1 and superordinate for =2). It is important to note that the categories of do not correspond to a given level of the ImageNet hierarchy . Hence, consider that all are mapped into certain nodes of the hierarchy . Our purpose is thus, to group the categories of into generic categories. This latter is equivalent to get the partitioning of into subsets i.e., . To do so, we define a partitioning function according to a categorical-level as:

where is the set of all descendant nodes of the categories according to . Using the function, we obtain generic categories, with .

We can now define our re-labeling function relative to a given categorical-level by:

where is a function that outputs, for all categories the set of all its ancestors in and is defined as , with being a deductive function that associates to a category of its direct ancestor, that is to say, the category directly above according to and its corresponding iterated function (i.e composed with itself times and for which we assume that the image of the root node of is itself).

Simply said, while Eq. (4.2) partitions the set of specific categories into “unnamed” generic categories (i.e., do not have association to a humanly understandable word), Eq. (4.2) re-labels them into existing (and thus “named”) categories of the categorical-level . Importantly, images are also automatically re-labeled according the same process than the initial categories they belong. To recap, our grouping SPV (illustrated in Figure 5) is the combination of the partitioning and re-labeling functions.

While the above process is applied on the re-labeling of the specific categories into generic ones belonging to a categorical-level , it can also be applied to re-label into generic categories belonging to hierarchical-levels of or clustering-levels (hierarchies constructed based on data, through clustering). For the former, we need the following two assumptions about : descendants of leaf-nodes are themselves (i.e., ) and if a leaf-node is at a certain hierarchical-level with , its least common ancestor is itself (i.e., if ). For the latter, no assumptions are needed and practical details to construct the clustering-hierarchy are given in Sec. 6.1.3.

In summary, in MulDiP-Net, we consider the initial specific SP as well as those obtained by three grouping SPVs, namely, categorical, hierarchical and clustering-based ones. This latter, results in a set of SPs denoted .

Fig. 5: Illustration of our grouping SPV. Given a set of specific categories (leaf nodes of the hierarchy in white diamonds) and a set of generic categories (here ) at a certain level (here categorical denoted ), our grouping SPV consists in three steps: computation of all descendants of according to the (1); computation of the descendants that belong to the initial set of categories (2), producing a group ; and re-labeling of the categories (as well as their images) of the latter group (3) into the categories of . The first two points correspond to Eq. (4.2), while the last one corresponds to Eq. (4.2).

4.3 MulDiP-Net Training

To describe the training of our Multi Discriminative Problem Network (MulDiP-Net), let us first consider a set of SPs , with being the initial SP and being the SPs obtained from different SPVs functions applied on the initial SP (as depicted in the previous section). MulDiP-Net consists in training one network per SP , with a network architecture and a learning procedure , (i.e., including the initial SP). Indeed, each SP consists in a set of images labeled among categories . This latter forms a set of training-data which is used to train each network. It is worth noting that, we can have as many architectures and learning procedures (one per network) as the number of SPs in , but here we focus on the special case where they are all the same (333In all the document, indicates the architecture of a network, while indicates a trained network., for >). Note also that, our method does not depend on a particular architecture or learning procedure and can thus directly benefit from the advances in this domain.

Specifically, here we used common CNN architectures (AlexNet [28], VGG [44] and DarkNet [39]) and follow the classical learning procedure for

, that is to say, a random initialization of the weights (with a Gaussian distribution), optimizing a softmax loss-function with stochastic gradient descent (SGD). We have as many cost functions to minimize as the number of SPs in

and each cost function is minimized independently. Since the SPV functions are based on grouping, all the SPs contain the same images as the initial one thus each cost function is minimized on the same set of training images, but with different labels. There is thus no cost in terms of additional data, while the cost to annotate is dramatically reduced since it is reported on each category and not each image. On ILSVRC for instance, it means annotations instead of million. At convergence, we obtain a set of +1 networks. Such an “ensemble-like” learning method could be limited in terms of network capacity, but it exists solutions to decrease this capacity while maintaining high performance, as detailed in Sec. 3.

4.4 MulDiP-Net Representation

Let us consider a MulDiP-Net (i.e., a set of +1 trained networks) and an image of a target-task . To extract the representation from , we perform the following two steps: (i) extraction of the features of through each subnetwork of MulDiP-Net and (ii) normalization and combination of these features into a relevant representation (Fig. 3). For the extraction of the features, let recall that an architecture is a composition of classifiers and features-extractor (i.e., ). Thus, to get the features from an image, the classifier as well as some of the last layers are discarded, and the first layers of network filter the images (e.g, when = for AlexNet [28]). Hence, the features-extractor function (denoted ) outputs, for , a vector (if

points to a fully-connected layer) or a tensor of activations (if

indicates a convolutional layer). In the latter case, the tensor is flattened or pooled in order to get a vector. Formally, the MulDiP-Net representation for the query image is thus computed as:


where is the fusion operator among the +1 input vectors, and is a normalization function that returns normalized representations. In practice for the normalization function, we choose the L-infinite norm - and for the fusion functions, we used the concatenation (reasons discussed in next paragraph).

Our goal was to combine features trained separately on specific and generic labels in order to learn more universal representations at near-zero cost of annotation, because we hypothesized that their learned features can be complementary. However, because of the different training-data, the networks have different behaviors after training. Indeed, once they are trained, they react differently on the same input images. More precisely, the features learned on specific data tend to fire with higher values than those learned on generic data. Thus, naively concatenating these sub-representations would restrict the fused representation to the activations of the dominating features (those learned with specific labels) with an additional noise from dominated features (those learned with generic labels), which will result to a degradation of performance. Hence, the independent normalization step is crucial, because it aims to homogenize the scales of the sub-representations. The same problem of dominating values has been observed in [29, 54], who solved it with a similar normalization process.

4.5 FSFT as a Dimensionality Reduction Method

The more SPs we consider in MulDiP-Net, the more universal representation we get. However, because of its concatenation fusion, the dimensionality of the fused representation linearly increases with the amount of SPs considered. Also, MulDiP-Net is an ensemble-like method since it is implemented such that it contains as many networks as the number of SP considered. This point is generally translated in terms of the amount of parameters in the final model. Indeed, while the MulDiP-Net aims at building more universal representations, it contains +1 times more parameters than a standard network. In summary, the actual form of MulDiP-Net faces two problems: (ii) the high capacity of its resulting model; and (ii) the high dimensionality of its resulting representation. To alleviate these drawbacks, we proposed to rely on the FSFT method (introduced in Sec. 3) and even more, on its use as a dimensionality reduction method. To do so, we roughly re-train each of the subnetworks of MulDiP-Net on the same initial SP, but with the last layers (on which FSFT focuses the re-training) containing much less neurons. Compared to the classical FSFT, the “FSFT as a Dimensionality Reduction method” (FSFT-DR) results in: (i) a final model that has much lower capacity and (ii) a representation with a lower dimensionality, at a near-zero decrease of performance, but that still significantly increases the performances compared to the initial network (experiments of Sec. 6.3). Obviously such a behavior is also observed on the MulDiP+FSFT-DR (FSFT-DR applied on each subnetwork of MulDiP-Net).

More formally, let us consider a features-extractor trained on a certain SP noted with a network architecture that has a penultimate layer of dimensionality =. FSFT-DR consists to re-train (as depicted in Eq. 2 about the FSFT re-training) the last layers of (which results in a new features-extractor denoted ) on the same source-problem but with a different network-architecture that has a penultimate layer of a dimensionality = lower than those of (i.e., <). Note that in the classical formulation of FSFT =. Note also that, while by construction FSFT-DR reduces the dimensionality, it also reduces the number of parameters of the final network (since contains less neurons and thus less weights than ). It is all the more true since FSFT-DR is applied on the last layers, which are generally fully-connected ones and thus contain the majority of the parameters of the CNN. For instance, AlexNet contains 62.35 millions parameters with 3.75 millions in the convolutional layers and 58.6 millions in the fully-connected (FC) ones (about 94% of the total). Regarding the application of FSFT-DR on the MulDiP-Net method, let first consider a given set of +1 trained networks. FSFT-DR is applied on each subnetwork by setting the last layer of to be of size , with being the dimensionality of the original sub-representation . The rest of the pipeline remains the same than in MulDiP-Net, that is, extraction of the penultimate layer from each subnetwork and merging them after independent normalization. In summary, compared to MulDiP-Net, MulDiP+FSFT-DR (i) contains much less parameters and (ii) results into a smaller representation than the classical MulDiP-Net, while being much more performing.

5 Evaluation of Universalizing Methods

Criterion Avg RG VDC BC mNRG
Coherent aggr.
Merit bonus
Penalty malus
Penalty for damage

Indep. to outliers

Indep. to reference
Time consistency
TABLE I: Comparison of all the universality evaluation-metrics – Avg baseline, RG [45], VDC [37] and ours, namely BC and mNRG – according all the criteria mentioned (and highlighted in bold) in Sec. 5.

Universality is motivated by the claim of Atkinson’s cognitive study [2]. Inspired by this, authors of [4, 37, 38] evaluated the universality of representations by their ability to simultaneously cover a large range of domains, like objects, faces, animals, etc. However, their evaluation scheme does not completely match with [2] since learning and testing are conducted on the same problem and only the image distribution differs (learn on train images and evaluate on test ones). However, in [2], the terms “re-used later in life” mean that universality implies to work well on many but different problems than those used to learn the representation. Hence, in this paper, we propose to consider this through the scheme of transfer-learning (TL) that naturally fits with it. Indeed, in TL, the source-task is used to learn the representation (analogically, the universal one developed in the early years) and the target-tasks are used to evaluate it (analogically, the problems solved later in life). Moreover, to better match the claim of [2] (re-use “as-is”), we propose to not modify the representation (do not fine-tune it) for each target-task, but rather learn a simple task-predictor on top of it. Nevertheless, when humans develop their representation, they do not have access to future problems, thus the target-tasks should not be available during the learning of the representation. What makes our evaluation framework unique is that it lies in a TL scheme, closer to [2] than those of [4, 37, 38] which lies in a classical end-to-end learning scheme.

Evaluating universality requires to aggregate the scores of a method on multiple target-problems. This step is not straightforward, since metrics of each problem could be different (e.g., error-rate, mAP, F1-score) or even be increasing (e.g., recall) or decreasing (e.g., median rank). Thus, naively averaging the scores would result into incoherences. To get coherent-aggregation, Subramanian et al. [45] proposed to average the gain over the scores of a reference universal representation across the set of multiple problems: relative gain (RG) computed as: , with and being respectively the performance of a reference method and the method to evaluate on the problem . Such aggregation is coherent, but depends on a reference method that needs to be arbitrarily and manually chosen. Rebuffi et al. [37] went further by identifying one additional criterion for universality: the metric should better rewards significant gain compared to the reference scores (called significance criterion). To do so, they proposed the Visual Decathlon Challenge (VDC), which is computed as: , with and respectively being the error-rate of the actual method and the best reachable one on the problem , is just a normalizing factor, and models the significance.

In addition to the properties already put on evidence, we introduce four more criteria. The first criteria (merit-bonus) indicates that the metric should reward proportionally with the score of the reference method. For instance a jump of 1% is more rewarding if it starts from a reference at 90% than one at 10%. The second criteria (penalty for damage) is motivated by the fact that a method could improve the performance on some tasks but decrease it on others. In such case, the metric should penalize them. The third proposed criteria (penalty malus) lies with the fact that, the aggregated improvement (over the reference) of some methods, could be lower than the aggregated decrease. Such method should also be penalized. A last important criterion is that the metric should be independent to outlier methods, in the sense that it should detect methods that are well suited to a given benchmark (gain a lot of points) that could compensate (when aggregated) a insignificant gain on the other benchmarks.

To respect these constraints, we propose the median normalized relative gain (mNRG): . The numerator is the simple RG metric, thus it respects the same criteria (coherent aggregation and penalty for damage because negative scores are allowed). The denominator acts as a normalization and naturally handles the merit-bonus and penalty malus criteria. Finally, instead of an average to aggregate, we used the median to make our metric independent to outliers.

Note that in VDC, negative values are not allowed (because of the ), thus it does not respect the penalty for damage nor the penalty malus. In addition, the significance criterion is modeled through a power , which not only prevents the merit-bonus, but even exhibits the inverse behavior. Indeed, VDC will reward more a method that gains 3% starting from a reference at 10%, than one that gains 2.5% starting from 90%.

Finally, none of the above metrics is independent to a reference method, thus we propose an alternative metric that we call Borda Count (BC

), which is based on a voting method and the ordinal scale, with each benchmark considered as an independent voter. First, each voter can use its own measure to estimate the performance of the methods, which provides coherent aggregation. Second, it becomes insensitive to some outliers such as the exceptional fit of a method to a particular benchmark. Moreover, we consider that reporting the number of times a method

is better than a method can be a more reliable information than the average difference of scores (as long as these score are actually comparable). Note that, BC lacks the consistency with time since its score differs with the chosen comparison methods. Formally, let us consider methods to rank, relying on the information provided by problems. Each method is then ranked according to each problem, resulting into a rank with and . It is converted into a score that is itself averaged to give the final score of the method: .

All the metrics are compared in Table I according to all the mentioned criteria. None of them address all the criteria but the proposed mNRG is the most complete of them.

6 Experimental Results

6.1 Experimental Settings

6.1.1 Transfer-Learning Protocol

As mentioned above, the evaluation of all the methods is carried in a transfer-learning scheme on multiple target-problems. The methods are trained with standard architectures: AlexNet [28] (default), VGG [44] and DarkNet [39]). By default, we train on ILSVRC* (see Sec. 6.1.2) because it is smaller than ILSVRC, making the process faster. The network on the source-problem is used to extract the image signatures on the target-problem. Here we focus on the classification task for the target-problems and learn each of their classes with a one-vs-all linear SVM classifier. Note that, as mentioned in Sec. 5, to match the claim of [2], there is no fine-tuning. The cost parameter of the classifiers is optimized on each dataset through cross-validation with the usual train/val splits. The performances of each target-problem are evaluated with standard splittings and metrics, namely the mean Average Precision (mAP) for multi-label datasets and Accuracy (Acc.) for mono-label ones. To evaluate the universality, we use the proposed mNRG evaluation-metric (Sec. 5).

6.1.2 Source and Target-datasets

For the source-problem, we used two subsets of ImageNet [41]: ILSVRC and ILSVRC* that contains half of the former. The characteristics of the datasets are presented on top of Table II. Regarding the target-problems, we used ten image classification benchmarks from different visual domains (actions, scenes, objects, birds, plants, etc.). Specifically, five of them contain coarse categories – Pascal VOC 2007 (VOC07[14], Pascal VOC 2012 (VOC12[15], Caltech-101 (CA101) [16], Caltech-256 (CA256[19] and Nus-Wide Objects (NWO[7] –, three contain fine categories – Stanford Cars (CARS[27], CUB-200 Birds (CUB[52] and Flowers-102 (FLO[33] –, one contain scenes – MIT Indoor 67 (MIT67[36] – and one contain actions – Stanford Actions (stACT[57]. Their characteristics are presented at the bottom of Table II. For all the benchmarks, we follow standard protocols (i.e., common splits and metrics). For VOC12 and stCAR, we used the official evaluation-servers.

Datasets (1) (2) (3) (4) (5) (6)
ILSVRC objects 483 569,000 48,299 Acc.
ILSVRC objects 1K 1.2M 50,000 Acc.
VOC07 objects 20 5,011 4,952 mAP
VOC12 objects 20 11,540 10,991 mAP
NWO objects 31 21,709 14,546 mAP
CA101 objects 102 3,060 3,022 Acc.
CA256 objects 257 15,420 15,187 Acc.
MIT67 scenes 67 5,360 1,340 Acc.
stACT actions 40 4,000 5,532 Acc.
CUB birds 200 5,994 5,794 Acc.
stCA cars 196 8,144 8,041 Acc.
FLO plants 102 1,020 6,149 Acc.
TABLE II: Source-datasets (top) and target datasets (bottom) used in this paper. The columns report some characteristics: (1) images-domain; (2) number of categories; (3): multiple categories per image (✓) or not (✗); (4) number of training samples; (5): number of test samples; and (6) evaluation-metric (Acc. or mAP).
Method VOC07 VOC12 CA101 CA256 NWO MIT67 stACT CUB stCA FLO
mAP mAP Acc. Acc. mAP Acc. Acc. Acc. Acc. Acc.
REFERENCE 66.8 67.3 71.1 53.2 52.5 36.0 44.3 36.1 14.4 50.5 0.0
SPV [3, 4, 59] 66.6 67.5 74.7 54.7 53.2 37.4 45.1 36.0 13.7 51.9 +1.5
SPV [28, 31, 49] 67.7 68.1 73.0 54.3 50.5 37.1 44.9 36.8 14.6 50.3 +1.4
AMECON [6] 61.1 62.1 58.7 40.6 45.8 24.3 32.7 26.1 13.1 36.4 -17.7
WhatMakes [23] 64.0 62.7 69.4 50.1 45.6 33.7 41.9 15.0 12.5 42.8 -7.5
ISM [55] 62.5 65.4 68.8 50.7 28.5 37.9 42.6 34.0 13.3 50.0 -4.3
GrowBrain-WA [54] 68.4 68.3 73.1 54.7 49.3 38.4 46.5 37.5 14.7 54.8 +3.5
GrowBrain-RWA [54] 69.1 69.0 74.8 55.9 50.4 40.0 48.4 38.6 14.8 56.1 +6.0
MuCaLe-Net [47] 69.5 69.8 76.0 56.8 54.7 41.3 48.5 35.6 15.7 54.8 +7.7
FSFT (Ours) 67.5 67.4 73.9 55.0 44.6 40.4 47.1 38.7 15.8 56.8 +4.0
MulDiP+FSFT (Ours) 69.8 70.0 77.5 58.3 47.9 43.7 50.2 37.4 16.1 59.7 +9.8
TABLE III: Comparison of our methods (bottom) to state-of-the-art universalizing-methods (top). The comparison is carried in a transfer-learning scheme on ten target-datasets. Methods are compared in terms of their individual scores on each benchmark (with standard metrics) and especially their aggregated scores, with our nMRG (blue scores in last column). The universalizing methods are compared to a reference one for which we colored its scores in red. In each column, we highlight the highest score in bold, and the second one is underlined. All the methods have been learned with the same AlexNet architecture on the same initial SP (ILSVRC*).

6.1.3 Implementation Details

For the SPV, we started with ILSVRC (and ILSVRC*) as initial source-problem and variated at generic levels with three generic grouping methods: categorical, hierarchical and clustering. For the two first methods, we used the ImageNet hierarchy in the functions of partitioning (Eq. (4.2)) and re-labeling (Eq. (4.2)). Practically, in the first method, a part of the categorical-level categories of ILSVRC is obtained from the list released in [41] and the other part is re-labeled by ourselves as depicted in [47]. This latter results into generic categories for the specific ones of ILSVRC ( generic for the specific of ILVRC*). For the hierarchical-method, we follow the bottom-up approach of [23] and re-labeled the categories to higher levels of the ImageNet hierarchy. For the clustering method, we follow [6] and clustered the data of the categories with a Kmeans algorithm ( from to with steps of ). Regarding the extraction and combination of the features (Eq. 7), we used one label-set per grouping SPV (i.e., basic-level, hierarchical and the clustering with =). The latter choice results in a set of four SPs (including the initial one). For the extraction of the representations of images of target-tasks, we always use the penultimate layer from each subnetwork. Regarding the normalization step before combining the representations, we used the infinite-norm (). In the FSFT methods, we choose the following settings: , meaning that we focus the retraining on the two last layers; , as the learning-rate used to train the original network; and , meaning that we train the last layers times faster than the firsts. Note that, for FSFT we also use the penultimate layer as representation of the target-images.

6.2 Comparison to the State-Of-The-Art

We compare the proposed FSFT and MulDiP+FSFT to state-of-the-art methods that could be used as universalizing methods:

  • REFERENCE [28]: A CNN trained on the initial source-problem, that contains specific categories (here 483). Since it is a classical method that works quite well on many problems, we use it as reference to evaluate universality of other methods.

  • SPV [3, 4, 59]: A method that consists in a adding SPV followed by the training on the obtained SP. Specifically, we added specific categories (randomly obtained from the leaf nodes of ImageNet) with their 100K images. Simply said, the method is trains a network on specific classes.

  • SPV [28, 31, 49]: Same as the previous method but with a generic adding SPV that adds generic categories (with their 100K images). This results in training a network on specific and generic classes. The generic ones were obtained from random internal-nodes of the ImageNet hierarchy.

  • WhatMakes [23]: A grouping SPV followed by the training of a network on the obtained SP. Specifically, the grouping SPV corresponds to a relabeling of specific categories into internal hierarchical-levels of the ImageNet hierarchy. We performed it for all the levels and report the results for the best one (6 level starting from the leaf-node’s one).

  • AMECON [6]: Similar to the previous method but differs by its kind of grouping SPV. Indeed, the grouping is performed by clustering. Specifically, all the images of each specific categories are used to compute the mean features (obtained through the layer of the pre-trained reference network) for the categories. Then, a Kmeans algorithm is used to cluster this set of category mean features. We applied this method with different amount of clusters ( from 50 to 300 with a step of 50) and report the best results (=).

  • ISM [55]: This method trains an ensemble-model with networks, one for each SP obtained from a splitting SPV. Here, we applied the method with half splitting (we split the initial SP in two balanced subsets). Note that, we chose two subsets to limit the method to a maximum of two networks for fair comparisons. Once the networks trained, we normalize and concatenate the features extracted from each of them.

  • GrowBrain-WA [54]: This recent method consists to fine-tune a trained network on the same source-problem it was trained originally, by growing the network capacity (wider or deeper). The best setting is the width augmented (WA) growing that consists to add neurons to the layer. We also implemented their normalization and scaling step for the new and old layers, because they are crucial to make this method performing. The final representation corresponds to the -dimensional layer. GrowBrain-RWA is an extended version that performs a recursive growing of the network capacity. The best setting they report is to add neurons on the layer and on the one.

  • MuCaLe-Net [47]: It consists to perform a normalization step followed by the concatenation of the features extracted from two CNN-models, one trained on data labeled according specific categories and one according categorical-levels. It results in a -dimensional representation.

[37, 38] explicitly tackled the universal representations problem and could be used for comparison. However, it is important to note that their goal was to improve universality by adding data from multiple domains, while our goal is to improve universality from a fixed set of data (one domain), making their methods not comparable to ours, since they use more annotated data.

As depicted in Sec. 5, the methods are evaluated in terms of the proposed mNRG score, in a transfer-learning scheme on a set of ten target-datasets from different domains. The results of the comparison are presented in Table III. As expected, the methods that consist to add images and their annotations (SPV and SPV) as well as those that increase the capacity of the network (GrowBrain) aim to learn more universal representation compared to the reference method (positive mNRG score). Surprisingly, the ISM method is not as performing as reported in their context. This might be due to the fact that it is designed for very large source-problems, and the half-million images used here are not sufficient, highlighting a clear limitation of their method. The same behavior can be observed with AMECON and WhatMakes. The former could be because the specific categories we used (leaf node of ImageNet) are not as specific as the ones (captions) used in their paper. The latter was rather a study that highlighted that a network trained on data labeled among generic categories is almost as performing as one trained on specific categories, through the evaluation on three target-tasks distributed among three domains (general objects, actions and scenes). However, we clearly observe here that on more domains, and especially fine-grained objects, such behavior is not observed anymore (highly negative mNRG score). Finally, we observe that our MulDiP+FSFT method significantly performs better than all other methods (highest mNRG score), meaning that it is clearly the most performing universalizing method. Note that, it also significantly outperforms MuCaLe-Net (by 2 points of mNRG), which clearly highlight the interest of combining MulDiP-Net with the proposed FSFT method.

Another salient observation is that our FSFT only is quite powerful, especially because it outperforms the methods that consist to add data and their annotations (SPV and SPV) as well as one that increase the network-capacity (GrowBrain-WA). It is worth noting that FSFT does need more data neither more capacity. Thus, it increases universality at zero cost of capacity and annotation, which is quite promising.

Method VOC07 CA101 CA256 NWO MIT67 stACT CUB FLO
mAP Acc. Acc. mAP Acc. Acc. Acc. Acc.
REFERENCE 66.8 71.1 53.2 52.5 36.0 44.3 36.1 50.5 0.0
Ensemble 67.8 72.2 54.5 52.0 37.2 45.0 34.7 51.8 +2.3
Multi-Task 61.5 61.8 45.4 49.4 30.7 36.4 25.6 38.7 -16.2
Multi-Label 44.7 46.8 26.4 25.1 27.2 28.0 15.2 38.1 -45.0
Recursive 65.3 68.6 50.8 52.4 33.4 50.8 29.4 45.5 -4.8
MulDiP-Net* 69.5 76.0 56.8 54.7 41.3 48.5 35.6 54.8 +7.9
FrST 62.3 64.3 47.3 50.0 30.9 38.4 29.3 41.1 -11.6
SFT 67.0 71.5 52.5 49.8 36.2 44.0 36.4 50.1 -0.1
FSFT* 67.6 73.3 54.8 47.2 37.9 46.9 38.2 56.3 +3.4
FSFT* 67.5 73.9 55.0 44.6 40.4 47.1 38.7 56.8 +4.5
MulDiP+FSFT* 69.8 76.6 57.3 49.9 41.6 48.7 38.0 59.2 +8.8
MulDiP+FSFT* 69.8 77.5 58.3 47.9 43.7 50.2 37.4 59.7 +10.7
TABLE IV: Comparison of our universalizing-methods with baseline methods. Methods are compared in terms of their mNRG score (last column in blue). All the methods have been learned with the same AlexNet architecture on the same initial SP (ILSVRC*). Methods marked with * are ours.
Fig. 6: Illustration of the different baseline methods.

6.3 Comparison with Baseline Methods

Here, we take further experiments to analyse the performances achieved by the proposed methods (MulDiP-Net and FSFT), through their comparison to several baseline-methods. All the baselines are described below, illustrated in Fig. 6 and the results are reported in Table IV. Note that, in this section (and supplementary), the comparisons are conducted on eight of the ten target-datasets presented in Sec. 6.1 (VOC12 and stCAR removed because evaluation-servers limited to 1 run per day needed).

We first assess whether the gain of universality obtained by our MulDiP-Net method is caused by the ensemble-model component. To do so, we compare it to a baseline that consists in an ensemble-model with two network trained on the same specific SP but with different random initializations of the weights (Ensemble). While Ensemble is significantly increasing universality compared to the reference, our MulDiP-Net is provides much better results. This latter, means that the performances of our method does not come from the ensemble-model aspect only, but also by the combination of features learned with data labeled among generic and specific categories. Hence, another baseline is to compare our method without the ensemble-model, but by jointly training a network on the two SPs (generic and specific). Indeed, we tested three variants: (i) training the set of SPs with a sum of softmax losses, i.e., one for each SP (Multi-Task); (ii) training the set of SPs with a multi-label loss layer (e.g., hinge loss), where the labels for each image contain both annotations, namely generic and specific (Multi-Label); and (iii) recursively training the SPs by training the network on the generic SP, then continuing the training on the specific SP (Recursive). Globally, the first two baselines, which belongs to the joint-training approach, strongly hurts performance compared to the reference method. The Recursive method is almost as performing as the reference one and we assume this is due to catastrophic forgetting [17], i.e., when the network is trained on the second specific problem, it forgets the features learned on the former generic one. In summary, the latter baselines clearly demonstrate the utility of the independent-learning (in our method) compared to the joint one. In other words, the growing capacity of the ensemble-model aspect is crucial to make our method benefiting from multiple SPs.

For our FSFT method, we mainly compare it to the two methods (namely, Frozen Self Training (FrST) and Self Fine-Tuning (SFT)) studied by [58], that also re-train a network on the same problem. More precisely, FrST consists in retraining the last layers only with the previous layers being “frozen”, while FST re-trains all the layers with the same learning-rate. As highlighted in [58], we observe a performance drop of FrST compared to the reference. As mentioned, in Sec. 3, this is due to the fragile co-adaptation neurons learned in the original network. A slight drop of performance is also observed for FST, meaning that fine-tuning does not always recovers all the co-adapted neurons. In contrast, our FSFT method increases performance, even when we compact its representation to (FSFT), clearly highlighting its capacity to recover co-adapted neurons of the original network and even the training of others.

Finally, we also assess the utility of combining MulDiP-Net with FSFT and even considering FSFT as a dimensionality reduction method (FSFT), compared to the reference and especially the MulDiP-Net method. As shown above, MulDiP+FSFT performs much better than MulDiP-Net, but more surprisingly, when MulDiP-Net is combined with FSFT, it is almost as performing as the former, at a much lower capacity. Hence, by using FSFT as a dimensionality reduction method (FSFT), we not only get a jump of performance (compared to MulDiP-Net), but also significantly alleviate the higher capacity produced by the ensemble-model aspect of MulDiP-Net.

Let note that supplementary materials are available and roughly contain: (i) a comparison of our methods to the state-of-the-art according all the universality evaluation-metrics of the literature; (ii) an evaluation of the impact of more and different grouping SPVs used in our MulDiP-Net; and (iii) the evaluation of MulDiP-Net with more training data and deeper architectures.

7 Conclusion

In this paper, we proposed four contributions: (i) a new challenge of learning more universal representations from a fixed set of data (domains and tasks); (ii) the evaluation of universality in a more suitable scheme (transfer-learning), as well as a new metric respecting most of the highlighted desirable criteria; (iii) a new method based on the re-training of networks, by focusing the training on some parameters; and finally (iv) a new method based on a general formalism of source problem variation and training of multiple networks. We demonstrated the effectiveness of our universalizing methods, in a transfer-learning scheme, through our evaluation-metric, on ten target-datasets from different domains. An in-depth analysis has also been conducted to highlight some important insights of our methods. We hope that our contributions will further support the creation of other methods to get more universal representations and open doors for many less explored aspects of transfer-learning such as, learning the source-problem, using Human knowledge or even through more realistic multimodal and dynamic environments such as HOME [5].

Appendix A Comparison to State-of-the-Art According Other Universality Metrics

In the Sec. 5 of the main paper, we proposed the mNRG universality evaluation metric and used it for the comparison of our methods with state-of-the-art. However, since our metric is novel, it is important to also perform the same comparison according the metrics of the literature, and compare the advantages and drawbacks of each metric. Such comparison is provided in Table V and should be analyzed with Table 3 of the main paper, which contain the detailed results on each benchmark. First of all, from the results we see that our MulDiP+FSFT method is the best universalizing method regardless the evaluation-metric. Moreover, still regardless the evaluation-metric, our FSFT is quite promising, since it significantly outperforms , and GrowBrain-WA at zero cost of annotation and without adding any additional parameter.

Regarding the comparison of the metrics, we can observe that our mNRG metric respects some of the properties highlighted in Sec. 5 of the main paper (e.g., merit bonus, penalty for damage, penalty malus), which is not the case for the baseline Avg, the RG [45], the VDC [37], the BC and the aNRG (being our method with an average operator instead of the median). For instance, compared to Avg and RG that gives almost the same universality scores to GrowBrain-WA and FSFT, our mNRG is able to give more points to FSFT since it gives significantly better results than GrowBrain-WA on seven of the ten datasets. Moreover, beside the significantly better results on the seven benchmarks, mNRG gives only +0.5 compared to GrowBrain-WA, since it penalizes its loose of performance on the NWO dataset.

We also observe that the VDC do not penalize the methods that performs less than the reference on some benchmarks (e.g., it gives to our MulDiP+FSFT almost twice the universality-score than MuCaLe-Net, while compared to the former, the latter never decreases performance in any of the benchmarks), while our mNRG is able to penalize it (MulDiP+FSFT outperforms MuCaLe-Net by only 2.1 points in terms of our mNRG metric). This is even more visible on the ISM method that should clearly give negative universality scores. Indeed, compared to the reference, ISM gives lower results on nine of the ten benchmarks (higher results only on the MIT67 benchmark), but since VDC do not perform penalty for damage, it gives 0.0 points on the the nine benchmarks and 0.9 points on the MIT67 one, which undesirably results in a positive universality-score. Moreover, VDC perform neither penalty for damage nor penalty malus, and as a consequence, is unable to say which method between AMECON and WhatMakes performs worse. A metric that has such ability (say method A is worse than B, even if they are both lower than the reference) could be interesting in a case, were for example, the methods A and B have some practical advantages compared to the reference and we would like to know which of these practically advantageous methods should be used as a reference for improving universality. Finally, regarding the VDC metric, we observe that compared to the scores (around 2000) reported in their paper, the scores reported in this experiment are much lower (around 100). It is important to note that, this is due to the fact that our evaluation scheme (transfer-learning: training the representation in the source-problem and evaluate on target-problem unseen during training) is much more challenging than theirs (end-to-end learning: training the representation in the source-problem and evaluate on the test-set of the same source-problem). Simply said, while we do not have access to the target-problems during the learning of the representation, they have, making it easier.

We also observe that compared to aNRG, our mNRG is able to decrease the SPV to a similar universality score than SPV, since the former seems to be well suited on some datasets like CA101 and CA256 (compared to other absolute improvements). Finally, while not visible here, by construction, the Avg do not provide coherent aggregation. For the same reason, our BC do not provide the same results according the comparison methods, making it not consistent with time. Note however that, as the best universality metric (our mNRG), BC has some good advantages like penalty for damage or independence to outliers.

Method Avg RG VDC BC aNRG mNRG
REFERENCE 49.2 0.0 0.0 50 0.0 0.0
SPV [3, 4, 59] 50.1 +0.9 18.3 62 +2.3 +1.5
SPV [28, 31, 49] 49.7 +0.5 6.7 56 +1.4 +1.4
AMECON [6] 40.1 -9.1 0.0 17 -20.2 -17.7
WhatMakes [23] 43.8 -5.4 0.0 22 -10.8 -7.5
ISM [55] 45.4 -3.8 0.9 32 -8.8 -4.3
GrowBrain-WA [54] 50.6 +1.4 20.1 71 +3.0 +3.5
GrowBrain-RWA [54] 51.7 +2.5 50.9 87 +5.6 +6.0
MuCaLe-Net [47] 52.3 +3.1 69.6 92 +7.0 +7.7
FSFT (Ours) 50.7 +1.5 36.7 76 +3.0 +4.0
MulDiP+FSFT (Ours) 53.1 +3.9 136.9 103 +8.6 +9.8
TABLE V: Comparison to state-of-the-art, according different universality evaluation metrics (those mentioned in Sec. 5 of the main paper). Note that, for a set of methods and datasets, the best achievable BC score is , while the worse is .
Method Network VOC07 CA101 CA256 NWO MIT67 stACT CUB FLOW
mAP Acc. Acc. mAP Acc. Acc. Acc. Acc.
Net-S (Ref.) AlexNet 71.7 79.7 62.4 58.3 46.9 51.2 36.3 58.4 0.0
Net-G AlexNet 71.5 77.4 60.4 57.8 42.8 49.3 19.5 52.4 -7.7
MulDiP-Net AlexNet 74.4 82.5 65.2 60.8 47.4 54.2 36.1 62.5 +7.4
Net-S VGG-16 86.1 88.8 78.0 71.8 66.7 73.5 69.8 78.9 +44.8
Net-G VGG-16 85.7 87.6 76.9 70.3 65.8 72.2 67.0 75.0 +38.9
MulDiP-Net VGG-16 87.5 92.0 80.9 72.6 68.9 75.0 71.5 81.9 +55.3
Net-S DarkNet-20 82.7 91.0 78.4 70.5 64.8 72.2 59.5 80.0 +38.9
Net-G DarkNet-20 83.2 91.5 78.1 73.2 64.4 72.6 52.5 78.9 +40.6
MulDiP-Net DarkNet-20 84.1 92.7 80.1 73.9 66.4 74.5 61.2 82.1 +47.1
TABLE VI: MulDiP-Net performances with different network architectures and more training data. To compute the mNRG scores (last column in blue), we used the Net-S of AlexNet as reference. All the methods have been learned on the same initial SP (whole ILSVRC).
Fig. 7: Impact of the different grouping SPV and more levels considered in our MulDiP-Net. Net-S (red dashed line) is used as reference.

Appendix B MulDiP-Net with Different and More Grouping-SPVs

Our MulDiP-Net method is based on a grouping SPV using categorical-levels. Here, we assess what is the impact of using different grouping methods. In particular, we compared it to grouping based on hierarchical-levels [23] of WordNet, clustering ones [6] and also random. For every grouping method (Random, Clustering, WordNet and our Categorical), we also compare MulDiP-Net to each of its subnetworks alone – i.e., the one trained on specific classes (Net-S) and the one trained on the generic ones (Net-G). In the main paper, we always used only two levels for fair comparisons, but as depicted in Sec. 4.4 of the main paper, our method could benefit from multiple levels. Thus, we implemented MulDiP-Net with more levels, namely Multi-3: initial specific SP and generic SPs obtained from from categorical and Wordnet grouping SPVs; and Multi-4: same as Multi-3 with an additional clustering-based grouping SPV. The results are presented in Fig. 7.

From the results, a first observation is that, whatever the grouping SPV, the Net-G is much less performing than Net-S, which contradicts the work of [23] (limited to few domains on target-datasets). Even if below than Net-S, our categorical one is the best grouping SPV, clearly highlighting the interest to introduce a grouping inspired by cognitive studies. Second, whatever the grouping, MulDiP-Net always performs better than its subnetworks (Net-G and especially the reference Net-S), which demonstrates the interest of combining specific and generic knowledge, in the way we do it. Third, in MulDiP-Net, while the best results are achieved with our categorical grouping (confirming its interest), it is worth noting that, the performance of random grouping is very close to Net-S, which highlights the utility of semantic grouping SPV. Finally, it is clearly observable that, the more levels we use in MulDiP-Net, the better performance we get.

Appendix C MulDiP-Net with Deeper Networks and More Training-Data

Increasing network capacity (wider or deeper layers) can be a very efficient universalizing method, since it can learn to perceive more elements or configuration through its new features. However, it is important to note that, it is not easy to modify the architecture (many costly experiments are needed to set all the hyper-parameters as well as the architecture itself) and no certainty of convergence is promised. In all cases, our contribution is orthogonal to this domain, and our aim here is to demonstrate this orthogonality. To do so, we implemented the reference, as well as our MulDiP-Net method with three popular architectures, namely the basic AlexNet (5 convolutional and 2 fully-connected layers), the deep and wide VGG-16 (16 convolutional and 2 fully-connected layers) and the fast and very-deep DarkNet-20 (20 convolutional layers followed by average pooling). Another important question is whether our approach of learning from a fixed set of training data could benefit from more data if they are available (adding-data approach. Thus, in this experiment, instead of using ILSVRC* (containing half-million images and 483 categories) as the initial source-problem, we used the whole ILSVRC which contains 1.2M images and 1K categories. The results of these experiments are presented in Table VI.

Four observations can be made. First, even with twice more data than in Table 3, MulDiP-Net still significantly increases universality compared to the reference. This demonstrates the orthogonality of our approach with the works that adds more data (domains [4, 37, 38] or tasks [45]). Second, the deeper architecture do not learn the more universal representation (Net-S with VGG-16 is better than Net-S with DarkNet-20). This clearly highlights that, compared to diversifying the source-problem, naively increasing the capacity is not safe for improving universality. Third, we clearly observe that MulDiP-Net outperforms its subnetworks regardless the architecture, which demonstrates that our approach could benefit from the field of network architectures. Last but not least, we can observe that Net-G is always below Net-S, except for DarkNet. This is surprising since one could have the intuition that the finer categories we use for training, the better results we get. However, it seems that this depends on the architecture, or maybe on the ratio between the number of units in the representation and the number of classes used for training.


  • [1] Karim Ahmed, Mohammad Haris Baig, and Lorenzo Torresani. Network of experts for large-scale image categorization. In ECCV, 2016.
  • [2] Janette Atkinson. The developing visual brain. Oxford University Press UK, 2002.
  • [3] Hossein Azizpour, Ali Razavian, Josephine Sullivan, Atsuto Maki, and Stefan Carlsson. Factors of transferability for a generic convnet representation. PAMI, 2015.
  • [4] H. Bilen and A. Vedaldi. Universal representations: The missing link between faces, text, planktons, and cat breeds. arXiv:1701.07275, 2017.
  • [5] Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, Luca Celotti, Florian Strub, Jean Rouat, Hugo Larochelle, and Aaron Courville. Home: A household multimodal environment. arXiv preprint arXiv:1711.11017, 2017.
  • [6] Ines Chami, Youssef Tamaazousti, and Hervé Le Borgne. Amecon: Abstract meta concept features for text-illustration. In International Conference on Multimedia Retrieval, ICMR, 2017.
  • [7] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao. Zheng. Nus-wide: A real-world web image database from national university of singapore. In CIVR, 2009.
  • [8] Alexis Conneau and Douwe Kiela. Senteval: An evaluation toolkit for universal sentence representations. LREC, 2018.
  • [9] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. EMNLP, 2017.
  • [10] Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. arXiv preprint arXiv:1805.01070, 2018.
  • [11] Alexis Conneau, Holger Schwenk, Loïc Barrault, and Yann Lecun. Very deep convolutional networks for text classification. In Association for Computational Linguistics, ACL, 2017.
  • [12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
  • [13] Jia Deng, Jonathan Krause, Alexander C Berg, and Li Fei-Fei. Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition. In CVPR, 2012.
  • [14] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge. IJCV, 2010.
  • [15] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge 2012, 2012.
  • [16] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. PAMI, 2006.
  • [17] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  • [18] Alexandru Lucian Ginsca, Adrian Popescu, Hervé Le Borgne, Nicolas Ballas, Phong Vo, and Ioannis Kanellos. Large-scale image mining with flickr groups. In MM, 2015.
  • [19] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. 2007.
  • [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [21] Luis Herranz, Shuqiang Jiang, and Xiangyang Li. Scene recognition with cnns: objects, scales and dataset bias. In CVPR, 2016.
  • [22] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015.
  • [23] Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes imagenet good for transfer learning? arXiv:1608.08614, 2016.
  • [24] Pierre Jolicoeur, Mark A Gluck, and Stephen M Kosslyn. Pictures and names: Making the connection. Cognitive Psychology, 1984.
  • [25] Armand Joulin, Laurens van der Maaten, Allan Jabri, and Nicolas Vasilache. Learning visual features from large weakly supervised data. In ECCV, 2016.
  • [26] Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, 2017.
  • [27] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCV Workshop, 2013.
  • [28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [29] Wei Liu, Andrew Rabinovich, and Alexander C Berg. Parsenet: Looking wider to see better. In ICLR workshop, 2016.
  • [30] Alexander Mathews, Lexing Xie, and Xuming He. Choosing basic-level concept names using visual and language context. In WACV, 2015.
  • [31] Pascal Mettes, Dennis Koelma, and Cees G. M. Snoek. The imagenet shuffle: Reorganized pre-training for video event detection. In ICMR, 2016.
  • [32] Venkatesh N Murthy, Vivek Singh, Terrence Chen, R Manmatha, and Dorin Comaniciu. Deep decision network for multi-class image classification. In CVPR, 2016.
  • [33] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In IEEE Computer Vision, Graphics & Image Processing, 2008.
  • [34] Vicente Ordonez, Wei Liu, Jia Deng, Yejin Choi, Alexander C Berg, and Tamara L Berg. Predicting entry-level categories. IJCV, 2015.
  • [35] Wanli Ouyang, Xiaogang Wang, Cong Zhang, and Xiaokang Yang. Factors in finetuning deep model for object detection with long-tail distribution. In CVPR, 2016.
  • [36] Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In CVPR, 2009.
  • [37] S-A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple visual domains with residual adapters. In NIPS, 2017.
  • [38] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Efficient parametrization of multi-domain deep neural networks. In CVPR, CVPR, 2018.
  • [39] Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. In CVPR, 2017.
  • [40] Eleanor Rosch. Principles of categorization. Cognition and Categorization, 1978.
  • [41] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015.
  • [42] Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. Learning cross-modal embeddings for cooking recipes and food images. In CVPR, CVPR, 2017.
  • [43] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
  • [44] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [45] Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. In ICLR, ICLR, 2018.
  • [46] Youssef Tamaazousti, Hervé Le Borgne, and Céline Hudelot. Diverse concept-level features for multi-object classification. In ICMR, 2016.
  • [47] Youssef Tamaazousti, Hervé Le Borgne, and Céline Hudelot. Mucale-net: Multi categorical-level networks to generate more discriminating features. In CVPR, 2017.
  • [48] Youssef Tamaazousti, Hervé Le Borgne, and Adrian Popescu. Constrained local enhancement of semantic features by content-based sparsity. In ICMR, 2016.
  • [49] Youssef Tamaazousti, Hervé Le Borgne, Adrian Popescu, Etienne Gadeski, Alexandru Ginsca, and Céline Hudelot. Vision-language integration using constrained local semantic features. CVIU, 2017.
  • [50] James W Tanaka and Marjorie Taylor. Object categories and expertise: Is the basic level in the eye of the beholder? Cognitive Psychology, 1991.
  • [51] Phong D Vo, Alexandru Ginsca, Hervé Le Borgne, and Adrian Popescu. Harnessing noisy web images for deep representation. CVIU, 2017.
  • [52] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • [53] Jingyan Wang, Olga Russakovsky, and Deva Ramanan. The more you look, the more you see: towards general object understanding through recursive refinement. In WACV, 2018.
  • [54] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by increasing model capacity. In CVPR, 2017.
  • [55] Yue Wu, Jun Li, Yu Kong, and Yun Fu.

    Deep convolutional neural network with independent softmax for large scale face recognition.

    In ACM, 2016.
  • [56] Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, and Yizhou Yu. Hd-cnn: hierarchical deep convolutional neural networks for large scale visual recognition. In ICCV, 2015.
  • [57] Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. Human action recognition by learning bases of action attributes and parts. In ICCV, 2011.
  • [58] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In NIPS, 2014.
  • [59] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva.

    Learning deep features for scene recognition using places database.

    In NIPS, 2014.