1 Introduction
Convolutional nerual networks (CNNs) have enabled transformational advances in classification, object detection and segmentation, among other tasks. However they have nontrivial complexity. State of the art models contain millions of parameters and require implementation in expensive GPUs. This creates problems for applications with computational constraints, such as mobile devices or consumer electronics. Figure 1 illustrates the problem in the context of a smart home equipped with an ecology of devices such as a camera that monitors package delivery and theft, a fridge that keeps track of its content, a treadmill that adjusts fitness routines to the facial expression of the user, or a baby monitor that keeps track of the state of a baby. As devices are added to the ecology, the GPU server in the house must switch between a larger number of classification, detection, and segmentation tasks. Similar problems will be faced by mobile devices, robots, smart cars, etc.
Under the current deep learning paradigm, this task switching is difficult to perform. The predominant strategy is to use a different CNN to solve each task. Since only a few models can be cached in the GPU, and moving models in and out of cache adds too much overhead to enable realtime task switching, there is a need for very efficient parameter sharing across tasks. The individual networks should share most of their parameters, which would always reside on the GPU. A remaining small number of task specific parameters would be switched per task. This problem is known as
multidomain learning (MDL) and has been addressed with the architecture of Figure 1 [34, 38]. This consists of set of fixed layers (denoted as ’’) shared by all tasks and a set of task specific adaptation layers (denoted as ’’) finetunned to each task. If the layers are much smaller than the layers, many models can be cached simultaneously. Ideally, thelayers should be pretrained, e.g. on ImageNet, and used by all tasks without additional training, enabling the use of special purpose chips to implement the majority of the computations. While
layers would still require a processing unit, the small amount of computation could enable the use of a CPU, making it costeffective to implement each network on the device itself.In summary, MDL aims to maximize the performance of the network ecology while minimizing the ratio of task specific () to total parameters (both types and ) per network. [34, 38] have shown that the architecture of Figure 1 can match the performance of fully finetuning each network in the ecology, even when layers contain as few as of the total parameters. In this work, we show that layers can be substantially further shrunk, using a datadriven lowrank approximation. As illustrated in Figure 2, this is based on transformations that match the order statistics of the layer inputs and outputs. Given principal component analyses (PCAs) of both input and output, the layer is approximated by a recoloring transformation: a projection into input PCA space, followed by a reconstruction into the output PCA space. By controlling the intermediate PCA dimensions, the method enables lowdimensional approximations of different input and output dimensions. To correct the mismatch (between PCA components) of two PCAs learned independently, a small miniadaptation layer is introduced between the two PCA matrices, and finetunned on the target target.
Since the overall transformation generalizes batch normalization, the method is denoted covariance normalization
(CovNorm). CovNorm is shown to outperform, with both theoretical and experimental arguments, purely geometric methods for matrix approximation, such as the singular value decomposition (SVD)
[35], finetuning of the original layers [34, 38], or adaptation based on batch normalization [2]. It is also quite simple, requiring two PCAs and the finetuning of a very small miniadaptation layer per layer and task. Experimental results show that it can outperform full network finetuning while reducing layers to as little as of the total parameters. When all tasks can be learned together, layers can be further reduced to of the full model size. This is achieved by combining the individual PCAs into a global PCA model, of parameters shared by all tasks, and only finetunning miniadaptation layers in a task specific manner.2 Related work
MDL is a transfer learning problem, namely the transfer of a model trained on a
source learning problem to an ecology of target problems. This makes it related to different types of transfer learning problems, which differ mostly in terms of input, or domain, and range space, or task.Task transfer: Task transfer addresses the use of a model trained on a source task to the solution of a target task. The two tasks can be defined on the same or different domains. Task transfer is prevalent in deep learning, where a CNN pretrained on a large source dataset, such as ImageNet, is usually finetunned [21] to a target task. While extremely effective and popular, full network finetunning changes most network parameters, frequently all. MDL addresses this problem by considering multiple target tasks and extensive parameter sharing between them.
Domain Adaptation: In domain adaptation, the source and target tasks are the same, and a model trained on a source domain is transfered to a target domain. Domain adaptation can be supervised, in which case labeled data is available for the target domain, or unsupervised, where it is not. Various strategies have been used to address these problems. Some methods seek the network parameters that minimize some function of the distance between feature distributions in the two domains [24, 4, 43]. Others introduce an adversarial loss that maximizes the confusion between the two domains [8, 45]. A few methods have also proposed to do the transfer at the image level, e.g. using GANs [11]
to map source images into (labeled) target images, then used to learn a target classifier
[3, 41, 14]. All these methods exploit the commonality of source and domain tasks to align source and target domains. This is unlike MDL, where source and target tasks are different. Nevertheless, some mechanisms proposed for domain adaptation can be used for MDL. For example, [5, 28]use a batch normalization layer to match the statistics of source and target data, in terms of means and standard deviation. This is similar to an early proposal for MDL
[2]. We show that these mechanisms underperform covariance normalization.a)  b)  c) 
Multitask learning: Multitask learning [6, 49] addresses the solution of multiple tasks by the same model. It assumes that all tasks have the same visual domain. Popular examples include classification and bounding box regression in object detection [9, 37]
, joint estimation of surface normals and depth
[7] or segmentation [29], joint representation in terms of attributes and facial landmarks [50, 33], among others. Multitask learning is sometimes also used to solve auxiliary tasks that strengthen performance of a task of interest, e.g. by accounting for context [10], or representing objects in terms of classes and attributes [15, 29, 30, 25]. Recently, there have been attempts to learn models that solve many problems jointly [18, 19, 48].Most multitask learning approaches emphasize the learning of the interrelationships between tasks. This is frequently accomplished by using a single network, combining domain agnostic lowerlevel network layers with task specific network heads and loss functions
[50, 7, 10, 15, 37, 19], or some more sophisticated forms of network branching [25]. The branching architecture is incompatible with MDL, where each task has its own input, different from those of all other tasks. Even when multitask learning is addressed with multiple tower networks, the emphasis tends to be on intertower connections, e.g. through crossstitching [29, 17]. In MDL, such connections are not feasible, because different networks can join the ecology of Figure 1 asynchronously, as devices are turned on and off.Lifelong learning: Lifelong learning aims to learn multiple tasks sequentially with a shared model. This can be done by adapting the parameters of a network or adapting the network architecture. Since training data is discarded upon its use, constraints are needed to force the model to remember what was previously learned. Methods that only change parameters either use the model output on previous tasks [23], previous parameters values [22], or previous network activations [44] to regularize the learning of the target task. They are very effective at parameter sharing, since a single model solves all tasks. However, this model is not optimal for any specific task, and can perform poorly on all tasks, depending on the mismatch between source and target domains [36]. We show that they can significantly underperform MDL with CovNorm. Methods that adapt the network architecture usually add a tower per new task [40, 1]. These methods have much larger complexity than MDL, since several towers can be needed to solve a single task [40], and there is no sharing of fixed layers across tasks.
Multidomain learning: This work builds on previous attempts at MDL, which have investigated different architectures for the adaptation layers of Figure 1. [2] used a BN layer [16]
of parameters tunned per task. While performing well on simple datasets, this does not have enough degrees of freedom to support transfer of large CNNs across different domains. More powerful architectures were proposed by
[38], who used a convolutional layer and [34], who proposed a ResNetstyle residual layer, known as a residual adaptation (RA) module. These methods were shown to perform surprisingly well in terms of recognition accuracy, equaling or surpassing the performance of full network fine tunning, but can still require a substantial number of adaptation parameters, typically of the network size. [35] addressed this problem by combining adapters of multiple tasks into a large matrix, which is approximated with an SVD. This is then finetuned on each target dataset. Compressing adaptation layers in this way was shown to reduce adaptive parameter counts to approximately half of [34]. However, all tasks have to be optimized simultaneously. We show that CovNorm enables a further tenfold reduction in adaptation layer parameters, without this limitation, although some additional gains are possible with joint optimization.3 MDL by covariance normalization
In this section, we introduce the CovNorm procedure for MDL with deep networks.
3.1 Multidomain learning
Figure 3 a) motivates the use of layers in MDL. The figure depicts two fixed weight layers, and , and a nonlinear layer in between. Since the fixed layers are pretrained on a source dataset , typically ImageNet, all weights are optimized for the source statistics. For standard losses, such as cross entropy, this is a maximum likelihood (ML) procedure that matches and to the statistics of activations and in . However, when the CNN is used on a different target domain, the statistics of these variables change and are no longer an ML solution. Hence, the network is suboptimal and must be finetunned on a target dataset . This is denoted full network finetuning and converts the network into an ML solution for , with the outcome of Figure 3
b). In the target domain, the intermediate random variables become
, and and the weights are changed accordingly, into and .While very effective, this procedure has two drawbacks, which follow from updating all weights. First, it can be computationally expensive, since modern CNNs have large weight matrices. Second, because the weights are not optimal for , i.e. the CNN forgets the source task, there is a need to store and implement two CNNs to solve both tasks. This is expensive in terms of storage and computation and increases the complexity of managing the network ecology. A device that solves both tasks must store two CNNs and load them in and out of cache when it switches between the tasks. These problems are addressed by the MDL architecture of Figure 1, which is replicated in greater detail on Figure 3 c). It introduces an adaptation layer and finetunes this layer only, leaving and unchanged. In this case, the statistics of the input are still those of , but the distributions along the network are now those of and . Since is fixed, nothing can be done about . However, the finetuning of encourages the statistics of to match those of , i.e. and thus . Even if cannot match statistics exactly, the mismatch is reduced by repeating the procedure in subsequent layers, e.g. introducing a second layer after , and optimizing adaptation matrices as a whole.
3.2 Adaptation layer size
Obviously, MDL has limited interest if has size similar to . In this case, each domain has as many adaptation parameters as the original network, all networks have twice the size, task switching is complex, and training complexity is equivalent to full fine tunning of the original network. On the other hand, if is much smaller than , MDL is computationally light and taskswitching much more efficient. In summary, the goal is to introduce an adaptation layer as small as possible, but still powerful enough to match the statistics of and . A simple solution is to make a batch normalization layer [16]. This was proposed in [2] but, as discussed below, is not effective. To overcome this problem, [38]
proposed a linear transformation
and [34] adopted the residual structure of [13], i.e. an adaptation layer . To maximize parameter savings, was implemented with a convolutional layer in both cases.This can, however, still require a nontrivial number of parameters, especially in upper network layers. Let convolve a bank of filters of size with feature maps. Then, has size , is dimensional, and a matrix. Since in upper network layers is usually small and , can be only marginally smaller than . [35] exploited redundancies across tasks to address this problem, creating a matrix with the layer parameters of multiple tasks and computing a lowrank approximation of this matrix with an SVD. The compression achieved with this approximation is limited, because the approximation is purely geometric, not taking into account the statistics of and . In this work, we propose a more efficient solution, motivated by the interpretation of as converting the statistics of into those of . It is assumed that the finetuning of produces an output variable whose statistics match those of . This could leverage adaptation layers in other layers of the network, but that is not important for the discussion that follows. The only assumption is that . The goal is to replace by a simpler matrix that maps into . For simplicity, we drop the primes and notation of Figure 3 in what follows, considering the problem of matching statistics between input and output of a matrix .
3.3 Geometric approximations
One possibility is to use a purely geometric solution [35]. Geometrically, the closest low rank approximation of a matrix is given by the SVD, . More precisely, the minimum Frobenius norm approximation , where , is where contains the largest singular values of . This can be written as , where and . If , these matrices have a total of parameters. An even simpler solution is to define and , replace by their product in Figure 3 c), and finetune the two matrices instead of . We denote this as the finetunned approximation (FTA). These approaches are limited by their purely geometric nature. Note that is determined by the source model (output dimension of ) and fixed. On the other hand, the dimension should depend on the target dataset . Intuitively, if is much smaller than , or if the target task is much simpler, it should be possible to use a smaller than otherwise. There is also no reason to believe that a single , or even a single ratio , is suitable for all network layers. While could be found by crossvalidation, this becomes expensive when there are multiple adaptation layers throughout the CNN. We next introduce an alternative, data driven, procedure that bypasses these difficulties.
3.4 Covariance matching
Assume that, as illustrated in Figure 2, and are Gaussian random variables of means and covariances , respectively, related by . Let the covariances have eigendecomposition
(1) 
where
contain eigenvectors as columns and
are diagonal eigenvalue matrices. We refer to the triplet as the PCA of . Then, it is well known that the statistics of and are related by(2) 
and, combining (1) and (2), . This holds when or, equivalently,
(3)  
(4) 
where is the “whitening matrix” of and the “coloring matrix” of . It follows that (2) holds if is implemented with a sequence of two operations. First, is mapped into a variable of zero mean and identity covariance, by defining
(5) 
Second, is mapped into with
(6) 
In summary, for Gaussian , the effect of is simply the combination of a whitening of
followed by a colorization with the statistics of
.3.5 Covariance normalization
The interpretation of the adaptation layer as a recoloring operation (whitening + coloring) sheds light on the number of parameters effectively needed for the adaptation, since the PCAs capture the effective dimensions of and . Let () be the number of eigenvalues significantly larger than zero in (). Then, the whitening and coloring matrices can be approximated by
(7) 
where () contains the nonzero eigenvalues of (), and () the corresponding eigenvectors. Hence, is well approximated by a pair of matrices (, ) totaling parameters.
On the other hand, the PCAs are only defined up to a permutation, which assigns an ordering to eigenvalues/eigenvectors. When the input and output PCAs are computed independently, the principal components may not be aligned. This can be fixed by introducing a permutation matrix between and in (4). The assumption that all distributions are Gaussian also only holds approximately in real networks. To account for all this, we augment the recoloring operation with a miniadaptation layer of size . This leads to the covariance normalization (CovNorm) transform
(8) 
where is learned by finetuning on the target dataset . Beyond improving recognition performance, this has the advantage of further parameters savings. The direct implementation of (8) increases the parameter count to . However, after finetuning, can be absorbed into one of the two other matrices , as shown in Figure 4. When , has dimension and replacing the two matrices by their product reduces the total parameter count to . In this case, we say that is absorbed into . Conversely, if , can be absorbed into . Hence, the total parameter count is . CovNorm is summarized in Algorithm 1.
3.6 The importance of covariance normalization
The benefits of covariance matching can be seen by comparison to previously proposed MDL methods. Assume, first, that and consist of independent features. In this case, are identity matrices and (5)(6) reduce to
(9) 
which is the batch normalization equation. Hence, CovNorm is a generalized form of the latter. There are, however, important differences. First, there is no batch. The normalizing distribution
is now the distribution of the feature responses of layer on the target dataset . Second, the goal is not to facilitate the learning of , but produce a feature vector with statistics matched to . This turns out to make a significant difference. Since, in regular batch normalization, is allowed to change, it can absorb any initial mismatch with the independence assumption. This is not the case for MDL, where is fixed. Hence, (9) usually fails, significantly underperforming (5)(6).Next, consider the geometric solution. Since CovNorm reduces to the product of two tall matrices, e.g. and of size , it should be possible to replace it with the finetuned approximation based on two matrices of this size. Here, there are two difficulties. First, is not known in the absence of the PCA decompositions. Second, in our experience, even when is set to the value used by PCA, the finetuned approximation does not work. As shown in the experimental section, when the matrices are initialized with Gaussian weights, performance can decrease significantly. This is an interesting observation because is itself initialized with Gaussian weights. It appears that a good initialization is more critical for the lowrank matrices.
Finally, CovNorm can be compared to the SVD, . From (3), this holds whenever , and . The problem is that the singular value matrix
conflates the variances of the input and output PCAs. The fact that
has two important consequences. First, it is impossible to recover the dimensions and by inspection of the singular values. Second, the lowrank criteria of selecting the largest singular values is not equivalent to CovNorm. For example, the principal components of with largest eigenvalues have the smallest singular values . Hence, it is impossible to tell if singular vectors of small singular values are the most important (PCA components of large variance for ) or the least important (noise). Conversely, the largest singular values can simply signal the least important input dimensions. CovNorm eliminates this problem by explicitly selecting the important input and output dimensions.3.7 Joint training
[35] considered a variant of MDL where the different tasks of Figure 1 are all optimized simultaneously. This is the same as assuming that a joint dataset is available. For CovNorm, the only difference with respect to the single dataset setting is that the PCAs are now those of the joint data . These can be derived from the PCAs of the individual target datasets with
(10) 
where is the cardinality of . Hence, CovNorm can be implemented by finetuning to each , storing the PCAs , using (10) to reconstruct the covariance of , and computing the global PCA. When tasks are available sequentially, this can be done recursively, combining the PCA of all previous data with the PCA of the new data. In summary, CovNorm can be extended to any number of tasks, with constant storage requirements (a single PCA), and no loss of optimality. This makes it possible to define two CovNorm modes.

independent: layers of network are adapted to target dataset . A PCA is computed for and the miniadaptation finetuned to . This requires task specific parameters (per layer) per dataset.

joint: a global PCA is learned from and shared across tasks. Only a miniadaptation layer is finetuned per . This requires taskspecific parameters (per layer) per dataset. All must be available simultaneously.
The independent model is needed if, for example, the devices of Figure 1 are produced by different manufacturers.
4 Experiments
In this section, we present results for both the independent and joint CovNorm modes.
Dataset: [34] proposed the decathlon dataset for evaluation of MDL. However, this is a collection of relatively small datasets. While sufficient to train small networks, we found it hard to use with larger CNNs. Instead, we used a collection of seven popular vision datasets. SUN 397 [47] contains classes of scene images and more than a million images. MITIndoor [46] is an indoor scene dataset with classes and samples per class. FGVCAircraft Benchmark [26] is a finegrained classification dataset of images of types of airplanes. Flowers102 [32] is a finegrained dataset with flower categories and to images per class. CIFAR100 [20] contains tiny images, from classes. Caltech256 [12] contains images of object categories, with at least samples per class. SVHN [31] is a digit recognition dataset with classes and more than samples. In all cases, images are resized to and the training and testing splits defined by the dataset are used, if available. Otherwise, is used for training and for testing.
Implementation: In all experiments, fixed layers were extracted from a source VGG16 [42] model trained on ImageNet. This has convolution layers of dimensions ranging from to . In a set of preliminary experiments, we compared the MDL performance of the architecture of 1 with these layers and adaptation layers implemented with 1) a convolutional layer of kernel size [38], 2) the residual adapters of [34], where and are batch normalization layers and as in 1), and 3) the parallel adapters of [35]. Since residual adapters produced the best results, we adopted this structure in all our experiments. However, CovNorm can be used with any of the other structures, or any other matrix . Note that could be absorbed into after finetuning but we have not done so, for consistency with [34].
In all experiments, finetuning used initial learning rate of , reduced by when the loss stops decreasing. After finetuning the residual layer, features were extracted at the input and output of and the PCAs computed and used in Algorithm 1. Principal components were selected by the explained variance criterion. Once the eigenvalues were computed and sorted by decreasing magnitude, i.e. , the variance explained by the first eigenvalues is . Given a threshold , the smallest index such that was determined, and only the first eigenvalues/eigenvectors were kept. This set the dimensions (depending on whether the procedure was used on or ). Unless otherwise noted, we used , i.e. of the variance was retained.
ImNet  Airc  C100  DPed  DTD  GTSR  Flwr  OGlt  SVHN  UCF  avg acc  S  #par  
RA [34]  59.67%  61.87%  81.20%  93.88%  57.13%  97.57%  81.67%  89.62%  96.13%  50.12%  76.89%  2621  2 
DAN [39]  57.74%  64.12%  80.07%  91.3%  56.54%  98.46%  86.05%  89.67%  96.77%  49.38%  77.01%  2851  2.17 
Piggyback [27]  57.69%  65.29%  79.87%  96.99%  57.45%  97.27%  79.09%  87.63%  97.24%  47.48%  76.6%  2838  1.28 
CovNorm  60.37%  69.37%  81.34%  98.75%  59.95%  99.14%  83.44%  87.69%  96.55%  48.92%  78.55%  3713  1.25 
Benfits of CovNorm: We start with some independent MDL experiments that provide insight on the benefits of CovNorm over previous MDL procedures. While we only report results for MITIndoor and CIFAR100, they are typical of all target datasets. Figure 6 shows the ratio of effective output to input dimensions, as a function of adaptation layer. It shows that the input of typically contains more information than the output. Note that is rarely one, is almost always less than , frequently smaller than , and smallest for the top network layers.
We next compared CovNorm to batch normalization (BN) [2], and geometric approximations based on the finetunned approximation (FTA) of Section 3.3. We also tested a mix of the geometric approaches (SVD+FTA), where was first approximated by the SVD and the matrices , finetuned on , and a mix of PCA and FTA (PCA+FTA), where the miniadaptation layer of CovNorm was removed and finetuned on , to minimize the PCA alignment problem. All geometric approximations were implemented with lowrank parameter values , where is the dimension of or and . For CovNorm, the explained variance threshold was varied in . Figure 6 shows recognition accuracies vs. the % of parameters. Here, parameters corresponds the adaptation layers of [34]: a network with residual adapters whose matrix is finetunned on . This is denoted RA and shown as an upperbound. A second upperbound is shown for full network fine tuning (FNFT). This requires more parameters than RA. BN, which requires close to zero parameters, is shown as a lower bound.
Several observations are possible. First, all geometric approximations underperform CovNorm. For comparable sizes, the accuracy drop of the best geometric method (SVD+FTA) is as large as . This is partly due to the use of a constant low rank throughout the network. This cannot match the effective, datadependent, dimensions, which vary across layers (see Figure 6
). CovNorm eliminates this problem. We experimented with heuristics for choosing variable ranks but, as discussed below (Figure
7), could not achieve good performance. Among the geometric approaches, SVD+FTA outperforms FTA, which has performance drops in most of datasets. It is interesting that, while is finetuned with random initialization, the process is not effective for the lowrank matrices of FTA. In several datasets, FTA could not match SVD+FTA.Even more surprising were the weaker results obtained when the random initialization was replaced by the two PCAs (PCA+FTA). Note the large difference between PCA+FTA and CovNorm (up to ), which differ by the miniadaptation layer . This is explained by the alignment problem of Section 3.5. Interestingly, while miniadaptation layers are critical to overcome this problem, they are as easy to finetune as . In fact, the addition of these layers (CovNorm) often outperformed the full matrix (RA). In some datasets, like MITIndoor, with of the parameters, CovNorm matched the performance of RA, Finally, as previously reported by [34], FNFT frequently underperformed RA. This is likely due to overfitting.
CovNorm vs SVD: Figure 7 provides empirical evidence for the vastly different quality of the approximations produced by CovNorm and the SVD. The figure shows a plot of the variance explained by the eigenvalues of the input and output distributions of an adaptation layer and the corresponding plot for its singular values. Note how the PCA energy is packed into a much smaller number of coefficients than the singular value energy. This happens because PCA only accounts for the subspaces populated by data, restricting the lowrank approximation to these subspaces. Conversely, the geometric approximation must approximate the matrix behavior even outside of these subspaces. Note that the SVD is not only less efficient in identifying the important dimensions, but also makes it difficult to determine how many singular values to keep. This prevents the use of a layerdependent number of singular values.
Comparison to previous methods: Table 1 summarizes the recognition accuracy and of adaptation layer parameters vs. VGG model size ( parameters), for various methods. All abbreviations are as above. Beyond MDL, we compare to learning without forgetting (LwF) [23] a lifelong method to learn a model that shares all parameters among datasets. The table is split into independent and joint MDL. For joint learning, CovNorm is implemented with (10) and compared to the SVD approach of [35].
Several observations can be made. First, CovNorm adapts the number of parameters to the task, according to its complexity and how different it is from the source (ImageNet). For the simplest datasets, such as the digit class SVHN, adaptation can require as few as taskspecific parameters. Datasets that are more diverse but ImageNetlike, such as Caltech256, require around parameters. Finally, larger adaptation layers are required by datasets that are either complex or quite different from ImageNet, e.g. scene (MITIndoor, SUN397) recognition tasks. Even here, adaptation requires less than parameters. On average, CovNorm requires additional parameters per dataset.
Second, for independent learning, all methods based on residual adapters significantly outperform BN and LwF. As shown by [34], RA outperforms FNFT. BN is uniformly weak, LwF performs very well on MITIndoor and Caltech256, but poorly on most other datasets. Third, CovNorm outperforms even RA, achieving higher recognition accuracy with less parameters. It also outperforms SVD+FTA and FTA by and , respectively, while reducing parameter sizes by a factor of . On a perdataset basis, CovNorm outperforms RA on all datasets other than CIFAR100, and SVD+FTA and FTA on all of them. In all datasets, the parameter savings are significant. Fourth, for joint training, CovNorm is substantially superior to the SVD [35], with higher recognition rates in all datasets, gains of up to (SUN397), and close to less parameters. Finally, comparing independent and joint CovNorm, the latter has slightly higher recognition for a slightly higher parameter count. Hence, the two approaches are roughly equivalent.
Results on Visual Decathlon Table 2 presents results on the Decathlon challenge [34], composed of ten different datasets of small images (). Models are trained with a combination of training and validation set and results obtained online. For fair comparison, we use the learning protocol of [34]. CovNorm achieves state of the art performance in terms of classification accuracy, parameter size, and decathlon score .
5 Conclusion
CovNorm is an MDL technique of very simple implementation. When compared to previous methods, it dramatically reduces the number of adaptation parameters without loss of recognition performance. It was used to show that large CNNs can be “recycled” across problems as diverse as digit, object, scene, or finegrained classes, with no loss, by simply tuning of their parameters.
6 Acknowledgment
This work was partially funded by NSF awards IIS1546305 and IIS1637941, a GRO grant from Samsung, and NVIDIA GPU donations.
References
 [1] R. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network of experts. In CVPR, pages 7120–7129, 2017.
 [2] H. Bilen and A. Vedaldi. Universal representations: The missing link between faces, text, planktons, and cat breeds. arXiv preprint arXiv:1701.07275, 2017.

[3]
K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan.
Unsupervised pixellevel domain adaptation with generative
adversarial networks.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, volume 1, page 7, 2017.  [4] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351, 2016.
 [5] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulò. Autodial: Automatic domain alignment layers. In ICCV, pages 5077–5085, 2017.
 [6] R. Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.
 [7] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.

[8]
Y. Ganin and V. Lempitsky.
Unsupervised domain adaptation by backpropagation.
International Conference in Machine Learning
, 2014.  [9] R. Girshick. Fast rcnn. arXiv preprint arXiv:1504.08083, 2015.
 [10] G. Gkioxari, R. Girshick, and J. Malik. Contextual action recognition with r* cnn. In Proceedings of the IEEE international conference on computer vision, pages 1080–1088, 2015.
 [11] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [12] G. Griffin, A. Holub, and P. Perona. Caltech256 object category dataset. 2007.
 [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [14] J. Hoffman, E. Tzeng, T. Park, J.Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycleconsistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.

[15]
J. Huang, R. S. Feris, Q. Chen, and S. Yan.
Crossdomain image retrieval with a dual attributeaware ranking network.
In Proceedings of the IEEE international conference on computer vision, pages 1062–1070, 2015.  [16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [17] B. Jou and S.F. Chang. Deep cross residual learning for multitask visual recognition. In Proceedings of the 2016 ACM on Multimedia Conference, pages 998–1007. ACM, 2016.
 [18] A. Kendall, Y. Gal, and R. Cipolla. Multitask learning using uncertainty to weigh losses for scene geometry and semantics. arXiv preprint arXiv:1705.07115, 3, 2017.

[19]
I. Kokkinos.
Ubernet: Training a universal convolutional neural network for low, mid, and highlevel vision using diverse datasets and limited memory.
In CVPR, volume 2, page 8, 2017.  [20] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 [21] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.

[22]
S.W. Lee, J.H. Kim, J. Jun, J.W. Ha, and B.T. Zhang.
Overcoming catastrophic forgetting by incremental moment matching.
In Advances in Neural Information Processing Systems, pages 4655–4665, 2017.  [23] Z. Li and D. Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
 [24] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. International Conference in Machine Learning, 2015.
 [25] Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris. Fullyadaptive feature sharing in multitask networks with applications in person attribute classification. In CVPR, volume 1, page 6, 2017.
 [26] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Finegrained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
 [27] A. Mallya, D. Davis, and S. Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pages 67–82, 2018.
 [28] M. Mancini, L. Porzi, S. R. Bulò, B. Caputo, and E. Ricci. Boosting domain adaptation by discovering latent domains. arXiv preprint arXiv:1805.01386, 2018.
 [29] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Crossstitch Networks for Multitask Learning. In CVPR, 2016.
 [30] P. Morgado and N. Vasconcelos. Semantically consistent regularization for zeroshot recognition. In CVPR, volume 9, page 10, 2017.
 [31] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
 [32] M.E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth Indian Conference on, pages 722–729. IEEE, 2008.

[33]
R. Ranjan, V. M. Patel, and R. Chellappa.
Hyperface: A deep multitask learning framework for face detection, landmark localization, pose estimation, and gender recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.  [34] S.A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, pages 506–516, 2017.
 [35] S.A. Rebuffi, H. Bilen, and A. Vedaldi. Efficient parametrization of multidomain deep neural networks. arXiv preprint arXiv:1803.10082, 2018.
 [36] S.A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proc. CVPR, 2017.
 [37] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: towards realtime object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2017.
 [38] A. Rosenfeld and J. K. Tsotsos. Incremental learning through deep adaptation. arXiv preprint arXiv:1705.04228, 2017.
 [39] A. Rosenfeld and J. K. Tsotsos. Incremental learning through deep adaptation. IEEE transactions on pattern analysis and machine intelligence, 2018.
 [40] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
 [41] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, volume 2, page 5, 2017.
 [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [43] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision, pages 443–450. Springer, 2016.
 [44] A. R. Triki, R. Aljundi, M. B. Blaschko, and T. Tuytelaars. Encoder based lifelong learning. IEEE Conference Computer Vision and Pattern Recognition, 2017.
 [45] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
 [46] M. Valenti, B. Bethke, D. Dale, A. Frank, J. McGrew, S. Ahrens, J. P. How, and J. Vian. The mit indoor multivehicle flight testbed. In Robotics and Automation, 2007 IEEE International Conference on, pages 2758–2759. IEEE, 2007.

[47]
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
Sun database: Largescale scene recognition from abbey to zoo.
In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pages 3485–3492. IEEE, 2010.  [48] A. R. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3712–3722, 2018.
 [49] Y. Zhang and Q. Yang. A survey on multitask learning. arXiv preprint arXiv:1707.08114, 2017.
 [50] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multitask learning. In European Conference on Computer Vision, pages 94–108. Springer, 2014.
Comments
There are no comments yet.