Convolutional nerual networks (CNNs) have enabled transformational advances in classification, object detection and segmentation, among other tasks. However they have non-trivial complexity. State of the art models contain millions of parameters and require implementation in expensive GPUs. This creates problems for applications with computational constraints, such as mobile devices or consumer electronics. Figure 1 illustrates the problem in the context of a smart home equipped with an ecology of devices such as a camera that monitors package delivery and theft, a fridge that keeps track of its content, a treadmill that adjusts fitness routines to the facial expression of the user, or a baby monitor that keeps track of the state of a baby. As devices are added to the ecology, the GPU server in the house must switch between a larger number of classification, detection, and segmentation tasks. Similar problems will be faced by mobile devices, robots, smart cars, etc.
Under the current deep learning paradigm, this task switching is difficult to perform. The predominant strategy is to use a different CNN to solve each task. Since only a few models can be cached in the GPU, and moving models in and out of cache adds too much overhead to enable real-time task switching, there is a need for very efficient parameter sharing across tasks. The individual networks should share most of their parameters, which would always reside on the GPU. A remaining small number of task specific parameters would be switched per task. This problem is known asmulti-domain learning (MDL) and has been addressed with the architecture of Figure 1 [34, 38]. This consists of set of fixed layers (denoted as ’’) shared by all tasks and a set of task specific adaptation layers (denoted as ’’) fine-tunned to each task. If the layers are much smaller than the layers, many models can be cached simultaneously. Ideally, the
layers should be pre-trained, e.g. on ImageNet, and used by all tasks without additional training, enabling the use of special purpose chips to implement the majority of the computations. Whilelayers would still require a processing unit, the small amount of computation could enable the use of a CPU, making it cost-effective to implement each network on the device itself.
In summary, MDL aims to maximize the performance of the network ecology while minimizing the ratio of task specific () to total parameters (both types and ) per network. [34, 38] have shown that the architecture of Figure 1 can match the performance of fully fine-tuning each network in the ecology, even when layers contain as few as of the total parameters. In this work, we show that layers can be substantially further shrunk, using a data-driven low-rank approximation. As illustrated in Figure 2, this is based on transformations that match the -order statistics of the layer inputs and outputs. Given principal component analyses (PCAs) of both input and output, the layer is approximated by a recoloring transformation: a projection into input PCA space, followed by a reconstruction into the output PCA space. By controlling the intermediate PCA dimensions, the method enables low-dimensional approximations of different input and output dimensions. To correct the mismatch (between PCA components) of two PCAs learned independently, a small mini-adaptation layer is introduced between the two PCA matrices, and fine-tunned on the target target.
Since the overall transformation generalizes batch normalization, the method is denoted covariance normalization
(CovNorm). CovNorm is shown to outperform, with both theoretical and experimental arguments, purely geometric methods for matrix approximation, such as the singular value decomposition (SVD), fine-tuning of the original layers [34, 38], or adaptation based on batch normalization . It is also quite simple, requiring two PCAs and the finetuning of a very small mini-adaptation layer per layer and task. Experimental results show that it can outperform full network fine-tuning while reducing layers to as little as of the total parameters. When all tasks can be learned together, layers can be further reduced to of the full model size. This is achieved by combining the individual PCAs into a global PCA model, of parameters shared by all tasks, and only fine-tunning mini-adaptation layers in a task specific manner.
2 Related work
MDL is a transfer learning problem, namely the transfer of a model trained on asource learning problem to an ecology of target problems. This makes it related to different types of transfer learning problems, which differ mostly in terms of input, or domain, and range space, or task.
Task transfer: Task transfer addresses the use of a model trained on a source task to the solution of a target task. The two tasks can be defined on the same or different domains. Task transfer is prevalent in deep learning, where a CNN pre-trained on a large source dataset, such as ImageNet, is usually fine-tunned  to a target task. While extremely effective and popular, full network fine-tunning changes most network parameters, frequently all. MDL addresses this problem by considering multiple target tasks and extensive parameter sharing between them.
Domain Adaptation: In domain adaptation, the source and target tasks are the same, and a model trained on a source domain is transfered to a target domain. Domain adaptation can be supervised, in which case labeled data is available for the target domain, or unsupervised, where it is not. Various strategies have been used to address these problems. Some methods seek the network parameters that minimize some function of the distance between feature distributions in the two domains [24, 4, 43]. Others introduce an adversarial loss that maximizes the confusion between the two domains [8, 45]. A few methods have also proposed to do the transfer at the image level, e.g. using GANs 
to map source images into (labeled) target images, then used to learn a target classifier[3, 41, 14]. All these methods exploit the commonality of source and domain tasks to align source and target domains. This is unlike MDL, where source and target tasks are different. Nevertheless, some mechanisms proposed for domain adaptation can be used for MDL. For example, [5, 28]
use a batch normalization layer to match the statistics of source and target data, in terms of means and standard deviation. This is similar to an early proposal for MDL. We show that these mechanisms underperform covariance normalization.
Multitask learning: Multi-task learning [6, 49] addresses the solution of multiple tasks by the same model. It assumes that all tasks have the same visual domain. Popular examples include classification and bounding box regression in object detection [9, 37]
, joint estimation of surface normals and depth or segmentation , joint representation in terms of attributes and facial landmarks [50, 33], among others. Multitask learning is sometimes also used to solve auxiliary tasks that strengthen performance of a task of interest, e.g. by accounting for context , or representing objects in terms of classes and attributes [15, 29, 30, 25]. Recently, there have been attempts to learn models that solve many problems jointly [18, 19, 48].
Most multitask learning approaches emphasize the learning of the interrelationships between tasks. This is frequently accomplished by using a single network, combining domain agnostic lower-level network layers with task specific network heads and loss functions[50, 7, 10, 15, 37, 19], or some more sophisticated forms of network branching . The branching architecture is incompatible with MDL, where each task has its own input, different from those of all other tasks. Even when multi-task learning is addressed with multiple tower networks, the emphasis tends to be on inter-tower connections, e.g. through cross-stitching [29, 17]. In MDL, such connections are not feasible, because different networks can join the ecology of Figure 1 asynchronously, as devices are turned on and off.
Lifelong learning: Lifelong learning aims to learn multiple tasks sequentially with a shared model. This can be done by adapting the parameters of a network or adapting the network architecture. Since training data is discarded upon its use, constraints are needed to force the model to remember what was previously learned. Methods that only change parameters either use the model output on previous tasks , previous parameters values , or previous network activations  to regularize the learning of the target task. They are very effective at parameter sharing, since a single model solves all tasks. However, this model is not optimal for any specific task, and can perform poorly on all tasks, depending on the mismatch between source and target domains . We show that they can significantly underperform MDL with CovNorm. Methods that adapt the network architecture usually add a tower per new task [40, 1]. These methods have much larger complexity than MDL, since several towers can be needed to solve a single task , and there is no sharing of fixed layers across tasks.
of parameters tunned per task. While performing well on simple datasets, this does not have enough degrees of freedom to support transfer of large CNNs across different domains. More powerful architectures were proposed by, who used a convolutional layer and , who proposed a ResNet-style residual layer, known as a residual adaptation (RA) module. These methods were shown to perform surprisingly well in terms of recognition accuracy, equaling or surpassing the performance of full network fine tunning, but can still require a substantial number of adaptation parameters, typically of the network size.  addressed this problem by combining adapters of multiple tasks into a large matrix, which is approximated with an SVD. This is then fine-tuned on each target dataset. Compressing adaptation layers in this way was shown to reduce adaptive parameter counts to approximately half of . However, all tasks have to be optimized simultaneously. We show that CovNorm enables a further ten-fold reduction in adaptation layer parameters, without this limitation, although some additional gains are possible with joint optimization.
3 MDL by covariance normalization
In this section, we introduce the CovNorm procedure for MDL with deep networks.
3.1 Multi-domain learning
Figure 3 a) motivates the use of layers in MDL. The figure depicts two fixed weight layers, and , and a non-linear layer in between. Since the fixed layers are pre-trained on a source dataset , typically ImageNet, all weights are optimized for the source statistics. For standard losses, such as cross entropy, this is a maximum likelihood (ML) procedure that matches and to the statistics of activations and in . However, when the CNN is used on a different target domain, the statistics of these variables change and are no longer an ML solution. Hence, the network is sub-optimal and must be finetunned on a target dataset . This is denoted full network finetuning and converts the network into an ML solution for , with the outcome of Figure 3
b). In the target domain, the intermediate random variables become, and and the weights are changed accordingly, into and .
While very effective, this procedure has two drawbacks, which follow from updating all weights. First, it can be computationally expensive, since modern CNNs have large weight matrices. Second, because the weights are not optimal for , i.e. the CNN forgets the source task, there is a need to store and implement two CNNs to solve both tasks. This is expensive in terms of storage and computation and increases the complexity of managing the network ecology. A device that solves both tasks must store two CNNs and load them in and out of cache when it switches between the tasks. These problems are addressed by the MDL architecture of Figure 1, which is replicated in greater detail on Figure 3 c). It introduces an adaptation layer and fine-tunes this layer only, leaving and unchanged. In this case, the statistics of the input are still those of , but the distributions along the network are now those of and . Since is fixed, nothing can be done about . However, the fine-tuning of encourages the statistics of to match those of , i.e. and thus . Even if cannot match statistics exactly, the mismatch is reduced by repeating the procedure in subsequent layers, e.g. introducing a second layer after , and optimizing adaptation matrices as a whole.
3.2 Adaptation layer size
Obviously, MDL has limited interest if has size similar to . In this case, each domain has as many adaptation parameters as the original network, all networks have twice the size, task switching is complex, and training complexity is equivalent to full fine tunning of the original network. On the other hand, if is much smaller than , MDL is computationally light and task-switching much more efficient. In summary, the goal is to introduce an adaptation layer as small as possible, but still powerful enough to match the statistics of and . A simple solution is to make a batch normalization layer . This was proposed in  but, as discussed below, is not effective. To overcome this problem, 
proposed a linear transformationand  adopted the residual structure of , i.e. an adaptation layer . To maximize parameter savings, was implemented with a convolutional layer in both cases.
This can, however, still require a non-trivial number of parameters, especially in upper network layers. Let convolve a bank of filters of size with feature maps. Then, has size , is dimensional, and a matrix. Since in upper network layers is usually small and , can be only marginally smaller than .  exploited redundancies across tasks to address this problem, creating a matrix with the layer parameters of multiple tasks and computing a low-rank approximation of this matrix with an SVD. The compression achieved with this approximation is limited, because the approximation is purely geometric, not taking into account the statistics of and . In this work, we propose a more efficient solution, motivated by the interpretation of as converting the statistics of into those of . It is assumed that the fine-tuning of produces an output variable whose statistics match those of . This could leverage adaptation layers in other layers of the network, but that is not important for the discussion that follows. The only assumption is that . The goal is to replace by a simpler matrix that maps into . For simplicity, we drop the primes and notation of Figure 3 in what follows, considering the problem of matching statistics between input and output of a matrix .
3.3 Geometric approximations
One possibility is to use a purely geometric solution . Geometrically, the closest low rank approximation of a matrix is given by the SVD, . More precisely, the minimum Frobenius norm approximation , where , is where contains the largest singular values of . This can be written as , where and . If , these matrices have a total of parameters. An even simpler solution is to define and , replace by their product in Figure 3 c), and fine-tune the two matrices instead of . We denote this as the fine-tunned approximation (FTA). These approaches are limited by their purely geometric nature. Note that is determined by the source model (output dimension of ) and fixed. On the other hand, the dimension should depend on the target dataset . Intuitively, if is much smaller than , or if the target task is much simpler, it should be possible to use a smaller than otherwise. There is also no reason to believe that a single , or even a single ratio , is suitable for all network layers. While could be found by cross-validation, this becomes expensive when there are multiple adaptation layers throughout the CNN. We next introduce an alternative, data driven, procedure that bypasses these difficulties.
3.4 Covariance matching
Assume that, as illustrated in Figure 2, and are Gaussian random variables of means and covariances , respectively, related by . Let the covariances have eigendecomposition
contain eigenvectors as columns andare diagonal eigenvalue matrices. We refer to the triplet as the PCA of . Then, it is well known that the statistics of and are related by
where is the “whitening matrix” of and the “coloring matrix” of . It follows that (2) holds if is implemented with a sequence of two operations. First, is mapped into a variable of zero mean and identity covariance, by defining
Second, is mapped into with
In summary, for Gaussian , the effect of is simply the combination of a whitening of
followed by a colorization with the statistics of.
3.5 Covariance normalization
The interpretation of the adaptation layer as a recoloring operation (whitening + coloring) sheds light on the number of parameters effectively needed for the adaptation, since the PCAs capture the effective dimensions of and . Let () be the number of eigenvalues significantly larger than zero in (). Then, the whitening and coloring matrices can be approximated by
where () contains the non-zero eigenvalues of (), and () the corresponding eigenvectors. Hence, is well approximated by a pair of matrices (, ) totaling parameters.
On the other hand, the PCAs are only defined up to a permutation, which assigns an ordering to eigenvalues/eigenvectors. When the input and output PCAs are computed independently, the principal components may not be aligned. This can be fixed by introducing a permutation matrix between and in (4). The assumption that all distributions are Gaussian also only holds approximately in real networks. To account for all this, we augment the recoloring operation with a mini-adaptation layer of size . This leads to the covariance normalization (CovNorm) transform
where is learned by fine-tuning on the target dataset . Beyond improving recognition performance, this has the advantage of further parameters savings. The direct implementation of (8) increases the parameter count to . However, after fine-tuning, can be absorbed into one of the two other matrices , as shown in Figure 4. When , has dimension and replacing the two matrices by their product reduces the total parameter count to . In this case, we say that is absorbed into . Conversely, if , can be absorbed into . Hence, the total parameter count is . CovNorm is summarized in Algorithm 1.
3.6 The importance of covariance normalization
The benefits of covariance matching can be seen by comparison to previously proposed MDL methods. Assume, first, that and consist of independent features. In this case, are identity matrices and (5)-(6) reduce to
which is the batch normalization equation. Hence, CovNorm is a generalized form of the latter. There are, however, important differences. First, there is no batch. The normalizing distributionis now the distribution of the feature responses of layer on the target dataset . Second, the goal is not to facilitate the learning of , but produce a feature vector with statistics matched to . This turns out to make a significant difference. Since, in regular batch normalization, is allowed to change, it can absorb any initial mismatch with the independence assumption. This is not the case for MDL, where is fixed. Hence, (9) usually fails, significantly underperforming (5)-(6).
Next, consider the geometric solution. Since CovNorm reduces to the product of two tall matrices, e.g. and of size , it should be possible to replace it with the fine-tuned approximation based on two matrices of this size. Here, there are two difficulties. First, is not known in the absence of the PCA decompositions. Second, in our experience, even when is set to the value used by PCA, the fine-tuned approximation does not work. As shown in the experimental section, when the matrices are initialized with Gaussian weights, performance can decrease significantly. This is an interesting observation because is itself initialized with Gaussian weights. It appears that a good initialization is more critical for the low-rank matrices.
Finally, CovNorm can be compared to the SVD, . From (3), this holds whenever , and . The problem is that the singular value matrix
conflates the variances of the input and output PCAs. The fact thathas two important consequences. First, it is impossible to recover the dimensions and by inspection of the singular values. Second, the low-rank criteria of selecting the largest singular values is not equivalent to CovNorm. For example, the principal components of with largest eigenvalues have the smallest singular values . Hence, it is impossible to tell if singular vectors of small singular values are the most important (PCA components of large variance for ) or the least important (noise). Conversely, the largest singular values can simply signal the least important input dimensions. CovNorm eliminates this problem by explicitly selecting the important input and output dimensions.
3.7 Joint training
 considered a variant of MDL where the different tasks of Figure 1 are all optimized simultaneously. This is the same as assuming that a joint dataset is available. For CovNorm, the only difference with respect to the single dataset setting is that the PCAs are now those of the joint data . These can be derived from the PCAs of the individual target datasets with
where is the cardinality of . Hence, CovNorm can be implemented by finetuning to each , storing the PCAs , using (10) to reconstruct the covariance of , and computing the global PCA. When tasks are available sequentially, this can be done recursively, combining the PCA of all previous data with the PCA of the new data. In summary, CovNorm can be extended to any number of tasks, with constant storage requirements (a single PCA), and no loss of optimality. This makes it possible to define two CovNorm modes.
independent: layers of network are adapted to target dataset . A PCA is computed for and the mini-adaptation fine-tuned to . This requires task specific parameters (per layer) per dataset.
joint: a global PCA is learned from and shared across tasks. Only a mini-adaptation layer is fine-tuned per . This requires task-specific parameters (per layer) per dataset. All must be available simultaneously.
The independent model is needed if, for example, the devices of Figure 1 are produced by different manufacturers.
In this section, we present results for both the independent and joint CovNorm modes.
Dataset:  proposed the decathlon dataset for evaluation of MDL. However, this is a collection of relatively small datasets. While sufficient to train small networks, we found it hard to use with larger CNNs. Instead, we used a collection of seven popular vision datasets. SUN 397  contains classes of scene images and more than a million images. MITIndoor  is an indoor scene dataset with classes and samples per class. FGVC-Aircraft Benchmark  is a fine-grained classification dataset of images of types of airplanes. Flowers102  is a fine-grained dataset with flower categories and to images per class. CIFAR100  contains tiny images, from classes. Caltech256  contains images of object categories, with at least samples per class. SVHN  is a digit recognition dataset with classes and more than samples. In all cases, images are resized to and the training and testing splits defined by the dataset are used, if available. Otherwise, is used for training and for testing.
Implementation: In all experiments, fixed layers were extracted from a source VGG16  model trained on ImageNet. This has convolution layers of dimensions ranging from to . In a set of preliminary experiments, we compared the MDL performance of the architecture of 1 with these layers and adaptation layers implemented with 1) a convolutional layer of kernel size , 2) the residual adapters of , where and are batch normalization layers and as in 1), and 3) the parallel adapters of . Since residual adapters produced the best results, we adopted this structure in all our experiments. However, CovNorm can be used with any of the other structures, or any other matrix . Note that could be absorbed into after fine-tuning but we have not done so, for consistency with .
In all experiments, fine-tuning used initial learning rate of , reduced by when the loss stops decreasing. After fine-tuning the residual layer, features were extracted at the input and output of and the PCAs computed and used in Algorithm 1. Principal components were selected by the explained variance criterion. Once the eigenvalues were computed and sorted by decreasing magnitude, i.e. , the variance explained by the first eigenvalues is . Given a threshold , the smallest index such that was determined, and only the first eigenvalues/eigenvectors were kept. This set the dimensions (depending on whether the procedure was used on or ). Unless otherwise noted, we used , i.e. of the variance was retained.
Benfits of CovNorm: We start with some independent MDL experiments that provide insight on the benefits of CovNorm over previous MDL procedures. While we only report results for MITIndoor and CIFAR100, they are typical of all target datasets. Figure 6 shows the ratio of effective output to input dimensions, as a function of adaptation layer. It shows that the input of typically contains more information than the output. Note that is rarely one, is almost always less than , frequently smaller than , and smallest for the top network layers.
We next compared CovNorm to batch normalization (BN) , and geometric approximations based on the fine-tunned approximation (FTA) of Section 3.3. We also tested a mix of the geometric approaches (SVD+FTA), where was first approximated by the SVD and the matrices , finetuned on , and a mix of PCA and FTA (PCA+FTA), where the mini-adaptation layer of CovNorm was removed and fine-tuned on , to minimize the PCA alignment problem. All geometric approximations were implemented with low-rank parameter values , where is the dimension of or and . For CovNorm, the explained variance threshold was varied in . Figure 6 shows recognition accuracies vs. the % of parameters. Here, parameters corresponds the adaptation layers of : a network with residual adapters whose matrix is fine-tunned on . This is denoted RA and shown as an upper-bound. A second upper-bound is shown for full network fine tuning (FNFT). This requires more parameters than RA. BN, which requires close to zero parameters, is shown as a lower bound.
Several observations are possible. First, all geometric approximations underperform CovNorm. For comparable sizes, the accuracy drop of the best geometric method (SVD+FTA) is as large as . This is partly due to the use of a constant low rank throughout the network. This cannot match the effective, data-dependent, dimensions, which vary across layers (see Figure 6
). CovNorm eliminates this problem. We experimented with heuristics for choosing variable ranks but, as discussed below (Figure7), could not achieve good performance. Among the geometric approaches, SVD+FTA outperforms FTA, which has performance drops in most of datasets. It is interesting that, while is fine-tuned with random initialization, the process is not effective for the low-rank matrices of FTA. In several datasets, FTA could not match SVD+FTA.
Even more surprising were the weaker results obtained when the random initialization was replaced by the two PCAs (PCA+FTA). Note the large difference between PCA+FTA and CovNorm (up to ), which differ by the mini-adaptation layer . This is explained by the alignment problem of Section 3.5. Interestingly, while mini-adaptation layers are critical to overcome this problem, they are as easy to fine-tune as . In fact, the addition of these layers (CovNorm) often outperformed the full matrix (RA). In some datasets, like MITIndoor, with of the parameters, CovNorm matched the performance of RA, Finally, as previously reported by , FNFT frequently underperformed RA. This is likely due to overfitting.
CovNorm vs SVD: Figure 7 provides empirical evidence for the vastly different quality of the approximations produced by CovNorm and the SVD. The figure shows a plot of the variance explained by the eigenvalues of the input and output distributions of an adaptation layer and the corresponding plot for its singular values. Note how the PCA energy is packed into a much smaller number of coefficients than the singular value energy. This happens because PCA only accounts for the subspaces populated by data, restricting the low-rank approximation to these subspaces. Conversely, the geometric approximation must approximate the matrix behavior even outside of these subspaces. Note that the SVD is not only less efficient in identifying the important dimensions, but also makes it difficult to determine how many singular values to keep. This prevents the use of a layer-dependent number of singular values.
Comparison to previous methods: Table 1 summarizes the recognition accuracy and of adaptation layer parameters vs. VGG model size ( parameters), for various methods. All abbreviations are as above. Beyond MDL, we compare to learning without forgetting (LwF)  a lifelong method to learn a model that shares all parameters among datasets. The table is split into independent and joint MDL. For joint learning, CovNorm is implemented with (10) and compared to the SVD approach of .
Several observations can be made. First, CovNorm adapts the number of parameters to the task, according to its complexity and how different it is from the source (ImageNet). For the simplest datasets, such as the -digit class SVHN, adaptation can require as few as task-specific parameters. Datasets that are more diverse but ImageNet-like, such as Caltech256, require around parameters. Finally, larger adaptation layers are required by datasets that are either complex or quite different from ImageNet, e.g. scene (MITIndoor, SUN397) recognition tasks. Even here, adaptation requires less than parameters. On average, CovNorm requires additional parameters per dataset.
Second, for independent learning, all methods based on residual adapters significantly outperform BN and LwF. As shown by , RA outperforms FNFT. BN is uniformly weak, LwF performs very well on MITIndoor and Caltech256, but poorly on most other datasets. Third, CovNorm outperforms even RA, achieving higher recognition accuracy with less parameters. It also outperforms SVD+FTA and FTA by and , respectively, while reducing parameter sizes by a factor of . On a per-dataset basis, CovNorm outperforms RA on all datasets other than CIFAR100, and SVD+FTA and FTA on all of them. In all datasets, the parameter savings are significant. Fourth, for joint training, CovNorm is substantially superior to the SVD , with higher recognition rates in all datasets, gains of up to (SUN397), and close to less parameters. Finally, comparing independent and joint CovNorm, the latter has slightly higher recognition for a slightly higher parameter count. Hence, the two approaches are roughly equivalent.
Results on Visual Decathlon Table 2 presents results on the Decathlon challenge , composed of ten different datasets of small images (). Models are trained with a combination of training and validation set and results obtained online. For fair comparison, we use the learning protocol of . CovNorm achieves state of the art performance in terms of classification accuracy, parameter size, and decathlon score .
CovNorm is an MDL technique of very simple implementation. When compared to previous methods, it dramatically reduces the number of adaptation parameters without loss of recognition performance. It was used to show that large CNNs can be “recycled” across problems as diverse as digit, object, scene, or fine-grained classes, with no loss, by simply tuning of their parameters.
This work was partially funded by NSF awards IIS-1546305 and IIS-1637941, a GRO grant from Samsung, and NVIDIA GPU donations.
-  R. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network of experts. In CVPR, pages 7120–7129, 2017.
-  H. Bilen and A. Vedaldi. Universal representations: The missing link between faces, text, planktons, and cat breeds. arXiv preprint arXiv:1701.07275, 2017.
-  K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In , volume 1, page 7, 2017.
-  K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351, 2016.
-  F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulò. Autodial: Automatic domain alignment layers. In ICCV, pages 5077–5085, 2017.
-  R. Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.
-  D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
Y. Ganin and V. Lempitsky.
Unsupervised domain adaptation by backpropagation.
International Conference in Machine Learning, 2014.
-  R. Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.
-  G. Gkioxari, R. Girshick, and J. Malik. Contextual action recognition with r* cnn. In Proceedings of the IEEE international conference on computer vision, pages 1080–1088, 2015.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. 2007.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
J. Huang, R. S. Feris, Q. Chen, and S. Yan.
Cross-domain image retrieval with a dual attribute-aware ranking network.In Proceedings of the IEEE international conference on computer vision, pages 1062–1070, 2015.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  B. Jou and S.-F. Chang. Deep cross residual learning for multitask visual recognition. In Proceedings of the 2016 ACM on Multimedia Conference, pages 998–1007. ACM, 2016.
-  A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. arXiv preprint arXiv:1705.07115, 3, 2017.
Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory.In CVPR, volume 2, page 8, 2017.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
-  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, and B.-T. Zhang.
Overcoming catastrophic forgetting by incremental moment matching.In Advances in Neural Information Processing Systems, pages 4655–4665, 2017.
-  Z. Li and D. Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. International Conference in Machine Learning, 2015.
-  Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris. Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In CVPR, volume 1, page 6, 2017.
-  S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
-  A. Mallya, D. Davis, and S. Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pages 67–82, 2018.
-  M. Mancini, L. Porzi, S. R. Bulò, B. Caputo, and E. Ricci. Boosting domain adaptation by discovering latent domains. arXiv preprint arXiv:1805.01386, 2018.
-  I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch Networks for Multi-task Learning. In CVPR, 2016.
-  P. Morgado and N. Vasconcelos. Semantically consistent regularization for zero-shot recognition. In CVPR, volume 9, page 10, 2017.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
-  M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth Indian Conference on, pages 722–729. IEEE, 2008.
R. Ranjan, V. M. Patel, and R. Chellappa.
Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  S.-A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, pages 506–516, 2017.
-  S.-A. Rebuffi, H. Bilen, and A. Vedaldi. Efficient parametrization of multi-domain deep neural networks. arXiv preprint arXiv:1803.10082, 2018.
-  S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proc. CVPR, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2017.
-  A. Rosenfeld and J. K. Tsotsos. Incremental learning through deep adaptation. arXiv preprint arXiv:1705.04228, 2017.
-  A. Rosenfeld and J. K. Tsotsos. Incremental learning through deep adaptation. IEEE transactions on pattern analysis and machine intelligence, 2018.
-  A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
-  A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, volume 2, page 5, 2017.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision, pages 443–450. Springer, 2016.
-  A. R. Triki, R. Aljundi, M. B. Blaschko, and T. Tuytelaars. Encoder based lifelong learning. IEEE Conference Computer Vision and Pattern Recognition, 2017.
-  E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
-  M. Valenti, B. Bethke, D. Dale, A. Frank, J. McGrew, S. Ahrens, J. P. How, and J. Vian. The mit indoor multi-vehicle flight testbed. In Robotics and Automation, 2007 IEEE International Conference on, pages 2758–2759. IEEE, 2007.
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
Sun database: Large-scale scene recognition from abbey to zoo.In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pages 3485–3492. IEEE, 2010.
-  A. R. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3712–3722, 2018.
-  Y. Zhang and Q. Yang. A survey on multi-task learning. arXiv preprint arXiv:1707.08114, 2017.
-  Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In European Conference on Computer Vision, pages 94–108. Springer, 2014.