Efficient Multi-Domain Network Learning by Covariance Normalization

06/24/2019 ∙ by Yunsheng Li, et al. ∙ University of California, San Diego 0

The problem of multi-domain learning of deep networks is considered. An adaptive layer is induced per target domain and a novel procedure, denoted covariance normalization (CovNorm), proposed to reduce its parameters. CovNorm is a data driven method of fairly simple implementation, requiring two principal component analyzes (PCA) and fine-tuning of a mini-adaptation layer. Nevertheless, it is shown, both theoretically and experimentally, to have several advantages over previous approaches, such as batch normalization or geometric matrix approximations. Furthermore, CovNorm can be deployed both when target datasets are available sequentially or simultaneously. Experiments show that, in both cases, it has performance comparable to a fully fine-tuned network, using as few as 0.13 domain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional nerual networks (CNNs) have enabled transformational advances in classification, object detection and segmentation, among other tasks. However they have non-trivial complexity. State of the art models contain millions of parameters and require implementation in expensive GPUs. This creates problems for applications with computational constraints, such as mobile devices or consumer electronics. Figure 1 illustrates the problem in the context of a smart home equipped with an ecology of devices such as a camera that monitors package delivery and theft, a fridge that keeps track of its content, a treadmill that adjusts fitness routines to the facial expression of the user, or a baby monitor that keeps track of the state of a baby. As devices are added to the ecology, the GPU server in the house must switch between a larger number of classification, detection, and segmentation tasks. Similar problems will be faced by mobile devices, robots, smart cars, etc.

Figure 1: Multi-domain learning addresses the efficient solution of several tasks, defined on different domains. Each task is solved by a different network but all networks share a set of fixed layers , which contain the majority of network parameters. These are complemented by small task-specific adaptation layers .

Under the current deep learning paradigm, this task switching is difficult to perform. The predominant strategy is to use a different CNN to solve each task. Since only a few models can be cached in the GPU, and moving models in and out of cache adds too much overhead to enable real-time task switching, there is a need for very efficient parameter sharing across tasks. The individual networks should share most of their parameters, which would always reside on the GPU. A remaining small number of task specific parameters would be switched per task. This problem is known as

multi-domain learning (MDL) and has been addressed with the architecture of Figure 1 [34, 38]. This consists of set of fixed layers (denoted as ’’) shared by all tasks and a set of task specific adaptation layers (denoted as ’’) fine-tunned to each task. If the layers are much smaller than the layers, many models can be cached simultaneously. Ideally, the

layers should be pre-trained, e.g. on ImageNet, and used by all tasks without additional training, enabling the use of special purpose chips to implement the majority of the computations. While

layers would still require a processing unit, the small amount of computation could enable the use of a CPU, making it cost-effective to implement each network on the device itself.

Figure 2: Covariance normalization. Each adaptation layer is approximated by three transformations: , which implements a projection onto the PCA space of the input (principal component matrix

and eigenvalue matrix

), , which reconstructs the PCA space of the output (matrices and ), and a mini-adaptation layer .

In summary, MDL aims to maximize the performance of the network ecology while minimizing the ratio of task specific () to total parameters (both types and ) per network. [34, 38] have shown that the architecture of Figure 1 can match the performance of fully fine-tuning each network in the ecology, even when layers contain as few as of the total parameters. In this work, we show that layers can be substantially further shrunk, using a data-driven low-rank approximation. As illustrated in Figure 2, this is based on transformations that match the -order statistics of the layer inputs and outputs. Given principal component analyses (PCAs) of both input and output, the layer is approximated by a recoloring transformation: a projection into input PCA space, followed by a reconstruction into the output PCA space. By controlling the intermediate PCA dimensions, the method enables low-dimensional approximations of different input and output dimensions. To correct the mismatch (between PCA components) of two PCAs learned independently, a small mini-adaptation layer is introduced between the two PCA matrices, and fine-tunned on the target target.

Since the overall transformation generalizes batch normalization, the method is denoted covariance normalization

(CovNorm). CovNorm is shown to outperform, with both theoretical and experimental arguments, purely geometric methods for matrix approximation, such as the singular value decomposition (SVD) 

[35], fine-tuning of the original layers [34, 38], or adaptation based on batch normalization [2]. It is also quite simple, requiring two PCAs and the finetuning of a very small mini-adaptation layer per layer and task. Experimental results show that it can outperform full network fine-tuning while reducing layers to as little as of the total parameters. When all tasks can be learned together, layers can be further reduced to of the full model size. This is achieved by combining the individual PCAs into a global PCA model, of parameters shared by all tasks, and only fine-tunning mini-adaptation layers in a task specific manner.

2 Related work

MDL is a transfer learning problem, namely the transfer of a model trained on a

source learning problem to an ecology of target problems. This makes it related to different types of transfer learning problems, which differ mostly in terms of input, or domain, and range space, or task.

Task transfer: Task transfer addresses the use of a model trained on a source task to the solution of a target task. The two tasks can be defined on the same or different domains. Task transfer is prevalent in deep learning, where a CNN pre-trained on a large source dataset, such as ImageNet, is usually fine-tunned [21] to a target task. While extremely effective and popular, full network fine-tunning changes most network parameters, frequently all. MDL addresses this problem by considering multiple target tasks and extensive parameter sharing between them.

Domain Adaptation: In domain adaptation, the source and target tasks are the same, and a model trained on a source domain is transfered to a target domain. Domain adaptation can be supervised, in which case labeled data is available for the target domain, or unsupervised, where it is not. Various strategies have been used to address these problems. Some methods seek the network parameters that minimize some function of the distance between feature distributions in the two domains [24, 4, 43]. Others introduce an adversarial loss that maximizes the confusion between the two domains [8, 45]. A few methods have also proposed to do the transfer at the image level, e.g. using GANs [11]

to map source images into (labeled) target images, then used to learn a target classifier 

[3, 41, 14]. All these methods exploit the commonality of source and domain tasks to align source and target domains. This is unlike MDL, where source and target tasks are different. Nevertheless, some mechanisms proposed for domain adaptation can be used for MDL. For example, [5, 28]

use a batch normalization layer to match the statistics of source and target data, in terms of means and standard deviation. This is similar to an early proposal for MDL 

[2]. We show that these mechanisms underperform covariance normalization.

a) b) c)
Figure 3: a) original network, b) after fine-tuning, and c) with adaptation layer . In all cases, is a weight layer and a non-linearity.

Multitask learning: Multi-task learning [6, 49] addresses the solution of multiple tasks by the same model. It assumes that all tasks have the same visual domain. Popular examples include classification and bounding box regression in object detection [9, 37]

, joint estimation of surface normals and depth 

[7] or segmentation [29], joint representation in terms of attributes and facial landmarks [50, 33], among others. Multitask learning is sometimes also used to solve auxiliary tasks that strengthen performance of a task of interest, e.g. by accounting for context [10], or representing objects in terms of classes and attributes [15, 29, 30, 25]. Recently, there have been attempts to learn models that solve many problems jointly [18, 19, 48].

Most multitask learning approaches emphasize the learning of the interrelationships between tasks. This is frequently accomplished by using a single network, combining domain agnostic lower-level network layers with task specific network heads and loss functions 

[50, 7, 10, 15, 37, 19], or some more sophisticated forms of network branching [25]. The branching architecture is incompatible with MDL, where each task has its own input, different from those of all other tasks. Even when multi-task learning is addressed with multiple tower networks, the emphasis tends to be on inter-tower connections, e.g. through cross-stitching [29, 17]. In MDL, such connections are not feasible, because different networks can join the ecology of Figure 1 asynchronously, as devices are turned on and off.

Lifelong learning: Lifelong learning aims to learn multiple tasks sequentially with a shared model. This can be done by adapting the parameters of a network or adapting the network architecture. Since training data is discarded upon its use, constraints are needed to force the model to remember what was previously learned. Methods that only change parameters either use the model output on previous tasks [23], previous parameters values [22], or previous network activations [44] to regularize the learning of the target task. They are very effective at parameter sharing, since a single model solves all tasks. However, this model is not optimal for any specific task, and can perform poorly on all tasks, depending on the mismatch between source and target domains [36]. We show that they can significantly underperform MDL with CovNorm. Methods that adapt the network architecture usually add a tower per new task [40, 1]. These methods have much larger complexity than MDL, since several towers can be needed to solve a single task [40], and there is no sharing of fixed layers across tasks.

Multi-domain learning: This work builds on previous attempts at MDL, which have investigated different architectures for the adaptation layers of Figure 1. [2] used a BN layer [16]

of parameters tunned per task. While performing well on simple datasets, this does not have enough degrees of freedom to support transfer of large CNNs across different domains. More powerful architectures were proposed by

[38], who used a convolutional layer and [34], who proposed a ResNet-style residual layer, known as a residual adaptation (RA) module. These methods were shown to perform surprisingly well in terms of recognition accuracy, equaling or surpassing the performance of full network fine tunning, but can still require a substantial number of adaptation parameters, typically of the network size. [35] addressed this problem by combining adapters of multiple tasks into a large matrix, which is approximated with an SVD. This is then fine-tuned on each target dataset. Compressing adaptation layers in this way was shown to reduce adaptive parameter counts to approximately half of [34]. However, all tasks have to be optimized simultaneously. We show that CovNorm enables a further ten-fold reduction in adaptation layer parameters, without this limitation, although some additional gains are possible with joint optimization.

3 MDL by covariance normalization

In this section, we introduce the CovNorm procedure for MDL with deep networks.

3.1 Multi-domain learning

Figure 3 a) motivates the use of layers in MDL. The figure depicts two fixed weight layers, and , and a non-linear layer in between. Since the fixed layers are pre-trained on a source dataset , typically ImageNet, all weights are optimized for the source statistics. For standard losses, such as cross entropy, this is a maximum likelihood (ML) procedure that matches and to the statistics of activations and in . However, when the CNN is used on a different target domain, the statistics of these variables change and are no longer an ML solution. Hence, the network is sub-optimal and must be finetunned on a target dataset . This is denoted full network finetuning and converts the network into an ML solution for , with the outcome of Figure 3

b). In the target domain, the intermediate random variables become

, and and the weights are changed accordingly, into and .

While very effective, this procedure has two drawbacks, which follow from updating all weights. First, it can be computationally expensive, since modern CNNs have large weight matrices. Second, because the weights are not optimal for , i.e. the CNN forgets the source task, there is a need to store and implement two CNNs to solve both tasks. This is expensive in terms of storage and computation and increases the complexity of managing the network ecology. A device that solves both tasks must store two CNNs and load them in and out of cache when it switches between the tasks. These problems are addressed by the MDL architecture of Figure 1, which is replicated in greater detail on Figure 3 c). It introduces an adaptation layer and fine-tunes this layer only, leaving and unchanged. In this case, the statistics of the input are still those of , but the distributions along the network are now those of and . Since is fixed, nothing can be done about . However, the fine-tuning of encourages the statistics of to match those of , i.e. and thus . Even if cannot match statistics exactly, the mismatch is reduced by repeating the procedure in subsequent layers, e.g. introducing a second layer after , and optimizing adaptation matrices as a whole.

3.2 Adaptation layer size

Obviously, MDL has limited interest if has size similar to . In this case, each domain has as many adaptation parameters as the original network, all networks have twice the size, task switching is complex, and training complexity is equivalent to full fine tunning of the original network. On the other hand, if is much smaller than , MDL is computationally light and task-switching much more efficient. In summary, the goal is to introduce an adaptation layer as small as possible, but still powerful enough to match the statistics of and . A simple solution is to make a batch normalization layer [16]. This was proposed in [2] but, as discussed below, is not effective. To overcome this problem, [38]

proposed a linear transformation

and [34] adopted the residual structure of [13], i.e. an adaptation layer . To maximize parameter savings, was implemented with a convolutional layer in both cases.

This can, however, still require a non-trivial number of parameters, especially in upper network layers. Let convolve a bank of filters of size with feature maps. Then, has size , is dimensional, and a matrix. Since in upper network layers is usually small and , can be only marginally smaller than . [35] exploited redundancies across tasks to address this problem, creating a matrix with the layer parameters of multiple tasks and computing a low-rank approximation of this matrix with an SVD. The compression achieved with this approximation is limited, because the approximation is purely geometric, not taking into account the statistics of and . In this work, we propose a more efficient solution, motivated by the interpretation of as converting the statistics of into those of . It is assumed that the fine-tuning of produces an output variable whose statistics match those of . This could leverage adaptation layers in other layers of the network, but that is not important for the discussion that follows. The only assumption is that . The goal is to replace by a simpler matrix that maps into . For simplicity, we drop the primes and notation of Figure 3 in what follows, considering the problem of matching statistics between input and output of a matrix .

3.3 Geometric approximations

One possibility is to use a purely geometric solution [35]. Geometrically, the closest low rank approximation of a matrix is given by the SVD, . More precisely, the minimum Frobenius norm approximation , where , is where contains the largest singular values of . This can be written as , where and . If , these matrices have a total of parameters. An even simpler solution is to define and , replace by their product in Figure 3 c), and fine-tune the two matrices instead of . We denote this as the fine-tunned approximation (FTA). These approaches are limited by their purely geometric nature. Note that is determined by the source model (output dimension of ) and fixed. On the other hand, the dimension should depend on the target dataset . Intuitively, if is much smaller than , or if the target task is much simpler, it should be possible to use a smaller than otherwise. There is also no reason to believe that a single , or even a single ratio , is suitable for all network layers. While could be found by cross-validation, this becomes expensive when there are multiple adaptation layers throughout the CNN. We next introduce an alternative, data driven, procedure that bypasses these difficulties.

Figure 4: Top: covnorm approximates adaptation layer by a sequence of whitening , mini-adaptation , and coloring operations. Bottom: after covnorm, the mini adaptation layer can be absorbed into (shown in the figure) or .

3.4 Covariance matching

Assume that, as illustrated in Figure 2, and are Gaussian random variables of means and covariances , respectively, related by . Let the covariances have eigendecomposition

(1)

where

contain eigenvectors as columns and

are diagonal eigenvalue matrices. We refer to the triplet as the PCA of . Then, it is well known that the statistics of and are related by

(2)

and, combining (1) and (2), . This holds when or, equivalently,

(3)
(4)

where is the “whitening matrix” of and the “coloring matrix” of . It follows that (2) holds if is implemented with a sequence of two operations. First, is mapped into a variable of zero mean and identity covariance, by defining

(5)

Second, is mapped into with

(6)

In summary, for Gaussian , the effect of is simply the combination of a whitening of

followed by a colorization with the statistics of

.

3.5 Covariance normalization

The interpretation of the adaptation layer as a recoloring operation (whitening + coloring) sheds light on the number of parameters effectively needed for the adaptation, since the PCAs capture the effective dimensions of and . Let () be the number of eigenvalues significantly larger than zero in (). Then, the whitening and coloring matrices can be approximated by

(7)

where () contains the non-zero eigenvalues of (), and () the corresponding eigenvectors. Hence, is well approximated by a pair of matrices (, ) totaling parameters.

On the other hand, the PCAs are only defined up to a permutation, which assigns an ordering to eigenvalues/eigenvectors. When the input and output PCAs are computed independently, the principal components may not be aligned. This can be fixed by introducing a permutation matrix between and in (4). The assumption that all distributions are Gaussian also only holds approximately in real networks. To account for all this, we augment the recoloring operation with a mini-adaptation layer of size . This leads to the covariance normalization (CovNorm) transform

(8)

where is learned by fine-tuning on the target dataset . Beyond improving recognition performance, this has the advantage of further parameters savings. The direct implementation of (8) increases the parameter count to . However, after fine-tuning, can be absorbed into one of the two other matrices , as shown in Figure 4. When , has dimension and replacing the two matrices by their product reduces the total parameter count to . In this case, we say that is absorbed into . Conversely, if , can be absorbed into . Hence, the total parameter count is . CovNorm is summarized in Algorithm 1.

Data: source and target
Insert an adaptation layer on a CNN trained on and fine-tune on . Store the layer input and output PCAs , , select the non-zero eigenvalues and corresponding eigenvectors from each PCA, and compute with (7). add mini-adaptation layer and replace by (8). Note that, as usual, the constant

can be implemented with a vector of biases. fine-tune

with and on and absorb into the larger of and .
Algorithm 1 Covariance Normalization

3.6 The importance of covariance normalization

The benefits of covariance matching can be seen by comparison to previously proposed MDL methods. Assume, first, that and consist of independent features. In this case, are identity matrices and (5)-(6) reduce to

(9)

which is the batch normalization equation. Hence, CovNorm is a generalized form of the latter. There are, however, important differences. First, there is no batch. The normalizing distribution

is now the distribution of the feature responses of layer on the target dataset . Second, the goal is not to facilitate the learning of , but produce a feature vector with statistics matched to . This turns out to make a significant difference. Since, in regular batch normalization, is allowed to change, it can absorb any initial mismatch with the independence assumption. This is not the case for MDL, where is fixed. Hence, (9) usually fails, significantly underperforming (5)-(6).

Next, consider the geometric solution. Since CovNorm reduces to the product of two tall matrices, e.g. and of size , it should be possible to replace it with the fine-tuned approximation based on two matrices of this size. Here, there are two difficulties. First, is not known in the absence of the PCA decompositions. Second, in our experience, even when is set to the value used by PCA, the fine-tuned approximation does not work. As shown in the experimental section, when the matrices are initialized with Gaussian weights, performance can decrease significantly. This is an interesting observation because is itself initialized with Gaussian weights. It appears that a good initialization is more critical for the low-rank matrices.

Finally, CovNorm can be compared to the SVD, . From (3), this holds whenever , and . The problem is that the singular value matrix

conflates the variances of the input and output PCAs. The fact that

has two important consequences. First, it is impossible to recover the dimensions and by inspection of the singular values. Second, the low-rank criteria of selecting the largest singular values is not equivalent to CovNorm. For example, the principal components of with largest eigenvalues have the smallest singular values . Hence, it is impossible to tell if singular vectors of small singular values are the most important (PCA components of large variance for ) or the least important (noise). Conversely, the largest singular values can simply signal the least important input dimensions. CovNorm eliminates this problem by explicitly selecting the important input and output dimensions.

3.7 Joint training

[35] considered a variant of MDL where the different tasks of Figure 1 are all optimized simultaneously. This is the same as assuming that a joint dataset is available. For CovNorm, the only difference with respect to the single dataset setting is that the PCAs are now those of the joint data . These can be derived from the PCAs of the individual target datasets with

(10)

where is the cardinality of . Hence, CovNorm can be implemented by finetuning to each , storing the PCAs , using (10) to reconstruct the covariance of , and computing the global PCA. When tasks are available sequentially, this can be done recursively, combining the PCA of all previous data with the PCA of the new data. In summary, CovNorm can be extended to any number of tasks, with constant storage requirements (a single PCA), and no loss of optimality. This makes it possible to define two CovNorm modes.

  • independent: layers of network are adapted to target dataset . A PCA is computed for and the mini-adaptation fine-tuned to . This requires task specific parameters (per layer) per dataset.

  • joint: a global PCA is learned from and shared across tasks. Only a mini-adaptation layer is fine-tuned per . This requires task-specific parameters (per layer) per dataset. All must be available simultaneously.

The independent model is needed if, for example, the devices of Figure 1 are produced by different manufacturers.

Figure 5: Ratio of effective dimensions () for different network layers. Left: MITIndoor. Right: CIFAR100.
Figure 6: accuracy vs. % of parameters used for adaptation. Left: MITIndoor. Right: CIFAR100.

4 Experiments

In this section, we present results for both the independent and joint CovNorm modes.

Dataset: [34] proposed the decathlon dataset for evaluation of MDL. However, this is a collection of relatively small datasets. While sufficient to train small networks, we found it hard to use with larger CNNs. Instead, we used a collection of seven popular vision datasets. SUN 397 [47] contains classes of scene images and more than a million images. MITIndoor [46] is an indoor scene dataset with classes and samples per class. FGVC-Aircraft Benchmark [26] is a fine-grained classification dataset of images of types of airplanes. Flowers102 [32] is a fine-grained dataset with flower categories and to images per class. CIFAR100 [20] contains tiny images, from classes. Caltech256 [12] contains images of object categories, with at least samples per class. SVHN [31] is a digit recognition dataset with classes and more than samples. In all cases, images are resized to and the training and testing splits defined by the dataset are used, if available. Otherwise, is used for training and for testing.

Implementation: In all experiments, fixed layers were extracted from a source VGG16 [42] model trained on ImageNet. This has convolution layers of dimensions ranging from to . In a set of preliminary experiments, we compared the MDL performance of the architecture of  1 with these layers and adaptation layers implemented with 1) a convolutional layer of kernel size  [38], 2) the residual adapters of [34], where and are batch normalization layers and as in 1), and 3) the parallel adapters of [35]. Since residual adapters produced the best results, we adopted this structure in all our experiments. However, CovNorm can be used with any of the other structures, or any other matrix . Note that could be absorbed into after fine-tuning but we have not done so, for consistency with [34].

In all experiments, fine-tuning used initial learning rate of , reduced by when the loss stops decreasing. After fine-tuning the residual layer, features were extracted at the input and output of and the PCAs computed and used in Algorithm 1. Principal components were selected by the explained variance criterion. Once the eigenvalues were computed and sorted by decreasing magnitude, i.e. , the variance explained by the first eigenvalues is . Given a threshold , the smallest index such that was determined, and only the first eigenvalues/eigenvectors were kept. This set the dimensions (depending on whether the procedure was used on or ). Unless otherwise noted, we used , i.e. of the variance was retained.

Table 1: Classification accuracy and of adaptation parameters (with respect to VGG size) per target dataset. FGVC MITIndoor Flowers Caltech256 SVHN SUN397 CIFAR100 average FNFT 85.73% 71.77% 95.67% 83.73% 96.41% 57.29% 80.45% 81.58% 100% 100% Independent learning BN [2] 43.6% 57.6% 83.07% 73.66% 91.1% 47.04% 64.8% 65.83% 0% 0% LwF[23] 66.25% 73.43% 89.12% 80.02% 44.13% 52.85% 72.94% 68.39% 0% 0% RA [34] 88.92% 72.4% 96.43% 84.17% 96.13% 57.38% 79.55% 82.16% 10% 10% SVD+FTA 89.07% 71.66% 95.67% 84.46% 96.04% 57.12% 78.28% 81.75% 5% 5% FTA 87.31% 70.26% 95.43% 83.82% 95.96% 56.43% 78.23% 81.06% 5% 5% CovNorm 88.98% 72.51% 96.76% 84.75% 96.23% 57.97% 79.42% 82.37% 0.34% 0.62% 0.35% 0.46% 0.13% 0.71% 1.1% 0.53% Joint learning SVD [35] 88.98% 71.7% 96.37% 83.63% 96% 56.58% 78.26% 81.65% 5% 5% CovNorm 88.99% 73.0% 96.69% 84.77% 96.22% 58.2 79.22% 82.44% 0.51% 0.51%
Figure 7: Variance explained by eigenvalues of a layer input and output, and similar plot for singular values. Left: MITIndoor. Right: CIFAR100.
ImNet Airc C100 DPed DTD GTSR Flwr OGlt SVHN UCF avg acc S #par
RA [34] 59.67% 61.87% 81.20% 93.88% 57.13% 97.57% 81.67% 89.62% 96.13% 50.12% 76.89% 2621 2
DAN [39] 57.74% 64.12% 80.07% 91.3% 56.54% 98.46% 86.05% 89.67% 96.77% 49.38% 77.01% 2851 2.17
Piggyback [27] 57.69% 65.29% 79.87% 96.99% 57.45% 97.27% 79.09% 87.63% 97.24% 47.48% 76.6% 2838 1.28
CovNorm 60.37% 69.37% 81.34% 98.75% 59.95% 99.14% 83.44% 87.69% 96.55% 48.92% 78.55% 3713 1.25
Table 2: Visual Decathlon results

Benfits of CovNorm: We start with some independent MDL experiments that provide insight on the benefits of CovNorm over previous MDL procedures. While we only report results for MITIndoor and CIFAR100, they are typical of all target datasets. Figure 6 shows the ratio of effective output to input dimensions, as a function of adaptation layer. It shows that the input of typically contains more information than the output. Note that is rarely one, is almost always less than , frequently smaller than , and smallest for the top network layers.

We next compared CovNorm to batch normalization (BN) [2], and geometric approximations based on the fine-tunned approximation (FTA) of Section 3.3. We also tested a mix of the geometric approaches (SVD+FTA), where was first approximated by the SVD and the matrices , finetuned on , and a mix of PCA and FTA (PCA+FTA), where the mini-adaptation layer of CovNorm was removed and fine-tuned on , to minimize the PCA alignment problem. All geometric approximations were implemented with low-rank parameter values , where is the dimension of or and . For CovNorm, the explained variance threshold was varied in . Figure 6 shows recognition accuracies vs. the % of parameters. Here, parameters corresponds the adaptation layers of [34]: a network with residual adapters whose matrix is fine-tunned on . This is denoted RA and shown as an upper-bound. A second upper-bound is shown for full network fine tuning (FNFT). This requires more parameters than RA. BN, which requires close to zero parameters, is shown as a lower bound.

Several observations are possible. First, all geometric approximations underperform CovNorm. For comparable sizes, the accuracy drop of the best geometric method (SVD+FTA) is as large as . This is partly due to the use of a constant low rank throughout the network. This cannot match the effective, data-dependent, dimensions, which vary across layers (see Figure 6

). CovNorm eliminates this problem. We experimented with heuristics for choosing variable ranks but, as discussed below (Figure 

7), could not achieve good performance. Among the geometric approaches, SVD+FTA outperforms FTA, which has performance drops in most of datasets. It is interesting that, while is fine-tuned with random initialization, the process is not effective for the low-rank matrices of FTA. In several datasets, FTA could not match SVD+FTA.

Even more surprising were the weaker results obtained when the random initialization was replaced by the two PCAs (PCA+FTA). Note the large difference between PCA+FTA and CovNorm (up to ), which differ by the mini-adaptation layer . This is explained by the alignment problem of Section 3.5. Interestingly, while mini-adaptation layers are critical to overcome this problem, they are as easy to fine-tune as . In fact, the addition of these layers (CovNorm) often outperformed the full matrix (RA). In some datasets, like MITIndoor, with of the parameters, CovNorm matched the performance of RA, Finally, as previously reported by [34], FNFT frequently underperformed RA. This is likely due to overfitting.

CovNorm vs SVD: Figure 7 provides empirical evidence for the vastly different quality of the approximations produced by CovNorm and the SVD. The figure shows a plot of the variance explained by the eigenvalues of the input and output distributions of an adaptation layer and the corresponding plot for its singular values. Note how the PCA energy is packed into a much smaller number of coefficients than the singular value energy. This happens because PCA only accounts for the subspaces populated by data, restricting the low-rank approximation to these subspaces. Conversely, the geometric approximation must approximate the matrix behavior even outside of these subspaces. Note that the SVD is not only less efficient in identifying the important dimensions, but also makes it difficult to determine how many singular values to keep. This prevents the use of a layer-dependent number of singular values.

Comparison to previous methods: Table 1 summarizes the recognition accuracy and of adaptation layer parameters vs. VGG model size ( parameters), for various methods. All abbreviations are as above. Beyond MDL, we compare to learning without forgetting (LwF) [23] a lifelong method to learn a model that shares all parameters among datasets. The table is split into independent and joint MDL. For joint learning, CovNorm is implemented with (10) and compared to the SVD approach of [35].

Several observations can be made. First, CovNorm adapts the number of parameters to the task, according to its complexity and how different it is from the source (ImageNet). For the simplest datasets, such as the -digit class SVHN, adaptation can require as few as task-specific parameters. Datasets that are more diverse but ImageNet-like, such as Caltech256, require around parameters. Finally, larger adaptation layers are required by datasets that are either complex or quite different from ImageNet, e.g. scene (MITIndoor, SUN397) recognition tasks. Even here, adaptation requires less than parameters. On average, CovNorm requires additional parameters per dataset.

Second, for independent learning, all methods based on residual adapters significantly outperform BN and LwF. As shown by [34], RA outperforms FNFT. BN is uniformly weak, LwF performs very well on MITIndoor and Caltech256, but poorly on most other datasets. Third, CovNorm outperforms even RA, achieving higher recognition accuracy with less parameters. It also outperforms SVD+FTA and FTA by and , respectively, while reducing parameter sizes by a factor of . On a per-dataset basis, CovNorm outperforms RA on all datasets other than CIFAR100, and SVD+FTA and FTA on all of them. In all datasets, the parameter savings are significant. Fourth, for joint training, CovNorm is substantially superior to the SVD [35], with higher recognition rates in all datasets, gains of up to (SUN397), and close to less parameters. Finally, comparing independent and joint CovNorm, the latter has slightly higher recognition for a slightly higher parameter count. Hence, the two approaches are roughly equivalent.

Results on Visual Decathlon Table 2 presents results on the Decathlon challenge [34], composed of ten different datasets of small images (). Models are trained with a combination of training and validation set and results obtained online. For fair comparison, we use the learning protocol of [34]. CovNorm achieves state of the art performance in terms of classification accuracy, parameter size, and decathlon score .

5 Conclusion

CovNorm is an MDL technique of very simple implementation. When compared to previous methods, it dramatically reduces the number of adaptation parameters without loss of recognition performance. It was used to show that large CNNs can be “recycled” across problems as diverse as digit, object, scene, or fine-grained classes, with no loss, by simply tuning of their parameters.

6 Acknowledgment

This work was partially funded by NSF awards IIS-1546305 and IIS-1637941, a GRO grant from Samsung, and NVIDIA GPU donations.

References