Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work we develop FiLM Transfer (FiT) which fulfills these requirements in the image classification setting. FiT uses an automatically configured Naive Bayes classifier on top of a fixed backbone that has been pretrained on large image datasets. Parameter efficient FiLM layers are used to modulate the backbone, shaping the representation for the downstream task. The network is trained via an episodic fine-tuning protocol. The approach is parameter efficient which is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FiT on a wide range of downstream datasets and show that it achieves better classification accuracy than the state-of-the-art Big Transfer (BiT) algorithm at low-shot and on the challenging VTAB-1k benchmark, with fewer than 1 Finally, we demonstrate the parameter efficiency of FiT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.READ FULL TEXT VIEW PDF
With the success of the commercial application of deep learning in many fields such as computer vision(Schroff et al., 2015) et al., 2020), speech recognition (Xiong et al., 2018), and language translation (Wu et al., 2016), an increasing number of models are being trained on central servers and then deployed on remote devices, often to personalize a model to a specific user’s needs. Personalization requires models that can be updated inexpensively by minimizing the number of parameters that need to be stored and/or transmitted and frequently calls for few-shot learning methods as the amount of training data from an individual user may be small (Massiceti et al., 2021). At the same time, for privacy, security, and performance reasons, it can be advantageous to use federated learning where a model is trained on an array of remote devices, each with different data, and share gradient or parameter updates instead of training data with a central server (McMahan et al., 2017). In the federated learning setting, in order to minimize communication cost with the server, it is beneficial to have models with a small number of parameters that need to be updated for each training round conducted by remote clients. The amount of training data available to the clients is often small, again necessitating few-shot learning approaches.
Few-shot learning approaches fall broadly into two camps – meta-learning (Hospedales et al., 2020) and transfer learning (fine-tuning) (Yosinski et al., 2014). It is useful to characterise these approaches in terms of shared and updateable parameters. Shared parameters do not change as the model is retrained or updated, while updateable parameters are those that are either recomputed or learned as the model is updated or retrained. From a statistical perspective, shared parameters capture similarities between datasets, while updateable parameters efficiently capture the differences. In general, meta-learning approaches (Hospedales et al., 2020) to few-shot learning are trained in a multi-task manner and as a result share a large number of parameters and only update a small subset when adapting to a new task (Requeima et al., 2019). While meta-learners can achieve good accuracy on datasets that are similar to what they are meta-trained on, their accuracy suffers when tested on datasets that are significantly different (Dumoulin et al., 2021). Transfer learning algorithms can often outperform meta-learners, especially on diverse datasets and even at low-shot (Dumoulin et al., 2021; Tian et al., 2020). While some transfer learning algorithms are able to minimize the number of updateable parameters by only fine-tuning the final or a small subset of layers, the state of the art Big Transfer (BiT) (Dumoulin et al., 2021; Kolesnikov et al., 2019) algorithm requires every parameter in a large network to be updated.
In this work, we focus on designing deep learning network architectures and associated training protocols that allow image classification models to be updated with only a small subset of the total model parameters, without sacrificing prediction performance when there is only a small number of training examples available. This leads to reduced storage and transmission costs for updating personalized models on remote devices (Massiceti et al., 2021), distributed training in federated learning (McMahan et al., 2017), and efficient ensemble realization (Havasi et al., 2020), among other applications. To realize our goal of small model updates, we pursue a transfer learning approach that takes advantage of image classification backbones that have been pretrained on large upstream datasets (Kolesnikov et al., 2019). We freeze the backbone parameters such that they are shared when fine-tuning on a downstream dataset. Parameter efficient FiLM (Perez et al., 2018) adapter layers are used to modulate the backbone, shaping the representation for the downstream task. For a ResNet50 (He et al., 2016a), the FiLM layer parameter count is less than 0.05 of the overall model size, yet allows expressive adaptation. The last, novel piece of the proposed system is the use of an automatically configured Naive Bayes final layer classifier which outperforms the usual linear layer head. The system is trained end-to-end with an episodic fine-tuning protocol. We call this approach FiLM Transfer or FiT. We experiment with FiT on a wide range of downstream datasets and show that it achieves better classification accuracy at low-shot with two orders of magnitude fewer updateable parameters when compared to BiT (Kolesnikov et al., 2019) and competitive accuracy when more data is available. We also demonstrate the benefits of FiT on a low-shot real-world model personalization application and in a demanding few-shot federated learning scenario. Our contributions:
A parameter and data efficient network architecture for low-shot transfer learning consisting of an automatically configured Naive Bayes final layer classifier and parameter efficient FiLM layers that are used to adapt a fixed, pretrained backbone to a downstream dataset;
A meta-learning inspired episodic training protocol for low-shot fine-tuning requiring no data augmentation, no regularization, and a minimal set of hyper-parameters;
Superior classification accuracy at low-shot on standard downstream datasets and on the challenging VTAB-1k benchmark while using using of the updateable parameters when compared to the leading transfer learning method BiT;
Demonstration of superior parameter efficiency without sacrificing classification accuracy in distributed low-shot applications including model personalization and federated learning where model update size is important performance metric.
In this section we detail the FiT algorithm focusing on the few-shot image classification scenario.
We denote input images where is the width, the height, the number of channels and image labels where is the number of image classes indexed by . Assume that we have access to a model that outputs for and is comprised of a feature extractor backbone with parameters
that has been pretrained on a large upstream dataset such as Imagenet(Russakovsky et al., 2015) where is the output feature dimension and a final layer classifier or head with weights . Let be the downstream dataset that we wish to fine-tune the model to.
For the backbone, we freeze the parameters to the values learned during upstream pretraining and add Feature-wise Linear Modulation (FiLM) (Perez et al., 2018) layers with parameters at strategic points within . A FiLM layer scales and shifts the activations arising from the channel of a convolutional layer in the block of the backbone as , where and are scalars. The set of FiLM parameters are learned during fine-tuning. We add FiLM layers following the middle 3 3 convolutional layer in each ResNetV2 (He et al., 2016b) block and also one at the end of the backbone prior to the head. Fig. A.1 illustrates a FiLM layer operating on a convolutional layer, and Fig. A.1 illustrates how a FiLM layer can be added to a ResNetV2 network block. FiLM layers can be similarly added to EfficientNet based backbones. A key advantage of FiLM layers is that they enable expressive feature adaptation while adding only a small number of parameters (Perez et al., 2018). For example, in a ResNet50 with a FiLM layer in every block, the set of FiLM parameters account for only 11648 parameters which is fewer than 0.05% of the parameters in . We show in Section 4 that FiLM layers allow the model to adapt to a broad class of datasets.
For the head, we use a Gaussian Naive Bayes classifier which can be automatically configured directly from data. If the data is normally distributed, Naive Bayes is generally more effective than logistic regression(Hastie et al., 2009; Efron, 1975), and especially so in the small data setting (Pohar et al., 2004). Also, when compared to the usual linear layer head, Naive Bayes has fewer free parameters to learn and in Section 4
we show that it yields superior results. The probability of classifying a test pointis:
where , ,
are the maximum likelihood estimates,is the number of examples of class in , and is a multivariate Gaussian over with mean and covariance .
Estimating the mean for each class is straightforward and incurs a total storage cost of . However, estimating the covariance for each class is challenging when the number of examples per class is small and the embedding dimension of the backbone is large. In addition, the storage cost for the covariance matrices may be prohibitively high if is large. In this work, we use three different approximations to the covariance in place of in Eq. 1:
In the above, is the computed covariance of the examples in class in , is the computed covariance of all the examples in assuming they arise from a single Gaussian with a single mean,
are weights learned during training, and the identity matrixis used as a regularizer. The primary difference between QDA and LDA is that QDA computes and stores a covariance matrix for each class in the dataset, while LDA shares a single covariance matrix across all classes. The number of model shared and updateable parameters for the three FiT variants as well as the BiT algorithm are detailed in Table A.1. Using the BiT-M-R50x1 (Kolesnikov et al., 2019) backbone on a 10-way classification task, BiT shares no parameters and all 23.52M parameters are updateable. FiT shares the 25.50M backbone parameters and FiT-QDA, FiT-LDA, and FiT-ProtoNets have 21.01M, 32,140, and 32,128 updateable parameters, respectively. The parameters are known and fixed from pretraining, but we must learn the FiLM parameters and the covariance weights as outlined in the next section.
We learn and via fine-tuning, but employ an approach often used in meta-learning referred to as episodic training (Vinyals et al., 2016) to form the training batches. Note we require ‘training’ data to compute , , to configure the head, and a ‘test’ set to optimize and via gradient ascent. Thus, from the downstream dataset , we derive two sets – and . If is sufficiently large ( 1000), we set . Otherwise, in the few-shot scenario, we randomly split into and such that the number of examples or shots in each class are roughy equal in both partitions and that there is at least one example of each class in both. Refer to Algorithm A.1 for details. For each training iteration, we sample a task consisting of a support set drawn from with examples and a query set drawn from with examples. First, is formed by randomly choosing a subset of classes selected from the range of available classes in . Second, the number of shots to use for each selected class is randomly selected from the range of available examples in each class of with the goal of keeping the examples per class as equal as possible. Third, is formed by using the classes selected for and all available examples from in those classes up to a limit of 2000 examples. Refer to Algorithm A.2 for details. Episodic task sampling is crucial to achieving the best classification accuracy. If all of and are used for every iteration, overfitting occurs, limiting accuracy (see Table A.3 and Table A.4). The support set is then used to compute , , and and we then use to train and with maximum likelihood. We optimize the following objective:
training hyperparameters include a learning rate,, and the number of training iterations. For the transfer learning experiments in Section 4 these are set to constant values across all datasets and do not need to be tuned based on a validation set. We do not augment the training data, even in the 1-shot case. When there is only one shot per class, we leave the FiLM layer parameters at their initial value of and and predict as described in the next paragraph. In Section 4 we show this can yield results that surpass those when augmentation and training steps are taken on the 1-shot data.
Once the FiLM parameters and covariance weights have been learned, we use for the support set to compute , , and for each class and then Eq. 1 can be used to make a prediction for any unseen test input.
We take inspiration from residual adapters (Rebuffi et al., 2017, 2018) where parameter efficient adapters are inserted into a ResNet with frozen pretrained weights. The adapter parameters and the final layer linear classifier are then learned via fine-tuning. FiT differs from residual adapters by focusing on the few-shot scenario, and using FiLM layers that have fewer parameters than a residual adapter, a Naive Bayes head, and an episodic training protocol. CNAPs (Requeima et al., 2019) also uses a frozen, pretrained backbone with FiLM layers added. The FiLM parameters and a linear head are generated by meta-trained hypernetworks. Simple CNAPs (Bateni et al., 2020) improves on CNAPs by using a Mahalanobis distance based classifier that uses a blend of class specific and task specific covariance matrices. FiT-LDA greatly improves over Simple CNAPs (Bronskill et al., 2021) in terms of classification accuracy and updateable parameters on the VTAB-1k benchmark as a result of using fine-tuning instead of amortization networks and using a single dataset specific covariance matrix as opposed to class specific one.
FLUTE (Triantafillou et al., 2021) during multi-task training jointly learns backbone parameters along with a set of dataset specific FiLM layer parameters. When asked to classify a novel input at test-time, the backbone parameters are frozen and a Blender network generates initial values for the FiLM parameters using a combination of those that it has learned in the training phase and the test task. The initial FiLM parameters are then improved via fine-tuning through a nearest centroid final layer classifier. FiT differs from FLUTE in that (i) there is no initial multi-task learning phase; (ii) more fine-tuning iterations; and (iii) the use of a Naive Bayes head. The work of Mudrakarta et al. (2019)
has the same aim as this work. Instead of fine-tuning FiLM layers that are added to a pretrained network, batch normalization weights are tuned. While this work is similar in spirit, unlikeFiT, it does not focus on the few-shot regime, employs a linear head, and uses a more conventional fine-tuning protocol.
In this section, we evaluate the classification accuracy and updateable parameter efficiency of FiT in a series of challenging benchmarks and application scenarios. First, we compare different variations of FiT to Big Transfer (BiT) (Kolesnikov et al., 2019), a state-of-the-art transfer learning algorithm, on several standard downstream datasets in the few-shot regime. Second, we evaluate FiT against BiT on the challenging VTAB-1k benchmark (Zhai et al., 2019), where BiT has been shown to outperform all meta-learners (Dumoulin et al., 2021; Bronskill et al., 2021). Third, we show how FiT can be used in a personalization scenario on the ORBIT (Massiceti et al., 2021)
dataset, where a smaller updateable model is an important evaluation metric. Finally, we applyFiT to a few-shot federated learning scenario where minimizing the number of parameter updates and their size is a key requirement. In addition, we introduce a metric Relative Model Update Size or RMUS, which is the ratio of the number of updateable parameters in one model to another. Training and evaluation details are in Section A.5. Source code for experiments can be found at: https://github.com/cambridge-mlg/fit.
Fig. 1 shows the classification accuracy as a function of RMUS for FiT-LDA and BiT on four downstream datasets (CIFAR10, CIFAR100 (Krizhevsky et al., 2009), Flowers (Nilsback and Zisserman, 2008), and Pets (Parkhi et al., 2012)) that were used to evaluate the performance of BiT (Kolesnikov et al., 2019). Table A.2 contains complete tabular results with additional variants of FiT and BiT. All methods use the BiT-M-R50x1 (Kolesnikov et al., 2019)
backbone that has been pretrained on the ImageNet-21K(Russakovsky et al., 2015) dataset. The key observations from Fig. 1 are:
For 10 shots (except 1-shot on CIFAR100), FiT-LDA outperforms BiT, often by a large margin.
On 3 out of 4 datasets, FiT-LDA outperforms BiT even when all of is used for fine-tuning.
FiT-LDA outperforms BiT despite BiT having more than 100 times as many updateable parameters.
To avoid overfitting when is small, Table A.3 indicates that it is better to split into two disjoint partitions and and that and should be randomly sub-sampled as opposed to using all of the data in each training iteration.
We also evaluate BiT-FiLM, a variant of BiT that uses the same training protocol as the standard version of BiT, but the backbone weights are frozen and FiLM layers are added in the same manner as FiT. The FiLM parameters and the linear head weights are learned during training. The results are shown in Table A.2 and the key observations are:
In general, at low-shot, the standard version of BiT outperforms BiT-FiLM. However, as the shot increases, especially when training on all of , BiT-FiLM is equal in classification accuracy.
The above implies that FiLM layers have sufficient capacity to accurately fine-tune to downstream datasets, but the FiT head and training protocol are needed to achieve superior results.
While the accuracy of FiT-QDA and FiT-LDA is similar, the storage requirements for a covariance matrix for each class makes QDA impractical if model update size is an important consideration.
The accuracy of FiT-ProtoNets is slightly lower than FiT-LDA, but often betters BiT, despite BiT having more than 100 times as many updateable parameters.
The datasets used in this section were similar in content to the dataset used for pretraining and the performance of FiT-QDA and FiT-LDA was similar, indicating that the covariance per class was not that useful for these datasets. In the next section, we test on a wider variety of datasets, many of which differ greatly from the upstream data.
The VTAB-1k benchmark (Zhai et al., 2019) is a low to medium-shot transfer learning benchmark that consists of 19 datasets grouped into three distinct categories (natural, specialized, and structured). From each dataset, 1000 examples are drawn at random from the training split to use for the downstream dataset . After fine-tuning, the entire test split is used to evaluate classification performance. Table 1 shows the classification accuracy of the three variants of FiT and BiT using the BiT-M-R50x1 backbone for all variants. The key observations from our results are:
Both FiT-QDA and FiT-LDA outperform BiT on VTAB-1k.
The FiT-QDA variant has the best overall performance, showing that the class covariance is important to achieve superior results on datasets that differ from those used in upstream pretraining (e.g. the structured category of datasets). However, the updateable parameter cost is high.
FiT-LDA utilizes two orders of magnitude fewer updateable parameters compared to BiT, making it the preferred approach.
Table A.4 indicates that it is best to use all of for each of and (i.e. no split) and that and should be randomly sub-sampled as opposed to using all of the data in each iteration.
In our experiments, we use ORBIT (Massiceti et al., 2021), a real-world few-shot video dataset recorded by people who are blind/low-vision. A blind or vision-impaired user collects a series of short videos on their smartphone of objects that they would like to recognize. The collected videos and associated labels are then uploaded to a central service to train a personalized classification model for that user. Once trained, the personalized model is downloaded to the user’s smartphone. The initial model download would include all the model parameters, both shared and updateable. However, models with a smaller number of updateable parameters are preferred in order to save model storage space on the central server and in transmitting any updated models to a user. The goal is to take a backbone pretrained on ImageNet (Deng et al., 2009) or other large-scale dataset and construct a personalized model for a user using only their individual data. We follow the object recognition benchmark task proposed by the authors, which tests a personalized model on two different video types: clean where only a single object is present and clutter where that object appears within a realistic, multi-object scene.
In Table 2, we compare FiT-LDA to several competitive transfer learning and meta-learning methods. We use the LDA variant of FiT, as it achieves higher accuracy in comparison to the ProtoNets variant, while using far fewer updateable parameters than QDA. For transfer learning, we include FineTuner (Yosinski et al., 2014), which freezes the weights in the backbone and fine-tunes only the linear classifier head on an individual’s data. For meta-learning approaches, we include ProtoNets (Snell et al., 2017) and Simple CNAPs (Bateni et al., 2020), which are meta-trained on Meta-Dataset (Dumoulin et al., 2021). Training and evaluation details are in Section A.5.2. For this comparison, we show frame and video accuracy, averaged over all the videos from all tasks across all test users ( test users, tasks in total). We also report the number of shared and individual updateable parameters required to be stored or transmitted. The key observations from our results are:
FiT-LDA outperforms competitive meta-learning methods, Simple CNAPs and ProtoNets.
FiT-LDA also outperforms FineTuner in terms of the video accuracy and performs within error bars of it in terms of the frame accuracy.
The number of individual parameters for FiT-LDA is far fewer than in Simple CNAPs and is of the same order of magnitude as FineTuner and ProtoNets.
Compared to a linear head, Naive Bayes reduces the size of the optimization space as there are fewer parameters to learn (only and ), making FiT-LDA more suitable for the few-shot setting.
In this section, we show how FiT can be used in the few-shot federated learning setting, where training data are split between client nodes, e.g. mobile phones or personal laptops, and each client has only a handful of samples. Model training is performed via numerous communication rounds between a server and clients. In each round, the server selects a fraction of clients making updates and then sends the current model parameters to these clients. For data privacy reasons, clients update models locally using only their personal data and then send their parameter updates back to the server. Finally, the server aggregates information from all the clients, updates the shared model parameters, and proceeds to the next round until convergence. In this setting, models with a smaller number of updatable parameters are preferred in order to reduce the client-server communication cost which is typically bandwidth-limited.
For our experiments, we use CIFAR100 (Krizhevsky et al., 2009), a relatively large-scale dataset compared to those commonly used to benchmark federated learning methods (Reddi et al., 2021; Shamsian et al., 2021). We employ the basic FedAvg (McMahan et al., 2016) algorithm. We train all models for communication rounds, with clients per round and update steps per client. Each client has classes, which are sampled randomly before the start of training. During an update, a client initializes their local model with recent FiLM parameters received from the server and then performs several training steps as described in Section 2, using only their data. The Naive Bayes head allows a client to construct a local classifier at each update step, eliminating the need to initiate a shared classifier and transmit the parameters of this classifier at each training round. In contrast, using a linear head in this setting would entail additional communication costs, as it would be passed at each client-server interaction. We use a ProtoNets head for simplicity, although QDA and LDA heads could also be used. After training, the global FiLM parameters are transmitted to all clients. More specific training and evaluation details are in Section A.5.3.
We evaluate FiT in two scenarios, global and personalized. In the global setting, the aim is to construct a global classifier and report accuracy on the CIFAR100 (Krizhevsky et al., 2009) test set. We assume that the server knows which classes belong to each client, and constructs a shared classifier by taking a mean over prototypes produced by clients for a particular class. In the personalized scenario, we test a personalized model on test classes present in the individual’s training set and then report the mean accuracy over all clients. As opposed to the personalization experiments on ORBIT, where a personalized model is trained using only the client’s local data, in this experiment we initialize a personalized model with the learned global FiLM parameters and then construct a ProtoNets classifier with individual’s data. Thus, the goal of the personalized setting is to estimate how advantageous distributed learning can be for training FiLM layers to build personalized few-shot models.
To evaluate the FiT model in a distributed learning setup, we define baselines which form an upper and lower bounds on the model performance. For the global scenario, we take a FiT model simultaneously trained on all available data as the upper bound baseline. To get the lower bound baseline, we train a FiT model for each client with their local data, then average FiLM parameters of these individual models and construct a global ProtoNets classifier using the resulting FiLM parameters. The upper bound is therefore standard batch training, the performance of which we hope federated learning can approach. The lower bound is a simplistic version of federated learning with a single communication round which federated averaging should improve over. For the personalized setting, the upper bound baseline is as in the global scenario from which we form a personalized classifier by taking a subset of classes belonging to a client from a global -way classifier. The lower bound baseline is set to a FiT model trained for each client individually. The upper bound is again standard batch training and the lower bound is derived from locally trained models which do not share information and therefore should be improved upon by federated learning.
Fig. 2 shows global and personalized classification accuracy as a function of communication cost for different numbers of clients and shots per client. By communication cost we mean the number of parameters transmitted during training. The key observations from our results are:
In the global setting, the federated learning model is only slightly worse () than the upper bound baseline, while outperforming the lower bound model, often by a large margin. This shows that FiT can be efficiently used in distributed learning settings with different configurations.
In the personalized scenario, for a sufficient number of clients () the gap between the federated learning model and the upper bound model is significantly reduced with the increase in number of shots. Distributed training strongly outperforms the lower bound baseline, surprisingly even in the case of clients with disjoint classes. This provides empirical evidence that collaborative distributed training can be helpful for improving personalized models in the few-shot data regime.
The low communication cost per round and relatively fast empirical convergence of FiT results in a parameter efficient training protocol, requiring only around M parameters to be transferred during the whole training phase. In contrast, if we use BiT for federated learning, around M parameters will be sent at each communication round, yielding an enormous communication cost.
In Section A.3.4, we show that distributed training of a FiT model can be efficiently used to learn from more extreme, non-natural image datasets like Quickdraw (Jongejan et al., 2016). Although the number of communication rounds must be increased for efficient transfer to Quickdraw, FiT still has orders of magnitude lower communication cost than BiT.
In this work, we proposed FiT, a parameter and data efficient few-shot transfer learning system that allows image classification models to be updated with only a small subset of the total model parameters. We demonstrated that FiT can outperform BiT, a state-of-the-art transfer learning method at low shot while using one hundred times fewer updateable parameters. We also showed the efficiency benefits of employing FiT in model personalization and federated learning applications. There has been considerable work on compressing models (Cheng et al., 2017b) and designing more parameter efficient networks (Tan and Le, 2019, 2021; Sandler et al., 2018) to reduce the model parameter count. These lines of research are complementary to FiT. Model compression can be used in conjunction with our work by compressing the subset of updateable parameters. Similarly, parameter efficient networks can serve as the backbones of our classification systems. We leave the combination of these technologies and FiT for future work.
The main limitation of this work is that it is computationally expensive and much slower to adapt to new tasks compared to meta-learning methods that can adapt with a single forward pass through a network (Requeima et al., 2019) or a small number of gradient steps (Finn et al., 2017). Thus FiT may be inappropriate for certain time critical applications and potentially use more energy than competitive approaches.
Image classification methods, including the work presented here, have the potential to be beneficial to society as they are readily applicable to medical image analysis, remote sensing for environmental work, and as we demonstrated in Section 4.3, helping blind users find their personal items. Conversely, the same methods could be deployed in an adverse manner such as in military or police surveillance systems. Our system for improving the parameter efficiency of models has the potential benefits of lowering energy, bandwidth, and storage costs and our federating learning approach may provide benefits in protecting user privacy. Despite the improvements presented in this work, image classification methods remain imperfect and can potentially capture biases that violate fairness principles. As a result, our method should be applied judiciously.
We thank Aristeidis Panos and Siddharth Swaroop for providing helpful and insightful comments. This work has been performed using resources provided by the Cambridge Tier-2 system operated by the University of Cambridge Research Computing Service https://www.hpc.cam.ac.uk funded by EPSRC Tier-2 capital grant EP/P020259/1.
Funding in direct support of this work: Aliaksandra Shysheya, John Bronskill, Massimiliano Patacchiola and Richard E. Turner are supported by an EPSRC Prosperity Partnership EP/T005386/1 between the EPSRC, Microsoft Research and the University of Cambridge.
Remote sensing image scene classification: benchmark and state of the art. Proceedings of the IEEE 105 (10), pp. 1865–1883. Cited by: Table A.4, Table 1.
A survey of model compression and acceleration for deep neural networks. arxiv 2017. arXiv preprint arXiv:1710.09282. Cited by: §5.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3606–3613. Cited by: Table A.4, Table 1.
Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1.
EfficientNet: Rethinking model scaling for convolutional neural networks. In 36th, pp. 6105–6114. Cited by: §5.
International Conference on Machine Learning, pp. 10096–10106. Cited by: §5.
Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1.
Fig. A.1 illustrates a FiLM layer operating on a convolutional layer, and Fig. A.1 illustrates how a FiLM layer can be added to a ResNetV2 network block. FiLM layers can be similarly added to EfficientNet based backbones, amongst others.
Table A.1 depicts the shared and updateable parameters for BiT and the 3 variants of FiT as a function of , , , and . The rightmost column provides the example updateable parameter count for all models in the case of a BiT-M-R50x1 backbone with , , , and .
For FiT-QDA and FiT-LDA, the means and covariances contribute to the updateable parameter count. We use a mean for every class which contributes updateable parameters. A covariance matrix has values, however it can be represented in Cholesky factorized form which results in a lower triangular matrix and thus can be represented with values, with the rest being zeros.
In the case of FiT-LDA, where the covariance matrix is shared across all classes, a more compact representation is possible, resulting in considerable savings in updateable parameters:
From Eq. A.1, it follows that to compute the probability of classifying a test point , we need to store which has dimensionality and which has dimensionality 1 for each class , resulting in only parameters required for the FiT-LDA head. Since the covariance matrix is not shared in the case of FiT-QDA, no additional savings are possible in that case.
This section contains additional results that would not fit into the main paper, including tabular versions of figures.
Table A.3 shows the few-shot results for all three variants of FiT with different ablations on how the downstream dataset is allocated during training. No Split indicates that is not split into two disjoint partitions and . However, and are sampled to form episodic training tasks as detailed in Algorithm A.2. Split indicates that is is split into two disjoint partitions as detailed in Algorithm A.1 and then sampled into tasks as described in Algorithm A.2. Use All indicates that (i.e. is not split) and that and are not sampled and that for all tasks .
Table A.3 shows that Use All is consistently the worst option. In general, in the few-shot case, Split either outperforms No Split (CIFAR10, Pets) or achieves the same level of performance (CIFAR100, Flowers102). As a result, we use the Split option when reporting the few-shot results.
Table A.4 shows the VTAB-1k results for all three variants of FiT with different ablations on how the downstream dataset is allocated during training. Refer to Section A.3.2 for the meanings of No Split, Split, and Use All. With some minor exceptions, the Use All case performs the worst. The performance of the No Split and Split options is very close, with No Split being slightly better when averaged over all of the datasets. As a result, we use the No Split option when reporting the VTAB-1k results.
To test distributed training of a FiT model on a more extreme, non-natural image dataset, we also include the results for federated training of FiT on the Quickdraw dataset. As there is no pre-defined train/test split for the Quickdraw dataset, we randomly choose samples from each of the classes and use them for testing. We train all federated training models for communication rounds, with clients per round, and update steps per client. Since Quickdraw is a more difficult dataset than CIFAR100, it requires more communication rounds for training. Each client has classes, which are sampled randomly at the start of training. In our experiments, we omit the -clients case, as the overall amount of data in the system is not enough to even train a robust global upper bound baseline model.
In this section, we provide implementation details for all of the experiments in Section 4.
All of the FiT few-shot and VTAB-1k transfer learning experiments were carried out on a single NVIDIA A100 GPU with 80GB of memory. The Adam optimizer [Kingma and Ba, 2015] with a constant learning rate of 0.0035, for 400 iterations, and =100 was used throughout. No data augmentation was used and images were scaled to 384384 pixels unless the image size was 3232 pixels or less, in which case the images were scaled to 224224 pixels. These hyper-parameters were empirically derived from a small number of runs.
FiT-QDA, FiT-LDA, and FiT-ProtoNets take approximately 12, 10, and 9 hours, respectively, to fine-tune on all 19 VTAB datasets and 5, 3, and 3 hours, respectively, to fine tune all shots on the 4 low-shot datasets.
For the BiT few-shot experiments, we used the code supplied by the authors [Kolesnikov et al., 2020] with minor augmentations to read additional datasets. The BiT few-shot experiments were run on a single NVIDIA V100 GPU with 16GB.
For the BiT VTAB-1k experiments, we used the three fine-tuned models for each of the datasets that were provided by the authors [Kolesnikov et al., 2020]. We evaluated all of the models on the respective test splits for each dataset and averaged the results of the three models. The BiT-HyperRule [Kolesnikov et al., 2019] was respected in all runs. These experiments were executed on a single NVIDIA GeForce RTX 3090 with 24GB of memory.
The personalization experiments were carried out on a single NVIDIA GeForce RTX 3090 with 24GB of memory. It takes approximately hours to train FiT-LDA personalization models for all the ORBIT [Massiceti et al., 2021] test tasks. We derived all hyperparameters empirically from a small number of runs. We used the ORBIT codebase111https://github.com/microsoft/ORBIT-Dataset in our experiments, only adding the code for splitting test user tasks and slightly modifying the main training loop to make it suitable for FiT training.
For the personalization experiments, all methods use an EfficientNet-B0 () as the feature extractor, as it has previously shown superior performance on the ORBIT dataset [Bronskill et al., 2021], and an image size of . FiT-LDA, FineTuner [Yosinski et al., 2014] and Simple CNAPs [Bateni et al., 2020] use a backbone pretrained on ImageNet [Deng et al., 2009], while ProtoNets [Snell et al., 2017] meta-trained the weights of the feature extractor on Meta-Dataset [Dumoulin et al., 2021].
The FineTuner [Yosinski et al., 2014] results are from [Bronskill et al., 2021]. Meta-trained weights for Simple CNAPs [Bateni et al., 2020] and ProtoNets [Snell et al., 2017] are also taken from [Bronskill et al., 2021]. Using these weights, we test these models on the ORBIT test set and report the results.
FiLM layers in FiT-LDA are added to the feature extractor as described in Section 2, resulting in .
We follow the task sampling protocols described in [Massiceti et al., 2021], and train the FiT model for optimization steps using the Adam optimizer with a learning rate of . The ORBIT test tasks have a slightly different structure in comparison to standard few-shot classification tasks, so in Algorithm A.3 we provide a modified version of data splitting for the classifier head construction. In particular, each test user has a number of objects (classes) they want to recognize, with several videos recorded per object. Each video is split into clips, consecutive -frame parts of the video. A user test task is comprised of these clips, randomly sampled from different videos of the user’s objects, and associated labels. Since clips sampled from the same video can be semantically similar, we split the test task so that clips from the same video can only be in either the support or query set, except for the cases when there is only one video of an object available.
For each local update a new Adam optimizer is initialized. In each communication round, clients are randomly chosen for making model updates. All of the federated learning experiments were carried out on a single NVIDIA A100 GPU with 80GB of memory. In all experiments we use FiT with the BiT-M-R50x1 [Kolesnikov et al., 2019] backbone pretrained on the ImageNet-21K [Russakovsky et al., 2015] dataset and ProtoNets head. We derive all hyperparameters empirically from a small number of runs.
We train all federated learning models with different number of clients and shots per client for communication rounds. We use a learning rate of at the start of the training, decaying it by every communication rounds. Upper and lower bound baselines for both the global and personalized scenarios were trained for epochs using the Adam optimizer with a constant learning rate of . It takes around minutes to train federated learning models, with slightly more training time required for the models with a larger number of shots.
We train all federated learning models with a different number of clients and shots per client for communication rounds. We use a constant learning rate of for training all federated learning models, except for the model with clients and shots, where we decay the learning rate by every communication rounds. Upper bound baseline models, which require training a global model using all available data, were trained for steps using the Adam optimizer with a constant learning rate of . Lower baseline models, requiring training a personalized model for each individual, were trained for steps using the Adam optimizer with a learning rate of . As there are only few samples per class per client, personalized models are trained in a few-shot regime, resulting in overfitting if trained for longer. It takes around hours to train federated learning models, with slightly more training time required for the models with a larger number of shots.