Attentions have been recently put onto the privacy concerns in the machine learning pipeline, especially the deep neural networks (DNNs), which are increasingly adopted on mobile devices(xu2019first; kdd/WangZBZCY18) and often require a large amount of sensitive data (e.g., images and input corpus) from mobile users to train. As one of the many typical examples, the recent release of General Data Protection Regulation (GDPR) (gdpr) by European Union strictly regulates whether and how companies can access the personal data owned by their users. In parallel, a lot of efforts have been made in the research community to design novel paradigm of machine learning and large-scale data mining that preserves the privacy of end users. One promising direction is the emerging federated learning (mcmahan2016communication), which aims to train DNN models in a decentralized way, collaboratively from the contribution of many client devices, without gathering the private data from individual devices to the cloud.
Training neural network models on decentralized data and devices addresses the privacy issue to some extent at the expense of efficiency. Once a neural network architecture is determined, there are newly developed methods to speed up the decentralized training process (konevcny2016federated; mcmahan2016communication). However, in most tasks when the network architecture is not determined a priori, it remains very difficult to search for the optimal network architecture(s) and train them efficiently in a decentralized setup. Indeed, it is well known that designing an efficient architecture is a labor-intensive process that may require a vast number of iterations of training attempts, which could become uninhibitedly time-consuming given decentralized training data. Making it worse, the hardware platforms on mobile devices are highly heterogeneous (AI-benchmark)
, thus different network architectures are required to manage diverse resource budgets on the hardware. This becomes a bottleneck of practically deploying federated learning, given the increasingly important role of neural architecture search (or NAS) in launching deep learning in reality.
To address this major challenge, this paper proposes a new paradigm for DNN training to enable automatic neural architecture search (NAS) on decentralized data, called federated NAS. The major goal is to address both automation and privacy issues while training DNNs with heterogeneous mobile devices. As shown in Figure 1, the basic guiding idea of federated NAS is to decouple the two primary logic steps of NAS process, i.e., model search and model training, and separately distribute them on cloud and clients. Specifically, every single client uses only its local dataset to train and test a model, while the cloud coordinates all the clients and determines the searching direction without requiring the raw data.
Given the preceding conceptual principles, enabling Neural architecture search in a federated setting is fundamentally challenging due to limited on-client hardware resources, i.e., computation and communication. NAS is known to be computation-intensive (e.g., thousands of GPU-hrs (rl-nas)
), given the large number of model candidates to be explored. Meanwhile, the communication cost between cloud and clients also scales up with the increased number of model candidates. It is also worth mentioning that in the federated NAS paradigm, the data distribution among clients are often non-iid and highly-skewed(mcmahan2016communication)
, which can probably mislead the NAS algorithm to select non-optimal DNN candidate.
We present the first framework for federated NAS, named FedNAS. FedNAS starts from an expensive pre-trained model, and iteratively adapts the model to a more compact one until it meets a user-specified resource budget. For each iteration, FedNAS generates a list of pruned model candidates, then re-trains (tunes) and tests them collaboratively across the cloud and clients. The most accurate one will be selected before moving to the next iteration. When terminated, FedNAS outputs a sequence of simplified DNN architectures that form the efficient frontier that strikes a balance on the trade-off of model accuracy and resource consumption.
By learning and retrofitting the idea of using proxy task as insufficient candidate re-training from the previous work (netadapt; liu2018progressive; cai2018path; tan2019mnasnet), FedNAS provides several insightful mechanisms, i.e., the parallel tuning of each DNN candidate (across clients), dynamic (across time), and heterogeneous (across models), to make federated NAS practical. (1) By parallel tuning, FedNAS works on different model candidates simultaneously and recognizes the available clients into many groups. All clients in a group collaboratively train and test a DNN candidate with their results (accuracy, gradients, etc) uploaded and properly fused on the cloud. Different groups work on different DNN candidates in parallel to increase the scalability by involving more available clients. To ensure the generality of each DNN candidate, FedNAS incorporates a principled client partition algorithm with regard to each client’ data distribution and data size. (2) By dynamic training, FedNAS increasingly trains each candidate with more rounds as iterations go on, instead of using a fixed and large round number as prior NAS works (netadapt). This is based on the observation that as the model being simplified to smaller, each DNN candidate requires more re-training to regain the accuracy so that FedNAS can adapt the model at the right direction. (3) By model heterogeneity, FedNAS early drops the non-optimal candidates during the re-training stages (e.g., 2 rounds), but only the optimal one is trained for the required round number (e.g., 10). This is based on the observation that the optimal DNN candidate often quickly outperforms others even far before the re-training is done.
We comprehensively evaluated the performance of FedNAS
on two datasets, ImageNet (iid) and Celeba (non-iid), as well as two CNN architectures, i.e., MobileNet and simplified AlexNet. The results show thatFedNAS achieves similar model accuracy as state-of-the-art NAS algorithm that trains models on centralized data, and the three novel optimizations above can reduce the client cost by up to two orders of magnitude, e.g., 277 for computation time and 281 for bandwidth usage. FedNAS also provides flexible trade-offs between the generated model accuracy and the client cost.
In summary, the main contributions of this paper are:
To our best knowledge, we are the first to propose federated neural architecture search, a novel paradigm to automate the generation of DNN models on decentralized data.
We present FedNAS, a practical framework that enables efficient federated NAS. The core of FedNAS is to fully leverage the insufficient candidate tuning, an intrinsic NAS characteristic, and incorporate key optimizations to reduce on-client overhead.
We evaluate FedNASwith extensive experiments. Results show that FedNAS is able to generate a sequence of models under different resource budgets with as high accuracy as traditional NAS algorithm without centralized data, and significantly reduce computational and communication cost on clients compared to straightforward federated NAS designs.
2. related work
Neural architecture search (NAS) Designing neural networks is a labor-intensive process that requires a large amount of trial and error by experts. To address this problem, there is growing interest in automating the search for good neural network architectures. Originally, NAS is mostly designed to find the single most accurate architecture within a large search space, without regard for the model performance (e.g., size and computations) (liu2018progressive; rl-nas). In recent years, with more attention on deploying neural networks on heterogeneous platforms, researchers have been developing NAS algorithms (tan2019mnasnet; morphnet; netadapt; wu2019fbnet) to automate model simplifications. The goal is to generate a sequence of simplified models from an expensive one with the best accuracy under corresponding resource budgets, i.e., the pareto frontier of accuracy-computation trade-off. FedNASis motivated and based on those prior efforts.
Accelerating NAS Despite the remarkable results, conventional NAS algorithms are prohibitively computation-intensive. The main bottleneck is the training of a large number of model candidates, which often takes up to thousands of GPU hours (zoph2018learning). As a trade-off, many NAS algorithms (netadapt; liu2018progressive; cai2018path; tan2019mnasnet; morphnet)
propose to search for building blocks on proxy tasks, such as training for fewer epochs, starting with a smaller dataset, or learning with fewer blocks. Our work also utilizes and retrofits such proxy tasks as insufficient tuning during the search process. Recent work has explored weight sharing across models through a hypernetwork(brock2017smash; pham2018efficient) or an over-parameterized one-shot model (conf/icml/BenderKZVL18; cai2018proxylessnas) to amortize the cost of training. Those methods, however, target at generating only one model and often break the high parallelism of NAS, making them not suitable to our target scenario, i.e., generate multiple models under different resource budgets in federated settings. A few efforts have proposed distributed systems (ATM; Vizier; Katib) for automated machine learning tasks. Such work assume the training data is centralized on cloud instead of decentralized on clients. As a comparison, we face some unique challenges: data distributions are non-iid and highly-skewed, client devices are resource-constrained, etc.
Federated learning (FL) (mcmahan2016communication) is a distributed machine learning approach to enabling the training on a large corpus of decentralized data residing on devices like smartphones. By decentralization, FL addresses the fundamental problems of privacy, ownership, and data locality. Though our proposed FedNAS approach borrows the spirits from FL, all existing FL research focus on training one specific model instead of the end-to-end procedure of automatic architecture search. As a result, their designs miss important optimizations from intrinsic characteristics of NAS such as early dropping out model candidates (more details in Section 3.3). Further enhancements to FL, e.g., improving the privacy guarantees by differential privacy (fl-dp) and secure aggregation (fl-secure), reducing the communication cost among cloud and clients through weights compression (konevcny2016federated; mcmahan2016communication), are complementary and orthogonal to FedNAS.
We define our target problem and identify the challenges.
3.1. Problem Statement
Objective Intuitively, the goal of our proposed federated NAS framework is to provide optimal models to run on mobile devices in an automatic and privacy-preserving way. For automation, the framework can begin with a well-known network architecture, e.g., MobileNet, and generate a sequence of simplified models under different resource budgets without any developers’ manual efforts. To preserving privacy , the framework requires no training data (e.g., input corpus, images) to be uploaded to a centralized cloud or shared among devices. Such application scenarios are abounding: next-word prediction (fl-dp; hard2018federated), speech keyword spotting (leroy2019federated), image classification (liu2018secure), etc. As traditional NAS frameworks do, the goal of federated NAS can be formulated as following:
where is a simplified NN model, computes the accuracy, evaluates the resource consumption of the resource type, and is the budget of the resource and the constraint on the optimization. The resource type can be computational cost (MACs), latency, energy, memory footprint, etc., or a combination of these metrics. The main terminologies and symbols used in this work are summarized in Table 1. For simplicity, we only consider one resource type in this work (i.e., =1).
|Notation||Definitions or Descriptions|
|iteration||Cloud loops for different decayed resource budgets ()|
|round||Cloud loops for fusing the gradients from different clients ()|
|epoch||Client loops for training on local dataset each round ()|
|short-term fine-tune||Insufficient re-training of model candidates during neural architecture search without convergence|
|long-term fine-tune||Sufficient re-training at the end of whole search process|
|Global model maintained by cloud that achieves best accuracy under certain resource budget|
|DNN candidate that is simplified from a|
Contributors A typical federated setting assumes that there are substantial distributed devices available for training, e.g., tens of thousands (bonawitz2019towards). A device can be a smartphone, a tablet, or even an IoT gadget depending on the target scenario. Each device contains a small number of data samples locally, and limited hardware resources (e.g., computational capacity and network bandwidth).
3.2. Federating State-of-the-Art NAS Algorithm
Intuitively, any NAS algorithm can be leveraged to work on decentralized data. We base our approach on one of the state-of-the-art: NetAdapt (netadapt). Besides its superior performance as reported, it has another advantage: NetAdapt generates multiple DNN candidates for each iteration, and selects one of them based on their performance. Those candidates can be trained and tested in parallel without any dependency. Indeed, this suits well into the federated setting where lots of devices run independently.
Figure 2 shows the workflow of NetAdapt (with only left part of the figure) and its federated version (the whole figure). Generally speaking, NetAdapt iterates over monotonically decreasing resources budgets, for each of which it generates multiple compressed DNN candidates, fine-tunes each candidate, and picks the optimal one with highest accuracy. Finally it performs a long-term fine-tune on the optimal models to convergence. To enable NetAdapt to run on decentralized data, i.e., under federated settings, we can simply replace the training (both short-term and long-term fine-tune) and testing part with a FL-like process, in which a model will be trained at available clients and fused at cloud for many rounds. Section 4 will present more details of how FedNAS works.
3.3. Challenges and Key Optimizations
The major challenge of federated NAS is the heavy on-client computational and communication cost. Consequently, the end-to-end process of federated NAS can be excessively time-consuming. Taking communication cost for short-term fine-tune as an example, the total uplink bandwidth usage can be roughly estimated as
where calculates the model gradient size, is the number of clients involving training , and iterate over all resource budgets (depending on the developer configurations) and DNN candidates (depending on the model architecture), respectively. Communication cost is known to be a major bottleneck in federated learning (mcmahan2016communication), and it will be further amplified by the large number of DNN candidates and resource budgets to be explored during NAS.
We identify the key opportunity as how sufficient shall each DNN candidate be tuned during the search process. While some work realized the tuning can be short-term without getting converged, but they did not explore how long such a process is sufficient. By our study, we find that the tuning of each DNN candidate can be parallel (across clients), dynamic (across time), heterogeneous (across models). In the following sections, we introduce three key optimizations provided by FedNAS, where the first one is to reduce , and the other two are to reduce .
Training candidates on partial clients in parallel One opportunity of speeding up federated NAS comes from the huge amount of client devices that can participate in the training process. In common FL setting, a device is available when it is idle, charged, under unmetered network (e.g., WiFi), and so on. As reported by Google (bonawitz2019towards), tens of thousands of devices are available for FL at the same time. However, only hundreds of devices can be efficiently utilized in parallel due to the limitation of the state-of-the-art gradients fusion algorithm (e.g., FedAvg). By training and testing DNN candidates on separated groups of clients, we not only reduce the average computational and communication cost of each candidate, but also scale out better with the large number of available clients. It motivates us to organize all clients into a two-level hierarchy for high parallelism (more details in Section 4.2).
Dynamic round number We studied the performance of different DNN candidates at different resource iterations. As illustrated in Figure 3, each line represents the accuracy (y-axis) of a candidate with different training rounds (x-axis), and the red dashed one is the optimal candidate to be picked as it achieves the highest accuracy after all rounds of training done. By comparing the two subfigures, we find that at early iterations the DNN candidates, especially those with higher accuracy, reach stable condition much earlier than later iterations. This is because our algorithm starts with a pre-trained model: as it proceeds the impacts from inefficient tuning accumulate and the model parameters become more and more random. Such insight motivates us towards using dynamic round numbers, e.g., a smaller one for early iterations and keep increasing the number in later stages. While the round number becomes larger, it is worth noting that the model complexity () decreases as the algorithm proceeds. It makes the optimization quite effective in reducing clients’ overheads.
Early dropping out non-optimal candidates Figure 3 also shows that the optimal candidate (the red dashed line) quickly outperforms others within 13 rounds. It guides us to another optimization: early dropping the candidates while only keeping the optimal one being trained with more rounds. Noting that though optimal candidate has been already picked within several rounds, it still needs to go through more rounds of training. Otherwise the model accuracy will quickly drop to very low and thus misleading the candidate selection afterwards, as confirmed by our experiments in Section 5.4.
In next section we will introduce the details of our federated NAS framework, FedNAS, which incorporates the aforementioned optimization techniques.
4. the FedNAS system
Overview The pseudo code of FedNAS’s workflow is shown in Algorithm 1. FedNAS maintains a model called global model () among cloud and clients, which starts with an expensive one and will be iteratively adapted until it meets the required resource budget. The network architecture of the initial global model () is given by the developers, e.g., MobileNet. It can be either a pre-trained model or actively trained through federated learning as part of FedNAS. The goal of each iteration (line 2–11) is to adapt the global model to a smaller one through the cooperation between cloud and clients, i.e., under the budget of where indicates how much the constraint tightens for the iteration (a similar concept of learning rate) and can vary from iteration to iteration. The algorithm terminates when the final resource budget is satisfied. FedNAS outputs the final adapted model and also generates a sequence of simplified models at intermediate iterations (i.e., the highest accuracy network picked at each iteration , …, ) that form the efficient frontier of accuracy to resource trade-offs.
The iteration begins with generating a set of pruned models () as candidates based on (4.1). Each will be scheduled to a group of clients (4.2), on which the model will be repeatedly i) downloaded to each client within the group (line 14–24); ii) trained and tested via the local dataset on that client (line 34–38); iii) collected to cloud and fused into a new model for many rounds (line 27–32). During this process, all except the optimal one will dropped out (line 25–26, 4.3). This picked represents the most accurate model under the current resource budget, thus making it the next global model (line 10). Finally, the cloud performs federated learning on the or other as specified by developers till convergence (line 12, 4.4).
4.1. Model Pruning
FedNAS adapts a based on standard pruning approaches. More specifically, FedNAS reduces the number of filters in a single CONV (convolutional) or FC (fully-connected) layers to meet the resource budget of current iteration, as CONV and FC are known to be the computationally dominant layers in most NN architectures (deepcache). To choose which filters to prune, FedNAS computes the 2-norm magnitude of each filter and the one with smallest value will be pruned first. More advanced methods can be adopted to replace the magnitude-based method, such as removing the filters based on their joint influence on the feature maps (yang2017designing).
By adapting, FedNAS generates pruned model candidates s, where equals to the sum of CONV and FC layer numbers, e.g., 14 for MobileNet. For larger models, we can also speed up the adaptation process by treating a group of multiple layers as a single unit (instead of a single layer), e.g., residual block in ResNet (resnet).
4.2. Clients Partitioning & Scheduling
The goal of this stage is to partition the clients that are available for training into different groups. Each group contains one or multiple clients, and the number of groups () is given by developers. The partition starts once the cloud determines which clients are available and their associated information (i.e., data number and data distribution, see below) has been uploaded. Since the availability of clients is dynamic depending on the user behavior and device status, the partition needs to be performed at each iteration. Each will be scheduled to one group for training (i.e., short-term fine-tune) and testing.
How to partition A good partition follows two principles. First, the total data number of each group shall be close and balanced. This is to ensure that each is tuned and tested on enough data to make the results trustworthy, and also ensure high parallelism without being bottlenecked by large groups. Second, the data distribution of each group shall be representative of the dataset from all clients. Since in federated setting the data owned by each client is often non-iid, a random partition may lead to groups with biased data and makes the resultant accuracy non-representative. In such a case, our algorithm may choose the wrong candidate.
To formulize the two policies above, we denote a partition as , and the total data number within as which is simply summed over the data number of all clients within .
Here, calculates the distance between the data distributions of two groups, is an imaginary group including the data from all clients, r is a configurable variable that controls how unbalanced FedNAS can tolerate about the data sizes across different groups (default: 1.1). This equation can be approximately solved by a greedy algorithm: first sorting all clients by their data number, then iteratively dispatching the largest one to a group so that the data size balance is maintained (i.e., the inequality) while the smallest average distribution distance is achieved.
For classification tasks, which is the focus of this work, FedNAS
uses the normalized number of each class type to represent the data distribution, i.e., a vectorwhere equals to the ratio of data numbers labeled with class type. The distribution distance is computed as the Manhattan distance between such two vectors. Note that the ratio of different class types can be considered to be less privacy-sensitive compared to the gradients that need to be uploaded for many times, so it shall not compromise the original privacy level of federated setting. Nevertheless, the distribution vectors can be further encrypted through secure multiparty computation (lindell2005secure).
How to schedule Each will be scheduled to a random group for training and testing. If all groups are busy, cloud will wait until one has finished and schedule the next to this group.
As an important configuration to be set by the developers, the number of groups () makes the trade-offs between the quality of neural architecture selection and the computational cost imposed on client devices. A larger promises higher parallelism so that the NAS process can be faster, but also means the training and testing data provisioned to each is less. Our experiments in Section 5 will dig into such trade-offs and provide useful insights to developers in determining a proper group number.
4.3. Candidate Dropping and Selection
Short-term fine-tune on decentralized data Each will be trained and tested on the scheduled group for many rounds, similar to the methodology of federated learning. At each round, every client within the group downloads the newest version, then trains (local-tune) and tests the model. The training and testing datasets are both split from the client’s local dataset. The local-tune takes multiple epochs () to reduce round number and communication cost (mcmahan2016communication). The training and testing results, i.e., gradients and accuracy, associated with the dataset size, will be uploaded to cloud. The gradients will be fused to update the model candidate on cloud, and the accuracy will be fused as the metric to pick the optimal model candidate after all rounds.
Guided by our finding in Section 3.3, FedNAS reduces the on-client computational and communication cost during short-term fine-tune process through dynamic round number and early dropping candidates. More specifically, FedNAS increasingly trains each candidate with more rounds as iterations go on. For each round, FedNAS collects the local accuracy from clients and fuse them into a weighted accuracy for each . The ones with largest accuracy degradation (defined below) will be dropped and no longer tuned. For the rest of the valid candidates, FedNAS collects the gradients from clients and fuses them into a new . As round goes on, fewer and fewer candidates need to be tuned and tested. Noting that the accuracy is fused first so that the gradients of the dropped candidates at this round do not need to be uploaded.
The goal of this short-term fine-tune is to regain accuracy of . This step is important while adapting small networks with a large resource reduction because otherwise the accuracy will drop to zero, which can cause FedNAS to choose the wrong model candidate. One main difference between this stage and a standard FL process is that this stage takes relatively smaller number of iterations (i.e., short-term) without requiring the model to converge.
Accuracy fusion and comparison The accuracy generated by each client will be uploaded to the cloud. For a given and its scheduled group , once the cloud receives all accuracy of the clients within the same group , it combines the accuracy into a new one by weighting the testing data numbers on the same client:
where and are the testing data number and testing accuracy reported by the client of group correspondingly, is the client number of group.
With the accuracy of all model candidates computed at each round, FedNAS drops the models with largest accuracy degradation. Note that each may have different resource consumptions (Section 4.1), we use the ratio of accuracy degradation to the resource consumption reduction over the previous (i.e., the unpruned model at the beginning of this iteration):
Model fusion For a given and its scheduled group , FedNAS fuses the gradients from all clients within by weighting the training data numbers used in local-tune.
where and are the training data number and gradients uploaded from the client of group correspondingly.
Once FedNAS finishes the model search process above, a sequence of models have been generated, i.e., . As the final stage, FedNAS performs a standard federated learning on or other (called FL-tune) if needed by the developer. The goal of this stage is to make the obtained models converge. FedNAS can utilize any existing FL algorithm to run FL-tune and currently it uses one of the state-of-the-art (mcmahan2016communication). When multiple are demanded, FedNAS can still utilize the partitioned clients to train them in parallel.
In our experiments, we mainly evaluate three parts of performance: 1) 5.2: does FedNAS generate high accuracy models under different resource budgets? 2) 5.3: what’s the computational and communication cost of FedNAS on clients? 3) 5.4: what’s the impacts of FedNAS’s key designs?
5.1. Experiment Settings
|Dataset||Model||Task||Client number||Data per client|
|ImageNet (iid)||MobileNet (13 CONV, 1 FC)||Image classification||1,500||915.0|
|Celeba (non-iid)||Simplified AlexNet (6 CONV, 1 FC)||Face attrs classification||9,343||21.4|
Datasets As shown in Table 2, we tested FedNAS on 2 datasets commonly used for federated learning experiments: ImageNet (imagenet) (iid) and Celeba (celeba) (non-iid). For ImageNet, we randomly split it into 1,500 clients. For Celeba, we split it into 9,343 clients based on the identities of face images. We re-used the scripts of LEAF (leaf), a popular federated learning framework, to pre-process Celeba data and generate non-iid data. Each Celeba image is tagged with 40 binary attributes. We randomly select 3 of them (Smiling, Male, Mouth_Slightly_Open) and combine the 3 features into a classification task with 8 classes. The dataset on each client was further split to three parts: training set used for short-term fine-tune, validation set used to test the accuracy of DNN candidates, testing set used to evaluate the final accuracy of each simplified model (6:2:2).
Models We applied FedNAS on two models: MobileNet (mobilenet) (for ImageNet, 224x224 input size), a widely used CNN network for mobile applications; A simplified AlexNet, which we call ConvNet (for Celeba, 128x128 input size) with sequential CONV, Pooling, and final FC layers. We did not apply FedNAS on larger networks like ResNet or VGG because small and compact networks are more difficult to simplify; these large networks are also seldom deployed on mobile platforms.
Resource type We mainly used multiply-accumulate operations (MACs) as the metric to specify resource budgets. For MobileNet, we reduce the resource budget by 5% at each iteration with 0.98 decay. For ConvNet, we reduce the resource budget by 5% at each iteration with 0.93 decay.
Alternatives We compare FedNAS with two state-of-the-art automatic network simplification approaches. Note that both of them are performed on centralized data.
NetAdapt (netadapt) is the basis of FedNAS. We directly reused their open code and kept the original parameter setting.
Multipliers (mobilenet) are simple but effective approaches to simplify networks. We used Width Multiplier to scale the number of filters by a percentage across all CONV and FC layers.
Hardware of Cloud and Clients All experiments were carried out on a high-end server with 12 P100 Tesla GPUs. To simulate the client-side computation cost, we used DL4J (DL4J) to obtain the training speed of MobileNet and ConvNet as well as each pruned DNN candidate on Samsung Note 10. The training speed is then plugged into to our experiment platform, as a way to simulate the on-client computation cost. The communication cost is also simulated by recording the data transmission between cloud process and client processes.
5.2. Analysis of Accuracy
Figure 4 shows the comparison of the models generated by FedNAS and other alternatives. Overall, FedNAS achieves similar performance as NetAdapt, and both of them significantly outperform Multipliers. Noting that FedNAS trains models on decentralized data with much better user privacy. On Celeba, the model generated by FedNAS is up to 2.5 less complex (specified by MACs) with the similar accuracy or 1.9% higher accuracy with the same complexity compared to Multipliers. On ImageNet, the model generated by FedNAS is 1.5 less complex with 0.8% higher accuracy compared to Multipliers.
On ImageNet, we notice a performance gap between FedNAS and NetAdapt around 2% when MobileNet is simplified by more than 70%. This is because in our current default setting, the short-term fine-tune is conservative to keep the client cost low, so that sometimes the candidate is not sufficiently trained thus misleading the model selection. As we will show later, by varying the system configurations (e.g., round number and group number), the accuracy of FedNAS can be further improved to be closer to NetAdapt.
We then studied how the network architectures look like when adapting MobileNet to 50% MACs on ImageNet using different approaches. As illustrated in Figure 5, FedNAS generates similar network architecture as NetAdapt but different from Multipliers. This well explains the performance similarity/gap between FedNAS and the alternatives shown above.
5.3. Analysis of Client Cost
We studied how much improvements and trade-offs brought by our key optimizations introduced in Section 3.3. By default, we drop 33% candidates after each round, thus all non-optimal candidates will be dropped after 3 rounds. The dynamic round numbers used are (1-5 iters: 5 rounds; 7-10: 10; 11-15: 15; ¿15: 20) for ImageNet and (1-5 iters: 2 rounds; 6-10: 5; 11-15: 8; ¿15: 10) for Celeba. The group numbers for ImageNet/Celeba are 15 and 20, respectively. The settings are consistent with the accuracy experiments in Figure 4. Here we only report the on-client cost for short-term fine-tune during model search, excluding the cost for long-term fine-tune at the last step and the potential federated learning for the initial model. This is because the short-term fine-tune is often more computational intensive, while the latter ones depend on further user specifications, e.g., what models are needed for deployment.
Overall improvements As shown in Figure 6, all three techniques can significantly reduce the on-client cost, i.e., computational and communication. In a naive design of federated NAS with all optimizations disabled, the communication and computational cost are 277 and 281 more on ImageNet, and 161 and 162 more on Celeba, respectively. With one technique disabled, i.e., dynamic round number / early dropping candidates / group hierarchy, the cost can be up to 3.2 / 2.2 / 19.7 more. We observe that the first two optimizations aiming at reducing the round number are more effective at ImageNet. This is because ImageNet task is more complex than Celeba, so the model requires more short-term fine-tuning (round numbers) thus leaves more headroom for optimizations.
Note that, according to the experiments, our optimizations with the default settings have almost zero affects at the model accuracy. In fact, Figure 4 shows that FedNAS already achieves the accuracy upper bound defined by NetAdapt. Next, we studied the trade-offs between accuracy and cost from two optimizations (early drop candidates and group hierarchy) by varying the default settings.
|Model||Drop ratio each round||Top-1 Accuracy (%)||Avg uplink cost per client (MBs)|
|50% ConvNet||0% (no drop)||83.8 (0.0)||59.7 (0.0)|
|33% (default)||83.8 (0.0)||12.8 (-79%)|
|50%||83.5 (-0.3)||10.7 (-82%)|
|100%||82.1 (-1.7)||8.5 (-85%)|
|25% ConvNet||0% (no drop)||82.3 (0.0)||138.0 (0.0)|
|33% (default)||82.3 (0.0)||29.6 (-77%)|
|50%||81.6 (-0.7)||24.6 (-81%)|
|100%||77.8 (-4.5)||19.7 (-88%)|
|15% ConvNet||0% (no drop)||78.4 (0.0)||209.6 (0.0)|
|33% (default)||78.2 (-0.2)||44.9 (-79%)|
|50%||74.1 (-4.1)||37.4 (-81%)|
|100%||47.1 (-31.3)||30.0 (-86%)|
Trade-offs from drop round Table 3 shows the trade-offs from the timing to drop the non-optimal candidates. The results show that by dropping 33% candidates at each round, FedNAS can reduce the uplink cost by 57% with very little accuracy loss (0.2%). By more aggressive early dropping, FedNAS further reduces the uplink cost, but sacrifices much more model accuracy. In an extreme case where all non-optimal candidates are dropped immediately before the first round of model fusion (100% drop ratio), the model accuracy degrades by 31.3% when simplifying the ConvNet to 15% complexity. The reason is that with insufficient training (few rounds), the accuracy of candidates are not yet representative of the real performance of the corresponding network architectures, thus leading FedNAS to pick the wrong candidate. The impacts from such misleading accumulate as more iterations go on.
|Model||Group number||Top-1 Accuracy (%)||Avg uplink cost per client (MBs)|
|75% MobileNet||14 (default)||68.8 (0.0)||70.3 (0.0)|
|7||68.9 (+0.1)||140.7 (+100%)|
|28||68.6 (-0.2)||35.2 (-50%)|
|100||68.5 (-0.3)||9.8 (-86%)|
|50% MobileNet||14 (default)||67.4 (0.0)||218.1 (0.0)|
|7||67.6 (+0.4)||436.2 (+100%)|
|28||66.3 (-1.1)||109.0 (-50%)|
|100||66.0 (-1.4)||30.5 (-86%)|
Trade-offs from group number In essence, the group number determines how many clients and data are involved in training each model candidate. As shown in Table 4, with a smaller group number (7) on ImageNet, FedNAS’s accuracy doesn’t improve much (up to 0.4%) compared to our default setting (14), but incurs much more client cost (e.g., 2 more uplink network). It confirms our observation as discussed in Section 3.3 that training and testing each model candidate only require partial clients and data to involve. With a relatively larger group number 28, the accuracy drops by 1.1% when adapted to 50% complexity, but the uplink cost is also reduced by 50%. An even larger group number (100) helps reduce the cost by 86% but the accuracy degradation increases up to 1.4%. In a word, the group number provides rich trade-offs between the generated model accuracy and on-client cost. But note that when the group number is larger than the candidate number, further increasing it doesn’t reduce the end-to-end architecture search time because of the dependency between sequential iterations. Due to the limitation of current federated learning platforms, we currently don’t evaluate this neural architecture search time and leave it as future work.
5.4. Ablation Studies
Impact of short-term fine-tuning Figure 7 shows the model accuracy with different round numbers (without long-term fine-tuning). In an extreme case with zero round number, i.e., all candidates except the optimal one are dropped without model fusion, the accuracy rapidly drops to almost random guess. In this case, the algorithm picks the best candidate solely based on noise thus gives poor performance, and the long-term fine-tune cannot save the accuracy because the model architecture is inferior. With a reasonably smaller round number (e.g., 5 and 10), though the model accuracy can be largely preserved but still lower than the default setting. It demonstrates that though a small round number is often enough to pick the optimal candidate at current iteration (motivation for early dropping optimization), but still we need more rounds to re-train the picked model before entering into the next round. Otherwise the pruning direction at later iterations will be misled.
Impact of long-term fine-tuning Figure 8 illustrates the importance of performing the long-term fine-tuning using federated learning after global models have been generated. It shows that the short-term fine-tuning can preserve the accuracy well at the beginning, but the accuracy still drops faster as iterations go on due to the accumulation of insufficient training. The long-term fine-tuning can increase the accuracy by up to another 20% at later stages. Though at later iterations the raw accuracy drops faster, FedNAS is still able to pick the good candidate, thus maintains close performance compared to NetAdapt as shown above. Nevertheless, it shows that the training under the default setting has the potential to be further improved by adding more rounds.
In this work, we have presented a novel framework, FedNAS, which can automatically generate neural architectures with training data decentralized with a large number of clients. To deal with the heavy cost of on-client computation and communication, FedNAS identifies the key opportunity as insufficient candidate tuning by looking into the NAS intrinsic characteristics, and incorporates three key optimizations: parallel model tuning, dynamic training, and candidates early dropping. Tested on both iid and non-iid datasets, FedNAS is able to generate neural networks with similar accuracy compared to training on centralized data, with tolerable computational and communication cost on clients.