In the past decade, Deep Convolutional Neural Networks (DCNN) have shown powerful representation and learning capabilities[Goodfellow:2016:DL:3086952, 0483bd9444a348c8b59d54a190839ec9]
, and achieve unprecedented success in many commercial applications, such as computer vision[Krizhevsky:2012:ICD:2999134.2999257, DBLP:journals/corr/HeZRS15]mikolov2013efficient, DBLP:journals/corr/abs-1810-04805, DBLP:journals/corr/BojanowskiGJM16], speech recognition [DBLP:journals/corr/HannunCCCDEPSSCN14, DBLP:journals/corr/ZhangCJ16, DBLP:journals/corr/OordDZSVGKSK16], etc. Same as traditional machine learning algorithms, the success of deep neural network is partially driven by big data availability. However, in the real world, with the exception of few industries, most fields have only limited data or poor quality data. What’s worse, due to industry competition, privacy security, and complicated administrative procedures, It is almost impossible to train centralized machine learning models by integrating the data scattered around the countries and institutions [DBLP:journals/corr/abs-1902-04885].
At the same time, with the increasing awareness of data privacy, the emphasis on data privacy and security has became a worldwide major issue. News about leaks on public data are causing great concerns in public media and governments. In response, countries across the world are strengthening laws in protection of data security and privacy. An example is the General Data Protection Regulation (GDPR) [Voigt:2017:EGD:3152676] enforced by the European Union on May 25, 2018. Similar acts of privacy and security are being enacted in the US and China.
To decouple the need for machine learning from the need for storing large data in the central database, a new machine learning framework called federated learning was proposed [DBLP:journals/corr/McMahanMRA16]. Federated learning provides a promising approach for model training without compromising data privacy and security [DBLP:journals/corr/abs-1902-01046, DBLP:journals/corr/McMahanMRA16, DBLP:journals/corr/KonecnyMYRSB16]. Unlike typical centralized training procedures, federated learning requires each client collaboratively learn a shared model using the training data on the device and keeping the data locally. In that scenario, model parameters (or gradients) are transmitted between the server side and the clients on each round. Due to the frequent exchange of data between the central server and the clients, coupled with network unreliability and connection instability, communication costs are the principal constraint and restricting the performance of federated learning.
In this paper, we propose a new compression strategy called Residual Pooling Network (RPN) to address the communication costs problem by parameter selection and approximation. As we will see below, our approach can significantly reduce the quantity of data transmission, while still be able to maintain high-level model performance, more importantly, our approach is an end-to-end process, which make it easy to deploy in real-world application without human intervention.
Contributions: Our main contributions in this paper are as follows:
we propose a practical and promising approach to improve communication efficiency. Our approach not only reduces the amount of data being uploaded, but also reduces the amount of data being download.
we propose a general solution for CNN-based model compression under federated learning framework.
Replacing a parameter-encryption based approach, the parameter approximation and parameter selection method proposed keeps the model security without compromising communication efficiency, and is suitable to deploy into large-scale systems.
The rest of this paper is organized as follows: We first review the related works on federated learning and current communication efficiency strategies. Then we introduce our approach in detail. After that, we present the experimental results, followed by the conclusion and a discussion of future work.
2 Related Work
2.1 Federated Learning
In the traditional machine learning approach, data collected by different clients (IoT devices, smartphones, etc.) is uploaded and processed centrally in a cloud-based server or data center. However, due to data privacy and data security, sending raw data to the centric cloud is regarded as unsafe, and violate the General Data Protection Regulation (GDPR) [Voigt:2017:EGD:3152676]. To decouple the need for machine learning from the need for storing large data in the central database, a new machine learning framework called federated learning was proposed, a typical federated learning framework is as shown in Figure 1.
In federated learning scenario, each client updated their local model based on their local data, and then send the updated model’s parameters to the server side for aggregation, the steps are repeated in multiple rounds until the learning process converges. since training data remains on personal devices during training process, data security can be guaranteed.
In general, the standard federated learning training process including the following three steps [DBLP:journals/corr/abs-1902-04885, lim2019federated]:
Local Model Training: Denote as current iteration round. is the number of clients, is the initial model of this round for client , parameterized by , that is to say , is the local datasets of client . Based on , each client respectively update the local model parameters from to , the updated local model parameters are then subsequently sent to the server side.
Global Aggregation: The server side aggregates the local model parameters of each client, and calculate the global model, we can formalize our aggregation algorithm as follows:
Where is aggregation operator, some sensible choices of aggregation operator include FedAvg [DBLP:journals/corr/McMahanMRA16], Byzantine Gradient Descent [DBLP:journals/corr/ChenSX17], Secure Aggregation [Bonawitz:2017:PSA:3133956.3133982], Automated Update of Weights.
Update Local Model: When the aggregation is complete, the server side will send global model parameters back to clients, and replace local model respectively.
2.2 Communication Costs
In the standard federated learning training procedure, all model parameters should be exchanged between the server side and the clients, for complicated deep learning model, e.g. CNN, each update may comprise millions of parameters, coupled with network unreliability and connection in-stability, making the communication cost a significant bottleneck, as such, we need to improve the communication efficiency of federated learning. The total number of bits that have to be transmitted during model training is given by:
where is the total number of iterations between the server size and the clients, is the number of clients selected by the server side to update at round , denotes global model after times aggregation, is the selective parameter bits download to client side, similarly, is the selected parameter bits of client used to upload to server side. Using equation 2
as a reference, we can classify the current research topics on communication efficiency by the following four aspects:
Iterations frequency: One feasible solution to improve communication efficiency is to reduce the number of communications (see in equation 2) between the server side and the clients. McMahan et al.[DBLP:journals/corr/SureshYMK16] proposed an iterative model averaging algorithm called Federated Averaging (FedAvg), and points out that each client can iterate the local SGD update multiple times before the averaging step, thus reducing the total number of communication rounds. The experiments showed a reduction in required communication rounds by to as compared to FedSGD.
Pruning: Another research topic is to use model compression algorithm to reduce data transmission (see and in equation 2), This is a technique commonly used in distributed learning [lim2019federated, wang2018atomo]. A naive implementation requires that each client sends full model parameters back to the server in each round, and vice versa. Inspired by deep learning model compression algorithms[DBLP:journals/corr/HanMD15], distributed model compression approaches have been widely studied. Strom el.al[strom2015scalable, tsuzuku2018variance] proposed an approach in which only gradients with magnitude greater than a given threshold are sent to the server, however, it is difficult to define threshold due to different configurations for different tasks. In a follow-up work, Aji et al.[aji2017sparse] fixed transmission ratio p, and only communicate the fraction p entries with the biggest magnitude of each gradient.
Konecny et.al[DBLP:journals/corr/KonecnyMYRSB16] proposed low rank and subsampling of parameter matrix to reduce data transmission. Caldas et.al [DBLP:journals/corr/abs-1812-07210] extend on the studies in [DBLP:journals/corr/KonecnyMYRSB16] by proposing lossy compression and federated dropout to reduce server-to-participant communication costs.
Importance-based Update: This strategy involves selective communication such that only the important or relevant updates are transmitted in each communication round. The authors in 
propose the edge Stochastic Gradient Descent (eSGD) algorithm that selects only a small fraction of important gradients to be communicated to the FL server for parameter update during each communication round. the authors in[CMFL2019] propose the Communication-Mitigated Federated Learning (CMFL) algorithm that uploads only relevant local updates to reduce communication costs while guaranteeing global convergence.
Quantization: quantization is also a very important compression technique. Konecny et.al[DBLP:journals/corr/KonecnyMYRSB16] proposed probabilistic quantization, which reduces each scalar to one bit (maximum value and minimum value). Bernstein et al.[DBLP:journals/corr/abs-1802-04434, DBLP:journals/corr/abs-1810-05291] proposed signSGD, which quantizes gradient update to its corresponding binary sign, thus reducing the size by a factor of . Sattler et.al [DBLP:journals/corr/abs-1903-02891] proposed a new compression technique, called Sparse Ternary Compression (STC), which is suitable for non-iid condition.
Aforementioned approaches are feasible strategies to improve communication efficiency during federated learning training procedure, some of which can reduce the amount of data transferred, while some other can reduce the number of iterations. However, all these solutions are either need to manual setting for hyper-parameter, or break up the original process into multiple steps, as such, cannot implement end-to-end workflow.
In this section, We formalize our problem definition, and show how to use Residual Pooling Network (RPN) to improve communication efficiency under federated learning framework, our new federated learning framework is shown in Figure 3.
3.1 Symbol Definition
Suppose that our federated learning system consists of clients and one central server , as illustrated in Figure 1, denotes the th client, each of which has its own datasets, indicated by respectively, represents global model after times aggregation, is parameterized by , that is to say:
For simplicity, and without loss of generality, supposed we have completed times aggregation, and currently training on round , client need to update local model from to based on local dataset . The local model objective function is as follows:
The goal of client in iteration is to find optimal parameters
that minimize the loss function of equation4.
let represents the residual network, which is given by:
Execute spatial mean pooling on , it would be changed to .
3.2 Residual Pooling Network
The standard federated learning requires exchange all parameters between the server side and the clients, but actually, many literatures [strom2015scalable, tsuzuku2018variance, DBLP:journals/corr/HanMD15] had shown that not all parameters are sensitive to model updates in distributed model training. Under this assumption, we propose Residual pooling network, which compress data transmission through parameter approximation and parameter selection, RPN consists of the following two components:
Residual Network: The difference of model parameters before and after the model update is called Residual network. Residual network is used to Capture and evaluate the contribution of different filters, the sum of kernel elements is regarded as the contribution of this filter, only magnitude greater than a given threshold are sent to the server side.
The purpose of Residual Network is to capture the changes in parameters during training, the author in [DBLP:journals/corr/HeZR014] shown that different filters would have different responses to different image semantics. In light of different distribution of federated clients, each client would capture different semantic of local dataset , as such, the changes of each layer in local model are different, which makes it no need to upload all model parameters.
Spatial pooling: Spatial pooling is used to further compress model parameters. After we select high sensitive filters through residual network, using spatial mean pooling to compress each filter size from to , as shown in Figure 2.
Accord to [DBLP:journals/corr/IoffeS15]
, as the parameters of the previous layers change, the distribution of each layer’s inputs changes during model training, we refer to this phenomenon as internal covariate shift, the feasible solution to address the problem is using batch normalization to normalize each layer inputs. Similarly, during federated learning scenario, after local model training, the change in the distribution of network activations due to the change in network parameters during training, spatial pooling of convolutional layer make it normalize each layer by average operation, which makes it reduce the internal covariate shift to some extent.
3.3 Residual Pooling Network - Server perspective
The main function of federated server is used to aggregate the model parameters uploaded from all selective clients, when aggregation completed, the new updated global model parameters are then subsequently sent back to the clients for next iteration. In this section, we analyze our RPN algorithm from the perspective of federated server, in our new strategy, we don’t aggregate raw model parameters, but residual pooling network parameters instead, since only compressed parameters are transmitted, the amount of data downloaded will be significantly reduced.
First, we initialize global model as , and send it to all clients, also set . For each round at server side, also for each client , we send a triple (, and ) to all selective clients for local model training, wait until we get response from all selective client, the global aggregation can be expressed as follows:
Repeated this cycle until model converges or maximum number of iterations are satisfied. The entire federated server algorithm can refer to Algorithm 1.
3.4 Residual Pooling Network - Client perspective
Typical federated learning requires each client updates the local model based on local dataset, and then send model parameters to server side for aggregation, however, as previously discussed, not all parameters are sensitive to model updates in distributed model training. In this section, We will analyze RPN algorithm from the perspective of the client, and show how the amount of data uploaded will be significantly reduced.
Without loss of generality, suppose we start at round . The first step is to recover local model , based on and the sum of , where from 1 to , which means:
After that, based on local datasets , we do local model update, change the model from to , and calculate residual network , compress using spatial mean pooling technique, get the final RPN model , send it back to server side for aggregation. The entire federated client algorithm can refer to Algorithm 2.
3.5 Algorithm Correctness Analysis
Lemma 1. Denote is global model after times aggregation, is the local model of client before training on round , and is the corresponding local model after training on round , under iid condition, we should have:
Equation 8 is satisfied, because of is add mean pooling operation on , which makes
3.6 Performance Analysis
The overall communication costs consist of two aspects: the upload from client to server and download from server to client.
Upload: According to Algorithm 2, the total number of uploaded data is equal to:
Since we execute spatial mean pooling on convolutional layer, suppose the kernel shape of model is , where is the number of filter input, is the number of filter output size, is kernel size. after compressed, the kernel shape of model is . Obviously, compared with standard federated learning, the data upload amount reduce times for each convolutional kernel.
Download: According to Algorithm 1, the total number of downloaded data is equal to:
is the aggregation of all RPN model collected from selective clients, the shape of model is equal to , as previously discussed, the data download amount also reduce times for each convolutional kernel.
We evaluate our approach on three different tasks: image classification, object detection and semantic segmentation, we will compare the performance between standard federated learning and RPN. All our experiments are run on a federated learning cluster consists of one server side and 10 clients, each of which are equipped with 8 NVIDIA TeslaV100 GPUs.
4.1 Experiments on MNIST classification
We evaluate classification performance on mnist dataset using a 5-layers CNN model, which contains total 1,199,882 parameters, after using RPN technique, the transmitted data can drop to 1,183,242, since most of the parameters are came from full connection layer, the compression effect is not obvious enough. we split the data into 10 parts randomly, and send it to each client for training datasets, the data distribution of each client is shown in table 4.
The results are shown in Figure 4, as we can see, RPN can achieve almost the same effect as standard federated learning.
Our experiment performance summarize in table 1:
4.2 Experiments on Pascal-VOC object detection
In this section, we conduct object detection task on PASCAL VOC dataset. The PASCAL Visual Object Classification (PASCAL VOC) dataset is a well-known dataset for object detection, classification, segmentation of objects and so on. There are around 10,000 images for training and validation containing bounding boxes with objects. Although, the PASCAL VOC dataset contains only 20 categories, it is still considered as a reference dataset in the object detection problem.
We conduct object detection task on pascal voc dataset with yoloV3. yolov3 contrains 107 layer with 75 are convolutional layers. same as mnist experiment, we split the data into 10 parts randomly, and send it back to each client for training data.
Our experiment results are shown in Figure 5. We use mean average precision (mAP) as our performance metrics, mAP is a popular metric in measuring the accuracy of object detectors like Faster R-CNN, SSD, etc. As we can see, rpn will shuffe fluctuation in the early stage, but as the iterations continue, the performance becomes stable, and shows shallow performance gap between rpn and standard federated learning.
Our experiment performance summarize in table 2, since in yolov3 network architecture, about 70% layers are convolutional layer, compared with mnist classification, our RPN compression strtegy is more obvious, we can reduce model parameters from 61,895,776 to 12,382,560, while only one percent of performance is declining.
4.3 Experiments on Pascal-VOC Semantic segmentation
In this section, we conduct Semantic segmentation task on PASCAL VOC dataset, and use fully convolutional networks (FCN) as our model. FCN can use different model architecture for feature extraction, such as ResNet, vggnet, DenseNet, etc. In our experiment, we use Resnet50 as our based models.
Our experiment results are shown in Figure 6. We use meanIOU and pixel accuracy as our performance metrics.
Our experiment performance summarize in table 3, we can reduce model parameters from 23,577,621 to 13,325,281, while only 2.6 percent of performance is declining for meanIOU metric, and 1.16 percent of performance is declining for pixel accuracy.
|Model||Parameters transmission||meanIOU||Pixel ACC|
5 CONCLUSION and Future Work
In this paper, we propose a new compression strategy to improve communication efficiency of federated learning. we test our approach on three different tasks, including classification, object detection and semantic segmentation, the experiments show that RPN not only reduce data transmission effectively, but also achieve almost the same performance as compared to standard federated learning. Most importantly, RPN is n end-to-end procedure, which makes it easy to deploy in real-world application without human intervention. In the future, we will combine other compression strategies to improve communication efficiency.
|client||total image number||0||1||2||3||4||5||6||7||8||9|
|label||client 1||client 2||client 3||client 4||client 5||client 6||client 7||client 8||client 9||client 10|