RPN: A Residual Pooling Network for Efficient Federated Learning

01/23/2020 ∙ by Anbu Huang, et al. ∙ The Hong Kong University of Science and Technology 14

Federated learning is a new machine learning framework which enables different parties to collaboratively train a model while protecting data privacy and security. Due to model complexity, network unreliability and connection in-stability, communication cost has became a major bottleneck for applying federated learning to real-world applications. Current existing strategies are either need to manual setting for hyper-parameters, or break up the original process into multiple steps, which make it hard to realize end-to-end implementation. In this paper, we propose a novel compression strategy called Residual Pooling Network (RPN). Our experiments show that RPN not only reduce data transmission effectively, but also achieve almost the same performance as compared to standard federated learning. Our new approach performs as an end-to-end procedure, which should be readily applied to all CNN-based model training scenarios for improvement of communication efficiency, and hence make it easy to deploy in real-world application without human intervention.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past decade, Deep Convolutional Neural Networks (DCNN) have shown powerful representation and learning capabilities

[Goodfellow:2016:DL:3086952, 0483bd9444a348c8b59d54a190839ec9]

, and achieve unprecedented success in many commercial applications, such as computer vision

[Krizhevsky:2012:ICD:2999134.2999257, DBLP:journals/corr/HeZRS15]

, nature language processing

[mikolov2013efficient, DBLP:journals/corr/abs-1810-04805, DBLP:journals/corr/BojanowskiGJM16], speech recognition [DBLP:journals/corr/HannunCCCDEPSSCN14, DBLP:journals/corr/ZhangCJ16, DBLP:journals/corr/OordDZSVGKSK16], etc. Same as traditional machine learning algorithms, the success of deep neural network is partially driven by big data availability. However, in the real world, with the exception of few industries, most fields have only limited data or poor quality data. What’s worse, due to industry competition, privacy security, and complicated administrative procedures, It is almost impossible to train centralized machine learning models by integrating the data scattered around the countries and institutions [DBLP:journals/corr/abs-1902-04885].

At the same time, with the increasing awareness of data privacy, the emphasis on data privacy and security has became a worldwide major issue. News about leaks on public data are causing great concerns in public media and governments. In response, countries across the world are strengthening laws in protection of data security and privacy. An example is the General Data Protection Regulation (GDPR) [Voigt:2017:EGD:3152676] enforced by the European Union on May 25, 2018. Similar acts of privacy and security are being enacted in the US and China.

To decouple the need for machine learning from the need for storing large data in the central database, a new machine learning framework called federated learning was proposed [DBLP:journals/corr/McMahanMRA16]. Federated learning provides a promising approach for model training without compromising data privacy and security [DBLP:journals/corr/abs-1902-01046, DBLP:journals/corr/McMahanMRA16, DBLP:journals/corr/KonecnyMYRSB16]. Unlike typical centralized training procedures, federated learning requires each client collaboratively learn a shared model using the training data on the device and keeping the data locally. In that scenario, model parameters (or gradients) are transmitted between the server side and the clients on each round. Due to the frequent exchange of data between the central server and the clients, coupled with network unreliability and connection instability, communication costs are the principal constraint and restricting the performance of federated learning.

In this paper, we propose a new compression strategy called Residual Pooling Network (RPN) to address the communication costs problem by parameter selection and approximation. As we will see below, our approach can significantly reduce the quantity of data transmission, while still be able to maintain high-level model performance, more importantly, our approach is an end-to-end process, which make it easy to deploy in real-world application without human intervention.

Contributions: Our main contributions in this paper are as follows:

  • we propose a practical and promising approach to improve communication efficiency. Our approach not only reduces the amount of data being uploaded, but also reduces the amount of data being download.

  • we propose a general solution for CNN-based model compression under federated learning framework.

  • Replacing a parameter-encryption based approach, the parameter approximation and parameter selection method proposed keeps the model security without compromising communication efficiency, and is suitable to deploy into large-scale systems.

The rest of this paper is organized as follows: We first review the related works on federated learning and current communication efficiency strategies. Then we introduce our approach in detail. After that, we present the experimental results, followed by the conclusion and a discussion of future work.

2 Related Work

2.1 Federated Learning

In the traditional machine learning approach, data collected by different clients (IoT devices, smartphones, etc.) is uploaded and processed centrally in a cloud-based server or data center. However, due to data privacy and data security, sending raw data to the centric cloud is regarded as unsafe, and violate the General Data Protection Regulation (GDPR) [Voigt:2017:EGD:3152676]. To decouple the need for machine learning from the need for storing large data in the central database, a new machine learning framework called federated learning was proposed, a typical federated learning framework is as shown in Figure 1.

In federated learning scenario, each client updated their local model based on their local data, and then send the updated model’s parameters to the server side for aggregation, the steps are repeated in multiple rounds until the learning process converges. since training data remains on personal devices during training process, data security can be guaranteed.

Figure 1: Federated Learning Architecture [DBLP:journals/corr/abs-1902-04885]

In general, the standard federated learning training process including the following three steps [DBLP:journals/corr/abs-1902-04885, lim2019federated]:

  • Local Model Training: Denote as current iteration round. is the number of clients, is the initial model of this round for client , parameterized by , that is to say , is the local datasets of client . Based on , each client respectively update the local model parameters from to , the updated local model parameters are then subsequently sent to the server side.

  • Global Aggregation: The server side aggregates the local model parameters of each client, and calculate the global model, we can formalize our aggregation algorithm as follows:

    (1)

    Where is aggregation operator, some sensible choices of aggregation operator include FedAvg [DBLP:journals/corr/McMahanMRA16], Byzantine Gradient Descent [DBLP:journals/corr/ChenSX17], Secure Aggregation [Bonawitz:2017:PSA:3133956.3133982], Automated Update of Weights.

  • Update Local Model: When the aggregation is complete, the server side will send global model parameters back to clients, and replace local model respectively.

2.2 Communication Costs

In the standard federated learning training procedure, all model parameters should be exchanged between the server side and the clients, for complicated deep learning model, e.g. CNN, each update may comprise millions of parameters, coupled with network unreliability and connection in-stability, making the communication cost a significant bottleneck, as such, we need to improve the communication efficiency of federated learning. The total number of bits that have to be transmitted during model training is given by:

(2)

where is the total number of iterations between the server size and the clients, is the number of clients selected by the server side to update at round , denotes global model after times aggregation, is the selective parameter bits download to client side, similarly, is the selected parameter bits of client used to upload to server side. Using equation 2

as a reference, we can classify the current research topics on communication efficiency by the following four aspects:


Iterations frequency: One feasible solution to improve communication efficiency is to reduce the number of communications (see in equation 2) between the server side and the clients. McMahan et al.[DBLP:journals/corr/SureshYMK16] proposed an iterative model averaging algorithm called Federated Averaging (FedAvg), and points out that each client can iterate the local SGD update multiple times before the averaging step, thus reducing the total number of communication rounds. The experiments showed a reduction in required communication rounds by to as compared to FedSGD.


Pruning: Another research topic is to use model compression algorithm to reduce data transmission (see and in equation 2), This is a technique commonly used in distributed learning [lim2019federated, wang2018atomo]. A naive implementation requires that each client sends full model parameters back to the server in each round, and vice versa. Inspired by deep learning model compression algorithms[DBLP:journals/corr/HanMD15], distributed model compression approaches have been widely studied. Strom el.al[strom2015scalable, tsuzuku2018variance] proposed an approach in which only gradients with magnitude greater than a given threshold are sent to the server, however, it is difficult to define threshold due to different configurations for different tasks. In a follow-up work, Aji et al.[aji2017sparse] fixed transmission ratio p, and only communicate the fraction p entries with the biggest magnitude of each gradient.

Konecny et.al[DBLP:journals/corr/KonecnyMYRSB16] proposed low rank and subsampling of parameter matrix to reduce data transmission. Caldas et.al [DBLP:journals/corr/abs-1812-07210] extend on the studies in [DBLP:journals/corr/KonecnyMYRSB16] by proposing lossy compression and federated dropout to reduce server-to-participant communication costs.


Importance-based Update: This strategy involves selective communication such that only the important or relevant updates are transmitted in each communication round. The authors in [216799]

propose the edge Stochastic Gradient Descent (eSGD) algorithm that selects only a small fraction of important gradients to be communicated to the FL server for parameter update during each communication round. the authors in

[CMFL2019] propose the Communication-Mitigated Federated Learning (CMFL) algorithm that uploads only relevant local updates to reduce communication costs while guaranteeing global convergence.


Quantization: quantization is also a very important compression technique. Konecny et.al[DBLP:journals/corr/KonecnyMYRSB16] proposed probabilistic quantization, which reduces each scalar to one bit (maximum value and minimum value). Bernstein et al.[DBLP:journals/corr/abs-1802-04434, DBLP:journals/corr/abs-1810-05291] proposed signSGD, which quantizes gradient update to its corresponding binary sign, thus reducing the size by a factor of . Sattler et.al [DBLP:journals/corr/abs-1903-02891] proposed a new compression technique, called Sparse Ternary Compression (STC), which is suitable for non-iid condition.


Aforementioned approaches are feasible strategies to improve communication efficiency during federated learning training procedure, some of which can reduce the amount of data transferred, while some other can reduce the number of iterations. However, all these solutions are either need to manual setting for hyper-parameter, or break up the original process into multiple steps, as such, cannot implement end-to-end workflow.

3 Methodology

In this section, We formalize our problem definition, and show how to use Residual Pooling Network (RPN) to improve communication efficiency under federated learning framework, our new federated learning framework is shown in Figure 3.

3.1 Symbol Definition

Suppose that our federated learning system consists of clients and one central server , as illustrated in Figure 1, denotes the th client, each of which has its own datasets, indicated by respectively, represents global model after times aggregation, is parameterized by , that is to say:

(3)

For simplicity, and without loss of generality, supposed we have completed times aggregation, and currently training on round , client need to update local model from to based on local dataset . The local model objective function is as follows:

(4)

The goal of client in iteration is to find optimal parameters

that minimize the loss function of equation

4.

let represents the residual network, which is given by:

(5)

Execute spatial mean pooling on , it would be changed to .

3.2 Residual Pooling Network

The standard federated learning requires exchange all parameters between the server side and the clients, but actually, many literatures [strom2015scalable, tsuzuku2018variance, DBLP:journals/corr/HanMD15] had shown that not all parameters are sensitive to model updates in distributed model training. Under this assumption, we propose Residual pooling network, which compress data transmission through parameter approximation and parameter selection, RPN consists of the following two components:

  • Residual Network: The difference of model parameters before and after the model update is called Residual network. Residual network is used to Capture and evaluate the contribution of different filters, the sum of kernel elements is regarded as the contribution of this filter, only magnitude greater than a given threshold are sent to the server side.

    The purpose of Residual Network is to capture the changes in parameters during training, the author in [DBLP:journals/corr/HeZR014] shown that different filters would have different responses to different image semantics. In light of different distribution of federated clients, each client would capture different semantic of local dataset , as such, the changes of each layer in local model are different, which makes it no need to upload all model parameters.

    Figure 2: Spatial mean pooling, we calculate the mean value of the kernel, use this value as new kernel with size (1, 1)
  • Spatial pooling: Spatial pooling is used to further compress model parameters. After we select high sensitive filters through residual network, using spatial mean pooling to compress each filter size from to , as shown in Figure 2.

    Accord to [DBLP:journals/corr/IoffeS15]

    , as the parameters of the previous layers change, the distribution of each layer’s inputs changes during model training, we refer to this phenomenon as internal covariate shift, the feasible solution to address the problem is using batch normalization to normalize each layer inputs. Similarly, during federated learning scenario, after local model training, the change in the distribution of network activations due to the change in network parameters during training, spatial pooling of convolutional layer make it normalize each layer by average operation, which makes it reduce the internal covariate shift to some extent.

Figure 3: New federated learning workflow of our approach: (1) select clients for local model update. (2) recover local model. (3) local model training based on local datasets. (4) calculate residual network. (5) spatial pooling. (6) send rpn to server side, and do aggregation. (7) send rpn back to selected clients, and repeat this cycle.

3.3 Residual Pooling Network - Server perspective

1.3

  Initialize , and set
  Send initial model to each client
  for  do
     
     for  do
        
     end for
     RPN aggregation:
  end for
Algorithm 1 RPN: FL_Server

The main function of federated server is used to aggregate the model parameters uploaded from all selective clients, when aggregation completed, the new updated global model parameters are then subsequently sent back to the clients for next iteration. In this section, we analyze our RPN algorithm from the perspective of federated server, in our new strategy, we don’t aggregate raw model parameters, but residual pooling network parameters instead, since only compressed parameters are transmitted, the amount of data downloaded will be significantly reduced.

First, we initialize global model as , and send it to all clients, also set . For each round at server side, also for each client , we send a triple (, and ) to all selective clients for local model training, wait until we get response from all selective client, the global aggregation can be expressed as follows:

(6)

Repeated this cycle until model converges or maximum number of iterations are satisfied. The entire federated server algorithm can refer to Algorithm 1.

3.4 Residual Pooling Network - Client perspective

Typical federated learning requires each client updates the local model based on local dataset, and then send model parameters to server side for aggregation, however, as previously discussed, not all parameters are sensitive to model updates in distributed model training. In this section, We will analyze RPN algorithm from the perspective of the client, and show how the amount of data uploaded will be significantly reduced.

Without loss of generality, suppose we start at round . The first step is to recover local model , based on and the sum of , where from 1 to , which means:

(7)

After that, based on local datasets , we do local model update, change the model from to , and calculate residual network , compress using spatial mean pooling technique, get the final RPN model , send it back to server side for aggregation. The entire federated client algorithm can refer to Algorithm 2.

1.3

  Input: client id ; round
  Output: model parameters that sent back to server
  Recover model:
  for  do
     
     
  end for
  Let
  Calculate residual model
  Compress with spatial pooling, and get
  Send back to server side
Algorithm 2 RPN: FL_Client (ClientUpdate)

3.5 Algorithm Correctness Analysis

Lemma 1. Denote is global model after times aggregation, is the local model of client before training on round , and is the corresponding local model after training on round , under iid condition, we should have:

(8)

Equation 8 is satisfied, because of is add mean pooling operation on , which makes

(9)

is satisfied.


Lemma 2. Suppose we have clients in our federated learning cluster, Algorithm 1 and Algorithm 2 can guarantee to recover local model.

(10)

3.6 Performance Analysis

The overall communication costs consist of two aspects: the upload from client to server and download from server to client.

  • Upload: According to Algorithm 2, the total number of uploaded data is equal to:

    (11)

    Since we execute spatial mean pooling on convolutional layer, suppose the kernel shape of model is , where is the number of filter input, is the number of filter output size, is kernel size. after compressed, the kernel shape of model is . Obviously, compared with standard federated learning, the data upload amount reduce times for each convolutional kernel.

  • Download: According to Algorithm 1, the total number of downloaded data is equal to:

    (12)

    is the aggregation of all RPN model collected from selective clients, the shape of model is equal to , as previously discussed, the data download amount also reduce times for each convolutional kernel.

4 Experiments

We evaluate our approach on three different tasks: image classification, object detection and semantic segmentation, we will compare the performance between standard federated learning and RPN. All our experiments are run on a federated learning cluster consists of one server side and 10 clients, each of which are equipped with 8 NVIDIA TeslaV100 GPUs.

4.1 Experiments on MNIST classification

We evaluate classification performance on mnist dataset using a 5-layers CNN model, which contains total 1,199,882 parameters, after using RPN technique, the transmitted data can drop to 1,183,242, since most of the parameters are came from full connection layer, the compression effect is not obvious enough. we split the data into 10 parts randomly, and send it to each client for training datasets, the data distribution of each client is shown in table 4.

The results are shown in Figure 4, as we can see, RPN can achieve almost the same effect as standard federated learning.

Figure 4: Experiments on MNIST dataset

Our experiment performance summarize in table 1:

MNIST
Model Parameters transmission Accuracy
FL 1,199,882 0.991
RPN 1,183,242 0.9912
Performance
Table 1: experiments performance on mnist classification

4.2 Experiments on Pascal-VOC object detection

In this section, we conduct object detection task on PASCAL VOC dataset. The PASCAL Visual Object Classification (PASCAL VOC) dataset is a well-known dataset for object detection, classification, segmentation of objects and so on. There are around 10,000 images for training and validation containing bounding boxes with objects. Although, the PASCAL VOC dataset contains only 20 categories, it is still considered as a reference dataset in the object detection problem.

We conduct object detection task on pascal voc dataset with yoloV3. yolov3 contrains 107 layer with 75 are convolutional layers. same as mnist experiment, we split the data into 10 parts randomly, and send it back to each client for training data.

Our experiment results are shown in Figure 5. We use mean average precision (mAP) as our performance metrics, mAP is a popular metric in measuring the accuracy of object detectors like Faster R-CNN, SSD, etc. As we can see, rpn will shuffe fluctuation in the early stage, but as the iterations continue, the performance becomes stable, and shows shallow performance gap between rpn and standard federated learning.

Figure 5: Experiments on PASCAL VOC object detection

Our experiment performance summarize in table 2, since in yolov3 network architecture, about 70% layers are convolutional layer, compared with mnist classification, our RPN compression strtegy is more obvious, we can reduce model parameters from 61,895,776 to 12,382,560, while only one percent of performance is declining.

PASCAL VOC
Model Parameters transmission Accuracy
FL 61,895,776 0.7077
RPN 12,382,560 0.7002
Performance
Table 2: experiments performance on PASCAL VOC Object detection

4.3 Experiments on Pascal-VOC Semantic segmentation

In this section, we conduct Semantic segmentation task on PASCAL VOC dataset, and use fully convolutional networks (FCN) as our model. FCN can use different model architecture for feature extraction, such as ResNet, vggnet, DenseNet, etc. In our experiment, we use Resnet50 as our based models.

Our experiment results are shown in Figure 6. We use meanIOU and pixel accuracy as our performance metrics.

Figure 6: Experiments on PASCAL VOC semantic segmentation

Our experiment performance summarize in table 3, we can reduce model parameters from 23,577,621 to 13,325,281, while only 2.6 percent of performance is declining for meanIOU metric, and 1.16 percent of performance is declining for pixel accuracy.

PASCAL VOC
Model Parameters transmission meanIOU Pixel ACC
FL 23,577,621 0.5077 0.864
RPN 13,325,281 0.4942 0.854
Performance
Table 3: experiments performance on PASCAL VOC Object detection

5 CONCLUSION and Future Work

In this paper, we propose a new compression strategy to improve communication efficiency of federated learning. we test our approach on three different tasks, including classification, object detection and semantic segmentation, the experiments show that RPN not only reduce data transmission effectively, but also achieve almost the same performance as compared to standard federated learning. Most importantly, RPN is n end-to-end procedure, which makes it easy to deploy in real-world application without human intervention. In the future, we will combine other compression strategies to improve communication efficiency.

label distribution
client total image number 0 1 2 3 4 5 6 7 8 9
client 1 5832 763 586 423 533 582 607 690 568 539 541
client 2 5230 649 798 518 330 429 380 320 731 626 449
client 3 6190 363 724 722 421 611 612 531 669 708 829
client 4 6832 518 831 605 1082 637 607 996 586 701 269
client 5 5903 420 570 588 763 288 688 611 641 600 734
client 6 5598 518 763 563 597 591 531 328 608 421 678
client 7 6239 611 524 855 628 789 664 511 709 790 158
client 8 6166 417 449 511 666 737 508 1080 563 563 672
client 9 5298 523 960 739 707 432 498 285 420 488 246
client 10 6712 1141 537 434 404 746 326 566 770 425 1373
Table 4: Mnist data distribution of each federated client
client id
label client 1 client 2 client 3 client 4 client 5 client 6 client 7 client 8 client 9 client 10
Aeroplane 53 32 71 66 58 61 38 45 39 60
Bicycle 21 62 33 48 39 56 57 68 42 40
Bird 61 68 103 52 50 75 37 43 59 50
Boat 31 42 71 52 23 36 43 54 29 40
Bottle 71 82 38 54 65 98 37 77 61 81
Bus 31 12 53 64 55 46 27 18 59 60
Car 123 108 45 64 72 151 77 48 89 103
Cat 75 89 101 161 77 87 89 58 113 94
Chair 170 65 79 91 106 118 68 91 117 67
Cow 12 25 39 33 41 29 17 18 29 20
Diningtable 34 28 45 72 52 61 37 28 49 30
Dog 208 102 113 178 58 69 91 82 119 98
Horse 43 25 31 49 72 28 19 31 27 42
Motorbike 16 72 21 40 53 39 72 68 49 52
Person 340 689 213 480 111 239 128 196 295 303
PottedPlant 45 24 65 83 28 31 39 18 45 51
Sheep 28 12 61 22 14 20 30 18 39 27
Sofa 42 39 29 37 35 20 51 54 32 30
Train 34 26 66 14 25 32 38 40 19 41
Tvmonitor 68 43 23 45 39 29 17 29 31 40
Table 5: PASCAL VOC data distribution of each federated client

References