Ternary Compression for Communication-Efficient Federated Learning

03/07/2020 ∙ by Jinjin Xu, et al. ∙ University of Surrey East China Universtiy of Science and Technology 0

Learning over massive data stored in different locations is essential in many real-world applications. However, sharing data is full of challenges due to the increasing demands of privacy and security with the growing use of smart mobile devices and IoT devices. Federated learning provides a potential solution to privacy-preserving and secure machine learning, by means of jointly training a global model without uploading data distributed on multiple devices to a central server. However, most existing work on federated learning adopts machine learning models with full-precision weights, and almost all these models contain a large number of redundant parameters that do not need to be transmitted to the server, consuming an excessive amount of communication costs. To address this issue, we propose a federated trained ternary quantization (FTTQ) algorithm, which optimizes the quantized networks on the clients through a self-learning quantization factor. A convergence proof of the quantization factor and the unbiasedness of FTTQ is given. In addition, we propose a ternary federated averaging protocol (T-FedAvg) to reduce the upstream and downstream communication of federated learning systems. Empirical experiments are conducted to train widely used deep learning models on publicly available datasets, and our results demonstrate the effectiveness of FTTQ and T-FedAvg compared with the canonical federated learning algorithms in reducing communication costs and maintaining the learning performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The number of Internet of Things (IoTs) and smart mobile devices deployed, e.g., in process industry has grown dramatically over the past decades, having generated massive amounts of data stored distributively every moment. Meanwhile, recent achievements in deep learning

[1], such as AlphaGo [2], rely heavily on the knowledge stored in big data. Naturally, to adopt deep learning methods for effective utilization of the rich data contained in local clients of the process industry, e.g., branch factories, will provide a strong support to industrial production. However, training deep learning models by distributed data is difficult, while uploading private data to cloud is controversial, due to limitations on network bandwidth, budgets and security regulations, e.g., GDPR [3].

Many research efforts have been devoted to related fields recently, while early work in this area has mainly focused on training deep models on multiple machines to alleviate the computational burden of large data volumes, known as distributed machine learning [4]. These methods have achieved satisfactory performance, by splitting big data into tiny sets to accelerate the model training process. For example, data parallelism [4], model parallelism [4, 5] (see in Fig. 1) and parameter server [6, 7, 8] are commonly used methods in practice. Correspondingly, the weights optimization strategies for multiple machines have also been proposed. For example, Zhang et al. [9]

proposed asynchronous mini-batch stochastic gradient descent algorithm (ASGD) on multi-GPU devices for deep neural networks (DNNs) training and achieved 3.2 times speed-up on 4 GPUs than the single one without loss of precision. Recently, distributed machine learning algorithms for multiple datacenters which are located in different regions have been studied in

[10]. However, little attention has been paid to the data security and the impact of data distribution on the performance.

To address the drawbacks of distributed learning, researchers have proposed an interesting framework to train a global model while keeping the private data locally, known as federated learning [11, 12, 13]. The federated approach makes it possible to extract knowledge in the data distributed on local devices without uploading private data to a certain server. Fig. 2 illustrates the simplified workflow and an application scenario in the process industry. Several extensions have been introduced to the standard federated learning system. Zhao et al. [14] have observed weight divergence caused by extreme data distributions and proposed a method of sharing a small amount of data with other clients to enhance the performance of federated learning algorithms. Wang et al. [15] have proposed adaptive federated learning systems under a given resource budget based on a control algorithm that balances the client update and global aggregation, and analyzed the convergence bound of the distributed gradient descent. Recent comprehensive overviews of federated leanring can be found in [16, 17], and design ideas, the challenges and future research directions of federated learning on massively mobile devices are presented in [18, 19].

Since users may pay more attention to privacy protection and data security, federated learning will play a key role in deep learning, although it faces challenges in terms of data distribution and communication costs.

I-1 Data Distribution

The data generated by different clients, e.g., factories, may be unbalanced and not subject to the independent and identical distribution hypothesis, which means unbalanced and/or non-IID datasets.

I-2 Communication Costs

Federated learning is influenced by the rapidly growing depth of model and the amount of multiply-accumulate operations (MACs) [20]. This is due to the fact that the massive communication costs for uploading and downloading is necessary, while the average upload and download speed is asymmetric, e.g., 26.36 Mbps mean mobile download speed vs. 11.05 Mbps upload speed of UK in Q3-Q4 2017[21].

Fig. 1: The diagram of model parallelism. The complete model is distributed and stored in multiple clients, and the model is stitched after the training procedure is finished on the clients.

Obviously, high communication costs is one of the main reasons hindering distributed and federated training. Although the initial model compression related research is not intended to reduce the communication costs, it has been a source of inspiration for communication efficient distributed learning. Neural network pruning is an early method of model compression proposed in [22]. Parameter pruning and sharing in [22, 23], low-rank factorization [24], transferred/compact convolutional filters [25] and knowledge distillation in [26]

are some main ideas reported in the literature. Reduction of communication costs by simultaneously minimazing the performance and minimizing the model complexity using an multi-objective evolutionary algorithm is reported in

[27]. Recently, an layer-wise asynchronous model update approach has been proposed in [28] to reduce the number of parameters to be transmitted.

Gradient quantization has been proposed to accelerate data parallelism distributed learning [29]; gradient sparsification [30] and gradient quantization [31, 32] have been developed to reduce the model size; an efficient federated learning by sparse ternary compression (STC) has been proposed [33], which is robust to non-IID data and communication-efficient on both upstream and downstream communications. However, since the STC is a model compression method after local training is completed, the quantization process is not optimized during training.

To the best of our knowledge, most of the federated learning methods emphasize the application of full-precision networks or streamline models after the training procedure on client is finished, rather than simplifying the model during training. Therefore, deploying the federated learning environment in the widely used IoT devices in the process industry is somehow difficult. To address this issue, we focus on the model compression on clients during training to reduce the energy consumption at inference stage and the communication costs of federated learning. The main contributions of this paper are summarized as follows:

Fig. 2: An illustration to the application of federated learning in process industry. The branch factories train local models on private clients data through iterations, and send the trained local models to the server for aggregation to obtain optimized the global model.
  • A ternary quantization approach is introduced into the training and inference stages of clients. The trained ternary models are well suited for inference in network edge devices, e.g., wireless sensors.

  • A ternary federated learning protocol is presented to reduce the communication costs between clients and server, which compresses both upstream and downstream communications. Note that the quantification of the model weights can further enhance privacy protection since this makes the reverse engineering of the model more difficult.

  • A theoretical analysis of the proposed algorithm is provided and the performance of the algorithm is empirically verified on a deep feedforward network (MLP) and a deep residual network (ResNet) using widely used datasets, namely MNIST [34] and CIFAR10 [35].

The remainder of this paper is organized as follows. In Section II, we briefly review the standard federated learning protocol and several widely used network quantization approaches. Section III proposes a method to quantify models of clients in federated learning systems, called federated trained ternary quantization (FTTQ), based on the quantitative algorithms mentioned earlier. On the basis of FTTQ, a ternary federated learning protocol that reduces both upstream and downstream communications is presented. In Section IV, a theoretical analysis of the proposed algorithm is provided. Experimental settings and results are presented in Section V to compare the new protocol with standard algorithms. Finally, conclusions and future directions are given in Section VI.

Ii Background and Methods

In this section, we first introduce some preliminaries of the standard federated learning workflow and its basic formulations. Subsequently, the definitions and main features of popular ternary quantization methods are presented, followed by an numeric example.

Ii-a Federated Learning Protocol

It is usually assumed that the data used by a distributed learning algorithm belongs to the same feature space, which may not be true in federated learning. As illustrated in Fig. 3, the basic protocol of federated learning proposed in [18] is round-based. Specifically, private storage, server and clients (usually mobile devices) are main participants in the whole protocol, and there are three main phases in each round: selection, configuration and reporting.

Fig. 3: Federated learning workflow with massive mobile devices. Firstly, the server selects suitable clients to deploy configuration (global model structure) for training. Then the clients complete the learning process in the specified budget (e.g., time) and return the local models to the server. Finally, the server aggregate all local models to obtain the trained global model.

In this work, we assume supervised learning is used for train the models. The global model with parameters

, deployed in distributed client , is trained based on local training dataset , which consists of training sample pairs (, ), . The loss of sample pair (, ) is denoted by , where

is the loss function. The total loss of a certain task on client

is :

(1)

We assume there are clients whose data is stored independently, and the aim of federated learning is to minimize the global loss . Therefore, the global objective function of the federated learning system can be defined as:

(2)

where is the proportion of the clients participating in the aggregation in the current round, which is determined by the above three phases.

Theoretically, the participation ratio in (2) is calculated by the number of participating clients and the total number of clients. Additionally, the local batch size

and local epochs

are also important hyperparameters. In the experiment, we will further study the effects of these parameters on the performance of the algorithm by manually setting the values of them.

It is easy to find that the communication costs are heavily dependent on the amount of information to be transferred between the server and clients, and the dominating factor in this procedure is the size of the parameters. One of the requirements for communication-efficient federated learning to be fulfilled is that both upstream and downstream communications need to be compressed [33]. Note that the performance of federated learning may dramatically drop due to disproportionation of data distribution.

Ii-B Quantization

Quantization improves energy and space efficiency of deep networks by reducing the number of bits per weight [36]. This is done by mapping the parameters in a continuous space to a quantization discrete space, which can greatly reduce model redundancy and save memory overhead. It has been shown that the ternary weight network (TWN for short) [32] is able to reduce the Euclidean distance between the quantization parameters and (consisting of -1, +1 and 0) and the full-precision parameters by a scaling factor compared with binary networks [31], thus making the accuracy of the quantization network close to the full-precision network:

(3)

where represents the cost function of this optimization problem; and are optimal solution to , with . Li et al. [32] introduce an approximated optimal solution with a threshold-based function to quantify all layers of the deep neural network model:

(4)

where the and are the full-precision and quantized weights of layer, and , , respectively, which provide a rule of thumb to calculate the optimal .

However, the weights of TWN are limited to -1, 0, 1 and is a constant. In order to further improve the performance of quantized deep networks while maintaining the compression ratio, Zhu et al. [37] have proposed a trained ternary quantization algorithm (TTQ for short). In TTQ, two quantization factors (positive factor and negative factor ) are adopted to scale the ternary weights in each layer.

Fig. 4: An example of how the TTQ algorithm works. Firstly, normalized full-precision weights and biases are quantized to {-1, 0, +1} by the given threshold per layer. Secondly, positive and negative quantification factors are used to scale the quantized weights. Finally, the calculated gradients are back-propagated to each layer. The right part in the dotted rectangle represents the inference stage.

The workflow of TTQ is illustrated by Fig. 4, the normalized full-precision weights are quantized by , and

with full-precision activations. Instead of using the optimized threshold in TWN, TTQ adopts a heuristic method to calculate

:

(5)

where is a constant factor determined by experience and .

Iii Proposed Algorithm

In this section, we first propose a federated trained ternary quantization (FTTQ for short) to reduce the energy consumption in each client during inference and the upstream and downstream communications. Subsequently, a ternary federated averaging protocol (T-FedAvg for short) is suggested.

Iii-a Federated Trained Ternary Quantization

Since no direct data exchange is usually allowed between clients in the federated learning system, weight divergence [14] may be different among clients. For example in the layer of the global model shared by client and client , if = 5 and = 50, it is not necessarily true that the global model will be biased towards if we use the same factor for the two models. To address this issue, we start by scaling the weights to [-1, 1]:

(6)

where is a scaling function, . However, magnitude imbalance [38] may be introduced when scaling the entire of a certain network, thus resulting in significant loss of precision, since most of the elements are pushed to zero. Therefore, we scale the weights layer by layer.

Then, by using the same strategy as TTQ, we calculate the quantization threshold according to the scaled weights as follows:

(7)

where is a hyperparameter with a default setting on the client , and .

However, according to (6) and (7), we can easily find that the thresholds in all layers are mostly the same since the maximum absolute value of the scaled is 1 in most layers. Thus, the model capability may be effected by the homogeneity of the threshold. To avoid this issue, we propose an alternative threshold calculation criterion:

(8)

where

is the number of neurons and

is layer-wise calculated. Obviously, the threshold obtained by (8) is influenced by the layer sparsity and can be seen as an extension of (7) as:

(9)

Notably, the threshold will turn into the optimal solution proposed in [32] if we set the value of to 0.7. The calculation method of is generally adjusted according to the performance.

Subsequently, several operations are taken to achieve layer-wise weight quantization to overcome the computation burden and reduce the communication costs:

(10)
(11)
(12)

where is the step function and is the Hadamard product,

is an independent quantization vector which is trained together with other parameters layer by layer, and

is the quantized ternary weights. Consequently, the mask matrix can be rewritten as a union of a positive index and a negative index of the local model:

(13)
(14)

Different from the standard TTQ, we adopt a quantization factor which is updated with its gradients together with other parameters instead of the previous two quantization factors in each layer, mainly due to the following reasons.

Iii-A1 Stability

Large weight divergence will be encountered after synchronization if participating clients are initialized with different parameters, which leads to performance degeneration [14]. Hence, the weight divergence should be minimized at each layer in the federated learning environment.

Iii-A2 Energy Consumption

We present an proposition about the convergence trends of and and its proof in Section IV, followed by the experimental results on the two factors with different initials when training MLP and ResNet in Appendix A. It is worth noting that the trend of the positive and negative quantization factors of TTQ algorithm is almost the same in all layers. Hence, the energy consumption in calculating the gradients of two quantization factors during the back propagation can be cut in half if only one quantization factor is retained, which is important for some resource-constrained clients.

After quantifying the whole network, the loss function can be calculated and the errors be backpropagated in the same way as for continuous weights except that the weights are

or zeros. The gradients of and latent full-precision model are calculated according to the rules in [37]. The new update rule is summarized in Algorithm 1. Consequently, FTTQ significantly reduces the size of the updates transmitted to the server, thus reducing the costs of upstream communications. However, the costs of the downstream communications will not be reduced if no additional measures are taken, since the weights of the global model cannot be decomposed into the coefficient and ternary matrix after aggregation. To address this issue, a ternary federated learning protocol is presented in the next section.

Input: Full-precision parameters and quantization vector , loss function , dataset D with sample pairs , learning rate .
Output: Quantified model
init: All clients parameters are initialized with . for   do
       update
end for
Return (including , )
Algorithm 1 Federated Trained Ternary Quantization (FTTQ)
Fig. 5: The diagram of proposed T-FedAvg. The blue part runs on the clients with normalized full-precision weights, which is similar to the standard federated learning framework; then the quantification factors, thresholds and ternary local models are pushed to the server, as shown in the orange part; after that, the global model is obtained by the server aggregation and normalization; finally, the global model is quantized and pushed back to all clients.

Iii-B Ternary Federated Averaging

The two-step scheme of the proposed ternary federated averaging protocol with private data is elaborated in Fig. 5. In general, the participating clients quantize the normalized local models and upload the thresholds, quantization factors and ternary models to the server. Then the server aggregates all local models to obtain the global model. Finally, the server quantifies the normalized global model again using fixed thresholds and pushes the quantized global model to all clients. The basic flow is described as follows.

Iii-B1 Upstream

Let = {1, 2,…,} be the set of indices which represent the randomly selected clients that participate in the aggregation, where is the participation ratio and is the total number of the clients in the federated learning system. The local scaled full-precision and quantized weights of client are denoted by and , respectively. We upload the trained ( and ) to the server instead of the updates after local iterations, although the two are equivalent [11]. And at inference stage, only the quantized model is needed for prediction.

Iii-B2 Downstream

After communication rounds, the server will rebuild all models received from the participated clients, and the global model can be calculated by . Then the server will quantify the global model again with a constant threshold using a default setting of 0.05 and push the quantized model to the clients.

Input: Initial global model parameters
Init: Broadcast to clients , , assign each client a unique dataset . for round r = 1,…,  do
       for c } in parallel do
             Client does: 
                   download quantified initialize FTTQ() upload to server
             end
            
       end for
      Server does: 
             broadcast to all clients
       end
      
end for
Algorithm 2 Ternary Federated Averaging

Unlike standard federated learning algorithms, our method compresses communications during the upload and download phases, which brings major advantages when deploying DNNs at the inference stage for resource-constrained devices. Specifically, the clients move the local networks from 32-bit to 2-bit and push the 2-bit networks and quantification parameters to the server, and then download the quantized global model from the server at the end. For example, if we configure a federated learning environment involving 20 clients and a global model that requires 25 MB of storage space, the total communication costs of the standard federated learning is about 1 GB per round (upload and download). By contrast, our method reduces the costs to 65 MB per round (upload and download), which is about 1/16 of the standard method. Note that quantifying the global model pushed to the clients makes reverse engineering more difficult. The overall workflow of the proposed ternary federated protocol is summarized in Algorithm 2.

Iv Theoretical analysis

In this section, we first formally demonstrate the properties of two quantization factors , in TTQ, followed by a proof of unbiasedness of FTTQ and T-FedAvg.

By default in this paper, the subscripts represent the indices of the elements in a network instead of the indices of the clients in the federated learning system.

Iv-a The Convergence of Quantization Factors in TTQ

The experimental results of the convergence profiles of and of two widely used neural networks are presented in Appendix A, showing that the two factors have converged to the same value. To theoretically prove the convergence, we at first introduce the following assumption.

Assumption IV.1

The elements in scaled full-precision

are uniformly distributed between 0 and 1,

(15)

Then we have the following proposition.

Proposition IV.1

Given an one-layer online gradient system, each element of its parameters is initialized with a symmetric distribution centered at 0, e.g., , which is quantized by TTQ with two iterative factors , and a fixed threshold , then we have:

(16)

where is the training epoch and .

Proof IV.1

The converged and can be regarded as the optimal solution of the quantization factors, which can reduce the Euclidean distance between the full-precision weights and the quantized weights , which is equal to . Then we have:

(17)

where , and , and according to (4) we have

(18)

Then the original problem can be transformed to

(19)

where is a constant independent of and . Hence the optimal solution of (19) can be obtained when

(20)

Since the weights are distributed symmetrically, and will converge to the same value. This completes the proof.

Iv-B The Unbiasedness of FTTQ

Here, we first prove the unbiasedness of FTTQ. To simplify the original problem, we adopt an assumption that is common in network initialization.

With Assumption IV.1, we prove Proposition IV.2:

Proposition IV.2

Let be the local scaled network parameters defined in Assumption IV.1 of one client in a given federated learning system. If is quantified by FTTQ algorithm, then we have

(21)
Proof IV.2

According to (20), is calculated by the elements in , where is a fixed number once the parameters are generated under Assumption IV.1, hence the elements indexed by obey a new uniform distribution between and , then we have

(22)

therefore, the probability density function

of () can be regarded as .

According to Proposition IV.1 and (20), we have:

(23)

where is an arbitrary element in and represents the number of elements in .

We know that

(24)

and since

(25)

hence

(26)

and under Assumption IV.1, we have

(27)

then it is immediate that

(28)

Hence, the FTTQ quantizer output can be considered as an unbiased estimator of the input

[39]. We can guarantee the unbiasedness of FTTQ in federated learning systems when the weights are uniformly distributed. Furthermore, since the distribution of network weights in most of layers may be non-uniform due to the stochastic errors from data, the self-learning factor and may reduce the quantization errors, which can be regarded as a non-uniform sampling method.

Iv-C The Properties of T-FedAvg

Here, we adopt the following assumption to demonstrate the properties of T-FedAvg, which is widely used rules in the literature [14].

Assumption IV.2

When a federated learning system with clients and one server is established, all clients will be initialized with the same global model.

Under the above assumption, Zhao et al. [14] proved the following conclusions:

Lemma IV.1

The weight divergence which leads to the performance degeneration after rounds of synchronization between the clients and the server mainly comes from two parts, including the weight divergence after rounds of aggregation, i.e., (superscript denotes the centralized setting), and the weight divergence resulting from the Earth mover’s distance (EMD) between the data distribution on each client and the actual distribution of the whole data population.

Lemma IV.2

When all the clients are initialized with the same global model, the EMD between the data distribution on each client and the distribution of the whole data population becomes the main cause of the performance degeneration.

Since our method is unbiased and can reduce the Euclidean distance between the quantized network and the full-precision network, we can conclude that T-FedAvg can also perform well if the original FedAvg converges to the optimal solution.

V Experimental Results

This section evaluates the performance of the proposed method on widely used benchmark datasets. We set up multiple controlled experiments to examine the performance compared with the standard federated learning algorithm in terms of the test accuracy and communication costs. In the following, we present the experimental settings and the results.

V-a Settings

To evaluate the performance of the proposed network quantization and ternary protocol in federated learning systems, we first conduct experiments with 10 independent physical clients connected by a Local Area Network (LAN). Then, we test the obtained model in the simulation environment with the number of clients varying from 10 to 100.

The physical system consists of four CPU laptops that are connected wirelessly through LAN to mimic low-power mobile devices, one of which acts as the server aggregation model and the remaining laptops act as clients participating in the federated training. Each client only communicates with the server and there is no information exchange between the clients.

For simulations, we typically use 10 clients for experiments according to the number of classes in the datasets. A detailed description of the configuration is given below.

1) Compared algorithms. In this work, we compare the following algorithms:

  • []

  • Baseline: the centralized learning algorithm, such as stochastic gradient descent (SGD) method, which means that all data is stored in a single computing center and the model is trained directly using the entire data.

  • FedAvg: the canonical federated learning approach presented in [11].

  • TTQ: the canonical trained ternary quantization method, in which the configuration is the same as the baseline, i.e., the data is stored in a centralized manner and a model is trained using all the data.

  • T-FedAvg: our proposed quantized federated learning approach.

2) Datasets. We select two representative benchmark datasets that are widely used for classification, and no data augmentation method is used in all experiments.

  • []

  • MNIST [34]: it contains 60000 training and 10000 testing gray-scale handwritten image samples with 10 classes, where the dimension of each image is 2828. Since the features of MNIST are easily extracted, this data set is mainly used to train small networks.

  • CIFAR10 [35]: it contains 60000 colored images of 10 types of objects from frogs to planes, 50000 for training and 10000 for testing. It is a widely used benchmark data set that is difficult to extract features.

2) Models. To evaluate the performance of above algorithms, two popular deep learning models are selected: MLP and ResNet, which represent tiny and large models, respectively. The detailed setting are as follows:

  • []

  • MLP: it is mainly used for training small data sets, e.g., MNIST. The model contains two hidden layers with the number of neurons of 30 and 20, respectively. For centralized and distributed training, the learning rate

    is set to the same and the ReLU function is selected as the activation function.

  • ResNet18: it is a simplified version of the widely used ResNet [40], where the number of input and output channels for all convolutional layers is reduced to 64. It is a typical benchmark model for evaluating the performance of algorithms on large data sets.

3) Data distribution. The performance of federated learning is affected by the features of training data stored on the separated clients. To investigate the impact of different data distributions, several types of data are generated:

  • []

  • IID data: each client holds an IID subset of data containing 10 classes, thus having a IID-subset of the data.

  • Non-IID data: the union of the samples in all clients is the entire dataset, but the number of classes contained in each client is not equal to the total number of categories in the entire dataset (10 for MNIST and CIFAR10). We can use the label to assign samples of classes to each client, where is the number of classes per client. In case of extremely non-IID data, is equal to 1 for each client, but this case is generally not considered since there is no need to train (e.g., classification) if only one class is stored on each client.

  • Unbalancedness in data size: typically, the size of the datasets on different clients varies a lot. To investigate the influence of the unbalancedness in data size in the federated learning environment, we split the entire dataset into several distinct parts.

3) Basic configuration. The basic configuration of federated learning system in our experiments is set as follows:

  • []

  • Total number of clients: = 100.

  • The participation ratio per round: = 0.1.

  • Classes per client: = 10.

  • Local batch size: = 64.

  • Local epochs: = 5.

All experimental settings are summarized in Table I.

Models MLP ResNet
Dataset MNIST CIFAR10
Optimizer SGD Adam
Learning rate 0.0001 0.008
Baseline 92.75% 86.30%
Parameter amount 24330 607050
TABLE I: Models and hyperparameters.

The learning rate of the centralized and federated learning algorithms is the same and remains constant during the training. Note that a small learning rate is set for training MLP to slow down the convergence speed for easy observation.

V-B Performance on IID Data

In this part, we conduct experiments on IID MNIST and CIFAR10 using the benchmark algorithms with MLP and ResNet mentioned above, where the baseline and TTQ are representatives of centralized approaches.

Specifically, the data used by the centralized methods is stored centrally in one computing center, while the data used by FedAvg and T-FedAvg is stored separately. To explore the best performance of FedAvg and T-FedAvg, the federated learning environment is set with 10 fully participating clients and each client holds an IID subset of data containing 10 classes.

The results and model weight width are summarized in Table II. We can see that the test accuracies achieved by the baseline algorithm and TTQ are 92.75%, 92.87%, respectively, when trained on MNIST with MLP, and 86.30%, 85.73%, respectively, on CIFAR10 with ResNet. TTQ has a slight deterioration in performance on CIFAR10 when the model complexity is reduced.

Methods MNIST CIFAR10
Accuracy Width Accuracy Width
Baseline 92.75% 32 bit 86.30% 32 bit
FedAvg 92.37% 32 bit 85.72% 32 bit
TTQ 92.87% 2 bit 85.73% 2 bit
T-FedAvg 92.75% 2 bit 86.60% 2 bit
TABLE II: Test accuracies achieved and weights width of different algorithms when trained on IID data

When the data distribution is IID, FedAvg achieves 92.37% and 85.72% accuracies, respectively, on MNIST and CIFAR10. However, the test accuracy of T-FedAvg, which is about 1/16 of the full-precision model size, achieves 92.75% on MNIST and the highest on CIFAR10, 86.60%. It is worth noting that, as the depth of the network deepens, the quantization error is declining and may even exceed the accuracy of the original model, which can be evidenced by the performance of T-FedAvg on ResNet.

The convergence speeds of the four methods in different local iterations are illustrated in Fig. 6, where the centralized methods, the baseline and TTQ obtain convergence curves corresponding to the federated learning environment by the interval of . Overall, the convergence speed of our method is the fastest when trained on MNIST and is slightly slower than FedAvg on CIFAR10 in the initial phase, which is up to the performance of TTQ.

Fig. 6: Convergence speed of the four compared algorithms over communication rounds or epochs with the same models.

The test accuracies achieved by FedAvg and T-FedAvg for various batch sizes are shown in Fig. 7. We notice that our method outperforms FedAvg for a small batch size. Since small batches mean more iterations and can thus reduce quantization errors, it is beneficial for clients with limited computing resources. However, the performance of T-FedAvg is not robust enough compared to FedAvg in big local batch size. This may be attributed to the fact that insufficient model training has resulted in an accumulation in quantization errors when the local batch size is large.

Fig. 7: Maximum accuracies achieved by FedAvg and T-FedAvg when training MLP on MNIST (left) and ResNet18 on CIFAR10 (right), respectively, for 100 rounds with different batch sizes. Ten clients are involved for all experiments in full participation.

V-C Performance with Different

The boxplots of data distributions with different are depicted in Fig. 9, where the y-axis represents the sample label (0-9). As shown in the figure, the original distribution of training and test data are IID. Specifically, the data distribution is IID in the case of = 10, which means that each client has a IID subset of the entire dataset (refer to the plot right). However, when the = 2, the samples on each client are divided according to the label, which is non-IID, and has no overlap with other clients. Similarly, the samples in all clients are non-IID when = 5, but there are some overlaps in data between clients. Clearly, the data distribution can be regarded as non-IID when is smaller than the number of total classes categories of the training data. In this case, the local stochastic gradient cannot be considered as an unbiased estimate of the global gradient with non-IID settings.

Fig. 8: Test accuracies achieved by MLP and ResNet trained on non-IID MNIST and CIFAR10 after fixed rounds of FedAvg and T-FedAvg. The participation ratio is fixed to 1 and the training data is split among the clients with different .
Fig. 9: Data distributions with different . When =2, there is no overlap in data between clients and each client contains two categories (left). When =5, the samples on 10 clients are sampled by labels but there are some overlap between clients (middle). When =10, the samples on 10 clients are generated by randomly sampling (right). Note that Only 3 clients are shown in the figures when the is 2, 5 or 10.

Although FedAvg and T-FedAvg have achieved satisfactory test accuracies on IID data, a significant degradation of the test performance of FedAvg and T-FedAvg is observed on non-IID data, which is illustrated in Fig. 8 and Table. III. Note that 10 clients are selected with full participation and no pre-training model is used during the training, which means that the participating ratio is 1 for the purpose of investigating the impact of for the non-IID setting.

Methods MNIST CIFAR10
= 2 = 5 = 2 = 5
FedAvg 86.69% 87.17% 52.10% 74.21%
T-FedAvg 87.1% 87.22% 52.35% 74.43%
TABLE III: Test accuracies achieved over non-IID data for different .

However, as mentioned in the previous work [11], federated learning suffers from extremely non-IID data distribution. When = 2 and 5, each client is randomly assigned 2 and 5 classes, respectively, both methods work well on MNIST, although there is an acceptable performance degradation. Nevertheless, a significant reduction in the test accuracy on CIFAR10 is observed for both methods when = 2, and increasing from 2 to 5 can effectively alleviate the degeneration.

As we all know, the intricate features of CIFAR10 makes it more difficult to train the model than MNIST. Therefore, the performance gap on MNIST when = 2 and 5 is smaller than that on CIFAR10. And during the training process, T-FedAvg outperforms the standard FedAvg in terms of convergence speed, and is similar to that of the baseline method at the earlier stage.

Theoretically, since T-FedAvg could reduce the upstream and downstream communication costs, we can increase the number of clients or communication rounds within the same constraint of budgets to alleviate the performance degeneration. Recently, a method which shares partial selected IID data to alleviate the performance degeneration has also been proposed [14]. Although the method has certain limitations (e.g., the way of generating the shared dataset and the over-fitting problems introduced by the shared dataset), it is still an promising solution.

V-D Influence of the Participation Ratio

We investigate the effect of on T-FedAvg in this subsection. We fix the total number of the clients and the local batch sizes to 100 and 64, respectively, throughout all experiments. Here, the experiments are done using MLP only, since the robustness of MLP to non-IID data (see in Fig. 8) can reduce the effect of model selection. Fig. 10 shows the test accuracies achieved by T-FedAvg during the training on IID and non-IID MNIST in the federated learning environment with different participation ratios ().

Fig. 10: Test accuracies achieved by T-FedAvg when training MLP on MNIST with IID distribution (left) and non-IID distribution (right) in fixed rounds at different participation ratios (0.1, 0.3, 0.5, 0.7).

As we can see, T-FedAvg is relatvely robust to the changes of the participation ratio over IID and non-IID data. Although reducing participation ratios has negative effects on the convergence speed and convergence values achieved in the fixed rounds, the negative effects are more pronounced on non-IID data (refer to the right panel of Fig. 10). A similar phenomenon has also been observed in [33]. We surmise that the performance degradation on non-IID data is heavily dependent on whether the features on the clients selected by the server for model aggregation are representative in the federated learning environment. It is common to increase the participation ratio to alleviate the negative impacts.

V-E Influence of Unbalancedness in Dada Size

All experiments above were performed with a balanced split of the data, where all clients were assigned the same number of samples. In the following, we investigate the performance of the proposed algorithm on the unbalancedness in the data size [33]. If we use to represent the set of number of samples on clients, we can define the degree of unbalanceness by the ratio :

(29)

where the median of

is sometimes helpful to accommodate long tailed distributions and possible outliers

[41].

When = 0.1, most of the samples are stored on a few clients, and when = 1, almost all clients store the same number of samples. To simulate the unbalanced data distribution, we vary from 0.1 to 1, with an average of 30 out of 100 clients participating. And the test accuracies achieved by FedAvg and T-FedAvg for various are illustrated in Fig. 11.

Fig. 11: Test accuracies achieved by MLP on MNIST and ResNet on CIFAR10 after 400 rounds of iterations with FedAvg and T-FedAvg, where the local batch size and participation ratio are set to 32 and 0.3, respectively.

We can see that the unbalancedness in data size does not have a significant impact on the performance of federated learning. This is due to the fact that even when the data is mostly concentrated on some clients, both algorithms can achieve satisfactory performance.

V-F Comparison of Communication Costs

In this subsection, we compared the communication costs of FedAvg and T-FedAvg for a fixed round number. The learning environment is configured the same as in Section IV. Since both algorithms have achieved convergence within 100 rounds on all datasets, we fix the round number to 100. The results are shown in Table IV.

Methods MLP ResNet
Upload Download Upload Download
FedAvg 742.49 MB 742.49 MB 18525.70 MB 18525.70 MB
T-FedAvg 46.41 MB 46.41 MB 1157.86 MB 1157.86 MB
TABLE IV: The total memory required to achieve a certain targeted test accuracy on different tasks in an IID setting within 100 rounds. Note that 10 out of 100 clients ( = 0.1) participating the aggregation after 5 local training epochs in one round.

We can see that the communication costs of T-FedAvg are reduced by nearly 94% in the upload and download phases compared to the standard FedAvg. To the best of our knowledge, no such a significant communications compression level in the download phase has been achieved in federated learning.

Vi Conclusions and Future Work

Federated learning is effective in privacy preservation, although it is constrained by limited upstream and downstream bandwidths and the performance may seriously degrade when the data distribution is extreme. To address these issues, we have proposed federated trained ternary quantization (FTTQ), a compression method adjusted for federated learning based on TTQ algorithm, to reduce the energy consumption at the inference stage on the clients. Furthermore, we have proposed ternary federated learning protocol, which compress both uploading and downloading communications to nearly one sixteenth of the standard method. The optimal solutions of the quantization factors, detailed proofs of the unbiasedness and convergence of the proposed methods are also given. Our experimental results on widely used benchmark datasets demonstrate the effectiveness of the proposed algorithms. Moreover, since we have reduced the downstream and upstream communication costs between the clients and the server, we can increase the number of clients or the rounds of communications within the same constraint of budgets to improve the performance of federated learning.

Our approach can be seen as an application of trained ternary quantization method by quantizing the global model to reduce the communication costs. However, the great reduction in communication costs is at the expense of the performance of federated learning, in particular when the data on the clients are extremely non-IID. Our future work will aim at finding a more efficient approach to improving the performance of federated learning on non-IID data.

Appendix A Trade-off between Model Capacity and Communication Costs

In this appendix, we analyze the weight distribution and the convergence of the quantification factors in the proposed TTQ algorithm. Both MLP and ResNet are selected to enhance the reliability of the analysis. The initial values of the quantization factors are changed in the case of fixing other hyperparameters (e.g., the batch size), and the influences of the quantization factors are observed from two aspects: the convergence trend in each layer and the effect of the absolute difference in the quantization factor values.

Fig. 12: Convergence analysis of quantization factors in ternary MLP. The convergence trend under different initial values (top) and the effect of the gap of positive and negative values (bottom).

Firstly, we conduct experiments on MLP since there exists only one quantized hidden layer. As shown on the top of Fig. 12, if we subtract the initial value from the values obtained by each iteration of the quantization factor, we can see that the gap between the iterative values and the initial values of the positive and negative factors changes in the same trend. Regardless of whether the initial values are the same, the offsets are consistent with respect to the respective initial values.

As we know, if the convergence values are the same as the initial values, TTQ degenerates into TWN and the learning capability of ternary models may decline. Especially, varies greatly due to the distinct data distribution of each client in the process of federated learning. To address this issue, we reduce the two quantization factors to one, and use (6), (8) to constrain the threshold between clients.

To illustrate the effect of the gap of positive and negative values, we fix the initial value of one of the factors and increase the initial value of the other factor, as shown in the bottom of Fig. 12. We can see that as the interval increases, the change in the values of quantification factors is smaller, and finally it is almost close to 0 (convergence to the initial values). So we can conclude that the gradients of and will be tiny in the end, according to (13) and (14).

Similar phenomena can also be observed from the experiments conducted on ResNet. The results obtained by the selected appropriate initial values are shown in Fig. 13. When the initial values of the two quantization factors are the same, the convergence profile of and in the layer is nearly symmetrical and the difference between the absolute values of the two is almost zero at each epoch (refer to Fig. 13same-init-a). In case the initial values of the and are different, it can be observed that and have the same trend in the layer from Fig. 13diff-init-b, while the convergence trend of the two parameters is more fluctuating.

[Same initial values] [Different initial values]
Fig. 13: The convergence trend in specific layer and convergence values among layers with same (left) and different (right) initial values of the quantification factors.

References