The advancement of machine learning (ML) is heavily dependent on the processing of massive amounts of data. The most timely and relevant data are often generated at different devices all over the world, e.g., data collected by mobile phones and video cameras. Because of communication and privacy constraints, gathering all these data for centralized processing is impractical/infeasible, motivating the need for ML training over widely distributed data (decentralized learning). For example, geo-distributed learning DBLP:conf/nsdi/HsiehHVKGGM17 trains a global ML model over data spread across geo-distributed data centers. Similarly, federated learning DBLP:conf/aistats/McMahanMRHA17 trains a centralized model over data from a large number of devices (mobile phones).
Key Challenges in Decentralized Learning. There are two key challenges in decentralized learning. First, training a model over decentralized data using traditional training approaches (i.e., those designed for centralized data) requires massive communication, which drastically slows down the training process because the communication is bottlenecked by the limited wide-area or mobile network bandwidth DBLP:conf/nsdi/HsiehHVKGGM17; DBLP:conf/aistats/McMahanMRHA17. Second, decentralized data are typically generated at different contexts, which can lead to significant differences in the distribution of the data across data partitions. For example, facial images collected by cameras will reflect the demographics of each camera’s location, and images of kangaroos will be collected only from cameras in Australia or zoos. Unfortunately, existing decentralized learning algorithms (e.g., DBLP:conf/nsdi/HsiehHVKGGM17; DBLP:conf/aistats/McMahanMRHA17; DBLP:conf/nips/SmithCST17; DBLP:conf/ICLR/LinHMWD18; DBLP:conf/icml/TangLYZL18) mostly focus on reducing communication, as they either (i) assume that data are independent and identically distributed (IID) across different partitions or (ii) conduct only a very limited study on non-IID data partitions. This leaves a key question mostly unanswered: What happens to different ML applications and decentralized learning algorithms when their training data partitions are not IID?
Our Goal and Key Findings. In this work, we aim to answer the above key question by conducting the first detailed empirical study of the impact of non-IID data partitions on decentralized learning. Our study covers various ML applications, ML models, training datasets, decentralized learning algorithms, and degrees of deviation from IID. We focus on deep neural networks (DNNs) as they are the most relevant solutions for our applications. Our study reveals three key findings:
Training over non-IID data partitions is a fundamental and pervasive problem for decentralized learning. All three decentralized learning algorithms in our study suffer from major model quality loss (or even divergence) when run on non-IID data partitions, across all applications, models, and training datasets in our study.
DNNs with batch normalization DBLP:conf/icml/IoffeS15 are particularly vulnerable to non-IID data partitions, suffering significant model quality loss even under BSP, the most communication-heavy approach to decentralized learning.
The degree of deviation from IID (the skewness) is a key determinant of the difficulty level of the problem. These findings reveal that non-IID data is an important yet heavily understudied challenge in decentralized learning, worthy of extensive study.
Solutions. As two initial steps towards addressing this vast challenge, we first show that among the many proposed alternatives to batch normalization, group normalization DBLP:conf/eccv/WuH18 avoids the skew-induced accuracy loss of batch normalization under BSP. With this fix, all models in our study perform well under BSP for non-IID data, and the problem can be viewed as a trade-off between accuracy and communication frequency. Intuitively, there is a tug-of-war among the different data partitions, with each partition pulling the model to reflect its data, and only frequent communication, tuned to the skew-induced accuracy loss, can save the overall model accuracy of the algorithms in our study. Accordingly, we present , a system-level approach that adapts the communication frequency of decentralized learning algorithms to accuracy loss, by cross-validating model accuracy across data partitions (model traveling). Our experimental results show that ’s adaptive approach automatically reduces communication by 9.6 (under high skew) to 34.1 (under mild skew) while retaining the accuracy of BSP.
Contributions. We make the following contributions. First, we conduct a detailed empirical study on the problem of non-IID data partitions. To our knowledge, this is the first study to show that the problem of non-IID data partitions is a fundamental and pervasive challenge for decentralized learning. Second, we make a new observation showing that the challenge of non-IID data partitions is particularly problematic for DNNs with batch normalization, even under BSP. We discuss the root cause of this problem and we find that it can be addressed by using an alternative normalization technique. Third, we show that the difficulty level of this problem varies with the data skew. Finally, we design and evaluate , a system-level approach that adapts the communication frequency to reflect the skewness in the data, seeking to maximize communication savings while preserving model accuracy.
Our study focuses on label-based partitioning of data, in which the distribution of labels varies across partitions. The paper concludes with a broader taxonomy of regimes of non-IID data (sec:discussion), the study of which is left to future work.
2 Background and Setup
We provide background on decentralized learning and popular algorithms for this learning setting (subsec:decentral) and then describe our study’s experimental setup (subsec:setup).
2.1 Decentralized Learning
In a decentralized learning setting, we aim to train a ML model based on all the training data samples that are generated and stored in one of the partitions (denoted as ). The goal of the training is to fit to all data samples. Typically, most decentralized learning algorithms assume the data samples are independent and identically distributed (IID) among different , and we refer to such a setting as the IID setting. Conversely, we call it the Non-IID setting if such an assumption does not hold.
We evaluate three popular decentralized learning algorithms to see how they perform on different applications over the IID and Non-IID settings. These algorithms can be used with a variety of stochastic gradient descent (SGD) approaches, and aim to reduce communication, either among data partitions () or between the data partitions and a centralized server.
DBLP:conf/nsdi/HsiehHVKGGM17, a geo-distributed learning algorithm that dynamically eliminates insignificant communication among data partitions. Each partition accumulates updates to each model weight locally, and communicates to all other data partitions only when its relative magnitude exceeds a predefined threshold (Algorithm 1 in Appendix A)
DBLP:conf/aistats/McMahanMRHA17, a popular algorithm for federated learning that combines local SGD on each client with model averaging. Specifically, selects a subset of the partitions
in each epoch, runs a prespecified number of local SGD steps on each selected, and communicates the resulting models back to a centralized server. The server averages all these models and uses the averaged as the starting point for the next epoch (Algorithm 2 in Appendix A).
, a popular algorithm that communicates only a pre-specified amount of gradients each epoch, with various techniques to retain model quality such as momentum correction, gradient clippingDBLP:conf/icml/PascanuMB13, momentum factor masking, and warm-up training DBLP:journals/corr/GoyalDGNWKTJH17 (Algorithm 3 in Appendix A).
In addition to these decentralized learning algorithms, we also show the results of using BSP (bulk synchronous parallel) DBLP:journals/cacm/Valiant90 over the IID and Non-IID settings. BSP is significantly slower than the above algorithms because it does not seek to reduce communication: All updates from each are accumulated and shared among all data partitions after each training iteration (epoch). As noted earlier, for decentralized learning, there is a natural tension between the frequency of communication and the quality of the resulting model: differing distributions among the pull the model in different directions—more frequent communication helps mitigate this “tug-of-war” in order that the model well-represents all the data. Thus, BSP, with its full communication every iteration, is used as the target baseline for model quality for each application.
2.2 Experimental Setup
Our study consists of three dimensions: (i) ML applications/models, (ii) decentralized learning algorithms, and (iii) degree of deviation from IID. We explore all three dimensions with rigorous experimental methodologies. In particular, we make sure the accuracy of our trained ML models on IID data matches the reported accuracy in corresponding papers. To our knowledge, this is the first detailed empirical study on ML over non-IID data partitions.
We evaluate different deep learning applications, DNN model structures, and training datasets:
with four DNN models (AlexNet DBLP:conf/nips/KrizhevskySH12, GoogLeNet DBLP:conf/cvpr/SzegedyLJSRAEVR15, LeNet lecun1998gradient, and ResNet DBLP:conf/cvpr/HeZRS16) over two datasets (CIFAR-10 krizhevsky2009learning
and ImageNetILSVRC15). We use the validation data accuracy as the model quality metric.
with the center-loss face model DBLP:conf/eccv/WenZL016 over the CASIA-WebFace DBLP:journals/corr/YiLLL14a dataset. We use verification accuracy on the LFW dataset LFWTech as the model quality metric.
For all applications, we tune the training parameters (e.g., learning rate, minibatch size, number of epochs, etc.) such that the baseline model (BSP in the IID setting) achieves the model quality of the original paper. We then use these training parameters in all other settings. We further ensure that training/validation accuracy has stopped improving by the end of all our experiments. Appendix B lists all major training parameters in our study.
Non-IID Data Partitions. We create non-IID data partitions by partitioning datasets using the labels
on the data, i.e., using image class for image classification and person identities for face recognition. This partitioning emulates real-world non-IID settings, which often involve highly unbalanced label distributions across different locations (e.g., kangaroos only in Australia or zoos, a person’s face in only a few locations worldwide). We control the degree of deviation from IID by controlling the fraction of data that are non-IID. For example, 20% non-IID indicates 20% of the dataset are partitioned by labels, while the remaining 80% are partitioned randomly.
Hyper-Parameters Selection. The algorithms we study provide the following hyper-parameters (see Appendix A for the detail of these algorithms) to control the amount of communication (and hence the training time):
uses , the initial threshold to determine if a is significant. Starting with this initial , the significance threshold decreases whenever the learning rate decreases.
uses to control the number of local SGD steps on each selected .
uses to control the sparsity of updates (update magnitudes in top percentile are exchanged). Following the original paper DBLP:conf/ICLR/LinHMWD18, follows a warm-up scheduling: 75%, 93.75%, 98.4375%, 99.6%, 99.9%. We use a hyper-parameter , the number of epochs for each warm-up sparsity, to control the duration of the warm-up. For example, if , is 75% in epochs 1–4, 93.75% in epochs 5–8, 98.4375% in epochs 9–12, 99.6% in epochs 13–16, and 99.9% in epochs 17+.
We select a hyper-parameter of each decentralized learning algorithms (, , ) so that (i) achieves the same model quality as BSP in the IID setting and (ii) achieves similar communication savings across the three decentralized learning algorithms. We study the sensitivity of our findings to the choice of in subsec:decentral_parameter.
3 Non-IID Study: Results Overview
This paper seeks to answer the question as to what happens to ML applications, ML models, and decentralized learning algorithms when their training data partitions are not IID. In this section, we provide an overview of our findings, showing that non-IID data partitions cause major model quality loss, across all applications, models, and algorithms in our study. We discuss the results for in subsec:overview_cifar10 and subsec:overview_imagenet and for in subsec:overview_face.
3.1 Image Classification with CIFAR-10
We first present the model quality with different decentralized learning algorithms over the IID and Non-IID settings for over the CIFAR-10 dataset. We use five partitions () in this evaluation. As the CIFAR-10 dataset consists of ten object classes, each data partition has two object classes in the Non-IID setting. Figure 1 shows the results with four popular DNNs (AlexNet, GoogLeNet, LeNet, and ResNet). According to the hyper-parameter criteria in subsec:setup, we select for , for , and for . We make two major observations.
1) It is a pervasive problem. All three decentralized learning algorithms lose significant model quality for all four DNNs in the Non-IID setting. We see that while these algorithms retain the validation accuracy of BSP in the IID setting with 15–20 communication savings (agreeing with the results from the original papers for these algorithms), they lose 3% to 74% validation accuracy in the Non-IID setting. Simply running these algorithms for more epochs would not help because the training/validation accuracy has already stopped improving. Furthermore, the training is completely diverged in some cases, such as with GoogLeNet and ResNet20 (with ResNet20 also diverges in the IID setting). The pervasiveness of the problem is quite surprising, as we have a diverse set of decentralized learning algorithms and DNNs. This result shows that Non-IID data is a pervasive and challenging problem for decentralized learning, and this problem has been heavily understudied. sec:decentral discusses the cause of this problem.
2) Even BSP cannot completely solve this problem. We see that even BSP, with its full communication every iteration, cannot retain model quality for some DNNs in the Non-IID setting. In particular, the validation accuracy of ResNet20 in the Non-IID setting is 39% lower than that in the IID setting. This finding suggests that, for some DNNs, it may not be possible to solve the Non-IID data challenge by communicating more frequently between data partitions (). We find that this problem not only exists in ResNet20, but in all DNNs we study with batch normalization layers (ResNet10, BN-LeNet DBLP:conf/icml/IoffeS15 and Inception-v3 DBLP:conf/cvpr/SzegedyVISW16). We discuss this problem and potential solutions in sec:batch_norm.
3.2 Image Classification with ImageNet
We study over the ImageNet dataset ILSVRC15 dataset (1,000 image classes) to see if the Non-IID data problem exists in different datasets. We use two partitions () in this experiment so each partition gets 500 image classes. According to the hyper-parameter criteria in subsec:setup, we select for , for , and for .
The same trend in different datasets. Figure 2 illustrates the validation accuracy in the IID and Non-IID settings. Interestingly, we observe the same problems in the ImageNet dataset, whose number of classes is two orders of magnitude more than the CIFAR-10 dataset. First, we see that and lose significant validation accuracy (8.1% to 27.2%) for both DNNs in the Non-IID setting. On the other hand, while is able to retain the validation accuracy for GoogLeNet in the Non-IID setting, it cannot converge to a useful model for ResNet10. Second, BSP also cannot retain the validation accuracy for ResNet10 in the Non-IID setting, which concurs with our observation in the CIFAR-10 study. These results show that the Non-IID data problem not only exists in various decentralized learning algorithms and DNNs, but also exists in different datasets.
3.3 Face Recognition
We further examine another popular ML application, , to see if the Non-IID data problem is a fundamental challenge across different applications. We again use two partitions () in this evaluation, and we store different people’s faces in different partitions in the Non-IID setting. According to the hyper-parameter criteria in subsec:setup, we select for , for , and for . It is worth noting that the verification process of is fundamentally different from , as does not
use the classification layer (and thus the training labels) at all in the verification process. Instead, for each pair of verification images, the trained DNN is used to compute a feature vector for each image, and the distance between these feature vectors is used to determine if the two images belong to the same person.
The same problem in different applications. Figure 3 presents the LFW verification accuracy using different decentralized learning algorithms in the IID and Non-IID settings. Again, the same problem happens in this application: these decentralized learning algorithms work well in the IID setting, but they lose significant accuracy in the Non-IID setting. In fact, both and cannot converge to a useful model in the Non-IID setting, and their 50% accuracy is from random guessing (the verification process is a series of binary questions). This result is particularly noteworthy as uses a vastly different verification process that does not rely on the training labels, which are used to create the Non-IID setting to begin with. We conclude that Non-IID data is a fundamental and pervasive problem across various applications, datasets, models, and decentralized learning algorithms.
4 Problems of Decentralized Learning Algorithms
The results in sec:overview show that three diverse decentralized learning algorithms all suffer drastic accuracy loss in the Non-IID setting. We investigate the potential reasons for this (subsec:decentral_cause) and the sensitivity to hyper-parameter choice (subsec:decentral_parameter).
4.1 Reasons for Model Quality Loss
Gaia. We extract the -trained models from both partitions (denoted DC-0 and DC-1) for over the ImageNet dataset, and then evaluate the validation accuracy of each model based on the image classes in each partition. As Figure 4 shows, the validation accuracy is pretty consistent among the two sets of image classes when training the model in the IID setting: the results for IID DC-0 Model are shown, and IID DC-1 Model is the same. However, the validation accuracy varies drastically under the Non-IID setting (Non-IID DC-0 Model and Non-IID DC-1 Model). Specifically, both models perform well for the image classes in their respective partition, but they perform very poorly for the image classes that are not in their respective partition. This reveals that using in the Non-IID setting results in completely different models among data partitions, and each model is only good for recognizing the image classes in its data partition.
This raises the following question: How does produce completely different models in the Non-IID setting, given that synchronizes all significant updates () to ensure that the differences across models in each weight is insignificant (sec:background)? To answer this, we first compare each weight in the Non-IID DC-0 and DC-1 Models, and find that the average difference among all the weights is only 0.5% (reflecting that the threshold for significance in the last epoch was 1%). However, we find that given the same input image, the neuron
values are vastly different (at an average difference of 173%). This finding suggests that small model differences can result in completely different models. Mathematically, this is because weights are both positive and negative: a small percentage difference in individual weights of a neuron can lead to a large percentage difference in its value. As eliminates insignificant communication, it creates an opportunity for models in each data partition to specialize for the image classes in their respective data partition, at the expense of other classes.
DeepGradientCompression. While local model specialization explains why performs poorly in the Non-IID setting, it is still unclear why other decentralized learning algorithms also exhibit the same problem. More specifically, and always maintain one global model, hence there is no room for local model specialization. To understand why these algorithms perform poorly, we study the average residual update delta () with . This number represents the magnitude of the gradients that have not yet been exchanged among different , due to its communicating only a fixed number of gradients in each epoch (sec:background). Thus, it can be seen as the amount of gradient divergence among different .
Figure 5 depicts the average residual update delta for the first 20 training epochs when training ResNet20 over the CIFAR-10 dataset. We show only the first 20 epochs because the training diverges after that in the Non-IID setting. As the figure shows, the average residual update delta is an order of magnitude higher in the Non-IID setting (283%) than that in the IID setting (27%). Hence, each generates large gradients in the Non-IID setting, which is not surprising as each sees vastly different training data in the Non-IID setting. However, these large gradients are not synchronized because sparsifies the gradients at a fixed rate. When they are finally synchronized, they may have diverged so much from the global model that they lead to the divergence of the whole model. Our experiments also support this proposition, as we see diverges much more often in the Non-IID setting.
FederatedAveraging. The above analysis for can also apply to , which delays communication from each by a fixed number of local iterations. If the weights in different diverge too much, the synchronized global model can lose accuracy or completely diverge DBLP:journals/corr/abs-1806-00582. We validate this by plotting the average local weight update delta for at each global synchronization (, where is the averaged global model weight). Figure 6 depicts this number for the first 25 training epochs when training AlexNet over the CIFAR-10 dataset. As the figure shows, the average local weight update delta in the Non-IID setting (48.5%) is much higher than that in the IID setting (20.2%), which explains why Non-IID data partitions lead to major accuracy loss for . The difference is less pronounced than with , so the impact on accuracy is smaller.
4.2 Algorithm Hyper-Parameters
We now study the sensitivity of the non-IID problem to hyper-parameter choice. Table 1 presents the results for varying its hyper-parameter (subsec:setup) when training on CIFAR-10, and we leave the results for the other two algorithms and more models to Appendix C. As the table shows, we study seven choices for and compare the results with BSP. We make two main observations.
First, almost all hyper-parameter settings lose significant accuracy in the Non-IID setting (relative to BSP in the IID setting). Even with a relatively conservative hyper-parameter setting (e.g., , the most communication-intensive of the choices shown), we still see a 3.3% and 21.9% accuracy loss. On the other hand, the exact same hyper-parameter choice for in the IID setting can achieve close to BSP-level accuracy (within 1.1%). We see the same trend with much more aggressive hyper-parameter settings as well (e.g., ). This shows that the problem of Non-IID data partitions is not specific to particular hyper-parameter settings, and that hyper-parameter settings that work well in the IID setting may perform poorly in the Non-IID setting.
Second, more conservative hyper-parameter settings (which implies more frequent communication among the ) often greatly decrease the accuracy loss in the Non-IID setting. For example, the validation accuracy with is significantly higher than the one with . This suggests that we may be able to use more frequent communication among the for higher model quality in the Non-IID setting (mitigating the “tug-of-war” among the (subsec:decentral)).
5 Batch Normalization: Problem and Solution
As sec:overview discusses, even BSP cannot retain model quality in the Non-IID setting for DNNs with batch normalization layers. In this section, we first discuss why batch normalization is particularly vulnerable in the Non-IID setting (subsec:batch_norm_problem) and then study alternative normalization techniques, including one—Group Normalization—that works better in this setting (subsec:batch_norm_solution).
5.1 The Problem of Batch Normalization in the Non-IID Setting
Batch normalization DBLP:conf/icml/IoffeS15 (BatchNorm) is one of the most popular mechanisms in deep learning, and it has been employed by default in most deep learning models (more than 11,000 citations). BatchNorm enables faster and more stable DNN training because it enables larger learning rates, which in turn make convergence much faster and help avoid sharp local minimum (hence, the model generalizes better).
How BatchNorm works. BatchNorm aims to stabilize a DNN by normalizing the input distribution of selected layers such that the inputs on each channel of the layer have zero mean (
) and unit variance (). Because the global mean and variance is unattainable with stochastic training, BatchNorm uses minibatch mean and variance
as an estimate of the global mean and variance. Specifically, for each minibatch, BatchNorm calculates the minibatch mean and variance , and then uses and to normalize each in DBLP:conf/icml/IoffeS15. Recent work shows that BatchNorm enables larger learning rates because: (i) BatchNorm corrects large gradient updates that could result in divergence DBLP:conf/nips/BjorckGSW18 and (ii) BatchNorm makes the underlying problem’s landscape significantly more smooth DBLP:conf/nips/SanturkarTIM18.
BatchNorm and the Non-IID setting. While BatchNorm is effective in practice, its dependence on minibatch mean and variance ( and ) is known to be problematic in certain settings. This is because BatchNorm uses and for training, but it typically uses an estimated global mean and variance ( and ) for validation. If there is a major mismatch between these means and variances, the validation accuracy is going to be low because the input distribution during validation does not match the distribution during training. This can happen if the minibatch size is small or the sampling of minibatches is not IID DBLP:conf/nips/Ioffe17. The Non-IID setting in our study exacerbates this problem because each data partition sees very different training samples. Hence, the and in each can vary significantly in the Non-IID setting, and the synchronized global model may not work for any set of data. Worse still, we cannot simply increase the minibatch size or do better minibatch sampling to solve this problem, because in the Non-IID setting the underlying training dataset in each does not represent the global training dataset.
We validate if there is indeed major divergence in and among different in the Non-IID setting. We calculate the divergence of as the difference between in different over the average (i.e., it is for two partitions and ). We use the average over every 100 minibatches in each so that we get better estimation. Figure 7 depicts the divergence of for each channel of the first layer of BN-LeNet, which is constructed by inserting BatchNorm to LeNet after each convolutional layer. As we see, the divergence of is significantly larger in the Non-IID setting (between 6% to 51%) than that in the IID setting (between 1% to 5%). We also observe the same trend in minibatch variances (not shown). As discussed earlier, this phenomenon is detrimental to training: Each uses very different and to normalize its model, but the resultant global model can use only one and which cannot match all of these diverse batch means and variances. As this problem has nothing to do with the frequency of communication among , it explains why even BSP cannot retain model accuracy for BatchNorm in the Non-IID setting.
5.2 Alternatives to Batch Normalization
As the problem of BatchNorm in the Non-IID setting is due to its dependence on minibatches, the natural solution is to replace BatchNorm with alternative normalization mechanisms that are not dependent on minibatches. Unfortunately, most existing alternative normalization mechanisms have their own drawbacks. We first discuss the normalization mechanisms that have major shortcomings, and then we discuss a particular mechanism that may be used instead.
Weight Normalization DBLP:conf/nips/SalimansK16. Weight Normalization (WeightNorm) is a normalization scheme that normalizes the weights in a DNN as oppose to the neurons (which is what BatchNorm and most other normalization techniques do). WeightNorm is not dependent on minibatches as it is normalizing the weights. However, while WeightNorm can effectively control the variance of the neurons, it still needs a mean-only BatchNorm in many cases to achieve the model quality and training speeds of BatchNorm DBLP:conf/nips/SalimansK16. This mean-only BatchNorm makes WeightNorm vulnerable to the Non-IID setting again, because there is a large divergence in among the in the Non-IID setting (subsec:batch_norm_problem).
Layer Normalization DBLP:journals/corr/BaKH16. Layer Normalization (LayerNorm) is a technique that is inspired by BatchNorm. Instead of computing the mean and variance of a minibatch for each channel, LayerNorm computes the mean and variance across all channels for each sample. Specifically, if the inputs are four-dimensional vectors (batch channel width height), BatchNorm produces means and variances along the dimensions. On the other hand, LayerNorm produces means and variances along the
dimensions (per-sample mean and variance). As the normalization is done on a per-sample basis, LayerNorm is not dependent on minibatches. However, LayerNorm makes a key assumption that all inputs make similar contributions to the final prediction. But this assumption does not hold for some models such as convolutional neural networks, where the activation of neurons should not be normalized with non-activated neurons. As a result, BatchNorm still outperforms LayerNorm for these modelsDBLP:journals/corr/BaKH16.
Batch Renormalization DBLP:conf/nips/Ioffe17. Batch Renormalization (BatchReNorm) is an extension to BatchNorm that aims to alleviate the problem of small minibatches (or inaccurate minibatch mean, , and variance, ). BatchReNorm achieves this by incorporating the estimated global mean () and variance () during training, and introducing two hyper-parameters to contain the difference between (, ) and (, ). These two hyper-parameters are gradually relaxed such that the earlier training phase is more like BatchNorm, and the later phase is more like BatchReNorm.
We evaluate BatchReNorm with BN-LeNet over CIFAR-10 to see if BatchReNorm can solve the problem of Non-IID data partitions. We replace all BatchNorm layers with BatchReNorm layers, and we carefully select the BatchReNorm hyper-parameters so that BatchReNorm achieves the highest validation accuracy in both the IID and Non-IID settings. Table 2 shows the Top-1 validation accuracy. We see that while BatchNorm and BatchReNorm achieve similar accuracy in the IID setting, they both perform worse in the Non-IID setting. In particular, while BatchReNorm performs much better than BatchNorm in the Non-IID setting (75.3% vs. 65.4%), BatchReNorm still loses accuracy compared to the IID setting. This is not surprising, because BatchReNorm still relies on minibatches to certain degree, and prior work has shown that BatchReNorm’s performance still degrades when the minibatch size is small DBLP:conf/nips/Ioffe17. Hence, BatchReNorm cannot completely solve the problem of Non-IID data partitions, which is a more challenging problem than small minibatches.
Group Normalization DBLP:conf/eccv/WuH18. Group Normalization (GroupNorm) is an alternative normalization mechanism that aims to overcome the shortcomings of BatchNorm and LayerNorm. GroupNorm divides adjacent channels into groups of a prespecified size , and computes the per-group mean and variance for each input sample. Specifically, for a four-dimensional input , GroupNorm partitions the set of channels () into multiple groups () of size . GroupNorm then computes means and variances along the dimension. Hence, GroupNorm does not depend on minibatches for normalization (the shortcoming of BatchNorm), and GroupNorm does not assume all channels make equal contributions (the shortcoming of LayerNorm).
We evaluate GroupNorm with BN-LeNet over CIFAR-10 to see if we can use GroupNorm as an alternative to BatchNorm in the Non-IID setting. We carefully select , which works best with this DNN. Figure 8 shows the Top-1 validation accuracy with GroupNorm and BatchNorm across decentralized learning algorithms. We make two main observations.
First, GroupNorm successfully recovers the accuracy loss of BatchNorm with BSP in the Non-IID setting. As the figure shows, GroupNorm with BSP achieves 79.2% validation accuracy in the Non-IID setting, which is as good as the accuracy in the IID setting. This shows GroupNorm can be used as an alternative to BatchNorm to overcome the Non-IID data challenge for BSP. Second, GroupNorm dramatically helps the decentralized learning algorithms to improve model accuracy in the Non-IID setting as well. We see that with GroupNorm, there is 14.4%, 8.9% and 8.7% accuracy loss for , and , respectively. While the accuracy losses are still significant, they are better than their BatchNorm counterparts by an additive 10.7%, 19.8% and 60.2%, respectively.
Summary. Overall, our study shows that GroupNorm DBLP:conf/eccv/WuH18
can be a good alternative to BatchNorm in the Non-IID setting, especially for computer vision tasks. For BSP, it fixes the problem, while for decentralized learning algorithms, it greatly decreases the accuracy loss. However, it is worth noting that BatchNorm is widely adopted in many DNNs, hence, more study should done to see if GroupNorm can always replace BatchNorm for different applications and DNN models. As for other tasks such as recurrent (e.g., LSTMDBLP:journals/neco/HochreiterS97) and generative (e.g., GAN DBLP:conf/nips/GoodfellowPMXWOCB14) models, other normalization techniques such as LayerNorm DBLP:journals/corr/BaKH16 can be good options because (i) they are shown to be effective in these tasks and (ii) they are not dependent on minibatches, hence, they are unlikely to suffer the problems of BatchNorm in the Non-IID setting.
6 Degree of Deviation from IID
Our study in previous sections (sec:overview–sec:batch_norm) assumes a strict case of non-IID data partitions, where each training label only exists in a data partition exclusively. While this assumption may be a reasonable approximation for some applications (e.g., for , a person’s face image may exist only in a data partition for a geo-region in which the person lives), it could be an extreme case for other applications (e.g., ). Here, we study how the problem of non-IID data changes with the degree of deviation from IID (the skewness) by controlling the fraction of data that are non-IID (subsec:setup). Figure 10 illustrates the CIFAR-10 Top-1 validation accuracy of AlexNet and GN-LeNet (our name for BN-LeNet with GroupNorm replacing BatchNorm, Figure 8) in the 20%, 40%, 60% and 80% non-IID setting. We make two main observations.
1) Partial non-IID data is also problematic. We see that for all three decentralized learning algorithms, partial non-IID data still cause major accuracy loss. Even with a small degree of non-IID data such as 20%, we still see 5.8% and 3.4% accuracy loss for and in AlexNet (Figure 8(a)). The only exception is AlexNet with , which retains model accuracy in partial non-IID settings. However, the same technique suffers significant accuracy loss for GN-LeNet in partial non-IID settings (Figure 8(b)). We conclude that the problem of non-IID data does not occur only with exclusive non-IID data partitioning, and hence, the problem exists in a vast majority of practical decentralized settings.
2) The degree of deviation from IID often determines the difficulty level of the problem. We observe that the degree of skew changes the landscape of the problem significantly. In most cases, the model accuracy gets worse with higher degrees of skew, and the accuracy gap between 80% and 20% non-IID data can be as large as 7.4% (GN-LeNet with ). We see that while most decentralized learning algorithms can retain model quality within certain degrees of skew, there is usually a limit. For example, when training over 20% non-IID data, all three decentralized learning algorithms stay within 1.3% accuracy loss for GN-LeNet (Figure 8(b)). However, their accuracy losses become unacceptable when they are dealing with 40% or higher non-IID data.
7 Our Approach:
To address the problem of non-IID data partitions, we introduce , a generic, system-level approach that enables communication-efficient decentralized learning over arbitrarily non-IID data partitions. We provide an overview of (subsec:solution_overview), describe its key mechanisms (subsec:comm_control), and present evaluation results (subsec:solution_results).
7.1 Overview of
The goal of is a system-level solution that (i) enables high-accuracy, communication-efficient decentralized learning over arbitrarily non-IID data partitions; and (ii) is general enough to be applicable to a wide range of ML applications, ML systems, and decentralized learning algorithms. To this end, we design as a system-level module that can be integrated with various decentralized learning algorithms and ML systems.
Figure 11 overviews the design.
Estimate the degree of deviation from IID. As sec:skewness shows, knowing the degree of skew is very useful to determine an appropriate solution. To learn this key information, periodically moves the ML model from one data partition () to another during training (model traveling, ❶ in Figure 11). then evaluates how well a model performs on a remote data partition by evaluating the model accuracy with a subset of training data on the remote node. As we already know the training accuracy of this model in its originated data partition, we can infer the accuracy loss in this remote data partition (❷). The accuracy loss is essentially the performance gap for the same model over two different data partitions, which can be used as an approximation of the degree of skew. For example, it is very likely that a remote data partition consists of very different data characteristics if the model in the local data partition has reached training accuracy 60%, but the same model achieves only 30% accuracy in the remote data partition. More importantly, accuracy loss directly captures the extent to which the model underperforms on the different data partition.
Adaptive communication control (❸). Based on the accuracy loss learns from model traveling, controls the tightness of communication among data partitions to retain model quality. controls the communication tightness by automatically tuning the hyper-parameters of the decentralized learning algorithm (subsec:decentral_parameter). This tuning process is essentially solving an optimization problem that aims to minimize communication among data partitions while keeping accuracy loss within a reasonable threshold (subsec:comm_control provides further details).
handles non-IID data partitions in a manner that is transparent to ML applications and decentralized learning algorithms, and it controls communication based on the accuracy loss across partitions. Thus, we do not need to use the most conservative mechanism (e.g., BSP) all the time, and can adapt to whatever skew is present for the particular ML application and its training data partitions (sec:skewness).
7.2 Mechanism Details
We now discuss the mechanisms of in detail.
Accuracy Loss. The accuracy loss between data partitions represents the degree of model divergence. As subsec:decentral_cause discusses, ML models in different data partitions tend to specialize for their training data, especially when we use decentralized learning algorithms to relax communication.
Figure 12 demonstrates the above observation by plotting the accuracy loss between different data partitions when training GoogleNet over CIFAR-10 with . Two observations are in order. First, the accuracy loss changes drastically from the IID setting (0.4% on average) to the Non-IID setting (39.6% on average). This is expected as each data partition sees very different training data in the Non-IID setting, which leads to very different models in different data partitions. Second, more conservative hyper-parameters can lead to smaller accuracy drop in the Non-IID setting. For example, the accuracy loss for is significantly smaller than for larger settings of .
Based on the above observation, we can use accuracy loss (i) to estimate how much the models diverge from each other (reflecting training data differences); and (ii) to serve as an objective function for communication control. With accuracy loss, we do not need any domain-specific information from each ML application to learn and adapt to different degrees of deviation from IID, which makes much more widely applicable.
Communication Control. The goal of communication control is to retain model quality while minimizing communication among data partitions. Specifically, given a set of hyper-parameters for each iteration (or minibatch) , the optimization problem for is to minimize the total amount of communication for a data partition :
where is the total number of iterations to achieve the target model accuracy given all hyper-parameters throughout the training, is the amount of communication given , is the period size (in iterations) for model traveling, and is the communication cost for the ML model (for model traveling).
In practice, however, it is impossible to optimize for Equation 2 with one-pass training because we cannot know with different unless we train the model multiple times. We solve this problem by optimizing a proxy problem, which aims to minimize the communication while keeping the accuracy loss to a small threshold so that we can control model divergence caused by non-IID data partitions. Specifically, our target function is:
where is the accuracy loss based on the previously selected hyper-parameter (we memoize the most recent value for each ), and , are given parameters to determine the weight of accuracy loss and communication, respectively. We can employ various auto-tuning algorithms with Equation 3 to select such as hill climbing, stochastic hill climbing russell2016artificial, and simulated annealing van1987simulated. Note that we make not tunable here to further simplify the tuning.
Model Traveling Overhead. Using model traveling to learn accuracy loss can lead to heavy communication overhead if we need to do so for each pair of data partitions, especially if we have a large number of data partitions. For broadcast-based decentralized learning settings (e.g., geo-distributed learning), we leverage an overlay network to reduce the communication overhead for model traveling. Specifically, we use hubs to combine and broadcast models DBLP:conf/nsdi/HsiehHVKGGM17. The extra hops incurred are fine because model traveling is not latency sensitive. As for server-client decentralized learning settings (e.g., federated learning), only needs to control the communication frequency between server and clients, and the overhead of model traveling can be combined with model downloading at the beginning of each communication round between the server and clients.
7.3 Evaluation Results
We implement and evaluate in a GPU parameter server system DBLP:conf/eurosys/CuiZGGX16
based on CaffeDBLP:journals/corr/JiaSDKLGGD14. We evaluate several aforementioned auto-tuning algorithms and we find that hill climbing provides the best results. We compare with two other baselines: (1) BSP: the most communication-heavy approach that retains model quality in all Non-IID settings; and (2) Oracle: the ideal, yet unrealistic, approach that selects the most communication-efficient that retains model quality by running all possible in each setting prior to measured execution. Figure 13 shows the communication savings over BSP for both and Oracle when training with . Note that all results achieve the same validation accuracy as BSP. We make two main observations.
First, is much more effective than BSP in handling Non-IID settings. Overall, achieves 9.6–34.1 communication savings over BSP in various Non-IID settings without sacrificing model accuracy. As expected, saves more communication with less skewed data because can loosen communication in these settings (sec:skewness).
Second, is not far from the ideal Oracle baseline. Overall, requires only 1.1–1.5 more communication than Oracle to achieve the same model accuracy. cannot match the communication savings of Oracle because: (i) needs to do model traveling periodically, which leads to some extra overheads; and (ii) for some , high accuracy loss at the beginning can still end up with a high quality model, which cannot foresee. As Oracle requires many runs in practice, we conclude that is an effective, one-pass solution for decentralized learning over non-IID data partitions.
8 Related Work
To our knowledge, this is the first detailed study on the problem of non-IID data partitions for decentralized learning. Our study shows that this is a fundamental and pervasive problem, and we investigate various aspects of this problem, such as decentralized learning algorithms, batch normalization, and the degree of non-IID data, as well as present our approach. Here, we discuss related work.
Large-scale systems for centralized learning. There are many large-scale ML systems that aim to enable efficient ML training with centralized datasets (e.g., DBLP:conf/nips/RechtRWN11; LowGKBGH12; DBLP:conf/osdi/ChilimbiSAK14; DeanCMCDLMRSTYN12; HoCCLKGGGX13; CuiTWXDHHGGGX14; DBLP:conf/osdi/LiAPSAJLSS14; DBLP:journals/corr/ChenLLLWWXXZZ15; DBLP:journals/corr/GoyalDGNWKTJH17), where the data reside within a single data center. Some of them propose communication-efficient designs, such as relaxing synchronization requirements DBLP:conf/nips/RechtRWN11; HoCCLKGGGX13; DBLP:journals/corr/GoyalDGNWKTJH17 or sending fewer updates to parameter servers DBLP:conf/osdi/LiAPSAJLSS14; DBLP:conf/nips/LiASY14. These works assume the training data are centralized so they can be easily partitioned among the machines performing the training in an IID manner (e.g., by random shuffling). Hence, they are neither designed nor validated on non-IID data partitions.
Communication-efficient training for specific algorithms. A large body of prior work proposes ML training algorithms to reduce the dependency on intensive parameter updates to enable more efficient parallel training (e.g., DBLP:conf/nips/JaggiSTTKHJ14; DBLP:journals/jmlr/ZhangDW13; DBLP:conf/nips/ZinkevichWSL10; DBLP:conf/icml/TakacBRS13; DBLP:conf/uai/NeiswangerWX14; DBLP:conf/icml/ShamirS014; DBLP:conf/icml/ZhangL15a; DBLP:journals/jmlr/Shalev-Shwartz013). These works reduce communication overhead by proposing algorithm-specific approaches, such as solving a dual problem DBLP:journals/jmlr/Shalev-Shwartz013; DBLP:conf/icml/TakacBRS13; DBLP:conf/nips/JaggiSTTKHJ14 or employing a different optimization algorithm DBLP:conf/icml/RakhlinSS12; DBLP:conf/icml/ZhangL15a. The main drawback of these approaches is that they are not general and their applicability depends on the ML application. Besides, these works also assume centralized, IID data partitions. Thus, their effectiveness over decentralized, non-IID data partitions needs much more study.
Decentralized learning. As most training data are generated in a decentralized manner, prior work proposes communication-efficient algorithms (e.g., DBLP:conf/nsdi/HsiehHVKGGM17; DBLP:conf/aistats/McMahanMRHA17; DBLP:conf/ccs/ShokriS15; DBLP:conf/ICLR/LinHMWD18; DBLP:conf/icml/TangLYZL18) for ML training over decentralized datasets. However, the major focus of these works is to reduce communication overhead among data partitions, and they either (i) assume the data partitions are IID DBLP:conf/nsdi/HsiehHVKGGM17; DBLP:conf/ccs/ShokriS15; DBLP:conf/ICLR/LinHMWD18 or (ii) conduct only a limited study on non-IID data partitions DBLP:conf/aistats/McMahanMRHA17; DBLP:conf/icml/TangLYZL18. As our study shows, these decentralized learning algorithms lose significant model quality in the Non-IID setting (sec:overview). Some recent work DBLP:journals/corr/abs-1806-00582; DBLP:conf/nips/SmithCST17 investigates the problem of non-IID data partitions. For example, instead of training a global model to fit non-IID data partitions, federated multi-task learning DBLP:conf/nips/SmithCST17 proposes training local models for each data partition while leveraging other data partitions to improve the model accuracy. However, this approach does not solve the problem for global models, which are essential when a local model is unavailable (e.g., a brand new partition without training data) or ineffective (e.g., a partition with too few training examples for a class). The closest to our work is Zhao et al.’s study DBLP:journals/corr/abs-1806-00582 on over non-IID data, which shows suffers significant accuracy loss in the Non-IID setting. While the result of their study aligns with our observations on , our study (i) broadens the problem scope to a variety of decentralized learning algorithms, ML applications, DNN models, and datasets, (ii) explores the problem of batch normalization and possible solutions, and (iii) designs and evaluates .
9 Discussion: Regimes of Non-IID Data
Our study has focused on label-based partitioning of data, in which the distribution of labels varies across partitions. In this section, we present a broader taxonomy of regimes of non-IID data, as well as various possible strategies for dealing with non-IID data, the study of which are left to future work. We assume a general setting in which there may be many disjoint partitions, with each partition holding data collected from devices (mobile phones, video cameras, etc.) from a particular geographic region and time window.
Violations of Independence. Common ways in which data tend to deviate from being independently drawn from an overall distribution are:
Intra-partition correlation: If the data within a partition are processed in an insufficiently-random order, e.g., ordered by collection device and/or by time, then independence is violated. For example, consecutive frames in a video are highly correlated, even if the camera is moving.
Inter-partition correlation: Devices sharing a common feature can have correlated data across partitions. For example, neighboring geo-locations have the same diurnal effects (daylight, workday patterns), have correlated weather patterns (major storms), and can witness the same phenomena (eclipses).
Violations of Identicalness. Common ways in which data tend to deviate from being identically distributed are:
Quantity skew: Different partitions can hold vastly different amounts of data. For example, some partitions may collect data from fewer devices or from devices that produce less data.
Label distribution skew: Because partitions are tied to particular geo-regions, the distribution of labels varies across partitions. For example, kangaroos are only in Australia or zoos, and a person’s face is only in a few locations worldwide. The study in this paper focused on this setting.
Same label, different features: The same label can have very different “feature vectors” in different partitions, e.g., due to cultural differences, weather effects, standards of living, etc. For example, images of homes can vary dramatically around the world and items of clothing vary widely. Even within the U.S., images of parked cars in the winter will be snow-covered only in certain parts of the country. The same label can also look very different at different times, at different time scales: day vs. night, seasonal effects, natural disasters, fashion and design trends, etc.
Same features, different label: Because of personal preferences, the same feature vectors in a training data item can have different labels. For example, labels that reflect sentiment or next word predictors have personal/regional biases.
As noted in some of the above examples, non-IID-ness can occur over both time (often called concept drift) and space (geo-location).
Strategies for dealing with non-IID data. The above taxonomy of the many regimes of non-IID data partitions naturally leads to the question of what should the objective function of the DNN model be? In our study, we have focused on obtaining a global model that minimizes an objective function over the union of all the data. An alternative objective function might instead include some notion of “fairness” among the partitions in the final accuracy on their local data. There could also be different strategies for treating different non-IID regimes.
As noted in Section 8, multi-task learning approaches have been proposed for jointly training local models for each partition, but a global model is essential whenever a local model is unavailable or ineffective. A hybrid approach would be to train a “base” global model that can be quickly “specialized” to local data via a modest amount of further training on that local data. This approach would be useful for differences across space and time. For example, a global model trained under normal circumstances could be quickly adapted to natural disaster settings such as hurricanes, flash floods and forest fires.
As one proceeds down the path towards more local/specialized models, it may make sense to cluster partitions that hold similar data, with one model for each cluster. The goal is to avoid a proliferation of too many models that must be trained, stored, and maintained over time.
Finally, another alternative for handling non-IID data partitions is to use multi-modal training that combines DNNs with key attributes about the data partition pertaining to its geo-location. A challenge with this approach is determining what the attributes should be, in order to have an accurate yet reasonably compact model (otherwise, in the extreme, the model could devolve into local models for each geo-location).
As most timely and relevant ML data are generated at different places, decentralized learning provides an important path for ML applications to leverage these decentralized data. However, decentralized data are often generated at different contexts, which leads to a heavily understudied problem: non-IID training data partitions. We conduct a detailed empirical study of this problem, revealing three key findings. First, we show that training over non-IID data partitions is a fundamental and pervasive problem for decentralized learning, as all decentralized learning algorithms in our study suffer major accuracy loss in the Non-IID setting. Second, we find that DNNs with batch normalization are particularly vulnerable in the Non-IID setting, with even the most communication-heavy approach being unable to retain model quality. We further discuss the cause and potential solution to this problem. Third, we show that the difficulty level of the non-IID data problem varies greatly with the degree of deviation from IID. Based on these findings, we present , a system-level approach to minimizing communication while retaining model quality even under non-IID data. We hope that the findings and insights in this paper, as well as our open source code, will spur further research into the fundamental and important problem of non-IID data in decentralized learning.
Appendix A Details of Decentralized Learning Algorithms
This section presents pseudocode for , and .
Appendix B Training Parameters
|decay||Learning rate||Total epochs|
|AlexNet||20||0.9||0.0005||, divides by|
|10 at epoch 64 and 96||128|
|GoogLeNet||20||0.9||0.0005||, divides by|
|10 at epoch 64 and 96||128|
|GN-LeNet||20||0.9||0.0005||, divides by|
|10 at epoch 64 and 96||128|
|ResNet-20||20||0.9||0.0005||, divides by|
|10 at epoch 64 and 96||128|
|decay||Learning rate||Total epochs|
|decay, power = 0.5||60|
|decay, power = 1||64|
|decay||Learning rate||Total epochs|
|center-loss||64||0.9||0.0005||, divides by|
|10 at epoch 4 and 6||7|
Appendix C More Algorithm Hyper-Parameter Results
subsec:decentral_parameter presents hyper-parameter sensitivity results for on two DNNs. Here, we expand the sensitivity study to show more results for , and . We make the same observation as subsec:decentral_parameter for these algorithms. The results are shown in Tables 6, 7 and 8.