1 Introduction
Federated learning (FL) allows multiple clients to collaborate in the training of a global machine learning model under the coordination of a cloud server without sharing raw data (McMahan et al., 2017). In this setting, the clients (e.g., millions of mobiledevice users or hundreds of companies and organizations) train the model in parallel using their local data, and the cloud server updates the global model by aggregating the local models collected from the clients in communication iterations.
As a new paradigm of distributed machine learning, the data characteristics in FL significantly differ from those in the traditional distributed optimization (Li et al., 2014; Lian et al., 2018; Tang et al., 2018a, b, 2019; Yu & Jin, 2019). On one hand, considering the fact that the clients tend to have diverse usage patterns, the amount of local data across clients are usually different, and each client’s local dataset just represents a certain aspect of the overall data distribution. That is, the data distributions in FL are unbalanced and noni.i.d. (Hsieh et al., 2019; Mohri et al., 2019; Li et al., 2019; Kairouz et al., 2019). On the other hand, FL data are usually periodically variational, which comes from several practical reasons. Due to strict data protection regulations (European Parliament and Council of the European Union, 2016; Ginart et al., 2019) and resource constraints, in many cases, the clients cannot hold user data for a long time. Thus, the training data may change cyclically over time and follow a certain temporal pattern. For the worldwide FL applications, the training data is also periodically variational since the available clients often follow a diurnal pattern (Li et al., 2019). We can use a concept of blockcyclicity to model the periodical variation of training data in FL. The training process is composed of several cycles, each of which further contains several data blocks, representing different training data distributions through the cycle. Within each data block, the clients with unbalanced and noni.i.d. data jointly train the global model in a parallel way.
There has been some effort on developing convergence guarantees for FL algorithms, but none of the existing work has considered the practical data characteristics, i.e., unbalance, noni.i.d. distribution, and blockcyclic pattern. Some aspects of data features have been partially investigated in the literature. Yu et al. (2019b) provided a theoretical analysis for the federated averaging (FedAvg), also known as parallel restarted SGD, by assuming that data distributions are noni.i.d. but remain unchanged through the training process; Eichner et al. (2019) considered the blockcyclic data pattern, but simply assumed there is only one client.
The above discussed data characteristics introduce two major biases in FL: (1) client bias: the model trained on a client would be biased to the client’s local training data; and (2) block bias:
the model trained using the data from a block would be skewed towards the data distribution in this block. To migrate the client bias, we aggregate the local model updates from participating clients, and to overcome the block bias, instead of training a single global model for all the blocks, we construct a series of blockspecific predictors
^{1}^{1}1Throughout this paper, we use the blockspecific global model and predictor interchangeably. by aggregating the model updates from the corresponding block in different cycles. Based on the above basic ideas, we first propose MultiModel Parallel SGD (MMPSGD), which takes a blockmixed training strategy, i.e., the training process goes through the mixture of different blocks, but for each block, we average the historical global models generated over it to construct the specific predictor. For the federated optimization with a strongly convex and smooth objective, MMPSGD always converges to the optimal global model at a rate of , achieving a linear speedup in terms of the number of clients. MMPSGD obtains good performance when the blockspecific data distributions are not far away from each other. To further improve the performance for the case of extremely different data distributions across blocks, we propose MultiChain Parallel SGD (MCPSGD), which augments MMPSGD with a blockseparate training strategy. With this strategy, we construct a set of blockseparate global models using only the training data from the corresponding block. The critical step behind MCPSGD is that in each training round, we select the “better” blockspecific global models generated from the blockmixed training process and the blockseparate training process. We show that MCPSGD further ensures each blockspecific predictor to converge to the block’s optimal model at a rate of while adding a slight communication overhead.Our key contributions in this work can be summarized as follows: (1) To the best of our knowledge, we are the first to consider that the data distributions in FL are unbalanced, noni.i.d., and blockcyclic. (2) Under the practical data characteristics, we propose MMPSGD and MCPSGD, both of which return a set of blockspecific predictors and have a convergence guarantee of with respective to the optimal global model. MCPSGD further ensures that each blockspecific predictor would converge to the block’s optimal model. (3) We evaluate our algorithms over the CIFAR10 dataset. Evaluation results demonstrate that our algorithms have significant performance improvement: achieving 6% higher test accuracy compared with FedAvg, and preserve robustness for the variance of critical parameters, whereas FedAvg fluctuates intensely due to the blockcyclic pattern in training data.
2 Preliminaries
We consider a general distributed optimization problem:
(1) 
where is the number of clients, and is the weight of the th client such that and . Each function is defined by:
(2) 
where is the overall distribution of client ’s local data, and
is a loss function on the data
from . To simplify the weighted form, we further let represent to scale the local objective. Then, the global objective becomes an average of :(3) 
We next formalize the practical cyclicity of training data. There are cycles in total, each of which consists of different global data blocks. In FL, the cycles can be the days of model training, and the global data blocks correspond to daytime and nighttime in each day. The th global data block is further comprised of local data blocks. Formally, we have
(4) 
where denotes the distribution of client ’s th local data block. Within a block, there will be rounds as well as iterations in total, where denotes the number of local iterations in each round. Under such a datacyclic model, for each client , its data samples are drawn from the distribution
(5)  
where indexes the cycles, indexes the blocks, and indexes the local iterations within a block.
Now, the global data distribution is actually blockcyclic, and we rewrite the original global optimization objection in a blockcyclic way:
(6) 
where the blockspecific function is an average of in the corresponding block, and . In this work, we make the following assumptions on the function .
Assumption 1 (Strong Convexity).
is strongly convex with modulus : for any ,
Assumption 2 (Smoothness).
is smooth with modulus : for any ,
To establish the convergence results, we further make some assumptions about the local gradients and the feasible space of model parameters.
Assumption 3 (Bounded Variance).
During local training, the variance of stochastic gradients on each client is bounded by :
Assumption 4 (Bounded Gradient Norm).
The expected norm of the stochastic gradients is bounded by :
Assumption 5 (Bounded Model Parameters).
The norm of any model parameters is bounded by :
The above assumptions have also been made in the literature to derive convergence results. Assumptions 1 and 2 are standard. Assumptions 3 and 4 were made in (Stich et al., 2018; Stich, 2019; Yu et al., 2019a, b). Assumption 5 was made in (Zinkevich, 2003; Eichner et al., 2019).
Some related works have investigated the convergence of FL algorithms with either the noni.i.d. data distribution or the cyclic data characteristics. To the best of our knowledge, none of existing work has jointly considered these two characteristics. Yu et al. (2019b) proved that FedAvg achieves an convergence for a nonconvex objective under the assumptions that the data distributions are noni.i.d. but remains unchanged during the training process. This is actually a special case of our problem by setting , indicating there is only one data block. Eichner et al. (2019) observed that blockcyclic data characteristics can deteriorate the performance of both sequential SGD and parallel SGD. However, they only proposed an approach for a sequential case with convergence guarantee. This is another special case of our problem by setting , indicating that there is only one client in total.
In what follows, we discuss how to design FL algorithms to support arbitrary and while attaining an convergence guarantee. The convergence rate is independent on and is consistent to the result without considering cyclic data feature in Yu et al. (2019b). In addition, by setting , the convergence result would reduce to as in Eichner et al. (2019) without considering the noni.i.d. data distribution.
3 Algorithm Design
In this section, we propose two algorithms to construct a set of blockspecific predictors, making a tradeoff between performance guarantee and communication efficiency. To guarantee a minimax optimal error with respective to the single optimal global model^{2}^{2}2Throughout this paper, we use minimax optimal error to denote the difference between the average loss of our predictors over blocks and the loss of the single optimal global model. and reduce communication overhead, we propose MultiModel Parallel SGD, namely MMPSGD. To further ensure each predictor to converge to the block’s optimal model with a slight additional communication cost, we propose MultiChain Parallel SGD, namely MCPSGD. Figure 1 illustrates the workflows of MMPSGD and MCPSGD.
3.1 MultiModel Parallel SGD
In this subsection, we design MMPSGD, which outputs a set of blockspecific predictors with a minimax optimal error guarantee with respective to the single optimal global model. Considering that data distributions are noni.i.d. and blockcyclic, there exist client biase and block biase as introduced in Section 1. To overcome the biases, we execute the training process over the mixture of blocks, but for each block, we average the historical global models generated over it to obtain the corresponding predictor.
We sketch MMPSGD in Algorithm 1. At the beginning, we initialize the learning rate , the number of local iterations
, and the vector of predictors
. The vector records the latest predictor for each block. In the training process, there are iterations and rounds in total. If is a multiple of , then the iteration is a communication iteration. At each communication iteration, the cloud server collects and aggregates the local models from all participating clients to obtain the new global model (Line 5), updates the blockspecific predictor for the current block (Line 8), and pushes the new global model to all clients (Line 9). After receiving the latest global model, each client runs local SGD iterations according to the observed local gradients in parallel (Lines 11 to 12).After iterations in total, the algorithm will return blockspecific predictors . According to Line 8, we can verify that the final predictor for the block is the average of the historical global models calculated at communication iterations belonging to the block , i.e.,
(7) 
where is the set of communication iterations corresponding to the block .
3.2 MultiChain Parallel SGD
The blockspecific predictors returned by MMPSGD only have a convergence guarantee with respective to the single optimal global model. In this subsection, we propose MCPSGD to further improve the performance of the predictors, requiring that each predictor also converges to the optimal blockspecific model. With such a result, MCPSGD would have better performance when the datasets across the blocks are extremely heterogeneous. We note that a separate model trained only by the block’s data would converge to the block’s optimal model from the results in learning theory (Yu et al., 2019b). With this observation, we augment MMPSGD with a blockseparate training strategy. The basic idea behind MCPSGD is to evaluate the models for each block from the blockmixed training chain like in MMPSGD and a new blockseparate training chain, and use the “better” model (the model with a smaller average local loss) to update the blockspecific predictor at each communication iteration.
We sketch MCPSGD in Algorithm 2. We first initialize the learning rates and for the blockmixed chain and the blockseparate chains, respectively. We maintain a vector to record the latest blockspecific predictors, and to record the latest blockseparate models. In each communication iteration from the block , the cloud server updates the global blockmixed model and the global blockseparate model by aggregating the corresponding local models collected from all clients (Lines 6 and 7), and then pushes new global models and to all clients (Line 8). Each client evaluates the local losses of and over its local data, and sends them back to the cloud server (Line 9). The cloud server calculates the average local losses of and over all the clients, and selects the model with a smaller average loss as the interim model (Line 10). With this information, the cloud server can update the latest predictor for the current block (Line 12). Before entering a new data block (say, data block ), we reset the local blockseparate model of each client to the latest global blockseparate model in this block (Lines 13 to 16). After receiving the global blockmixed and blockseparate models, each client runs local SGD steps in parallel until the next communication iteration (Line 18).
Finally, MCPSGD also returns predictors. According to Line 12 in Algorithm 2, the final predictor of the block should be:
(8) 
Compared with MMPSGD, MCPSGD needs to exchange extra model parameters and losses between the clients and the cloud server. Specifically, according to Theorems 1 and 2, the communication overhead of MCPSGD is times more than that in MMPSGD. Given that the number of blocks is usually small in FL, such communication overhead is acceptable in practical system deployment.
4 Convergence Analysis
4.1 Convergence of MultiModel Parallel SGD
In this subsection, we bound the gap between the loss of the single optimal global model and the average loss of our final predictors over blocks. This is achieved by bounding the average loss over all the communication iterations , where denotes the global model in the communication iteration , and is the set of all communication iterations, i.e., . We note that the block index depends on given in equation (5).
The update of the global model is an aggregation of model updates in a series of successive iterations from to , each of which is further an aggregation over the local model updates from all clients. By some calculations, we can have the following relation between the global models from two successive communication iterations:
(9) 
where is the average of local gradients of all client at iteration , i.e., . Updating the model directly from the model in the previous communication iteration needs to consider the accumulated gradient from multiple training iterations, rather than a single gradient . Thus, it is challenging to use the traditional convexity analysis technique to bound the average loss over the communication iterations. In contrast, we observe that the relation between the global models from two successive training iterations is easy to describe and analyze:
(10) 
Now, it is feasible to use the single iteration gradient and the property of strong convexity to bound the average loss from all the training iterations. We next control the gap between over all training iterations and
over all communication iterations by selecting appropriate hyperparameters. With these two steps, we can obtain the desired bound for
and the convergence guarantee for the predictors returned by MMPSGD.Theorem 1.
By setting and in MMPSGD, when , we have:
(11) 
Proof Sketch of Theorem 1.
We give a proof sketch here and defer the detailed proofs to Appendix A.
We first introduce a useful notation for the latter analysis:
(12) 
We then introduce some lemmas.
Lemma 2 (Bounding the variance).
Under Assumption 3, it follows that:
(14) 
Lemma 3 (Bounding the deviation of local model).
Under Assumption 4, the deviation between the local model and global model at each iteration is bounded by
Lemma 4 (Bounding the average of gradients).
Lemma 5 (Bounding the average loss of iterations).
Lemma 6 (Bounding the average loss of communication iterations).
By the above lemmas and the convexity of , and choosing and , when , we can obtain the convergence guarantee for MMPSGD:
(18)  
∎
Theorem 1 shows that MMPSGD converges at a rate of over blockcyclic data. The convergence rate is independent of the number of blocks , guaranteeing that the performance would not deteriorate as increases.
4.2 Convergence of MultiChain Parallel SGD
We now prove the convergence of MCPSGD. The main operation in MCPSGD is that in each communication iteration, for each block, we evaluate the models from two chains and select the one with a smaller average local loss to update the predictor. With this operation, the final predictor would outperform the model from either of the two chains (Lemma 8). We further show that the model from the blockmixed chain can achieve the convergence rate of with respective to the single optimal global model, and the model from the blockseparate chain can have the convergence guarantee of with respective to the block’s optimal model (Lemma 7). By these steps, we can achieve the convergence result of MCPSGD.
Theorem 2.
By setting and , when , MCPSGD has the following convergence results:
(19) 
and for each block ,
(20) 
where is the block’s optimal model from the block .
Proof.
Please refer to Appendix B. ∎
5 Experiments
In this section, we consider an image classification task and present the evaluation results of MMPSGD and MCPSGD. We note that although our two algorithms focus on the convex objective in the theoretical part, they still work very well for nonconvex problems in practice.
Model and Dataset.
We take a convolutional neural network (CNN) from PyTorch’s tutorial and use the public CIFAR10 dataset. The CNN is formed by two convolutional layers and three fully connected layers with ReLU activation, max pooling, and softmax. In addition, the CIFAR10 dataset consists of 10 classes of
images with three RGB channels. There are 50,000 and 10,000 images for training and testing, respectively. To simulate the blockcylic data in FL, we first partition both the training and test sets into heterogeneous blocks based on labels, where a block contains images of several labels, and different blocks may contain partially crossed labels. Each data block is further distributed amongclients in a noni.i.d. and unbalanced way, where the local training sets on the clients are fetched from the block in sequence, and the sizes of the local training sets roughly follow the normal distribution with a mean of
and a variance of . We implement our algorithms, test each blockspecific predictor on the corresponding block’s test set, and calculate an average of the test accuracies.Implementation Settings. For all the experiments, we set the local training batch size to 2. For MMPSGD, we use a learning rate of , and for MCPSGD, we set both learning rates for the blockmixed and blockseparate chains. As the default settings, we set the number of cycles to 10 and set the number of blocks to 5. Within each block, we set the total number of rounds to 200 and let each client run local iterations of SGD. Furthermore, in our experiments, we observed that later models tend to have better accuracy. Thus, we empirically took the exponentially weighted average of the historical global model to obtain the final predictors, where the base is , and the round number works as the exponent.
Our Algorithms vs. FedAvg. We compare MMPSGD, MCPSGD, with FedAvg over the blockcyclic data under the default settings. We also evaluate FedAvg over the shuffled data as an ideal baseline, where we directly distribute the randomly shuffled training data among all clients without the operation of block partition.
We plot the evaluation results in Figure 3 and can observe that: (1) MMPSGD and MCPSGD achieve the best test accuracy of 65%; (2) FedAvg with blockcyclic data does not converge at all, fluctuating between 56% and 59%; and (3) FedAvg with the shuffled data achieves the best test accuracy of 62%, 3% lower than our algorithms. We can also clearly observe that when the number of rounds is smaller than 3000, the test accuracies of our algorithms grow in a stair shape. This is because the performance of any blockspecific predictor can be significantly improved in the first few cycles.
5  10  15  20  

MMPSGD  3,210  2,010  1,610  1,610 
MCPSGD  3,210  2,010  1,610  1,410 
Number of Local Iterations. We expect that our algorithms would converge in fewer rounds if we choose a larger number of local iterations . This is because the convergence rate is , where denotes the total number of local iterations, and the total number of rounds should be . However, increasing will enlarge the bias between the local models and the global model, so the number of rounds needed for convergence may not consistently decrease inversely with .
We evaluate MMPSGD and MCPSGD under different numbers of local iterations ’s, where increases from 5 to 20 with a step of 5. Figure 3 plots the test accuracies, and Table 1 lists the number of rounds required to first achieve the test accuracy of 60%. We can see that both MMPSGD and MCPSGD converge faster with a larger , and MCPSGD converges faster than MMPSGD at . These results conform to our expectation and analysis.
Number of Blocks. We expect that the performance of our algorithms would not deteriorate as the number of blocks increase. We let take different values for comparison, while fixing the total number of rounds in each cycle at 1000 (i.e., ). Figure 4 plots the results of our algorithms as well as two baselines of FedAvg with cyclic data and FedAvg with shuffled data.
From Figure 4(a), we can see that the performance of FedAvg significantly deteriorates with a larger . In particular, the best accuracy of FedAvg is 61% at . But at , its accuracy fluctuates between 53% and 57%. From Figures 4(b) and 4(c), we can see that as increases, both of our algorithms perform even better. For MMPSGD, it achieves the best test accuracies of 63% and 67% at and , respectively. For MCPSGD, its performance is more stable, achieving the best test accuracy between and regardless of . We also observe some points with little fluctuation in Figure 4(c), which is reasonable in nonconvex optimization. Two chains for each block may converge to different local optimal models, averaging them may cause that the predictor is neither of the local optimal models.
Participation Rate of Clients. In the practical scenario of crossdevice FL, only part of clients are chosen to participate in each round of collaborative training. In this set of simulations, we set the number of total clients to 1000 and randomly select a certain fraction of clients to participate in each round. Now, the clients in our algorithms denote those participating ones rather than the whole client pool. Figure 5 shows the evaluation results, when the participation rate increases from 5% to 20% with a step of 5%. We can see that our two algorithms are both robust to the participation rate. In particular, the best test accuracies of MMPSGD are between 60% and 61%, whereas the best test accuracies of MCPSGD are higher, between 61% and 62%.
6 Conclusion
In this work, we considered unbalanced, noni.i.d., and blockcyclic data distributions in FL. Such data characteristics would deteriorate the performance of conventional FL algorithms. To handle the problems introduced by cyclic data, we proposed MMPSGD and MCPSGD to obtain a series of blockspecific predictors. Both MMPSGD and MCPSGD attain a convergence guarantee of , achieving a linear speedup with respect to the number of clients. MCPSGD can further guarantee that each blockspecific predictor converges to the block’s optimal model at a rate of , while adding an acceptable communication overhead. Empirical studies over the CIFAR10 dataset demonstrate effectiveness and robustness.
References

Eichner et al. (2019)
Eichner, H., Koren, T., McMahan, B., Srebro, N., and Talwar, K.
Semicyclic stochastic gradient descent.
In Proceedings of ICML, pp. 1764–1773, 2019.  European Parliament and Council of the European Union (2016) European Parliament and Council of the European Union. The General Data Protection Regulation (EU) 2016/679 (GDPR). https://eurlex.europa.eu/eli/reg/2016/679/oj, April 2016. Took effect from May 25, 2018.
 Ginart et al. (2019) Ginart, A. A., Guan, M., Valiant, G., and Zou, J. Making ai forget you: Data deletion in machine learning. In Proc. of NeurIPS, 2019.
 Hsieh et al. (2019) Hsieh, K., Phanishayee, A., Mutlu, O., and Gibbons, P. B. The noniid data quagmire of decentralized machine learning. CoRR, abs/1910.00189, 2019.
 Kairouz et al. (2019) Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., D’Oliveira, R. G. L., Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z., Gascón, A., Ghazi, B., Gibbons, P. B., Gruteser, M., Harchaoui, Z., He, C., He, L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi, T., Joshi, G., Khodak, M., Konecný, J., Korolova, A., Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P., Mohri, M., Nock, R., Özgür, A., Pagh, R., Raykova, M., Qi, H., Ramage, D., Raskar, R., Song, D., Song, W., Stich, S. U., Sun, Z., Suresh, A. T., Tramèr, F., Vepakomma, P., Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F. X., Yu, H., and Zhao, S. Advances and open problems in federated learning. CoRR, abs/1912.04977, 2019.
 Li et al. (2014) Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.Y. Scaling distributed machine learning with the parameter server. In Proceedings of OSDI, pp. 583–598, 2014.
 Li et al. (2019) Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. CoRR, abs/1908.07873, 2019.
 Lian et al. (2018) Lian, X., Zhang, W., Zhang, C., and Liu, J. Asynchronous decentralized parallel stochastic gradient descent. In Proceedings of ICML, pp. 3043–3052, 2018.
 McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communicationefficient learning of deep networks from decentralized data. In Proceedings of AISTATS, pp. 1273–1282, 2017.
 Mohri et al. (2019) Mohri, M., Sivek, G., and Suresh, A. T. Agnostic federated learning. In Proceedings of ICML, pp. 4615–4625, 2019.
 Stich (2019) Stich, S. U. Local SGD converges fast and communicates little. In Proceedings of ICLR, 2019.
 Stich et al. (2018) Stich, S. U., Cordonnier, J., and Jaggi, M. Sparsified SGD with memory. In Proceedings of NeurIPS, pp. 4452–4463, 2018.
 Tang et al. (2018a) Tang, H., Gan, S., Zhang, C., Zhang, T., and Liu, J. Communication compression for decentralized training. In Proceedings of NeurIPS, pp. 7652–7662, 2018a.
 Tang et al. (2018b) Tang, H., Lian, X., Yan, M., Zhang, C., and Liu, J. : Decentralized training over decentralized data. In Proceedings of ICML, pp. 4855–4863, 2018b.
 Tang et al. (2019) Tang, H., Yu, C., Lian, X., Zhang, T., and Liu, J. DoubleSqueeze: Parallel stochastic gradient descent with doublepass errorcompensated compression. In Proceedings of ICML, pp. 6155–6165, 2019.
 Yu & Jin (2019) Yu, H. and Jin, R. On the computation and communication complexity of parallel SGD with dynamic batch sizes for stochastic nonconvex optimization. In Proceedings of ICML, pp. 7174–7183, 2019.
 Yu et al. (2019a) Yu, H., Jin, R., and Yang, S. On the linear speedup analysis of communication efficient momentum SGD for distributed nonconvex optimization. In Proceedings of ICML, pp. 7184–7193, 2019a.

Yu et al. (2019b)
Yu, H., Yang, S., and Zhu, S.
Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning.
In Proceedings of AAAI, pp. 5693–5700, 2019b.  Zinkevich (2003) Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Fawcett, T. and Mishra, N. (eds.), Proceedings of ICML, pp. 928–936, 2003.
Appendix A Proof of Theorem 1
Proof of Lemma 1.
For any static model , we have
(21)  
Note that . We next focus on bounding and split it in to three terms.
(22) 
Note that
(23)  
where depends on given in equation (5).
Proof of Lemma 2.
From Assumption 3, the variance of stochastic gradient is bounded by , then
(30)  
where (a) follows from that has mean and is independent across clients. ∎
Proof of Lemma 3.
For any , there exists a largest such that for . Then, we have
(31)  
where (a)(c) follow from the inequality , and (d) follows from Assumption 4. ∎
Proof of Lemma 4.
We focus on bounding the average of gradients. Although the sampling is blockcyclic, when we focus on a certain block in a certain cycle , it is equal to the noncyclic case but with only iterations.