Decentralized optimization methods have been widely used in the machine learning community. Compared to centralized optimization methods, they enjoy several advantages, including aggregating the computing power of distributed machines, robustness to dynamic network topologies, and preserving data privacy(Yuan et al., 2016). These advantages make them an attractive option when data are collected by distributed agents, and either communicating the data to a central processing agent is computational prohibitive, or due to privacy concern each agent needs to keep the local data privately. After the seminal work in Tsitsiklis (1985); Tsitsiklis et al. (1986), there have been fruitful development on distributed optimization. For non-smooth objective functions, consensus-based subgradient methods have been analyzed in Nedic and Ozdaglar (2009); Sundhar Ram et al. (2010); Nedic (2011). Dual averaging method proposed in (Duchi et al., 2012) further explains the effect of the network topology on the convergence. Usually distributed subgradient methods converge at the rate of . For smooth (and possibly strongly convex) objective, Shi et al. (2015a, b) propose an exact first-order algorithm EXTRA and its proximal variant, which attain an improved rate of for general smooth objective, and a linear rate of for smooth and strongly convex objective with (see also Nedich et al. (2016); Yuan et al. (2016)). Asynchronous decentralized (sub)gradient descent algorithms have also been proposed and analyzed Nedic (2011); Tsitsiklis et al. (1986). Another class of mainstream distributed algorithms are based on the dual method, which include the classic idea of dual decomposition (Terelius et al., 2011), and the celebrated alternating direction method of multipliers (ADMM) (Boyd et al., 2011; Makhdoumi and Ozdaglar, 2017; Wei and Ozdaglar, 2013). ADMM attains convergence rate for the smooth problem and for smooth and strongly convex problem, but such results rely on strong assumptions such as having no constraints (Shi et al., 2014) or the local subproblem of each agent being easy to solve (Wei and Ozdaglar, 2013). All the aforementioned methods can be categorized as projection-based methods, as they all require to take projection back to the feasible set of the constraints at each iteration. Though commonly assumed to be easy, in numerous applications such projection indeed can either be computational expensive (projection onto the trace norm ball, base polytopes (Fujishige and Isotani, 2011)) or even intractable (Collins et al., 2008). Frank-Wolfe (FW) algorithm arises as a natural alternative in these scenarios. Unlike projection-based methods, FW assumes a linear oracle (LO) that solves a linear optimization problem over the feasible set which can be significantly easier than the projection. We refer to algorithms that avoid projection as projection-free algorithms. FW algorithm has been revisited in recent years for its projection-free property (Jaggi, 2013, 2011; Clarkson, 2010; Bach et al., 2012) and numerous improvements have been made. These include regularized FW (Harchaoui et al., 2015), linearly convergent variants under additional assumptions (Garber and Hazan, 2013; Lacoste-Julien and Jaggi, 2015), and stochastic variants (Hazan and Kale, 2012; Hazan and Luo, 2016). Related Work. Despite the progress on centralized FW algorithm, results on distributed FW algorithm are surprisingly limited. Specialized versions of decentralized FW algorithm have been proposed. Wang et al. (2016) propose a distributed block-coordinate FW algorithm for block-separable feasible sets (Lacoste-Julien et al., 2013). Bellet et al. (2015) consider a Lasso-type distributed learning problem. They neither assume nor exploit the fact that the global objective is a natural summation of each agent’s local objective, and their communication scheme is also different from the usual “within neighborhood” communication scheme and could be significantly more complicated. Lafond et al. (2016) consider a distributed FW algorithm also for the Lasso-type problem that leverages the sparsity of iterates to improve communication overhead. To the best of our knowledge, the most recent distributed FW (DFW) algorithm on general smooth convex problems is by Wai et al. (2017), wherein the DFW convergence rate is for smooth objectives; and for smooth and strongly convex objective under the assumption that the minimizer lies in the interior of constraint set. This assumption is almost unrealistic for two reasons: it implies the problem is essentially unconstrained, which usually fails to impose structural properties (such as sparsity, low-rankness) to the solution; and for a unconstrained problem the vanilla distributed gradient descent algorithm suffices to solve the problem efficiently (Yuan et al., 2016). Whether such restrictive assumption could be removed while retaining the same complexity remains an open question. We should note that all the previously discussed methods share the same communication complexity and projection/LO complexity to obtain an -optimal solution, regardless of being projection-based or projection-free. In practice, however, the time consumed by a single communication and a LO/projection often differ by orders of magnitude, which could incur significant latency. Modern CPUs perform IO at over 10 GB/s yet communication over TCP/IP is about 10 MB/s, this gap is even more significant when LO oracle is already cheap. Consider the matrix completion problem, in Section 5 we will show that for an iteration consisting one round of communication and one LO, communication would take up over time. This implies in DFW the actual running time would be largely consumed by communication. To alleviate the problem of latency in communication expensive applications, whether it is possible to trade for a better communication complexity with a moderately worse LO complexity becomes another open question. Contributions. In this paper we answer the above mentioned questions with an (almost) affirmative answer. Our contributions are the following:
We propose a new distributed projection-free algorithm named Decentralized Conditional Gradient Sliding (DCGS), and show that it attains communication complexity and LO complexity for smooth convex objectives.
Without assuming the minimizer being in the interior of the feasible set, we show that DCGS attains communication complexity and LO complexity for smooth and strongly convex objectives.
Our algorithm builds upon a distributed version of primal-dual algorithm and is hence modular. As a consequence, improvement on centralized FW can be easily exploited by DCGS. We demonstrate this advantage when the feasible set is polyhedral, for smooth and convex objective the LO complexity can be reduced to 111Throughout this paper we use to hide any additional logrithmic factor, while for smooth and strongly convex objective the LO complexity can be further reduced to , which matches the result of (Wai et al., 2017), but without the restrictive assumption on the minimizer.
2 Problem formulation
We consider an undirected graph , where denotes the vertex set and denotes the edge set. Each node is associated with an agent indexed also by , and has its local objective . We define the neighborhood of agent to be . Each agent can only communicate information with its neighbors. Naturally, is assumed to be connected or otherwise distributed optimization is impossible. Our objective is to minimize the summation of the local objectives, subject to the constraint that belongs to a closed compact convex set , that is:
We assume each function is (possibly )-strongly convex and -smooth, i.e., . Our algorithm could also be easily adapted to the setting where has different smoothness and strong convexity. We present here only the homogeneous case for simplicity. The distributed formulation (1) can be reformulated as the following linearly constrained optimization problem. Consider each agent keeps its local copy of decision variable , we can impose a linear constraint on so that for all . Define the graph Laplacian to be:
Then (1) could be reformulated as:
Let denotes the Kronecker product, here and . Since is assumed to be connected, (1) and (3) are equivalent. We can further reformulate the linearly constrained problem as a bilinear saddle point problem. Observe that (3) is equivalent to:
The bilinear saddle point problem (4) is well suited for the primal-dual algorithm proposed in (Chambolle and Pock, 2011). We present the orginal primal-dual algorithm applied to our problem in Algorithm 1.
Lan et al. (2017) observe that since is a summation splitted across agents, all the updates in the primal-dual algorithm can be performed in a distributed way. They propose a distributed primal-dual algorithm and show that to find an -optimal solution, one needs rounds of communication for a non-smooth convex objective and for a non-smooth strongly convex objective, which improves upon the previous results. However, their algorithm still lies in the category of projection-based algorithms and they consider non-smooth problem, which is different from our setting.
In this section we present in Algorithm 3 the Decentralized Conditional Gradient Sliding for a general convex feasible set equipped with a linear oracle. At a high-level, DCGS is closely related to Conditional Gradient Sliding (CGS) algorithm proposed in Lan and Zhou (2016). However CGS considers only the primal problem, here we consider a primal-dual problem due to performing distributed optimization. As such, the analysis is significantly more involved.
The most important step of DCGS algorithm is in Line 8, where we update decision variable by calling the CG procedure defined in Line 12. If we define , then the CG procedure could be seemed as the FW algorithm applied to , with termination criterion , where the left hand side is often called the Wolfe gap. Below we make a few remarks on the communication mechanism, the main technical challenges and the modularity of our algorithm. Communication Mechanism. At each outer iteration , each local agent first computes based on extrapolation of two previous primal iterates, and broadcase to all of its neighbors . After one round of broadcasting, each agent uses received from its neighbors and perform dual variable update , then broadcast the updated dual variable to all of its neighbors . After second round of broadcasting, each agent uses received from its neighbors and call the CG procedure to update primal variable . Each iteration incurs two rounds of communication within the network, hence the overall communication complexity is the same as the outer iteration complexity. Trade-off between Communication and LO. If we set in Line 8, we are solving problem exactly. The outer iteration of DCGS then reduces to the primal-dual algorithm applied to our problem (4), implemented in a distributed fashion. By well-known results of the primal-dual algorithm (Chambolle and Pock, 2011; Lan et al., 2017), to get an -optimal solution one needs iterations for a convex objective and iterations for a strongly convex objective. From our previous discussion, this yields the communication complexity of DCGS to be and respectively. However in this extreme case LO calls in the CG procedure would be prohibitively large. Consequently, we need to carefully choose to ensure that the subproblem is solved in a controlled way: the convergence of outer iteration should be approximately at the same speed as the case when the subproblem is solved exactly, but meanwhile we need to keep LO complexity in the CG procedure to remain relatively small. Modularity. The CG procedure in DCGS algorithm is where all the calls to LO take place. We believe there are not much room for improvement in terms of the communication complexity, as our complexity in the general case matches that of the DFW algorithm under additional (very strong) assumption that the optimal solution is in the interior of the feasible set. The room for improvement then lies in possibly reducing LO complexity. If we treat the CG procedure as a module in the DCGS algorithm, can we replace it with a module that runs much faster for specific objectives or feasible sets, and obtain a better DCGS variant? The answer is affirmative. As an example we will show that significant improvement on LO complexity could be made when the feasible set is polyhedral.
4 Theoretical Results
4.1 General Feasible Set
In this section we set suitable parameters to DCGS for convex and strongly convex objectives. We will present its convergence results, communication and LO complexity. We also present a detailed comparison with results of DFW in (Wai et al., 2017).
Theorem 1 (Convergence for Smooth and Convex Objectives).
Set in Algorithm 2, where denotes the spectral norm of Laplacian matrix . Assume each is -smooth, we have:
From (5) it is straightforward to establish communication complexity. Note that it only depends on the number of agents and the network topology, and is independent of the objective function , which is a feature that DFW does not have.
Corollary 1 (Complexity for Smooth and Convex Objectives).
Under the same conditions as in Theorem 1, to get a solution such that , the number of communications and LO for each agent are respectively bounded by:
Detailed Comparison. DFW in (Wai et al., 2017) has communication and LO complexity both bounded by , where denotes the diameter of the feasible set and inversely relates to the spectral gap of the weighted communication matrix. If we set , and observe that , then it can be seen that our algorithm is at least times better in terms of the communication complexity, and at most worse in LO complexity. Suppose in application where the agents network is set beforehand so that be treated as constants, as objective becomes increasingly non-smooth, our algorithm outperforms DFW by factor of in communication with worsened LO complexity. For tasks that communication is time consuming but LO is much cheaper (e.g., matrix completion), such a trade-off can be significant, especially when we are not solving for a high precision solution.
Theorem 2 (Convergence for Smooth and Strongly Convex Objectives).
Set in Algorithm 2. Assume is -strongly convex and -smooth, we have:
Corollary 2 (Complexity for Smooth and Strongly Convex Objective).
Under the same conditions as in Theorem 2, to get a solution such that , the number of communications and LO for each agent can be respectively bounded by:
Detailed Comparison. DFW in (Wai et al., 2017) has both communication and LO complexity bounded by , but requires the minimizer to be bounded way from boundary of the feasible set. Our complexity result does not rely on this unrealistic assumption that often fails, especially when the constraint should be active to impose structural assumption (e.g., sparsity) on the solution. It is then fair to compare with their result in the convex setting. Our result can be deemed as trading for a better communication complexity from to , with a moderately worse LO complexity from to .
4.2 Polyhedral Feasible Sets
The trade-off between communication and LO however is almost unnecessary, when the feasible set is polyhedral. Specifically, DCGS achieves the same communication and LO complexity (with additionally logarithmic factor), regardless of being strongly convex or not. This improvement is a direct result from the modularity of DCGS: we replace the CG procedure in DCGS with a faster one adapted from (Lacoste-Julien and Jaggi, 2015). We present the modified DCGS for a polyhedral feasible set in Algorithm 3.
Corollary 3 (Smooth and Convex: polyhedral set).
Improvements. Observe that the LO complexity is now at the same order of magnitude as in (Wai et al., 2017) in terms of dependence on . If we set , LO complexity of DCGS in this setting is at most worse than in (Wai et al., 2017), which could even be better when the network is pooly connected (so that is large). Our complexity also depends on which could be interpretated as the condition number of the polyhedral constraint set .
Corollary 4 (Smooth and strongly convex: polyhedral set).
Improvements. The LO complexity is now improved to which is of the same order of magnitude as in (Wai et al., 2017) in terms dependence on , but this result makes no assumption on the minimizer. If we choose , our LO complexity reduces to: which has a clean interpretation in terms of its dependency on condition number of the objective and condition number of the polyhedral constraint.
5 Experimental Results
In this section we present experiments comparing DCGS with the existing distributed FW algorithm in (Wai et al., 2017) and demonstrate the superiority of our algorithm. For both of the following experiments, we set the number of agents , and the associated network to be a -connected cycle, i.e. each agent is connected to its previous one and the latter one . Lasso. We compare DCGS and DFW applying to the Lasso problem on a synthetic dataset and E2006-tfidf dataset (Kogan et al., 2009). The Lasso problem is formulated as:
Similar experiment was also conducted in (Lacoste-Julien and Jaggi, 2015) which showed linear convergence of FW variant over the polyhedral set. For synthetic data, we generate samples, with sampled i.i.d. from and . We generate with randomly selected non-zero entries, and with . For the real dataset, and we randomly draw samples from the entire E2006-tfidf dataset. We set and distribute data evenly across agents. The results are presented in Figure 1.
We observed significant performance improvement of DCGS over DFW. For the synthetic data, Figure (0(a), 0(b)) shows that DCGS with a moderately number of LO, converges to a high-quality solution that has loss by orders of magnitude better than DFW. The gap on communication complexity is even more significant. DFW algorithm takes more than rounds of communications while DCGS only takes rounds. We observe similar performance gap on E2006-tfidf dataset in Figure (0(c),0(d)). Matrix completion. We compare DCGS and DFW applying to matrix completion problems on synthetic dataset and MovieLens 100K dataset (Harper and Konstan, 2015). We remark that matrix completion is in fact a communication expensive task. As a toy example, consider a matrix which takes MB memory, sending this matrix with 10 MB/s network speed takes seconds, however an LO on a 4-core machine with Intel(R) Core(TM) i5-6267U CPU @ 2.90GHz processor and 16GB RAM takes less than seconds. This means for algorithms such as DFW, over of computation time would be waiting for the communication to complete. We present our simulation results in Figure (2). For the synthetic data set we generate the ground truth matrix with and rank . We randomly sample entries and observe with . For MovieLens 100K dataset we want to recover the rating matrix with , and we observed ratings. We set and run the algorithms on solving the following objective:
For the synthetic data, Figure 1(a) and 1(b) show DCGS and DFW need comparable LO to converge to a moderate precision solution, however DFW takes significantly more rounds of communication (). Since communication is the main computation bottle as we discussed above, DCGS would significantly outperform DFW in terms the actual running time. We observe similar performance gap on MovieLens 100K dataset in Figure (1(c),1(d)). Our experiment results thus suggest the applicability of DCGS in communication expensive applications.
In this paper, we propose a communication efficient, distributed projection-free algorithm called DCGS. We show that DCGS is communication efficient under convex and strongly convex setting without restrictive assumptions in existing work, and demonstrate the superiority of DCGS in communication expensive learning tasks such as matrix completion. We also show DCGS can be further improved when the feasible set is polyhedral, which is also validated by our numerical experiments. Future research directions include developing asynchronous DCGS variant and extending DCGS to non-convex settings.
- Bach et al. (2012) Bach, F., Lacoste-Julien, S., Obozinski, G., 2012. On the equivalence between herding and conditional gradient algorithms. ICML’12. Omnipress, USA, pp. 1355–1362.
- Bellet et al. (2015) Bellet, A., Liang, Y., Garakani, A. B., Balcan, M.-F., Sha, F., 2015. A Distributed Frank-Wolfe Algorithm for Communication-Efficient Sparse Learning. pp. 478–486.
- Boyd et al. (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., Jan. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3 (1), 1–122.
- Chambolle and Pock (2011) Chambolle, A., Pock, T., May 2011. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40 (1), 120–145.
- Clarkson (2010) Clarkson, K. L., Sep. 2010. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Trans. Algorithms 6 (4), 63:1–63:30.
- Collins et al. (2008) Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P. L., Jun. 2008. Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. J. Mach. Learn. Res. 9, 1775–1822.
- Duchi et al. (2012) Duchi, J. C., Agarwal, A., Wainwright, M. J., March 2012. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control 57 (3), 592–606.
- Fujishige and Isotani (2011) Fujishige, S., Isotani, S., 2011. A submodular function minimization algorithm based on the minimum-norm base. Pacific Journal of Optimization.
- Garber and Hazan (2013) Garber, D., Hazan, E., Jan. 2013. A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization. ArXiv e-prints.
- Harchaoui et al. (2015) Harchaoui, Z., Juditsky, A., Nemirovski, A., Aug 2015. Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming 152 (1), 75–112.
- Harper and Konstan (2015) Harper, F. M., Konstan, J. A., Dec. 2015. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst. 5 (4), 19:1–19:19.
- Hazan and Kale (2012) Hazan, E., Kale, S., 2012. Projection-free online learning. ICML’12. Omnipress, USA, pp. 1843–1850.
Hazan and Luo (2016)
Hazan, E., Luo, H., 2016. Variance-reduced and projection-free stochastic optimization. ICML’16. JMLR.org, pp. 1263–1271.
- Jaggi (2011) Jaggi, M., 2011. Sparse convex optimization methods for machine learning. Ph.D. thesis, ETH Zurich.
- Jaggi (2013) Jaggi, M., 2013. Revisiting frank-wolfe: Projection-free sparse convex optimization. ICML’13. JMLR.org, pp. I–427–I–435.
- Kogan et al. (2009) Kogan, S., Levin, D., Routledge, B. R., Sagi, J. S., Smith, N. A., 2009. Predicting risk from financial reports with regression. NAACL ’09. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 272–280.
- Lacoste-Julien and Jaggi (2015) Lacoste-Julien, S., Jaggi, M., 2015. On the global linear convergence of frank-wolfe optimization variants. NIPS’15. MIT Press, Cambridge, MA, USA, pp. 496–504.
- Lacoste-Julien et al. (2013) Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P., 2013. Block-coordinate frank-wolfe optimization for structural svms. ICML’13. JMLR.org, pp. I–53–I–61.
- Lafond et al. (2016) Lafond, J., Wai, H. T., Moulines, E., March 2016. D-fw: Communication efficient distributed algorithms for high-dimensional sparse optimization. In: ICASSP’2016. pp. 4144–4148.
- Lan et al. (2017) Lan, G., Lee, S., Zhou, Y., Jan. 2017. Communication-Efficient Algorithms for Decentralized and Stochastic Optimization. ArXiv e-prints.
- Lan and Zhou (2016) Lan, G., Zhou, Y., 2016. Conditional gradient sliding for convex optimization. SIAM Journal on Optimization 26 (2), 1379–1409.
- Makhdoumi and Ozdaglar (2017) Makhdoumi, A., Ozdaglar, A., Oct 2017. Convergence rate of distributed admm over networks. IEEE Transactions on Automatic Control 62 (10), 5082–5095.
- Nedic (2011) Nedic, A., June 2011. Asynchronous broadcast-based convex optimization over a network. IEEE Transactions on Automatic Control 56 (6), 1337–1351.
- Nedic and Ozdaglar (2009) Nedic, A., Ozdaglar, A., Jan 2009. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control 54 (1), 48–61.
- Nedich et al. (2016) Nedich, A., Olshevsky, A., Shi, W., Jul. 2016. Achieving Geometric Convergence for Distributed Optimization over Time-Varying Graphs. ArXiv e-prints.
- Shi et al. (2015a) Shi, W., Ling, Q., Wu, G., Yin, W., 2015a. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization 25 (2), 944–966.
- Shi et al. (2015b) Shi, W., Ling, Q., Wu, G., Yin, W., Nov 2015b. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing 63 (22), 6013–6023.
- Shi et al. (2014) Shi, W., Ling, Q., Yuan, K., Wu, G., Yin, W., April 2014. On the linear convergence of the admm in decentralized consensus optimization. IEEE Transactions on Signal Processing 62 (7), 1750–1761.
- Sundhar Ram et al. (2010) Sundhar Ram, S., Nedic, A., Veeravalli, V. V., Dec 2010. Distributed stochastic subgradient projection algorithms for convex optimization. Journal of Optimization Theory and Applications 147 (3), 516–545.
- Terelius et al. (2011) Terelius, H., Topcu, U., Murray, R. M., 2011. Decentralized multi-agent optimization via dual decomposition. IFAC Proceedings Volumes 44 (1), 11245 – 11251, 18th IFAC World Congress.
- Tsitsiklis et al. (1986) Tsitsiklis, J., Bertsekas, D., Athans, M., Sep 1986. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control 31 (9), 803–812.
- Tsitsiklis (1985) Tsitsiklis, J. N., 1985. Problems in decentralized decision making and computation. Ph.D. thesis, Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
- Wai et al. (2017) Wai, H. T., Lafond, J., Scaglione, A., Moulines, E., Nov 2017. Decentralized frank x2013;wolfe algorithm for convex and nonconvex problems. IEEE Transactions on Automatic Control 62 (11), 5522–5537.
- Wang et al. (2016) Wang, Y.-X., Sadhanala, V., Dai, W., Neiswanger, W., Sra, S., Xing, E., 20–22 Jun 2016. Parallel and distributed block-coordinate frank-wolfe algorithms. Vol. 48 of Proceedings of Machine Learning Research. PMLR, New York, New York, USA, pp. 1548–1557.
- Wei and Ozdaglar (2013) Wei, E., Ozdaglar, A., Jul. 2013. On the O(1/k) Convergence of Asynchronous Distributed Alternating Direction Method of Multipliers. ArXiv e-prints.
- Yuan et al. (2016) Yuan, K., Ling, Q., Yin, W., 2016. On the convergence of decentralized gradient descent. SIAM Journal on Optimization 26 (3), 1835–1854.
Appendix A Proof of Main Theorem
a.1 Proof of Theorem 1
In this subsection we prove the convergence result for smooth and convex objective. Denote and recall the saddle point problem defined in (4 ),for , we define the primal-dual gap function to be:
Note that if is a saddle point to (4 ), then and for any . It is then natural to measure the quality of a solution to problem (4 ) by . To handle unboundedness of here, we define the modified gap function to be:
In fact, we have the following proposition.
Proposition 1 ((Lan et al., 2017)).
If we have for any , then we must have and .
this claim is straightforward to eastablish by following the definition of and a proof by contradiction argument. By construction of in Algorithm 2 we know that:
with the ouput satisfying inequality . Since is strongly convex, we have: . Combine this two inequalities with some algebraic rearrangements yields the following:
Summing up the previous two inequalities and using the definition of we have:
We define the right hand side of previous equation by , and we are going to handle the weighted sum of the first three terms in seperately. For the first term:
where (21) follows from definition of . In (22) we use condition which follows from our parameters setting. (23) comes from telescoping the first summation in (22) and the condition that . We could bound the weighted sum of the second term in (20) by:
We further rewrite the first summation term in (26) as the following:
The summation in the second line could be in fact upper bounded by as the following:
Our next objective is to bound the right hand side of (31) as a linear function on . Collecting all the linear term of after some rearrangement, we get: