Straggler Mitigation in Distributed Optimization Through Data Encoding

11/14/2017 ∙ by Can Karakus, et al. ∙ 0

Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, coding-theory-inspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in certain linear computational steps of the optimization algorithm, thus completing the computation without waiting for the stragglers. In this paper, we propose an alternate approach where we embed the redundancy directly in the data itself, and allow the computation to proceed completely oblivious to encoding. We propose several encoding schemes, and demonstrate that popular batch algorithms, such as gradient descent and L-BFGS, applied in a coding-oblivious manner, deterministically achieve sample path linear convergence to an approximate solution of the original problem, using an arbitrarily varying subset of the nodes at each iteration. Moreover, this approximation can be controlled by the amount of redundancy and the number of nodes used in each iteration. We provide experimental results demonstrating the advantage of the approach over uncoded and data replication strategies.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Solving large-scale optimization problems has become feasible through distributed implementations. However, the efficiency can be significantly hampered by slow processing nodes, network delays or node failures. In this paper we develop an optimization framework based on encoding the dataset, which mitigates the effect of straggler nodes in the distributed computing system. Our approach can be readily adapted to the existing distributed computing infrastructure and software frameworks, since the node computations are oblivious to the data encoding.

In this paper, we focus on problems of the form

(1)

where ,

represent the data matrix and vector respectively. The function

is mapped onto a distributed computing setup depicted in Figure 1, consisting of one central server and worker nodes, which collectively store the row-partitioned matrix and vector . We focus on batch, synchronous optimization methods, where the delayed or failed nodes can significantly slow down the overall computation. Note that asynchronous methods are inherently robust to delays caused by stragglers, although their convergence rates can be worse than their synchronous counterparts. Our approach consists of adding redundancy by encoding the data and into and , respectively, where is an encoding matrix with redundancy factor , and solving the effective problem

(2)

instead. In doing so, we proceed with the computation in each iteration without waiting for the stragglers, with the idea that the inserted redundancy will compensate for the lost data. The goal is to design the matrix such that, when the nodes obliviously solve the problem (2) without waiting for the slowest nodes (where is a design parameter) the achieved solution approximates the original solution

sufficiently closely. Since in large-scale machine learning and data analysis tasks one is typically not interested in the exact optimum, but rather a “sufficiently" good solution that achieves a good generalization error, such an approximation could be acceptable in many scenarios. Note also that the use of such a technique does not preclude the use of other, non-coding straggler-mitigation strategies (

e.g., Yadwadkar et al. [2016], Wang et al. [2015], Ananthanarayanan et al. [2013] and references therein), which can still be implemented on top of the redundancy embedded in the system, to potentially further improve performance.

Focusing on gradient descent and L-BFGS algorithms, we show that under a spectral condition on , one can achieve an approximation of the solution of (1), by solving (2), without waiting for the stragglers. We show that with sufficient redundancy embedded, and with updates from a sufficiently large, yet strict subset of the nodes in each iteration, it is possible to deterministically achieve linear convergence to a neighborhood of the solution, as opposed to convergence in expectation (see Fig. 4). Further, one can adjust the approximation guarantee by increasing the redundancy and number of node updates waited for in each iteration. Another potential advantage of this strategy is privacy, since the nodes do not have access to raw data itself, but can still perform the optimization task over the jumbled data to achieve an approximate solution.

Although in this paper we focus on quadratic objectives and two specific algorithms, in principle our approach can be generalized to more general, potentially non-smooth objectives and constrained optimization problems, as we discuss in Section 4 ( adding a regularization term is also a simple generalization).

Our main contributions are as follows. (i) We demonstrate that gradient descent (with constant step size) and L-BFGS (with line search) applied in a coding-oblivious manner on the encoded problem, achieves (universal) sample path linear convergence to an approximate solution of the original problem, using only a fraction of the nodes at each iteration. (ii) We present three classes of coding matrices; namely, equiangular tight frames (ETF), fast transforms, and random matrices, and discuss their properties. (iii) We provide experimental results demonstrating the advantage of the approach over uncoded (

) and data replication strategies, for ridge regression using synthetic data on an AWS cluster, as well as matrix factorization for the Movielens 1-M recommendation task.

Related work.

Use of data replication to aid with the straggler problem has been proposed and studied in Wang et al. [2015], Ananthanarayanan et al. [2013], and references therein. Additionally, use of coding in distributed computing has been explored in Lee et al. [2016], Dutta et al. [2016]. However, these works exclusively focused on using coding at the computation level, i.e., certain linear computational steps are performed in a coded manner, and explicit encoding/decoding operations are performed at each step. Specifically, Lee et al. [2016] used MDS-coded distributed matrix multiplication and Dutta et al. [2016] focused on breaking up large dot products into shorter dot products, and perform redundant copies of the short dot products to provide resilience against stragglers. Tandon et al. [2016] considers a gradient descent method on an architecture where each data sample is replicated across nodes, and designs a code such that the exact gradient can be recovered as long as fewer than a certain number of nodes fail. However, in order to recover the exact gradient under any potential set of stragglers, the required redundancy factor is on the order of the number of straggling nodes, which could mean a large amount of overhead for a large-scale system. In contrast, we show that one can converge to an approximate solution with a redundancy factor independent of network size or problem dimensions (e.g., as in Section 5).

Our technique is also closely related to randomized linear algebra and sketching techniques Mahoney et al. [2011], Drineas et al. [2011], Pilanci and Wainwright [2015], used for dimensionality reduction of large convex optimization problems. The main difference between this literature and the proposed coding technique is that the former focuses on reducing the problem dimensions to lighten the computational load, whereas coding increases the dimensionality of the problem to provide robustness. As a result of the increased dimensions, coding can provide a much closer approximation to the original solution compared to sketching techniques.

2 Encoded Optimization Framework

Figure 1 shows a typical data-distributed computational model in large-scale optimization (left), as well as our proposed encoded model (right). Our computing network consists of machines, where machine stores and . The optimization process is oblivious to the encoding, i.e., once the data is stored at the nodes, the optimization algorithm proceeds exactly as if the nodes contained uncoded, raw data .

Figure 1: Left: Uncoded distributed optimization with partitioning, where and are partitioned as and . Right: Encoded distributed optimization, where node stores , instead of . The uncoded case corresponds to .

In each iteration

, the central server broadcasts the current estimate

, and each worker machine computes and sends to the server the gradient terms corresponding to its own partition .

Note that this framework of distributed optimization is typically communication-bound, where communication over a few slow links constitute a significant portion of the overall computation time. We consider a strategy where at each iteration , the server only uses the gradient updates from the first nodes to respond in that iteration, thereby preventing such slow links and straggler nodes from stalling the overall computation:

where , are the indices of the first nodes to respond at iteration , and . (Similarly, .) Given the gradient approximation, the central server then computes a descent direction through the history of gradients and parameter estimates. For the remaining nodes , the server can either send an interrupt signal, or simply drop their updates upon arrival, depending on the implementation.

Next, the central server chooses a step size , which can be chosen as constant, decaying, or through exact line search 111Note that exact line search is not more expensive than backtracking line search for a quadratic loss, since it only requires a single matrix-vector multiplication. by having the workers compute that is needed to compute the step size. We again assume the central server only hears from the fastest nodes, denoted by , where in general, to compute

(3)

where , and is a back-off factor of choice.

Our goal is to especially focus on the case , and design an encoding matrix such that, for any sequence of sets , , universally converges to a neighborhood of . Note that in general, this scheme with is not guaranteed to converge for traditionally batch methods like L-BFGS. Additionally, although the algorithm only works with the encoded function , our goal is to provide a convergence guarantee in terms of the original function .

3 Algorithms and Convergence Analysis

Let the smallest and largest eigenvalues of

be denoted by and , respectively.

Let with be given. In order to prove convergence,we will consider a family of matrices where is the aspect ratio (redundancy factor), such that for any , and any with ,

(4)

for sufficiently large , where is the submatrix associated with subset (we drop dependence on for brevity). Note that this is similar to the restricted isometry property (RIP) used in compressed sensing Candes and Tao [2005], except that (4) is only required for submatrices of the form . Although this condition is needed to prove worst-case convergence results, in practice the proposed encoding scheme can work well even when it is not exactly satisfied, as long as the bulk of the eigenvalues of lie within a small interval . We will discuss several specific constructions and their relation to property (4) in Section 4.

Gradient descent.

We consider gradient descent with constant step size, i.e.,

The following theorem characterizes the convergence of the encoded problem under this algorithm.

Theorem 1.

Let , where is computed using gradient descent with updates from a set of (fastest) workers , with constant step size for some , for all . If satisfies (4) with , then for all sequences of with cardinality ,

where , and , and is the initial objective value.

The proof is provided in Appendix B, which relies on the fact that the solution to the effective “instantaneous" problem corresponding to the subset lies in the set , and therefore each gradient descent step attracts the estimate towards a point in this set, which must eventually converge to this set. Note that in order to guarantee linear convergence, we need , which can be ensured by property (4).

Theorem 1 shows that gradient descent over the encoded problem, based on updates from only nodes, results in deterministically linear convergence to a neighborhood of the true solution , for sufficiently large , as opposed to convergence in expectation. Note that by property (4), by controlling the redundancy factor and the number of nodes waited for in each iteration, one can control the approximation guarantee. For and designed properly (see Section 4), then and the optimum value of the original function is reached.

Limited-memory-BFGS.

Although L-BFGS is originally a batch method, requiring updates from all nodes, its stochastic variants have also been proposed recently Mokhtari and Ribeiro [2015], Berahas et al. [2016]. The key modification to ensure convergence is that the Hessian estimate must be computed via gradient components that are common in two consecutive iterations, i.e., from the nodes in . We adapt this technique to our scenario. For , define , and

Then once the gradient terms are collected, the descent direction is computed by , where is the inverse Hessian estimate for iteration , which is computed by

with , , and with , where is the L-BFGS memory length. Once the descent direction is computed, the step size is determined through exact line search, using (3), with back-off factor , where is as in (4).

For our convergence result for L-BFGS, we need another assumption on the matrix , in addition to (4). Defining for , we assume that for some ,

(5)

for all . Note that this requires that one should wait for sufficiently many nodes to finish so that the overlap set has more than a fraction of all nodes, and thus the matrix can be full rank. This is satisfied if in the worst-case, and under the assumption that node delays are i.i.d., it is satisfied in expectation if . However, this condition is only required for a worst-case analysis, and the algorithm may perform well in practice even when this condition is not satisfied. The following lemma shows the stability of the Hessian estimate.

Lemma 1.

If (5) is satisfied, then there exist constants such that for all , the inverse Hessian estimate satisfies .

The proof, provided in Appendix A, is based on the well-known trace-determinant method. Using Lemma 1, we can show the following result.

Theorem 2.

Let , where is computed using L-BFGS as described above, with gradient updates from machines , and line search updates from machines . If satisfies (4) and (5), for all sequences of with ,

where , and , and is the initial objective value.

The proof is provided in Appendix B. Similar to Theorem 1, the proof is based on the observation that the solution of the effective problem at time lies in a bounded set around the true solution . As in gradient descent, coding enables linear convergence deterministically, unlike the stochastic and multi-batch variants of L-BFGS Mokhtari and Ribeiro [2015], Berahas et al. [2016].

Generalizations.

Although we focus on quadratic cost functions and two specific algorithms, our approach can potentially be generalized for objectives of the form for a simple convex function , e.g., LASSO; or constrained optimization (see Karakus et al. [2017]); as well as other first-order algorithms used for such problems, e.g., FISTA Beck and Teboulle [2009]. In the next section we demonstrate that the codes we consider have desirable properties that readily extend to such scenarios.

4 Code Design

We consider three classes of coding matrices: tight frames, fast transforms, and random matrices.

Tight frames.

A unit-norm frame for is a set of vectors with , where , such that there exist constants such that, for any ,

The frame is tight if the above satisfied with . In this case, it can be shown that the constants are equal to the redundancy factor of the frame, i.e., . If we form by rows that are a tight frame, then we have , which ensures . Then for any solution to the encoded problem (with ),

Therefore, the solution to the encoded problem satisfies the optimality condition for the original problem as well:

and if is also strongly convex, then is the unique solution. Note that since the computation is coding-oblivious, this is not true in general for an arbitrary full rank matrix, and this is, in addition to property (4), a desired property of the encoding matrix. In fact, this equivalency extends beyond smooth unconstrained optimization, in that

for any convex constraint set , as well as

for any non-smooth convex objective term , where is the subdifferential of . This means that tight frames can be promising encoding matrix candidates for non-smooth and constrained optimization too. In Karakus et al. [2017], it was shown that when is static, equiangular tight frames allow for a close approximation of the solution for constrained problems.

A tight frame is equiangular if is constant across all pairs with .

Proposition 1 (Welch bound Welch [1974]).

Let be a tight frame. Then . Moreover, equality is satisfied if and only if is an equiangular tight frame.

Therefore, an ETF minimizes the correlation between its individual elements, making each submatrix as close to orthogonal as possible, which is promising in light of property (4). We specifically evaluate Paley Paley [1933], Goethals and Seidel [1967] and Hadamard ETFs Szöllősi [2013] (not to be confused with Hadamard matrix, which is discussed next) in our experiments. We also discuss Steiner ETFs Fickus et al. [2012] in Appendix D, which enable efficient implementation.

Figure 2: Sample spectrum of for various constructions with high redundancy, and relatively small (normalized).
Figure 3: Sample spectrum of for various constructions with low redundancy, and large (normalized).
Fast transforms.

Another computationally efficient method for encoding is to use fast transforms: Fast Fourier Transform (FFT), if

is chosen as a subsampled DFT matrix, and the Fast Walsh-Hadamard Transform (FWHT), if is chosen as a subsampled real Hadamard matrix. In particular, one can insert rows of zeroes at random locations into the data pair

, and then take the FFT or FWHT of each column of the augmented matrix. This is equivalent to a randomized Fourier or Hadamard ensemble, which is known to satisfy the RIP with high probability

Candes and Tao [2006].

Random matrices.

A natural choice of encoding is using i.i.d. random matrices. Although such random matrices do not have the computational advantages of fast transforms or the optimality-preservation property of tight frames, their eigenvalue behavior can be characterized analytically. In particular, using the existing results on the eigenvalue scaling of large i.i.d. Gaussian matrices Geman [1980], Silverstein [1985] and union bound, it can be shown that

(6)
(7)

as , where denotes the

th singular value. Hence, for sufficiently large redundancy and problem dimension, i.i.d. random matrices are good candidates for encoding as well. However, for finite

, even if , in general for this encoding scheme the optimum of the original problem is not recovered exactly.

Property (4) and redundancy requirements.

Using the analytical bounds (6)–(7) on i.i.d. Gaussian matrices, one can see that such matrices satisfy (4) with , independent of problem dimensions or number of nodes . Although we do not have tight eigenvalue bounds for subsampled ETFs, numerical evidence (Figure 3) suggests that they may satisfy (4) with smaller than random matrices, and thus we believe that the required redundancy in practice is even smaller for ETFs.

Note that our theoretical results focus on the extreme eigenvalues due to a worst-case analysis; in practice, most of the energy of the gradient will be on the eigen-space associated with the bulk of the eigenvalues, which the following proposition suggests can be mostly 1 (also see Figure 3), which means even if (4) is not satisfied, the gradient (and the solution) can be approximated closely for a modest redundancy, such as . The following result is a consequence of the Cauchy interlacing theorem, and the definition of tight frames.

Proposition 2.

If the rows of are chosen to form an ETF with redundancy , then for , has eigenvalues equal to 1.

5 Numerical Results

Figure 4: Left: Sample evolution of uncoded, replication, and Hadamard (FWHT)-coded cases, for , . Right: Runtimes of the schemes for different values of , for the same number of iterations for each scheme. Note that this essentially captures the delay profile of the network, and does not reflect the relative convergence rates of different methods.
Ridge regression with synthetic data on AWS EC2 cluster.

We generate the elements of matrix i.i.d. , the elements of i.i.d. , for dimensions , and solve the problem , for regularization parameter . We evaluate column-subsampled Hadamard matrix with redundancy (encoded using FWHT for fast encoding), data replication with , and uncoded schemes. We implement distributed L-BFGS as described in Section 3 on an Amazon EC2 cluster using the mpi4py Python package, over m1.small worker node instances, and a single c3.8xlarge central server instance. We assume the central server encodes and sends the data variables to the worker nodes (see Appendix D for a discussion of how to implement this more efficiently).

Figure 4 shows the result of our experiments, which are aggregated over 20 trials. As baselines, we consider the uncoded scheme, as well as a replication scheme, where each uncoded partition is replicated times across nodes, and the server uses the faster copy in each iteration. It can be seen from the right figure that one can speed up computation by reducing from 1 to, for instance, 0.375, resulting in more than reduction in the runtime. Note that in this case, uncoded L-BFGS fails to converge, whereas the Hadamard-coded case stably converges. We also observe that the data replication scheme converges on average, but in the worst case, the convergence is much less smooth, since the performance may deteriorate if both copies of a partition are delayed.

Matrix factorization on Movielens 1-M dataset.

We next apply matrix factorization on the MovieLens-1M dataset Riedl and Konstan [1998] for the movie recommendation task. We are given , a sparse matrix of movie ratings 1–5, of dimension , where is specified if user has rated movie . We withhold randomly 20% of these ratings to form an 80/20 train/test split. The goal is to recover user vectors and movie vectors (where is the embedding dimension) such that , where , , and are user, movie, and global biases, respectively. The optimization problem is given by

(8)

We choose , , and , which achieves a test RMSE 0.861, close to the current best test RMSE on this dataset using matrix factorization222http://www.mymedialite.net/examples/datasets.html.

Figure 5: Test RMSE for (left) and (right) nodes, where the server waits for (top) and (bottom) responses. “Perfect" refers to the case where .
Figure 6: Total runtime with and nodes for different values of , under fixed 100 iterations for each scheme.

Problem (8) is often solved using alternating minimization, minimizing first over all , and then all , in repetition. Each such step further decomposes by row and column, made smaller by the sparsity of . To solve for , we first extract , and solve the resulting sequence of regularized least squares problems in the variables distributedly using coded L-BFGS; and repeat for , for all . As in the first experiment, distributed coded L-BFGS is solved by having the master node encoding the data locally, and distributing the encoded data to the worker nodes (Appendix D discusses how to implement this step more efficiently). The overhead associated with this initial step is included in the overall runtime in Figure 6.

The Movielens experiment is run on a single 32-core machine with 256 GB RAM. In order to simulate network latency, an artificial delay of is imposed each time the worker completes a task. Small problem instances () are solved locally at the central server, using the built-in function numpy.linalg.solve. Additionally, parallelization is only done for the ridge regression instances, in order to isolate speedup gains in the L-BFGS distribution. To reduce overhead, we create a bank of encoding matrices for Paley ETF and Hadamard ETF, for , and then given a problem instance, subsample the columns of the appropriate matrix to match the dimensions. Overall, we observe that encoding overhead is amortized by the speed-up of the distributed optimization.

Figure 6

gives the final performance of our distributed L-BFGS for various encoding schemes, for each of the 5 epochs, which shows that coded schemes are most robust for small

. A full table of results is given in Appendix C.

Acknowledgments

This work was supported in part by NSF grants 1314937 and 1423271.

References

  • Ananthanarayanan et al. [2013] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Effective straggler mitigation: Attack of the clones. In NSDI, volume 13, pages 185–198, 2013.
  • Beck and Teboulle [2009] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
  • Berahas et al. [2016] A. S. Berahas, J. Nocedal, and M. Takác. A multi-batch l-bfgs method for machine learning. In Advances in Neural Information Processing Systems, pages 1055–1063, 2016.
  • Candes and Tao [2005] E. J. Candes and T. Tao.

    Decoding by linear programming.

    IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
  • Candes and Tao [2006] E. J. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Transactions on Information Theory, 52(12):5406–5425, 2006.
  • Drineas et al. [2011] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós. Faster least squares approximation. Numerische mathematik, 117(2):219–249, 2011.
  • Dutta et al. [2016] S. Dutta, V. Cadambe, and P. Grover.

    Short-dot: Computing large linear transforms distributedly using coded short dot products.

    In Advances In Neural Information Processing Systems, pages 2092–2100, 2016.
  • Fickus et al. [2012] M. Fickus, D. G. Mixon, and J. C. Tremain. Steiner equiangular tight frames. Linear Algebra and Its Applications, 436(5):1014–1027, 2012.
  • Geman [1980] S. Geman. A limit theorem for the norm of random matrices. The Annals of Probability, pages 252–261, 1980.
  • Goethals and Seidel [1967] J. Goethals and J. J. Seidel. Orthogonal matrices with zero diagonal. Canad. J. Math, 1967.
  • Karakus et al. [2017] C. Karakus, Y. Sun, and S. Diggavi. Encoded distributed optimization. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 2890–2894. IEEE, 2017.
  • Lee et al. [2016] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran. Speeding up distributed machine learning using codes. In Information Theory (ISIT), 2016 IEEE International Symposium on, pages 1143–1147. IEEE, 2016.
  • Mahoney et al. [2011] M. W. Mahoney et al. Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning, 3(2):123–224, 2011.
  • Mokhtari and Ribeiro [2015] A. Mokhtari and A. Ribeiro. Global convergence of online limited memory BFGS. Journal of Machine Learning Research, 16:3151–3181, 2015.
  • Paley [1933] R. E. Paley. On orthogonal matrices. Studies in Applied Mathematics, 12(1-4):311–320, 1933.
  • Pilanci and Wainwright [2015] M. Pilanci and M. J. Wainwright. Randomized sketches of convex programs with sharp guarantees. IEEE Transactions on Information Theory, 61(9):5096–5115, 2015.
  • Riedl and Konstan [1998] J. Riedl and J. Konstan. Movielens dataset, 1998.
  • Silverstein [1985] J. W. Silverstein. The smallest eigenvalue of a large dimensional wishart matrix. The Annals of Probability, pages 1364–1368, 1985.
  • Szöllősi [2013] F. Szöllősi. Complex hadamard matrices and equiangular tight frames. Linear Algebra and its Applications, 438(4):1962–1967, 2013.
  • Tandon et al. [2016] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis. Gradient coding. ML Systems Workshop (MLSyS), NIPS, 2016.
  • Wang et al. [2015] D. Wang, G. Joshi, and G. Wornell. Using straggler replication to reduce latency in large-scale parallel computing. ACM SIGMETRICS Performance Evaluation Review, 43(3):7–11, 2015.
  • Welch [1974] L. Welch. Lower bounds on the maximum cross correlation of signals (corresp.). IEEE Transactions on Information theory, 20(3):397–399, 1974.
  • Yadwadkar et al. [2016] N. J. Yadwadkar, B. Hariharan, J. Gonzalez, and R. H. Katz. Multi-task learning for straggler avoiding predictive job scheduling. Journal of Machine Learning Research, 17(4):1–37, 2016.

Appendix A Lemmas

In the proofs, we will ignore the normalization constants on the objective functions for brevity. Let , and (we set ). Let denote the solution to the effective “instantaneous" problem at iteration , i.e., .

Stronger versions of the following lemma has been proved in Karakus et al. [2017], Pilanci and Wainwright [2015], but we include a weakened version of this result here for completeness.

Lemma 2.

For any and ,

Proof.

Define and note that

by triangle inequality, which implies

(9)

Now, for any , consider

where (a) follows by expanding and re-arranging , which is since is the minimizer of this function; (b) follows by the fact that by optimality of for ; (c) follows by Cauchy-Schwarz inequality; and (d) follows by the definition of matrix norm.

Since this is true for any , we choose , which gives

Plugging this back in (9), we get , which completes the proof. ∎

Lemma 3.

If

for all , for some , and for some , then

Proof.

Since for any ,

and similarly , we have

which can be re-arranged into the linear recursive inequality

where . By considering such inequalities for , multiplying each by and summing, we get

Lemma 4.

is -strongly convex.

Proof.

It is sufficient to show that the minimum eigenvalue of is bounded away from zero. This can easily be shown by the fact that

for any unit vector . ∎

Lemma 5.

Let be a positive definite matrix, with the condition number (ratio of maximum eigenvalue to the minimum eigenvalue) given by . Then, for any unit vector ,

Proof.

Let be the subspace spanned by , and let be a matrix whose columns form an orthonormal basis for . Then and can be represented as

for some , which implies

where since is still a positive definite matrix it has the eigen-decomposition . Defining for , note that the quantity we are interested in can be equivalently represented as

where . Further note that for any unit vector ,

and since , the condition number of (the ratio of the two non-zero elements of ) cannot be larger than that of , which is (since otherwise once could find unit vectors and such that , which is a contradiction). Representing for some angle , can then be written as . Note that minimizing the inner product is equivalent to maximizing the function

over . By setting the derivative to zero, we find that the maximizing is given by . Therefore

which is the desired result.

Proof of Lemma 1.

First note that

(10)

by (5) Also consider

which implies

again by (4). Now, setting , consider the trace

which implies . It can also be shown (similar to Berahas et al. [2016]) that

which implies . Since , its trace is bounded above, and its determinant is bounded away from zero, there must exist such that

Appendix B Proofs of Theorem 1 and Theorem 2

Throughout the section, we will consider a particular iteration , and denote

where and denote the minimum and maximum eigenvalues of a matrix.

We will also denote with the solution to the effective “instantaneous" problem at iteration , i.e.,