# OverSketched Newton: Fast Convex Optimization for Serverless Systems

Motivated by recent developments in serverless systems for large-scale machine learning as well as improvements in scalable randomized matrix algorithms, we develop OverSketched Newton, a randomized Hessian-based optimization algorithm to solve large-scale smooth and strongly-convex problems in serverless systems. OverSketched Newton leverages matrix sketching ideas from Randomized Numerical Linear Algebra to compute the Hessian approximately. These sketching methods lead to inbuilt resiliency against stragglers that are a characteristic of serverless architectures. We establish that OverSketched Newton has a linear-quadratic convergence rate, and we empirically validate our results by solving large-scale supervised learning problems on real-world datasets. Experiments demonstrate a reduction of 50 AWS Lambda, compared to state-of-the-art distributed optimization schemes.

## Authors

• 10 publications
• 16 publications
• 4 publications
• 108 publications
• 43 publications
07/07/2020

### A Distributed Cubic-Regularized Newton Method for Smooth Convex Optimization over Networks

We propose a distributed, cubic-regularized Newton method for large-scal...
05/11/2018

### Randomized Smoothing SVRG for Large-scale Nonsmooth Convex Optimization

In this paper, we consider the problem of minimizing the average of a la...
05/09/2015

### Newton Sketch: A Linear-time Optimization Algorithm with Linear-Quadratic Convergence

We propose a randomized second-order method for optimization known as th...
10/25/2019

### Convergence Analysis of the Randomized Newton Method with Determinantal Sampling

We analyze the convergence rate of the Randomized Newton Method (RNM) in...
07/15/2021

### Newton-LESS: Sparsification without Trade-offs for the Sketched Newton Update

In second-order optimization, a potential bottleneck can be computing th...
08/19/2021

### Using Multilevel Circulant Matrix Approximate to Speed Up Kernel Logistic Regression

Kernel logistic regression (KLR) is a classical nonlinear classifier in ...
03/18/2022

### Distributed Sketching for Randomized Optimization: Exact Characterization, Concentration and Lower Bounds

We consider distributed optimization methods for problems where forming ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Compared to first-order optimization algorithms, second-order methods—which use the gradient as well as Hessian information—enjoy superior convergence rates, both in theory and practice. For instance, Newton’s method converges quadratically for strongly convex and smooth problems, compared to the linear convergence of gradient descent. Moreover, second-order methods do not require step-size tuning and unit step-size provably works for most problems. These methods, however, are not vastly popular among practitioners, partly since they are more sophisticated, and partly due to higher computational complexity per iteration. In each iteration, they require the calculation of the Hessian followed by solving a linear system involving the Hessian, and this can be prohibitive when the training data is large.

Recently, there has been a lot of interest in distributed second-order optimization methods for more traditional server-based systems [1, 2, 3, 4, 5, 6]. These methods can substantially decrease the number of iterations, and hence reduce communication, which is desirable in distributed computing environments, at the cost of more computation per iteration. However, these methods are approximate, as they do not calculate the exact Hessian, due to data being stored distributedly among workers. Instead, each worker calculates an approximation of the Hessian using the data that is stored locally. Such methods, while still better than gradient descent, forego the quadratic convergence property enjoyed by exact Newton’s method.

In recent years, there has been tremendous growth in users performing distributed computing operations on the cloud due to extensive and inexpensive commercial offerings like Amazon Web Services (AWS), Google Cloud, Microsoft Azure, etc. Serverless platforms—such as AWS Lambda, Cloud functions and Azure Functions—penetrate a large user base by provisioning and managing the servers on which the computation is performed. This abstracts away the need for maintaining servers since this is done by the cloud provider and is hidden from the user—hence the name serverless. Moreover, allocation of these servers is done expeditiously which provides greater elasticity and easy scalability. For example, up to ten thousand machines can be allocated on AWS Lambda in less than ten seconds [7, 8, 9]. Using serverless systems for large-scale computation has been the central theme of several recent works [10, 11, 12, 13].

Due to several crucial differences between HPC/server-based and serverless architectures, existing distributed algorithms cannot, in general, be extended to serverless computing. Unlike server-based computing, the number of inexpensive workers in serverless platforms is flexible, often scaling into the thousands [9]. This heavy gain in the computation power, however, comes with the disadvantage that the commodity workers in the serverless architecture are ephemeral, have low memory and do not communicate amongst themselves111For example, serverless nodes in AWS Lambda, Google Cloud Functions and Microsoft Azure Functions have a maximum memory of 3 GB, 2 GB and 1.5 GB, respectively, and a maximum runtime of 900 seconds, 540 seconds and 300 seconds, respectively (these numbers may change over time). . The workers read/write data directly from/to a single data storage entity (for example, cloud storage like AWS S3).

Furthermore, unlike HPC/server-based systems, nodes in the serverless systems suffer degradation due to what is known as system noise. This can be a result of limited availability of shared resources, hardware failure, network latency, etc. [14, 15]. This results in job time variability, and hence a subset of much slower nodes, often called stragglers. These stragglers significantly slow the overall computation time, especially in large or iterative jobs. In [11], the authors plotted empirical statistics for worker job times for workers as shown in Fig. 1 for the AWS Lambda system. Such experiments consistently demonstrate that at least workers take significantly longer than the median job time, severely degrading the overall efficiency of the system.

Moreover, each call to serverless computing platforms such as AWS Lambda requires invocation time (during which AWS assigns Lambda workers), setup time (where the required python packages are downloaded) and communication time (where the data is downloaded from the cloud). The ephemeral nature of the workers in serverless systems requires that new workers should be invoked every few iterations and data should be communicated to them. Due to the aforementioned issues, first-order methods tend to perform poorly on distributed serverless architectures. In fact, their slower convergence is made worse on serverless platforms due to persistent stragglers. The straggler effect incurs heavy slow down due to the accumulation of tail times as a result of a subset of slow workers occurring in each iteration.

In this paper, we argue that second-order methods are highly compatible with serverless systems that provide extensive computing power by invoking thousands of workers but are limited by the communication costs and hence the number of iterations [9, 10]. To address the challenges of ephemeral workers and stragglers in serverless systems, we propose and analyze a randomized and distributed second-order optimization algorithm, called OverSketched Newton. OverSketched Newton uses the technique of matrix sketching from Randomized Numerical Linear Algebra (RandNLA) [16, 17] to obtain a good approximation for the Hessian.

In particular, we use the sparse sketching scheme proposed by [11], which is based on the basic RandNLA primitive of approximate randomized matrix multiplication [18], for straggler-resilient Hessian calculation in serverless systems. Such randomization also provides inherent straggler resiliency while calculating the Hessian. For straggler mitigation during gradient calculation, we use the recently proposed technique of employing error-correcting codes to create redundant computation [19, 20]. We prove that OverSketched Newton has a linear-quadratic convergence when the Hessian is calculated using the sparse sketching scheme proposed in [11]. Moreover, we show that at least a linear convergence is assured when the optimization algorithm starts with any random point in the constraint set. Our experiments on AWS Lambda demonstrate empirically that OverSketched Newton is significantly faster than existing distributed optimization schemes and vanilla Newton’s method that calculates exact Hessian.

### 1.1 Related Work

Existing Straggler Mitigation Schemes: Strategies like speculative execution have been traditionally used to mitigate stragglers in popular distributed computing frameworks like Hadoop MapReduce [21] and Apache Spark [22]. Speculative execution works by detecting workers that are running slower than expected and then allocating their tasks to new workers without shutting down the original straggling task. The worker that finishes first communicates its results. This has several drawbacks. For example, constant monitoring of tasks is required, where the worker pauses its job and provides its running status. Additionally, it is possible that a worker will straggle only at the end of the task, say, while communicating the results. By the time the task is reallocated, the overall efficiency of the system would have suffered already.

Recently, many coding-theoretic ideas have been proposed to introduce redundancy into the distributed computation for straggler mitigation [19, 20, 23, 24, 25, 26, 27]

, many of them catering to distributed matrix-vector multiplication

[19, 20, 23]. In general, the idea of coded computation is to generate redundant copies of the result of distributed computation by encoding the input data using error-correcting-codes. These redundant copies can then be used to decode the output of the missing stragglers. We use tools from [20] to compute gradients in a distributed straggler-resilient manner using codes, and we compare the performance with speculative execution.

Approximate Newton Methods: Several works in the literature prove convergence guarantees for Newton’s method when the Hessian is computed approximately using ideas from RandNLA [28, 29, 30, 31, 32]. However, these algorithms are designed for a single machine. Our goal in this paper is to use ideas from RandNLA to design a distributed approximate Newton method for a serverless system that is resilient to stragglers.

Distributed Second-Order Methods: There has been a growing research interest in designing and analyzing distributed (synchronous) implementations of second-order methods  [1, 2, 3, 4, 5, 6]. However, these implementations are tailored for server-based distributed systems. Our focus, on the other hand, is on serverless systems. Our motivation behind considering serverless systems stems from their usability benefits, cost efficiency, and extensive and inexpensive commercial offerings [9, 10]. We implement our algorithms using the recently developed serverless framework called PyWren [9]. While there are works that evaluate existing algorithms on serverless systems [12, 33], this is the first work that proposes a large-scale distributed optimization algorithm for serverless systems. We exploit the advantages offered serverless systems while mitigating the drawbacks such as stragglers and additional overhead per invocation of workers.

## 2 OverSketched Newton

### 2.1 Problem Formulation and Newton’s Method

We are interested in solving a problem of the following form on serverless systems in a distributed and straggler-resilient manner:

 f(w∗)=minw∈Cf(w), (1)

where is a closed and twice-differentiable convex function bounded from below, and is a given convex and closed set. We assume that the minimizer exists and is uniquely defined. Let and

denote, respectively, the minimum and maximum eigenvalues of the Hessian of

evaluated at the minimum, i.e., . In addition, we assume that the Hessian is Lipschitz continuous with modulus , that is, for any ,

 ||∇2f(w+Δ)−∇2f(w)||op≤L||Δ||2, (2)

where is the operator norm.

In the Newton’s method, the update at the -th iteration is obtained by minimizing the Taylor’s expansion of the objective function at , within the constraint set , that is

 wt+1=argminw∈C{f(wt)+∇f(wt)T(w−wt)+12(w−wt)T∇2f(wt)(w−wt)}. (3)

For the unconstrained case, that is when , Eq. (3) becomes

 wt+1=wt−[∇2f(wt)]−1∇f(wt). (4)

Given a good initial point such that

 ||w0−w∗||2≤γ8L, (5)

the Newton’s method satisfies the following update

 ||wt+1−w∗||2≤2Lγ||wt−w∗||22, (6)

In many applications like machine learning where the training data itself is noisy, using the exact Hessian is not necessary. Indeed, many results in the literature prove convergence guarantees for Newton’s method when the Hessian is computed approximately using ideas from RandNLA for a single machine [28, 29, 30, 31]. In particular, these methods perform a form of dimensionality reduction for the Hessian using random matrices, called sketching matrices. Many popular sketching schemes have been proposed in the literature, for example, sub-Gaussian, Hadamard, random row sampling, sparse Johnson-Lindenstrauss, etc. [36, 16].

Next, we present OverSketched Newton for solving problems of the form (1) distributedly in serverless systems using ideas from RandNLA.

### 2.2 OverSketched Newton

OverSketched Newton computes the full gradient in each iteration by using tools from error-correcting codes [19, 20, 24]. The exact method of obtaining the full gradient in a distributed straggler-resilient way depends on the optimization problem at hand. Our key observation is that, for several commonly encountered optimization problems, gradient computation relies on matrix-vector multiplication (see Sec. 3.2 for examples). We leverage coded matrix multiplication technique from [20] to perform the large-scale matrix-vector multiplication in a distributed straggler-resilient manner. The main idea of coded matrix multiplication is explained in Fig. 3; detailed steps are provided in Algorithm 1.

Similar to the gradient computation, the particular method of obtaining the Hessian in a distributed straggler-resilient way depends on the optimization problem at hand. For several commonly encountered optimization problems, Hessian computation involves matrix-matrix multiplication for a pair of large matrices (see Sec. 3.2 for several examples). For computing the large-scale matrix-matrix multiplication in parallel in serverless systems, we propose to use a straggler-resilient scheme called OverSketch from [11]. OverSketch does blocked partitioning of input matrices, and hence, it is more communication efficient than existing coding-based straggler mitigation schemes that do naïve row-column partition of input matrices [25, 26]. We note that it is well known in HPC that blocked partitioning of input matrices can lead to communication-efficient methods for distributed multiplication [11, 37, 38].

OverSketch uses a sparse sketching matrix based on Count-Sketch [36]. It has similar computational efficiency and accuracy guarantees as that of the Count-Sketch, with two additional properties: it is amenable to distributed implementation; and it is resilient to stragglers. More specifically, the OverSketch matrix is given as [11]:

 S=1√N(S1,S2,⋯,SN+e), (7)

where , for all , are i.i.d. Count-Sketch matrices with sketch dimension , and is the maximum number of stragglers per blocks. Note that , where is the required sketch dimension and is the over-provisioning parameter to provide resiliency against stragglers per workers. We assume that for some constant redundancy factor , .

Each of the Count-Sketch matrices is constructed (independently of others) as follows. First, for every row , , of , independently choose a column . Then, select a uniformly random element from , denoted as . Finally, set and set for all . (See [36, 11] for details.) We can leverage the straggler resiliency of OverSketch to obtain the sketched Hessian in a distributed straggler-resilient manner. An illustration of OverSketch is provided in Fig. 3; see Algorithm 2 for details.

The model update for OverSketched Newton is given by

 wt+1=argminw∈C{f(wt)+∇f(wt)T(w−wt)+12(w−wt)T^Ht(w−wt)}, (8)

where , is the square root of the Hessian , and is an independent realization of (7) at the -th iteration.

### 2.3 Local and Global Convergence Guarantees

Next, we prove the following local convergence guarantee for OverSketched Newton, that uses the sketch matrix in (7) and full gradient for approximate Hessian computation. Such a convergence is referred to as “local” in the literature [28, 30] since it assumes that the initial starting point satisfies (15). Let be the resultant sketch dimension of obtained after ignoring the stragglers, that is, .

###### Theorem 2.1 (Local convergence).

Let be the optimal solution of (1) and and be the minimum and maximum eigenvalues of , respectively. Let and . Then, using an OverSketch matrix with a sketch dimension and the number of column-blocks , the updates for OverSketched Newton with initialization satisfying (5) follow

 ||wt+1−w∗||2≤25L8γ||wt−w∗||22+5ϵβγ||wt−w∗||2,

with probability at least

.

###### Proof.

See Section 5.1. ∎

Theorem 2.1 implies that the convergence is linear-quadratic in error . Initially, when is large, the first term of the RHS will dominate and the convergence will be quadratic, that is, . In later stages, when becomes sufficiently small, the second term of RHS will start to dominate and the convergence will be linear, that is, . At this stage, the sketch dimension can be increased to reduce to diminish the effect of the linear term and improve the convergence rate in practice.

The above guarantee implies that when is sufficiently close to , OverSketched Newton has a linear-quadratic convergence rate. However, in general, a good starting point may not be available. Thus, we prove the following “global” convergence guarantee that shows that OverSketched Newton would converge from any random initialization of with high probability. In the following, we focus our attention to the unconstrained case, i.e., . We assume that is smooth and strongly convex in , that is,

 kI⪯∇2f(w)⪯KI, (9)

for some and . In addition, we use line-search to choose the step-size, and the update is given below

 wt+1=wt+αtpt,

where

 pt=−^H−1t∇f(wt), (10)

and, for some constant ,

 αt=max α such that α ≤1~{}~{}~{}~{}~{}~{}and f(wt+αpt) ≤f(wt)+αβpTt∇f(wt). (11)

Recall that , where is the square root of the Hessian , and is an independent realization of (7) at the -th iteration.

Note that (2.3) can be solved approximately in single machine systems using Armijo backtracking line search [35]. In Section 2.4, we describe how to implement distributed line-search in serverless systems when the data is stored in cloud.

###### Theorem 2.2 (Global convergence).

Let be the optimal solution of (1) where satisfies the constraints in (9). Let and be positive constants. Then, using an OverSketch matrix with a sketch dimension and the number of column-blocks , the updates for OverSketched Newton, for any , satisfy

 f(wt+1)−f(w∗)≤(1−ρ)(f(wt)−f(w∗)),

with probability at least , where . Moreover, the step-size satisfies .

###### Proof.

See Section 5.2. ∎

Theorem 2.2

guarantees the global convergence of OverSketched Newton starting with any initial estimate

to the optimal solution with at least a linear rate.

### 2.4 Distributed Line Search

In our experiments in Section 4, line-search was not required for the three synthetic and real datasets where OverSketched Newton was employed. However, that might not be true in general as the global guarantee in Theorem 2.2 depends on line-search. In general, line-search may be required until the current model estimate is close to , that is, satisfies the condition in (5). Here, we describe a line-search procedure for distributed serverless optimization, which is inspired by the line-search method from [5] for server-based systems222Note that codes can be used to mitigate stragglers during distributed line-search in a manner similar to the gradient computation phase..

To solve for the step-size as described in the optimization problem in (2.3), we set and choose a candidate set . After the master calculates the descent direction in the -th iteration according to (10), the -th worker calculates for all values of in the candidate set , where depends on the local data available at the -th worker and . The master then sums the results from workers to obtain for all values of in and finds the largest that satisfies the Armijo condition in (2.3).

Note that the line search requires an additional round of communication where the master communicates to the workers through cloud. The workers then calculate the the functions using local data and send the result back to the master. Finally, the master finds the right step-size and updates the model estimate .

## 3 OverSketched Newton on Serverless Systems

### 3.1 Logistic Regression using OverSketched Newton on Serverless Systems

The optimization problem for supervised learning using Logistic Regression takes the form

 minw∈Rd {f(w)=1nn∑i=1log(1+e−yiwTxi)+λ2∥w∥22}. (12)

Here, and are training sample vectors and labels, respectively. The goal is to learn the feature vector . Let and be the feature and label matrices, respectively. Using Newton’s method to solve (12) first requires evaluation of the gradient

 ∇f(w)=1nn∑i=1−yixi1+eyiwTixi+λw.

Calculation of involves two matrix-vector products, and , where . These matrix-vector products are performed distributedly. Faster convergence can be obtained by second-order methods which will additionally compute the Hessian , where is a diagonal matrix with entries given by . The product is computed approximately in a distributed straggler-resilient manner using (7).

We provide a detailed description of OverSketched Newton for large-scale logistic regression for serverless systems in Algorithm 3. Here, steps 4, 8, and 14 are computed in parallel on AWS Lambda. All other steps are simple vector operations that can be performed locally at the master, for instance, the user’s laptop. We assume that the number of features are small enough to fit the Hessian matrix H locally at the master and compute the update efficiently. Steps 4 and 8 are executed in a straggler-resilient fashion using the coding scheme in [20], as illustrated in Fig. 1 and described in detail in Algorithm 1.

We use the coding scheme in [20] since the encoding can be implemented in parallel and requires less communication per worker compared to the other schemes, for example schemes in [19, 26], that use Maximum Distance Separable (MDS) codes. Moreover, the decoding scheme takes linear time and is applicable on real-valued matrices. Note that since the example matrix is constant in this example, the encoding of is done only once before starting the optimization algorithm. Thus, the encoding cost can be amortized over iterations. Moreover, decoding over the resultant product vector requires negligible time and space, even when is scaling into the millions.

The same is, however, not true for the matrix multiplication for Hessian calculation (step 14 of Algorithm 3), as the matrix L changes in each iteration, thus encoding costs will be incurred in every iteration if error-correcting codes are used. Moreover, encoding and decoding a huge matrix stored in the cloud incurs heavy communication cost and becomes prohibitive. Motivated by this, we use OverSketch in step 14, as described in Algorithm 2, to calculate an approximate matrix multiplication, and hence the Hessian, efficiently in serverless systems with inbuilt straggler resiliency.333We also evaluate the exact Hessian-based algorithm with speculative execution, i.e., recomputing the straggling jobs, and compare it with OverSketched Newton in Sec. 4.

### 3.2 Example Problems

In this section, we describe several commonly encountered optimization problems (besides logistic regression) that can be solved using OverSketched Newton.

Ridge Regularized Linear Regression

: The optimization problem is

 minw∈Rd 12n||XTw−y||22+λ2∥w∥22. (13)

The gradient in this case can be written as , where , where the training matrix and label vector were defined previously. The Hessian is given by . For , this can be computed approximately using the sketch matrix in (7).

Linear programming via interior point methods

: The following linear program can be solved using OverSketched Newton

 minimizeAx≤b cTx, (14)

where and is the constraint matrix with . In algorithms based on interior point methods, the following sequence of problems using Newton’s method

 minx∈Rmf(x)=minx∈Rm(τcTx−n∑i=1log(bi−aix)), (15)

where is the -th row of , is increased geometrically such that when is very large, the logarithmic term does not affect the objective value and serves its purpose of keeping all intermediates solution inside the constraint region. The update in the -th iteration is given by , where is the estimate of the solution in the -th iteration. The gradient can be written as where and .

The Hessian for the objective in (15) is given by

 ∇2f(x)=ATdiag1(bi−αi)2A. (16)

The square root of the Hessian is given by . The computation of Hessian requires time and is the bottleneck in each iteration. Thus, we can use sketching to mitigate stragglers while evaluating the Hessian efficiently, i.e. , where is the OverSketch matrix defined in (7).

Lasso Regularized Linear Regression: The optimization problem takes the following form

 (17)

where is the measurement matrix, the vector contains the measurements, and . To solve (17), we consider its dual variation

 min||XTz||∞≤λ,z∈Rn12||y−z||22,

which is amenable to interior point methods and can be solved by optimizing the following sequence of problems where is increased geometrically

where is the -th column of . The gradient can be expressed in few matrix-vector multiplications as where , , and . Similarly, the Hessian can be written as , where is a diagonal matrix whose entries are given by .

Other common problems where OverSketched Newton is applicable include Linear Regression, Support Vector Machines (SVMs), Semidefinite programs, etc.

## 4 Experimental Results

In this section, we evaluate OverSketched Newton on AWS Lambda using real-world and synthetic datasets, and we compare it with state-of-the-art distributed optimization algorithms444A working implementation of OverSketched Newton is available at https://github.com/vvipgupta/OverSketchedNewton. We use the serverless computing framework, Pywren, developed recently in [9]. Our experiments are focused on logistic regression, a popular supervised learning problem, but they can be reproduced for other problems described in Section 3.2.

### 4.1 Comparison with Existing Second-order Methods

For comparison of OverSketched Newton with existing distributed optimization schemes, we choose recently proposed Globally Improved Approximate Newton Direction (GIANT) [5] as a representative algorithm. This is because GIANT boasts a better convergence rate than many existing distributed second-order methods for linear and logistic regression when . In GIANT, and other similar distributed second-order algorithms, the training data is evenly divided among workers, and the algorithms proceed in two stages. First, the workers compute partial gradients using local training data, which is then aggregated by the master to compute the exact gradient. Second, the workers receive the full gradient to calculate their local second-order estimate, which is then averaged by the master. An illustration is shown in Fig. 4.

For straggler mitigation in such server-based algorithms, [24] proposes a scheme for coding gradient updates called gradient coding, where the data at each worker is repeated multiple times to compute redundant copies of the gradient. See Figure 4(b) for illustration. Figure 4(a) illustrates the scheme that waits for all workers and Figure 4(c) illustrates the ignoring stragglers approach. We use the three schemes for dealing with stragglers illustrated in Figure 5 during the two stages of GIANT, and we compare their convergence with OverSketched Newton. We also evaluate exact Newton’s method with speculative execution for straggler mitigation, that is, reassigning and recomputing the work for straggling workers, and compare its convergence rate with OverSketched Newton.

###### Remark 1.

We note that the conventional distributed second-order methods for server-based systems—which distribute training examples evenly across workers (such as [1, 2, 3, 4, 5])—typically find a “local approximation” of second-order update at each worker and then aggregate it. OverSketched Newton, on the other hand, utilizes the massive storage and compute power in serverless systems to find a “global approximation”. As we demonstrate next using extensive experiments, it performs significantly better than existing server-based methods.

### 4.2 Experiments on Synthetic Data

In Figure 6, we present our experimental results on randomly generated dataset with and for logistic regression on AWS Lambda. Each column , for all , is sampled uniformly randomly from the cube . The labels are sampled from the logistic model, that is, , where the weight vector and bias

are generated randomly from the normal distribution.

The orange, blue and red curves demonstrate the convergence for GIANT with the full gradient (that waits for all the workers), gradient coding and mini-batch gradient (that ignores the stragglers while calculating gradient and second-order updates) schemes, respectively. The purple and green curves depict the convergence for the exact Newton’s method and OverSketched Newton, respectively. The gradient coding scheme is applied for one straggler, that is the data is repeated twice at each worker. We use 60 Lambda workers for executing GIANT in parallel. Similarly, for Newton’s method, we use 60 workers for matrix-vector multiplication in steps 4 and 8 of Algorithm 3, workers for exact Hessian computation and workers for sketched Hessian computation with a sketch dimension of in step 14 of Algorithm 3.

An important point to note from Fig. 6 is that the uncoded scheme (that is, the one that waits for all stragglers) has the worst performance. The implication is that good straggler/fault mitigation algorithms are essential for computing in the serverless setting. Secondly, the mini-batch scheme outperforms the gradient coding scheme by . This is because gradient coding requires additional communication of data to serverless workers (twice when coding for one straggler, see [24] for details) at each invocation to AWS Lambda. On the other hand, the exact Newton’s method converges much faster than GIANT, even though it requires more time per iteration.

The green plot shows the convergence of OverSketched Newton. It can be noted the number of iterations required for convergence for OverSketched Newton and exact Newton (that exactly computes the Hessian) is similar, but the OverSketched Newton converges in almost half the time due to significant time savings during the computation of Hessian, which is the computational bottleneck in each iteration.

### 4.3 Experiments on Real Data

In Figure 7, we use the EPSILON classification dataset obtained from [39], with and . We plot training and testing errors for logistic regression for the schemes described in the previous section. Here, we use workers for GIANT, and workers for matrix-vector multiplications for gradient calculation in OverSketched Newton. We use gradient coding designed for three stragglers in GIANT. This scheme performs worse than uncoded GIANT that waits for all the stragglers due to the repetition of training data at workers. Hence, one can conclude that the communication costs dominate the straggling costs. In fact, it can be observed that the mini-batch gradient scheme that ignores the stragglers outperforms the gradient coding and uncoded schemes for GIANT.

During exact Hessian computation, we use serverless workers with speculative execution to mitigate stragglers (i.e., recomputing the straggling jobs) compared to OverSketched Newton that uses workers with a sketch dimension of . OverSketched Newton requires a significantly smaller number of workers, as once the square root of Hessian is sketched in a distributed fashion, it can be copied into local memory of the master due to dimension reduction, and the Hessian can be calculated locally. Testing error follows training error closely, and important conclusions remain the same as in Figure 6. OverSketched Newton significantly outperforms GIANT and exact Newton-based optimization in terms of running time.

We repeated the above experiments for classification on the web page dataset [39] with and . We used 30 workers for each iteration in GIANT and any matrix-vector multiplications. Exact hessian calculation invokes workers as opposed to workers for OverSketched Newton, where the sketch dimension was . The results are shown in Figure 8 and follow the trends witnessed heretofore.

### 4.4 Coded computing versus Speculative Execution

In Figure 9, we compare the effect of straggler mitigation schemes, namely speculative execution, that is, restarting the jobs with straggling workers, and coded computing on the convergence rate during training and testing. We regard OverSketch based matrix multiplication as a coding scheme in which some redundancy is introduced during “over” sketching for matrix multiplication. There are four different cases, corresponding to gradient and hessian calculation using either speculative execution or coded computing. For speculative execution, we wait for at least of the workers to return (this works well as the number of stragglers is generally less than ) and restart the jobs that did not return till this point.

For both exact Hessian and OverSketched Newton, using codes for distributed gradient computation outperforms speculative execution based straggler mitigation. Moreover, computing the Hessian using OverSketch is significantly better than exact computation in terms of running time as calculating the Hessian is the computational bottleneck in each iteration.

### 4.5 Comparison with Gradient Descent on AWS Lambda

In Figure 11, we compare gradient descent with OverSketched Newton for logistic regression on EPSILON dataset. The statistics for OverSketched Newton were obtained as described in the previous section. We observed that for first-order methods, there is only a slight difference in convergence for a mini-batch gradient when the batch size is . Hence, for gradient descent, we use 100 workers in each iteration while ignoring the stragglers. The step-size was chosen using the method of backtracking line search, which determines the maximum amount to move in given a descent direction. As can be noted, OverSketched Newton significantly outperforms gradient descent.

### 4.6 Comparison with Server-based Optimization

In Fig. 11, we compare OverSketched Newton on AWS Lambda with existing distributed optimization algorithm GIANT in server-based systems (AWS EC2 in our case). The results are plotted on synthetically generated data for logistic regression. For server-based programming, we use Message Passing Interface (MPI) with one c3.8xlarge master and t2.medium workers in AWS EC2. In [10], the authors observed that many large-scale linear algebra operations on serverless systems take at least more time compared to MPI-based computation on server-based systems. However, as shown in Fig. 11, we observe that OverSketched Newton outperforms MPI-based optimization that uses existing state-of-the-art optimization algorithm. This is because OverSketched Newton exploits the flexibility and massive scale at disposal in serverless, and thus produces a better approximation of the second-order update than GIANT555We do not compare with exact Newton in server-based sytems since the training data is large and stored in the cloud. Thus, computing the exact Hessian would require a large number of workers (e.g., we use 10,000 workers for exact Newton in EPSILON dataset) which is infeasible in server-based as it incurs a heavy cost..

## 5 Proofs

### 5.1 Proof of Theorem 2.1

As is the optimal solution of RHS in (8), we have, for any in ,

 f(wt)+∇f(wt)T(w−wt)+12(w−wt)T^Ht(w−wt), ≥f(wt)+∇f(wt)T(wt+1−wt)+12(wt+1−wt)T^H(wt+1−wt), ⇒∇f(wt)T(w−wt+1)+12(w−wt)T^Ht(w−wt)−12(wt+1−wt)T^Ht(wt+1−wt)≥0, ⇒∇f(wt)T(w−wt+1)+12[(w−wt)T^Ht(w−wt+1)+(w−wt+1)T^Ht(wt+1−wt)]≥0.

Substituting by in the above expression and calling , we get

 −∇f(wt)TΔt+1+12[ΔTt+1^Ht(2Δt−Δt+1)]≥0, ⇒ΔTt+1^HtΔt−∇f(wt)TΔt+1≥12ΔTt+1^HtΔt+1.

Now, due to the optimality of , we have . Hence, we can write

 ΔTt+1^HtΔt−(∇f(wt)−∇f(w∗))TΔt+1≥12ΔTt+1^HtΔt+1.

Next, subsetituting in the above inequality, we get

 ΔTt+1(^Ht−∇2f(wt))Δt+ΔTt+1(∇2f(wt)−∫10∇2f(w∗+p(wt−w∗))dp)Δt≥12ΔTt+1^HtΔt+1.

Using Cauchy-Schwartz inequality in the LHS above, we get

 ||Δt+1||2||Δt||2(||^Ht−∇2f(wt)||op+∫10||∇2f(wt)−∇2f(w∗+p(wt−w∗))||opdp)≥12ΔTt+1^HtΔt+1.

Now, using the Lipschitz property of in (2) in the inequality above, we get

 12ΔTt+1^HtΔt+1 ≤||Δt+1||2||Δt||2||^Ht−∇2f(wt)||op+L2||Δt+1||2||Δt||22∫10(1−p)dp, ⇒12ΔTt+1^HtΔt+1 ≤||Δt+1||2(||Δt||2||^Ht−∇2f(wt)||op+L2||Δt||22). (18)

To complete the proof, we will need the following two lemmas. The first lemma defines an upper bound on the first term of the RHS.

###### Lemma 5.1.

Let where is the sparse sketch matrix in (7) with sketch dimension and . Then, the following holds

 (19)

with probability at least .

###### Proof.

Note that for the positive definite matrix , we have . Moreover,

 ||^Ht−∇2f(wt)||op=||ATt(StSTt−I)At||op≤||At||2op||StSTt−I||op

Next, we note than is the number of non-zero elements per row in the sketch in (7) after ignoring stragglers. Moreover, we use Theorem 8 in [40]

to bound the singular values for the sparse sketch

in (7) with sketch dimension and . It says that , where and the constants in and depend on and . Thus, , which implies that

 ||Stx||22∈(1+ϵ2/9±2ϵ/3)||x||22,

with probability at least . For , this leads to the following inequality

 ||Stx||22∈(1±ϵ)||x||22⇒|xT(StSTt−I)x|≤ϵ||x||22 ∀ x∈Rn.

This implies that with probability , which proves the desired result. ∎

Next lemma provides a lower bound on the LHS in (5.1).

###### Lemma 5.2.

For the sketch matrix in (7) with sketch dimension , the following holds

 xTStSTtx≥(1−ϵ)||x||22 ∀ x∈Rn, (20)

with probability at least .

###### Proof.

The result follows directly by substituting , and in Theorem 4.1 of [11]. ∎

Using Lemma 5.1 to bound the RHS of (5.1), we have, with probability at least ,

 12ΔTt+1^HtΔt+1≤||Δt+1||2(ϵ||∇2f(wt)||op||Δt||2+L2||Δt||22).

Since and sketch dimension , using Lemma 5.2 in above inequality, we get, with probability at least ,

 12(1−ϵ)||AΔt+1||22 ≤||Δt+1||2(ϵ||∇2f(wt)||op||Δt||2+L2||Δt||22), ⇒12(1−ϵ)ΔTt+1∇2f(wt)Δt+1 ≤||Δt+1||2(ϵ||∇2f(wt)||op||Δt||2+L2||Δt||22).

Now, since and are the minimum and maximum eigenvalues of , we get

 12(1−ϵ)||Δt+1||2(γ−L||Δt||2) ≤ϵ(β+L||Δt||2)||Δt||2+L2||Δt||22

by the Lipschitzness of , that is, . Rearranging for , we get

 ||Δt+1||2≤