OverSketched Newton: Fast Convex Optimization for Serverless Systems

by   Vipul Gupta, et al.

Motivated by recent developments in serverless systems for large-scale machine learning as well as improvements in scalable randomized matrix algorithms, we develop OverSketched Newton, a randomized Hessian-based optimization algorithm to solve large-scale smooth and strongly-convex problems in serverless systems. OverSketched Newton leverages matrix sketching ideas from Randomized Numerical Linear Algebra to compute the Hessian approximately. These sketching methods lead to inbuilt resiliency against stragglers that are a characteristic of serverless architectures. We establish that OverSketched Newton has a linear-quadratic convergence rate, and we empirically validate our results by solving large-scale supervised learning problems on real-world datasets. Experiments demonstrate a reduction of 50 AWS Lambda, compared to state-of-the-art distributed optimization schemes.



There are no comments yet.


page 1

page 2

page 3

page 4


A Distributed Cubic-Regularized Newton Method for Smooth Convex Optimization over Networks

We propose a distributed, cubic-regularized Newton method for large-scal...

Randomized Smoothing SVRG for Large-scale Nonsmooth Convex Optimization

In this paper, we consider the problem of minimizing the average of a la...

Newton Sketch: A Linear-time Optimization Algorithm with Linear-Quadratic Convergence

We propose a randomized second-order method for optimization known as th...

Convergence Analysis of the Randomized Newton Method with Determinantal Sampling

We analyze the convergence rate of the Randomized Newton Method (RNM) in...

Newton-LESS: Sparsification without Trade-offs for the Sketched Newton Update

In second-order optimization, a potential bottleneck can be computing th...

Using Multilevel Circulant Matrix Approximate to Speed Up Kernel Logistic Regression

Kernel logistic regression (KLR) is a classical nonlinear classifier in ...

Distributed Sketching for Randomized Optimization: Exact Characterization, Concentration and Lower Bounds

We consider distributed optimization methods for problems where forming ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Compared to first-order optimization algorithms, second-order methods—which use the gradient as well as Hessian information—enjoy superior convergence rates, both in theory and practice. For instance, Newton’s method converges quadratically for strongly convex and smooth problems, compared to the linear convergence of gradient descent. Moreover, second-order methods do not require step-size tuning and unit step-size provably works for most problems. These methods, however, are not vastly popular among practitioners, partly since they are more sophisticated, and partly due to higher computational complexity per iteration. In each iteration, they require the calculation of the Hessian followed by solving a linear system involving the Hessian, and this can be prohibitive when the training data is large.

Recently, there has been a lot of interest in distributed second-order optimization methods for more traditional server-based systems [1, 2, 3, 4, 5, 6]. These methods can substantially decrease the number of iterations, and hence reduce communication, which is desirable in distributed computing environments, at the cost of more computation per iteration. However, these methods are approximate, as they do not calculate the exact Hessian, due to data being stored distributedly among workers. Instead, each worker calculates an approximation of the Hessian using the data that is stored locally. Such methods, while still better than gradient descent, forego the quadratic convergence property enjoyed by exact Newton’s method.

In recent years, there has been tremendous growth in users performing distributed computing operations on the cloud due to extensive and inexpensive commercial offerings like Amazon Web Services (AWS), Google Cloud, Microsoft Azure, etc. Serverless platforms—such as AWS Lambda, Cloud functions and Azure Functions—penetrate a large user base by provisioning and managing the servers on which the computation is performed. This abstracts away the need for maintaining servers since this is done by the cloud provider and is hidden from the user—hence the name serverless. Moreover, allocation of these servers is done expeditiously which provides greater elasticity and easy scalability. For example, up to ten thousand machines can be allocated on AWS Lambda in less than ten seconds [7, 8, 9]. Using serverless systems for large-scale computation has been the central theme of several recent works [10, 11, 12, 13].

Figure 1: Job times for 3000 AWS Lambda nodes where the median job time is around 40 seconds, and around of the nodes take 100 seconds, and two nodes take as much as 375 seconds to complete the same job (figure borrowed from [11]).

Due to several crucial differences between HPC/server-based and serverless architectures, existing distributed algorithms cannot, in general, be extended to serverless computing. Unlike server-based computing, the number of inexpensive workers in serverless platforms is flexible, often scaling into the thousands [9]. This heavy gain in the computation power, however, comes with the disadvantage that the commodity workers in the serverless architecture are ephemeral, have low memory and do not communicate amongst themselves111For example, serverless nodes in AWS Lambda, Google Cloud Functions and Microsoft Azure Functions have a maximum memory of 3 GB, 2 GB and 1.5 GB, respectively, and a maximum runtime of 900 seconds, 540 seconds and 300 seconds, respectively (these numbers may change over time). . The workers read/write data directly from/to a single data storage entity (for example, cloud storage like AWS S3).

Furthermore, unlike HPC/server-based systems, nodes in the serverless systems suffer degradation due to what is known as system noise. This can be a result of limited availability of shared resources, hardware failure, network latency, etc. [14, 15]. This results in job time variability, and hence a subset of much slower nodes, often called stragglers. These stragglers significantly slow the overall computation time, especially in large or iterative jobs. In [11], the authors plotted empirical statistics for worker job times for workers as shown in Fig. 1 for the AWS Lambda system. Such experiments consistently demonstrate that at least workers take significantly longer than the median job time, severely degrading the overall efficiency of the system.

Moreover, each call to serverless computing platforms such as AWS Lambda requires invocation time (during which AWS assigns Lambda workers), setup time (where the required python packages are downloaded) and communication time (where the data is downloaded from the cloud). The ephemeral nature of the workers in serverless systems requires that new workers should be invoked every few iterations and data should be communicated to them. Due to the aforementioned issues, first-order methods tend to perform poorly on distributed serverless architectures. In fact, their slower convergence is made worse on serverless platforms due to persistent stragglers. The straggler effect incurs heavy slow down due to the accumulation of tail times as a result of a subset of slow workers occurring in each iteration.

In this paper, we argue that second-order methods are highly compatible with serverless systems that provide extensive computing power by invoking thousands of workers but are limited by the communication costs and hence the number of iterations [9, 10]. To address the challenges of ephemeral workers and stragglers in serverless systems, we propose and analyze a randomized and distributed second-order optimization algorithm, called OverSketched Newton. OverSketched Newton uses the technique of matrix sketching from Randomized Numerical Linear Algebra (RandNLA) [16, 17] to obtain a good approximation for the Hessian.

In particular, we use the sparse sketching scheme proposed by [11], which is based on the basic RandNLA primitive of approximate randomized matrix multiplication [18], for straggler-resilient Hessian calculation in serverless systems. Such randomization also provides inherent straggler resiliency while calculating the Hessian. For straggler mitigation during gradient calculation, we use the recently proposed technique of employing error-correcting codes to create redundant computation [19, 20]. We prove that OverSketched Newton has a linear-quadratic convergence when the Hessian is calculated using the sparse sketching scheme proposed in [11]. Moreover, we show that at least a linear convergence is assured when the optimization algorithm starts with any random point in the constraint set. Our experiments on AWS Lambda demonstrate empirically that OverSketched Newton is significantly faster than existing distributed optimization schemes and vanilla Newton’s method that calculates exact Hessian.

1.1 Related Work

Existing Straggler Mitigation Schemes: Strategies like speculative execution have been traditionally used to mitigate stragglers in popular distributed computing frameworks like Hadoop MapReduce [21] and Apache Spark [22]. Speculative execution works by detecting workers that are running slower than expected and then allocating their tasks to new workers without shutting down the original straggling task. The worker that finishes first communicates its results. This has several drawbacks. For example, constant monitoring of tasks is required, where the worker pauses its job and provides its running status. Additionally, it is possible that a worker will straggle only at the end of the task, say, while communicating the results. By the time the task is reallocated, the overall efficiency of the system would have suffered already.

Recently, many coding-theoretic ideas have been proposed to introduce redundancy into the distributed computation for straggler mitigation [19, 20, 23, 24, 25, 26, 27]

, many of them catering to distributed matrix-vector multiplication

[19, 20, 23]. In general, the idea of coded computation is to generate redundant copies of the result of distributed computation by encoding the input data using error-correcting-codes. These redundant copies can then be used to decode the output of the missing stragglers. We use tools from [20] to compute gradients in a distributed straggler-resilient manner using codes, and we compare the performance with speculative execution.

Approximate Newton Methods: Several works in the literature prove convergence guarantees for Newton’s method when the Hessian is computed approximately using ideas from RandNLA [28, 29, 30, 31, 32]. However, these algorithms are designed for a single machine. Our goal in this paper is to use ideas from RandNLA to design a distributed approximate Newton method for a serverless system that is resilient to stragglers.

Distributed Second-Order Methods: There has been a growing research interest in designing and analyzing distributed (synchronous) implementations of second-order methods  [1, 2, 3, 4, 5, 6]. However, these implementations are tailored for server-based distributed systems. Our focus, on the other hand, is on serverless systems. Our motivation behind considering serverless systems stems from their usability benefits, cost efficiency, and extensive and inexpensive commercial offerings [9, 10]. We implement our algorithms using the recently developed serverless framework called PyWren [9]. While there are works that evaluate existing algorithms on serverless systems [12, 33], this is the first work that proposes a large-scale distributed optimization algorithm for serverless systems. We exploit the advantages offered serverless systems while mitigating the drawbacks such as stragglers and additional overhead per invocation of workers.

2 OverSketched Newton

2.1 Problem Formulation and Newton’s Method

We are interested in solving a problem of the following form on serverless systems in a distributed and straggler-resilient manner:


where is a closed and twice-differentiable convex function bounded from below, and is a given convex and closed set. We assume that the minimizer exists and is uniquely defined. Let and

denote, respectively, the minimum and maximum eigenvalues of the Hessian of

evaluated at the minimum, i.e., . In addition, we assume that the Hessian is Lipschitz continuous with modulus , that is, for any ,


where is the operator norm.

In the Newton’s method, the update at the -th iteration is obtained by minimizing the Taylor’s expansion of the objective function at , within the constraint set , that is


For the unconstrained case, that is when , Eq. (3) becomes


Given a good initial point such that


the Newton’s method satisfies the following update


implying quadratic convergence [34, 35].

In many applications like machine learning where the training data itself is noisy, using the exact Hessian is not necessary. Indeed, many results in the literature prove convergence guarantees for Newton’s method when the Hessian is computed approximately using ideas from RandNLA for a single machine [28, 29, 30, 31]. In particular, these methods perform a form of dimensionality reduction for the Hessian using random matrices, called sketching matrices. Many popular sketching schemes have been proposed in the literature, for example, sub-Gaussian, Hadamard, random row sampling, sparse Johnson-Lindenstrauss, etc. [36, 16].

Next, we present OverSketched Newton for solving problems of the form (1) distributedly in serverless systems using ideas from RandNLA.

2.2 OverSketched Newton

Input : Matrix , vector , and block size parameter
Result: where is the product of matrix and vector
1 Initialization: Divide into row-blocks, each of dimension
2 Encoding: Generate coded , say , in parallel using a 2D product code by arranging the row blocks of in a 2D structure of dimension and adding blocks across rows and columns to generate parities; see Fig. 2 in [20] for an illustration
3 for  to  do
4       1. Worker receives the -th row-block of , say , and from cloud storage
5       2. computes
6       3. Master receives from worker
7 end for
Decoding: Master checks if it has received results from enough workers to reconstruct . Once it does, it decodes from available results using the peeling decoder
Algorithm 1 Straggler-resilient distributed computation of using codes

OverSketched Newton computes the full gradient in each iteration by using tools from error-correcting codes [19, 20, 24]. The exact method of obtaining the full gradient in a distributed straggler-resilient way depends on the optimization problem at hand. Our key observation is that, for several commonly encountered optimization problems, gradient computation relies on matrix-vector multiplication (see Sec. 3.2 for examples). We leverage coded matrix multiplication technique from [20] to perform the large-scale matrix-vector multiplication in a distributed straggler-resilient manner. The main idea of coded matrix multiplication is explained in Fig. 3; detailed steps are provided in Algorithm 1.

Figure 2: Coded matrix-vector multiplication: Matrix is divided into 2 row chunks and . During encoding, redundant chunk is created. Three workers obtain , and from the cloud storage S3, respectively, and then multiply by and write back the result to the cloud. The master can decode from the results of any two workers, thus being resilient to one straggler ( in this case).
Figure 3: OverSketch-based approximate Hessian computation: First, the matrix —satifying —is sketched in parallel using the sketch in (7). Then, each worker receives a block each of the sketched matrices and , multiplies them, and communicates back its results for reduction. During reduction, stragglers can be ignored by the virtue of “over” sketching. For example, here the desired sketch dimension is increased by block-size for obtaining resiliency against one straggler for each block of .

Similar to the gradient computation, the particular method of obtaining the Hessian in a distributed straggler-resilient way depends on the optimization problem at hand. For several commonly encountered optimization problems, Hessian computation involves matrix-matrix multiplication for a pair of large matrices (see Sec. 3.2 for several examples). For computing the large-scale matrix-matrix multiplication in parallel in serverless systems, we propose to use a straggler-resilient scheme called OverSketch from [11]. OverSketch does blocked partitioning of input matrices, and hence, it is more communication efficient than existing coding-based straggler mitigation schemes that do naïve row-column partition of input matrices [25, 26]. We note that it is well known in HPC that blocked partitioning of input matrices can lead to communication-efficient methods for distributed multiplication [11, 37, 38].

OverSketch uses a sparse sketching matrix based on Count-Sketch [36]. It has similar computational efficiency and accuracy guarantees as that of the Count-Sketch, with two additional properties: it is amenable to distributed implementation; and it is resilient to stragglers. More specifically, the OverSketch matrix is given as [11]:


where , for all , are i.i.d. Count-Sketch matrices with sketch dimension , and is the maximum number of stragglers per blocks. Note that , where is the required sketch dimension and is the over-provisioning parameter to provide resiliency against stragglers per workers. We assume that for some constant redundancy factor , .

Each of the Count-Sketch matrices is constructed (independently of others) as follows. First, for every row , , of , independently choose a column . Then, select a uniformly random element from , denoted as . Finally, set and set for all . (See [36, 11] for details.) We can leverage the straggler resiliency of OverSketch to obtain the sketched Hessian in a distributed straggler-resilient manner. An illustration of OverSketch is provided in Fig. 3; see Algorithm 2 for details.

Input : Matrices , required sketch dimension , straggler tolerance , block-size . Define
1 Sketching: Use sketch in Eq. (7) to obtain distributedly (see Algorithm 5 in [11] for details)
2 Block partitioning: Divide into matrix of blocks
3 Computation phase: Each worker takes a block of and each and multiplies them. This step invokes workers, where workers compute one block of
4 Termination: Stop computation when any out of workers return their results for each block of
Reduction phase: Invoke workers to aggregate results during the computation phase, where each worker will calculate one block of
Algorithm 2 Approximate Hessian calculation on serverless systems using OverSketch

The model update for OverSketched Newton is given by


where , is the square root of the Hessian , and is an independent realization of (7) at the -th iteration.

2.3 Local and Global Convergence Guarantees

Next, we prove the following local convergence guarantee for OverSketched Newton, that uses the sketch matrix in (7) and full gradient for approximate Hessian computation. Such a convergence is referred to as “local” in the literature [28, 30] since it assumes that the initial starting point satisfies (15). Let be the resultant sketch dimension of obtained after ignoring the stragglers, that is, .

Theorem 2.1 (Local convergence).

Let be the optimal solution of (1) and and be the minimum and maximum eigenvalues of , respectively. Let and . Then, using an OverSketch matrix with a sketch dimension and the number of column-blocks , the updates for OverSketched Newton with initialization satisfying (5) follow

with probability at least



See Section 5.1. ∎

Theorem 2.1 implies that the convergence is linear-quadratic in error . Initially, when is large, the first term of the RHS will dominate and the convergence will be quadratic, that is, . In later stages, when becomes sufficiently small, the second term of RHS will start to dominate and the convergence will be linear, that is, . At this stage, the sketch dimension can be increased to reduce to diminish the effect of the linear term and improve the convergence rate in practice.

The above guarantee implies that when is sufficiently close to , OverSketched Newton has a linear-quadratic convergence rate. However, in general, a good starting point may not be available. Thus, we prove the following “global” convergence guarantee that shows that OverSketched Newton would converge from any random initialization of with high probability. In the following, we focus our attention to the unconstrained case, i.e., . We assume that is smooth and strongly convex in , that is,


for some and . In addition, we use line-search to choose the step-size, and the update is given below



and, for some constant ,


Recall that , where is the square root of the Hessian , and is an independent realization of (7) at the -th iteration.

Note that (2.3) can be solved approximately in single machine systems using Armijo backtracking line search [35]. In Section 2.4, we describe how to implement distributed line-search in serverless systems when the data is stored in cloud.

Theorem 2.2 (Global convergence).

Let be the optimal solution of (1) where satisfies the constraints in (9). Let and be positive constants. Then, using an OverSketch matrix with a sketch dimension and the number of column-blocks , the updates for OverSketched Newton, for any , satisfy

with probability at least , where . Moreover, the step-size satisfies .


See Section 5.2. ∎

Theorem 2.2

guarantees the global convergence of OverSketched Newton starting with any initial estimate

to the optimal solution with at least a linear rate.

2.4 Distributed Line Search

In our experiments in Section 4, line-search was not required for the three synthetic and real datasets where OverSketched Newton was employed. However, that might not be true in general as the global guarantee in Theorem 2.2 depends on line-search. In general, line-search may be required until the current model estimate is close to , that is, satisfies the condition in (5). Here, we describe a line-search procedure for distributed serverless optimization, which is inspired by the line-search method from [5] for server-based systems222Note that codes can be used to mitigate stragglers during distributed line-search in a manner similar to the gradient computation phase..

To solve for the step-size as described in the optimization problem in (2.3), we set and choose a candidate set . After the master calculates the descent direction in the -th iteration according to (10), the -th worker calculates for all values of in the candidate set , where depends on the local data available at the -th worker and . The master then sums the results from workers to obtain for all values of in and finds the largest that satisfies the Armijo condition in (2.3).

Note that the line search requires an additional round of communication where the master communicates to the workers through cloud. The workers then calculate the the functions using local data and send the result back to the master. Finally, the master finds the right step-size and updates the model estimate .

3 OverSketched Newton on Serverless Systems

3.1 Logistic Regression using OverSketched Newton on Serverless Systems

The optimization problem for supervised learning using Logistic Regression takes the form


Here, and are training sample vectors and labels, respectively. The goal is to learn the feature vector . Let and be the feature and label matrices, respectively. Using Newton’s method to solve (12) first requires evaluation of the gradient

Calculation of involves two matrix-vector products, and , where . These matrix-vector products are performed distributedly. Faster convergence can be obtained by second-order methods which will additionally compute the Hessian , where is a diagonal matrix with entries given by . The product is computed approximately in a distributed straggler-resilient manner using (7).

1 Input Data (stored in cloud storage): Example Matrix and vector (stored in cloud storage), regularization parameter , number of iterations , Sketch as defined in Eq. (7)
2 Initialization: Define , Encode X and as described in Algorithm 1
3 for  to  do
        // Compute in parallel using Algorithm 1
4       for  to  do
6       end for
        // Compute in parallel using Algorithm 1
8       for  to  do
10       end for
        // Compute in parallel using Algorithm 2
12       ;
13       ;
14 end for
Algorithm 3 OverSketched Newton: Logistic Regression for Serverless Computing

We provide a detailed description of OverSketched Newton for large-scale logistic regression for serverless systems in Algorithm 3. Here, steps 4, 8, and 14 are computed in parallel on AWS Lambda. All other steps are simple vector operations that can be performed locally at the master, for instance, the user’s laptop. We assume that the number of features are small enough to fit the Hessian matrix H locally at the master and compute the update efficiently. Steps 4 and 8 are executed in a straggler-resilient fashion using the coding scheme in [20], as illustrated in Fig. 1 and described in detail in Algorithm 1.

We use the coding scheme in [20] since the encoding can be implemented in parallel and requires less communication per worker compared to the other schemes, for example schemes in [19, 26], that use Maximum Distance Separable (MDS) codes. Moreover, the decoding scheme takes linear time and is applicable on real-valued matrices. Note that since the example matrix is constant in this example, the encoding of is done only once before starting the optimization algorithm. Thus, the encoding cost can be amortized over iterations. Moreover, decoding over the resultant product vector requires negligible time and space, even when is scaling into the millions.

The same is, however, not true for the matrix multiplication for Hessian calculation (step 14 of Algorithm 3), as the matrix L changes in each iteration, thus encoding costs will be incurred in every iteration if error-correcting codes are used. Moreover, encoding and decoding a huge matrix stored in the cloud incurs heavy communication cost and becomes prohibitive. Motivated by this, we use OverSketch in step 14, as described in Algorithm 2, to calculate an approximate matrix multiplication, and hence the Hessian, efficiently in serverless systems with inbuilt straggler resiliency.333We also evaluate the exact Hessian-based algorithm with speculative execution, i.e., recomputing the straggling jobs, and compare it with OverSketched Newton in Sec. 4.

3.2 Example Problems

In this section, we describe several commonly encountered optimization problems (besides logistic regression) that can be solved using OverSketched Newton.

Ridge Regularized Linear Regression

: The optimization problem is


The gradient in this case can be written as , where , where the training matrix and label vector were defined previously. The Hessian is given by . For , this can be computed approximately using the sketch matrix in (7).

Linear programming via interior point methods

: The following linear program can be solved using OverSketched Newton


where and is the constraint matrix with . In algorithms based on interior point methods, the following sequence of problems using Newton’s method


where is the -th row of , is increased geometrically such that when is very large, the logarithmic term does not affect the objective value and serves its purpose of keeping all intermediates solution inside the constraint region. The update in the -th iteration is given by , where is the estimate of the solution in the -th iteration. The gradient can be written as where and .

The Hessian for the objective in (15) is given by


The square root of the Hessian is given by . The computation of Hessian requires time and is the bottleneck in each iteration. Thus, we can use sketching to mitigate stragglers while evaluating the Hessian efficiently, i.e. , where is the OverSketch matrix defined in (7).

Lasso Regularized Linear Regression: The optimization problem takes the following form


where is the measurement matrix, the vector contains the measurements, and . To solve (17), we consider its dual variation

which is amenable to interior point methods and can be solved by optimizing the following sequence of problems where is increased geometrically

where is the -th column of . The gradient can be expressed in few matrix-vector multiplications as where , , and . Similarly, the Hessian can be written as , where is a diagonal matrix whose entries are given by .

Other common problems where OverSketched Newton is applicable include Linear Regression, Support Vector Machines (SVMs), Semidefinite programs, etc.

4 Experimental Results

In this section, we evaluate OverSketched Newton on AWS Lambda using real-world and synthetic datasets, and we compare it with state-of-the-art distributed optimization algorithms444A working implementation of OverSketched Newton is available at https://github.com/vvipgupta/OverSketchedNewton. We use the serverless computing framework, Pywren, developed recently in [9]. Our experiments are focused on logistic regression, a popular supervised learning problem, but they can be reproduced for other problems described in Section 3.2.

4.1 Comparison with Existing Second-order Methods

Figure 4: GIANT: The two stage second order distributed optimization scheme with four workers. First, master calculates the full gradient by aggregating local gradients from workers. Second, the master calculates approximate Hessian using local second-order updates from workers.
(a) Simple Gradient Descent where each worker stores one-fourth fraction of the whole data and sends back a partial gradient corresponding to its own data to the master
(b) Gradient Coding described in [24] with straggling. To get the global gradient, master would compute
(c) Mini-batch gradient descent, where the stragglers are ignored during gradient aggregation and the gradient is later scaled according to the size of mini-batch
Figure 5: Different gradient descent schemes in server-based systems in presence of stragglers

For comparison of OverSketched Newton with existing distributed optimization schemes, we choose recently proposed Globally Improved Approximate Newton Direction (GIANT) [5] as a representative algorithm. This is because GIANT boasts a better convergence rate than many existing distributed second-order methods for linear and logistic regression when . In GIANT, and other similar distributed second-order algorithms, the training data is evenly divided among workers, and the algorithms proceed in two stages. First, the workers compute partial gradients using local training data, which is then aggregated by the master to compute the exact gradient. Second, the workers receive the full gradient to calculate their local second-order estimate, which is then averaged by the master. An illustration is shown in Fig. 4.

Figure 6: Convergence comparison of GIANT (employed with different straggler mitigation methods), exact Newton’s method and OverSketched Newton for Logistic regression on AWS Lambda. The synthetic dataset considered has 300,000 examples and 3000 features.

For straggler mitigation in such server-based algorithms, [24] proposes a scheme for coding gradient updates called gradient coding, where the data at each worker is repeated multiple times to compute redundant copies of the gradient. See Figure 4(b) for illustration. Figure 4(a) illustrates the scheme that waits for all workers and Figure 4(c) illustrates the ignoring stragglers approach. We use the three schemes for dealing with stragglers illustrated in Figure 5 during the two stages of GIANT, and we compare their convergence with OverSketched Newton. We also evaluate exact Newton’s method with speculative execution for straggler mitigation, that is, reassigning and recomputing the work for straggling workers, and compare its convergence rate with OverSketched Newton.

Remark 1.

We note that the conventional distributed second-order methods for server-based systems—which distribute training examples evenly across workers (such as [1, 2, 3, 4, 5])—typically find a “local approximation” of second-order update at each worker and then aggregate it. OverSketched Newton, on the other hand, utilizes the massive storage and compute power in serverless systems to find a “global approximation”. As we demonstrate next using extensive experiments, it performs significantly better than existing server-based methods.

4.2 Experiments on Synthetic Data

In Figure 6, we present our experimental results on randomly generated dataset with and for logistic regression on AWS Lambda. Each column , for all , is sampled uniformly randomly from the cube . The labels are sampled from the logistic model, that is, , where the weight vector and bias

are generated randomly from the normal distribution.

The orange, blue and red curves demonstrate the convergence for GIANT with the full gradient (that waits for all the workers), gradient coding and mini-batch gradient (that ignores the stragglers while calculating gradient and second-order updates) schemes, respectively. The purple and green curves depict the convergence for the exact Newton’s method and OverSketched Newton, respectively. The gradient coding scheme is applied for one straggler, that is the data is repeated twice at each worker. We use 60 Lambda workers for executing GIANT in parallel. Similarly, for Newton’s method, we use 60 workers for matrix-vector multiplication in steps 4 and 8 of Algorithm 3, workers for exact Hessian computation and workers for sketched Hessian computation with a sketch dimension of in step 14 of Algorithm 3.

An important point to note from Fig. 6 is that the uncoded scheme (that is, the one that waits for all stragglers) has the worst performance. The implication is that good straggler/fault mitigation algorithms are essential for computing in the serverless setting. Secondly, the mini-batch scheme outperforms the gradient coding scheme by . This is because gradient coding requires additional communication of data to serverless workers (twice when coding for one straggler, see [24] for details) at each invocation to AWS Lambda. On the other hand, the exact Newton’s method converges much faster than GIANT, even though it requires more time per iteration.

The green plot shows the convergence of OverSketched Newton. It can be noted the number of iterations required for convergence for OverSketched Newton and exact Newton (that exactly computes the Hessian) is similar, but the OverSketched Newton converges in almost half the time due to significant time savings during the computation of Hessian, which is the computational bottleneck in each iteration.

(a) Training error for logistic regression on EPSILON dataset
(b) Testing error for logistic regression on EPSILON dataset
Figure 7: Comparison of training and testing errors for several Newton based schemes with straggler mitigation. Testing error closely follows training error.
(a) Training error for logistic regression on EPSILON dataset
(b) Testing error for logistic regression on EPSILON dataset
Figure 8: Comparison of training and testing errors for several Newton based schemes with straggler mitigation. Testing error closely follows training error.

4.3 Experiments on Real Data

In Figure 7, we use the EPSILON classification dataset obtained from [39], with and . We plot training and testing errors for logistic regression for the schemes described in the previous section. Here, we use workers for GIANT, and workers for matrix-vector multiplications for gradient calculation in OverSketched Newton. We use gradient coding designed for three stragglers in GIANT. This scheme performs worse than uncoded GIANT that waits for all the stragglers due to the repetition of training data at workers. Hence, one can conclude that the communication costs dominate the straggling costs. In fact, it can be observed that the mini-batch gradient scheme that ignores the stragglers outperforms the gradient coding and uncoded schemes for GIANT.

During exact Hessian computation, we use serverless workers with speculative execution to mitigate stragglers (i.e., recomputing the straggling jobs) compared to OverSketched Newton that uses workers with a sketch dimension of . OverSketched Newton requires a significantly smaller number of workers, as once the square root of Hessian is sketched in a distributed fashion, it can be copied into local memory of the master due to dimension reduction, and the Hessian can be calculated locally. Testing error follows training error closely, and important conclusions remain the same as in Figure 6. OverSketched Newton significantly outperforms GIANT and exact Newton-based optimization in terms of running time.

We repeated the above experiments for classification on the web page dataset [39] with and . We used 30 workers for each iteration in GIANT and any matrix-vector multiplications. Exact hessian calculation invokes workers as opposed to workers for OverSketched Newton, where the sketch dimension was . The results are shown in Figure 8 and follow the trends witnessed heretofore.

4.4 Coded computing versus Speculative Execution

(a) Comparison for Training error
(b) Comparison for Testing error
Figure 9: Comparison of speculative execution and coded computing schemes for computing the gradient and Hessian for the EPSILON dataset. OverSketched Newton, that is, coding the gradients and sketching the Hessian, outperforms all other schemes.

In Figure 9, we compare the effect of straggler mitigation schemes, namely speculative execution, that is, restarting the jobs with straggling workers, and coded computing on the convergence rate during training and testing. We regard OverSketch based matrix multiplication as a coding scheme in which some redundancy is introduced during “over” sketching for matrix multiplication. There are four different cases, corresponding to gradient and hessian calculation using either speculative execution or coded computing. For speculative execution, we wait for at least of the workers to return (this works well as the number of stragglers is generally less than ) and restart the jobs that did not return till this point.

For both exact Hessian and OverSketched Newton, using codes for distributed gradient computation outperforms speculative execution based straggler mitigation. Moreover, computing the Hessian using OverSketch is significantly better than exact computation in terms of running time as calculating the Hessian is the computational bottleneck in each iteration.

4.5 Comparison with Gradient Descent on AWS Lambda

In Figure 11, we compare gradient descent with OverSketched Newton for logistic regression on EPSILON dataset. The statistics for OverSketched Newton were obtained as described in the previous section. We observed that for first-order methods, there is only a slight difference in convergence for a mini-batch gradient when the batch size is . Hence, for gradient descent, we use 100 workers in each iteration while ignoring the stragglers. The step-size was chosen using the method of backtracking line search, which determines the maximum amount to move in given a descent direction. As can be noted, OverSketched Newton significantly outperforms gradient descent.

Figure 10: Convergence rate comparison of gradient descent and OverSketched Newton for the EPSILON dataset.
Figure 11: Convergence rate comparison of AWS EC2 and AWS Lambda that use GIANT and OverSketched Newton, respectively, to solve a large scale logistic regression problem.

4.6 Comparison with Server-based Optimization

In Fig. 11, we compare OverSketched Newton on AWS Lambda with existing distributed optimization algorithm GIANT in server-based systems (AWS EC2 in our case). The results are plotted on synthetically generated data for logistic regression. For server-based programming, we use Message Passing Interface (MPI) with one c3.8xlarge master and t2.medium workers in AWS EC2. In [10], the authors observed that many large-scale linear algebra operations on serverless systems take at least more time compared to MPI-based computation on server-based systems. However, as shown in Fig. 11, we observe that OverSketched Newton outperforms MPI-based optimization that uses existing state-of-the-art optimization algorithm. This is because OverSketched Newton exploits the flexibility and massive scale at disposal in serverless, and thus produces a better approximation of the second-order update than GIANT555We do not compare with exact Newton in server-based sytems since the training data is large and stored in the cloud. Thus, computing the exact Hessian would require a large number of workers (e.g., we use 10,000 workers for exact Newton in EPSILON dataset) which is infeasible in server-based as it incurs a heavy cost..

5 Proofs

5.1 Proof of Theorem 2.1

As is the optimal solution of RHS in (8), we have, for any in ,

Substituting by in the above expression and calling , we get

Now, due to the optimality of , we have . Hence, we can write

Next, subsetituting in the above inequality, we get

Using Cauchy-Schwartz inequality in the LHS above, we get

Now, using the Lipschitz property of in (2) in the inequality above, we get


To complete the proof, we will need the following two lemmas. The first lemma defines an upper bound on the first term of the RHS.

Lemma 5.1.

Let where is the sparse sketch matrix in (7) with sketch dimension and . Then, the following holds


with probability at least .


Note that for the positive definite matrix , we have . Moreover,

Next, we note than is the number of non-zero elements per row in the sketch in (7) after ignoring stragglers. Moreover, we use Theorem 8 in [40]

to bound the singular values for the sparse sketch

in (7) with sketch dimension and . It says that , where and the constants in and depend on and . Thus, , which implies that

with probability at least . For , this leads to the following inequality

This implies that with probability , which proves the desired result. ∎

Next lemma provides a lower bound on the LHS in (5.1).

Lemma 5.2.

For the sketch matrix in (7) with sketch dimension , the following holds


with probability at least .


The result follows directly by substituting , and in Theorem 4.1 of [11]. ∎

Using Lemma 5.1 to bound the RHS of (5.1), we have, with probability at least ,

Since and sketch dimension , using Lemma 5.2 in above inequality, we get, with probability at least ,

Now, since and are the minimum and maximum eigenvalues of , we get

by the Lipschitzness of , that is, . Rearranging for , we get