Optimization for Large-Scale Machine Learning with Distributed Features and Observations

10/31/2016 ∙ by Alexandros Nathan, et al. ∙ Northwestern University 0

As the size of modern data sets exceeds the disk and memory capacities of a single computer, machine learning practitioners have resorted to parallel and distributed computing. Given that optimization is one of the pillars of machine learning and predictive modeling, distributed optimization methods have recently garnered ample attention in the literature. Although previous research has mostly focused on settings where either the observations, or features of the problem at hand are stored in distributed fashion, the situation where both are partitioned across the nodes of a computer cluster (doubly distributed) has barely been studied. In this work we propose two doubly distributed optimization algorithms. The first one falls under the umbrella of distributed dual coordinate ascent methods, while the second one belongs to the class of stochastic gradient/coordinate descent hybrid methods. We conduct numerical experiments in Spark using real-world and simulated data sets and study the scaling properties of our methods. Our empirical evaluation of the proposed algorithms demonstrates the out-performance of a block distributed ADMM method, which, to the best of our knowledge is the only other existing doubly distributed optimization algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The collection and analysis of data is widespread nowadays across many industries. As the size of modern data sets exceeds the disk and memory capacities of a single computer, it is imperative to store them and analyze them distributively. Designing efficient and scalable distributed optimization algorithms is a challenging, yet increasingly important task. There exists a large body of literature studying algorithms where either the features or the observations associated with a machine learning task are stored in distributed fashion. Nevertheless, little attention has been given to settings where the data is doubly distributed, i.e., when both features and observations are distributed across the nodes of a computer cluster. This scenario arises in practice as a result of a data collection process.

In this work, we propose two algorithms that are amenable to the doubly distributed setting, namely D3CA (Doubly Distributed Dual Coordinate Ascent) and RADiSA (RAndom Distributed Stochastic Algorithm). These methods can solve a broad class of problems that can be posed as minimization of the sum of convex functions plus a convex regularization term (e.g. least squares, logistic regression, support vector machines).

D3CA builds on previous distributed dual coordinate ascent methods [1, 2, 3], allowing features to be distributed in addition to observations. The main idea behind distributed dual methods is to approximately solve many smaller sub-problems (also referred to herein as partitions) instead of solving a large one. Upon the completion of the local optimization procedure, the primal and dual variables are aggregated, and the process is repeated until convergence. Since each sub-problem contains only a subset of the original features, the same dual variables are present in multiple partitions of the data. This creates the need to aggregate the dual variables corresponding to the same observations. To ensure dual feasibility, we average them and retrieve the primal variables by leveraging the primal-dual relationship (3), which we discuss in section  III.

In contrast with D3CA, RADiSA is a primal method and is related to a recent line of work [4, 5, 6]

on combining Coordinate Descent (CD) methods with Stochastic Gradient Descent (SGD). Its name has the following interpretation: the randomness is due to the fact that at every iteration, each sub-problem is assigned a random sub-block of local features; the stochastic component owes its name to the parameter update scheme, which follows closely that of the SGD algorithm. The work most pertinent to RADiSA is RAPSA

[4]

. The main distinction between the two methods is that RAPSA follows a distributed gradient (mini-batch SGD) framework, in that in each global iteration there is a single (full or partial) parameter update. Such methods suffer from high communication cost in distributed environments. RADiSA, which follows a local update scheme similar to D3CA, is a communication-efficient generalization of RAPSA, coupled with the stochastic variance reduction gradient (SVRG) technique

[7].

The contributions of our work are summarized as follows:

  • We address the problem of training a model when the data is distributed across observations and features. We propose two doubly distributed optimization methods.

  • We perform a computational study to empirically evaluate the two methods. Both methods outperform on all instances the block splitting variant of ADMM [8], which, to the best of our knowledge, is the only other existing doubly distributed optimization algorithm.

The remainder of the paper is organized as follows: Section  II discusses related works in distributed optimization; Section  III provides an overview of the problem under consideration, and presents the proposed algorithms; in Section  IV we present the results for our numerical experiments, where we compare D3CA and two versions of RADiSA against ADMM.

Ii Related Work

Stochastic Gradient Descent Methods

SGD is one of the most widely-used optimization methods in machine learning. Its low per-iteration cost and small memory footprint make it a natural candidate for training models with a large number of observations. Due to its popularity, it has been extensively studied in parallel and distributed settings. One standard approach to parallelizing it is the so-called mini-batch SGD framework, where worker nodes compute stochastic gradients on local examples in parallel, and a master node performs the parameter updates. Different variants of this approach have been proposed, both in the synchronous setting [9], and the asynchronous setting with delayed updates [10]. Another notable work on asynchronous SGD is Hogwild! [11], where multiple processors carry out SGD independently and one can overwrite the progress of the other. A caveat of Hogwild! is that it places strong sparsity assumptions on the data. An alternative strategy that is more communication efficient compared to the mini-batch framework is the Parallelized SGD (P-SGD) method [12], which follows the research direction set by [13, 14]. The main idea is to allow each processor to independently perform SGD on the subset of the data that corresponds to it, and then to average all solutions to obtain the final result. Note that in all aforementioned methods, the observations are stored distributively, but not the features.

Coordinate Descent Methods

Coordinate descent methods have proven very useful in various machine learning tasks. In its simplest form, CD selects a single coordinate of the variable vector, and minimizes along that direction while keeping the remaining coordinates fixed [15]. More recent CD versions operate on randomly selected blocks, and update multiple coordinates at the same time [16]. Primal CD methods have been studied in the parallel [17] and distributed settings [18, 19]. Distributed CD as it appears in [19] can be conducted with the coordinates (features) being partitioned, but requires access to all observations. Recently, dual coordinate ascent methods have received ample attention from the research community, as they have been shown to outperform SGD in a number of settings [20, 21]. In the dual problem, each dual variable is associated with an observation, so in the distributed setting one would partition the data across observations. Examples of such algorithms include [1, 2, 3]. CoCoA [1], which serves as the starting point for D3CA, follows the observation partitioning scheme and treats each block of data as an independent sub-problem. Due to the separability of the problem over the dual variables, the local objectives that are maximized are identical to the global one. Each sub-problem is approximately solved using a dual optimization method; the Stochastic Dual Coordinate Ascent (SDCA) method [21] is a popular algorithm for this task. Following the optimization step, the locally updated primal and dual variables are averaged, and the process is repeated until convergence. Similar to SGD-based algorithms, dual methods have not yet been explored when the feature space is distributed.

SGD-CD Hybrid Methods

There has recently been a surge of methods combining SGD and CD [22, 4, 5, 23, 6]. These methods conduct parameter updates based on stochastic partial gradients, which are computed by randomly sampling observations and blocks of variables. With the exception of RAPSA [4], which is a parallel algorithm, all other methods are serial, and typically assume that the sampling process has access to all observations and features. Although this is a valid assumption in a parallel (shared-memory) setting, it does not hold in distributed environments. RAPSA employs an update scheme similar to that of mini-batch SGD, but does not require all variables to be updated at the same time. More specifically, in every iteration each processor randomly picks a subset of observations and a block of variables, and computes a partial stochastic gradient based on them. Subsequently, it performs a single stochastic gradient update on the selected variables, and then re-samples feature blocks and observations. Despite the fact that RAPSA is not a doubly distributed optimization method, its parameter update is quite different from that of RADiSA. On one hand, RAPSA allows only one parameter update per iteration, whereas RADiSA permits multiple updates per iteration, thus leading to a great reduction in communication. Finally, RADiSA utilizes the SVRG technique, which is known to accelerate the rate of convergence of an algorithm.

ADMM-based Methods

A popular alternative for distributed optimization is the alternating direction method of multipliers (ADMM) [24]. The original ADMM algorithm is very flexible in that it can be used to solve a wide variety of problems, and is easily parallelizable. A block splitting variant of ADMM was recently proposed that allows both features and observations to be stored in distributed fashion [8]. One caveat of ADMM-based methods is their slow convergence rate. In our numerical experiments we show empirically the benefits of using RADiSA or D3CA over block splitting ADMM.

Iii Algorithms

In this section we present the D3CA and RADiSA algorithms. We first briefly discuss the problem of interest, and then introduce the notation used in the remainder of the paper.

Preliminaries

Figure 1: An illustration of the partitioning scheme under consideration. (a) and (b) show the definitions of and respectively.

In a typical supervised learning task, there is a collection of input-output pairs

, where each represents an observation consisting of features, and is associated with a corresponding label . This collection is usually referred to as the training set. The general objective under consideration can be expressed as a minimization problem of a finite sum of convex functions, plus a smooth, convex regularization term (where is the regularization parameter, and is parametrized by ):

(1)

An alternative approach for finding a solution to (1) is to solve its corresponding dual problem. The dual problem of (1) has the following form:

(2)

where is the convex conjugate of . Note that for certain non-smooth primal objectives used in models such as support vector machines and least absolute deviation, the convex conjugate imposes lower and upper bound constraints on the dual variables. One interesting aspect of the dual objective (2) is that there is one dual variable associated with each observation in the training set. Given a dual solution , it is possible to retrieve the corresponding primal vector by using

(3)

For any primal-dual pair of solutions and , the duality gap is defined as , and it is known that . Duality theory guarantees that at an optimal solution of (2), and of (1), .

Notation: We assume that the data is distributed across observations and features over computing nodes of a cluster. More specifically, we split the features into partitions, and the observations into partitions (for simplicity we assume that ). We denote the labels of a partition by , and the observations of the training set for its subset of features by . For instance, if we let and , the resulting partitions are , , and . Furthermore, represents all observations and features (across all ) associated with partition ( is defined similarly) – Figure 1 illustrates this partitioning scheme. We let denote the number of observations in each partition, such that , and we let correspond to the number of features in a partition, such that . Note that partitions corresponding to the same observations all share the common dual variable . In a similar manner, partitions containing the same features share the common primal variable . In other words, for some pre-specified values and , the partial solutions and represent aggregations of the local solutions for and for . At any iteration of D3CA, the global dual variable vector can be written as , whereas for RADiSA the global primal vector has the form , i.e. the global solutions are formed by concatenating the partial solutions.

Doubly Distributed Dual Coordinate Ascent

The D3CA framework presented in Algorithm 1 hinges on CoCoA [1], but it extends it to cater for the features being distributed as well. The main idea behind D3CA is to approximately solve the local sub-problems using a dual optimization method, and then aggregate the dual variables via averaging. The choice of averaging is reasonable from a dual feasibility standpoint when dealing with non-smooth primal losses – the LocalDualMethod guarantees that the dual variables are within the lower and upper bounds imposed by the convex conjugate, so their average will also be feasible. Although in CoCoA it is possible to recover the primal variables directly from the local solver, in D3CA, due to the averaging of the dual variables, we need to use the primal-dual relationship to obtain them. Note that in the case where , D3CA reduces to CoCoA.

Data: for and

Initialize: ,

1:for  do
2:     for all partitions  do in parallel
3:         LocalDualMethod
4:     end for
5:     for all  do in parallel
6:         
7:     end for
8:     for  all  do in parallel
9:         
10:     end for
11:end for
Algorithm 1 Doubly Distributed Dual Coordinate Ascent (D3CA)

D3CA requires the input data to be doubly partitioned across nodes of a cluster. In step 3, the algorithm calls the local dual solver, which is shown in Algorithm 2. The LocalDualMethod of choice is SDCA [21], with the only difference that the objective that is maximized in step 3 is divided by . The reason for this is that each partition now contains variables, so the factor ensures that the sum of the local objectives adds up to (2). Step 6 of Algorithm 1 shows the dual variable update, which is equivalent to averaging the dual iterates coming from SDCA. Finally, step 9 retrieves the primal variables in parallel using the primal-dual relationship. The new primal and dual solutions are used to warm-start the next iteration. The performance of the algorithm turns out to be very sensitive to the regularization parameter . For small values of relative to the problem size, D3CA is not always able to reach the optimal solution. One modification we made to alleviate this issue was to add a step-size parameter when calculating the ’s in the local dual method (Algorithm 2, step 3). In the case of linear Support Vector Machines (SVM) where the closed form solution for step 3 is given by , we replace with a step-size parameter [25]. In our experiments we use , where is the global iteration counter. Although, a step-size of this form does not resolve the problem entirely, the performance of the method does improve.

Figure 2: An illustration of two iterations of RADiSA, with six overall partitions ( and ).

Input: ,

Data: Local

Initialize: ,,

1:for  do
2:     choose at random
3:     find maximizing
4:      
5:     
6:     
7:     
8:end for
9:Output:
Algorithm 2 LocalDualMethod: Stochastic Dual Coordinate Ascent (SDCA)

In terms of parallelism, the sub-problems can be solved independently. These independent processes can either be carried out on separate computing nodes, or in distinct cores in the case of multi-core computing nodes. The only steps that require communication are step 6 and step 9. The communication steps can be implemented via reduce operations – in Spark we use treeAggregate, which is superior to the standard reduce operation.

Random Distributed Stochastic Algorithm

Similar to D3CA, RADiSA, outlined in Algorithm 3, assumes that the data is doubly distributed across partitions. Before reaching step 1 of the algorithm, all partitions associated with the same block of variables (i.e. for ) are further divided into non-overlapping sub-blocks. The reason for doing this is to ensure that at no time more than one processor is updating the same variables. Although the blocks remain fixed throughout the runtime of the algorithm, the random exchange of sub-blocks between iterations is allowed (step 5). The process of randomly exchanging sub-blocks can be seen graphically in Figure 2. For example, the two left-most partitions that have been assigned the coordinate block , exchange sub-blocks and from one iteration to the next. The notation in step 5 of the algorithm essentially implies that sub-blocks are partition-specific, and, therefore, depend on and .

A possible variation of Algorithm 3 is one that allows for complete overlap between the sub-blocks of variables. In this setting, however, concatenating all local variables into a single global solution (step 14) is no longer an option. Other techniques, such as parameter averaging, need to be employed in order to aggregate the local solutions. In our numerical experiments, we explore a parameter averaging version of RADiSA (RADiSA-avg).

The optimization procedure of RADiSA makes use of the Stochastic Variance Reduce Gradient (SVRG) method [7], which helps accelerate the convergence of the algorithm. SVRG requires a full-gradient computation (step 3), typically after a full pass over the data. Note that for models that can be expressed as the sum functions, like in (1), it is possible to compute the gradient when the data is doubly distributed. Although RADiSA by default computes a full-gradient for each global iteration, delaying the gradient updates can be a viable alternative. Step 10 shows the standard SVRG step, which is applied to the sub-block of coordinates assigned to that partition. The total number of inner iterations is determined by the batch size , which is a hyper-parameter. As is always the case with variants of the SGD algorithm, the learning rate (also known as step-size) typically requires some tuning from the user in order to achieve the best possible results. In Section  IV we discuss our choice of step-size. The final stage of the algorithm simply concatenates all the local solutions to obtain the next global iterate. The new global iterate is used to warm-start the subsequent iteration.

Similar to D3CA, the sub-problems can be solved independently. As far as communication is concerned, only the gradient computation (step 3) and parameter update (step 10) stages require coordination among the different processes. In Spark, the communication operations are implemented via .

Input: batch size , learning rate

Data: for and

Initialize:

Partition each into blocks, such that

1:for  do
2:     
3:     
4:     for all partitions  do in parallel
5:         Randomly pick sub-block in non-
6:          overlapping manner
7:         
8:         for  do
9:              randomly pick
10:              
11:              
12:         end for
13:     end for
14:      , where
15:end for
Algorithm 3 Random Distributed Stochastic Algorithm (RADiSA)

Iv Numerical Experiments

Figure 3: Relative optimality difference against elapsed time for three data sets with the following configurations of and : (4,2), (5,3) and (7,4).
Figure 4: Relative optimality difference against iteration count.

In this section we present two sets of experiments. The first set is adopted from [8], and we compare the block distributed version of ADMM with RADiSA and D3CA. In the second set of experiments we explore the scalability properties of the proposed methods. We implemented all algorithms in Spark and conducted the experiments in a Hadoop cluster with 4 nodes, each containing 8 Intel Xeon E5-2407 2.2GHz cores. For the ADMM method, we follow the approach outlined in [8], whereby the Cholesky factorization of the data matrix is computed once, and is cached for re-use in subsequent iterations. Since the computational time of the Cholesky decomposition depends substantially on the underlying BLAS library, in all subsequent figures reporting the execution time of ADMM, we have excluded the factorization time. This makes the reported times for ADMM lower than in reality.

The problem solved in [8]

was lasso regression, which is not a model of the form (

1). Instead, we trained one of the most popular classification models: binary classification hinge loss support vector machines (SVM). The data for the first set of experiments was generated according to a standard procedure outlined in [26]: the ’s and were sampled from the uniform distribution; , and the sign of each

was randomly flipped with probability 0.1. The features were standardized to have unit variance. We take the size of each partition to be dense

,111In [8] the size of the partitions was , but due to the BLAS issue mentioned earlier, we resorted to smaller problems to obtain comparable run-times across all methods. and set and accordingly to produce problems at different scales. For example, for and , the size of the entire instance is . The information about the three data sets is summarized in table I. As far as hyper-parameter tuning is concerned, for ADMM we set . For RADiSA we set the step-size to have the form , and select the constant that gives the best performance.

To measure the training performance of the methods under consideration, we use the relative optimality difference metric, defined as

where is the primal objective function value at iteration , and corresponds to the optimal objective function value obtained by running an algorithm for a very long time.

Nonzero entries M M M
Number of cores used
Table I: Datasets for Numerical Experiments (Part 1)

In Figure 3, we observe that RADiSA-avg performs best in all cases, with RADiSA coming in a close second, especially for smaller regularization values. Both variants of RADiSA and D3CA clearly outperform ADMM, which needs a much larger number of iterations to produce a satisfactory solution. We provide an additional comparison in Figure 4 that further demonstrates this point. We plot the relative optimality difference across 50 iterations. One note about RADiSA-avg is that its performance depends heavily on the number of observation partitions. The averaging step tends to dilute the updates, leading to a slower convergence rate. This is evident when training models on larger data sets than the ones shown in this round of experiments. Another important remark we should make is that when dealing with larger data sets, the behavior of D3CA is erratic for small regularization values. For large regularization values, however, it can produce good solutions.

In the second set of experiments we study the strong and weak scaling properties of our algorithms. The model under consideration is again linear SVM. For the strong scaling case, the overall size of the data set does not change, but we increase the number of available computing resources. This means that as the overall number of partitions increases, the workload of each processor decreases. For RADiSA, we keep the overall number of data points processed constant as we increase , which implies that as the sub-problem/partition size decreases, so does the batch size . One matter that requires attention is the step-size parameter. For all SGD-based methods, the magnitude of the step-size is inversely proportional to the batch size . We adjust the step-size as increases by simply taking into account the number of observation partitions . Note that D3CA does not require any parameter tuning. We test our algorithms on two real-world data sets that are available through the LIBSVM website.222http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html Table II summarizes the details on these data sets.

Dataset Observations Features Sparsity
real-sim 72,309 20,958 0.240%
news20 19,996 1,355,191 0.030%
Table II: Datasets for Numerical Experiments (Part 2 - Strong Scaling)

As we can see in Figure 5, RADiSA exhibits strong scaling properties in a consistent manner. In both data sets the run-time decreases significantly when introducing additional computing resources. It is interesting that early configurations with perform significantly worse compared to the alternate configurations where . Let us consider the configurations (4,1) and (1,4). In each case, the number of variable sub-blocks is equal to . This implies that the dimensionality of the sub-problems is identical for both partition arrangements. However, the second partition configuration has to process four times more observations compared to the first one, resulting in an increased run-time. It is noteworthy that the difference in performance tails away as the number of partitions becomes large enough. Overall, to achieve consistently good results, it is preferable that .

The strong scaling performance of D3CA is mixed. For the smaller data set (realsim), introducing additional computing resources deteriorates the run-time performance. On the larger data set (news20), increasing the number of partitions pays dividends when . On the other hand, when , providing additional resources has little to no effect. The pattern observed in Figure 5 is representative of the behavior of D3CA on small versus large data sets (we conducted additional experiments to further attest this). It is safe to conclude that when using D3CA, it is desirable that .

Figure 5: Strong scaling of realsim and news20. The -axis shows the various partition configurations for each level of . The -axis shows the total time in seconds that is needed to reach a 1% optimality difference. The run-time for the two methods is not comparable due to different regularization values being used. For RADiSA we used and for D3CA we used .

In the weak scaling experiments the workload assigned to each processor stays constant as additional resources are used to solve a larger problem. Given that problems can increase either in terms of observations or features, we set up our experiments as follows. We generate artificial data sets in the same manner as outlined earlier, but we take the size of each partition to be . We vary the number of observation partitions from to , and study the performance of our algorithms for and . We also consider two distinct sparsity levels: : and . In terms of measuring performance, we consider the weak scaling efficiency metric as follows. Let denote the time to complete a run when for fixed and , and let represent the time to solve a problem with given (for the same values of and ). The weak scaling efficiency is given as:

Note that the termination criterion for a run is reaching a relative optimality difference. Furthermore, we use regularization values of and for RADiSA and D3CA, respectively. In Figure 6, we can see that neither of the two methods is able to achieve a linear decrease in scaling efficiency as becomes larger. For , RADiSA manages to scale well at first, but when , its performance deteriorates. We should note that the scaling efficiency seems to flatten out for large values of and , which is a positive characteristic. As far as D3CA is concerned, it is interesting that the scaling efficiency is very close for different values of . Finally, sparsity has a negative impact on the scaling efficiency of both methods.

Figure 6: Weak scaling experiments. The -axis shows the number of observation partitions , whereas the -axis shows the weak scaling efficiency metric for different configurations of and . For RADiSA we use and for D3CA we use . In the left bottom figure, D3CA did not produce a satisfactory solution for .

V Conclusion

In this work we presented two doubly distributed algorithms for large-scale machine learning. Such methods can be particularly flexible, as they do not require each node of a cluster to have access to neither all features nor all observations of the training set. It is noteworthy that when massive datasets are already stored in a doubly distributed manner, our algorithms are the only option for the model training procedure. Our numerical experiments show that both methods outperform the block distributed version of ADMM. There is, nevertheless, room to improve both methods. The most important task would be to derive a step-size parameter for D3CA that will guarantee the convergence of the algorithm for all regularization parameters. Furthermore, removing the bottleneck of the primal vector computation would result into a significant speedup. As far as RADiSA is concerned, one potential extension would be to incorporate a streaming version of SVRG [27], or a variant that does not require computation of the full gradient at early stages [28]. Finally, studying the theoretical properties of both methods is certainly a topic of interest for future research.

References