Structure-Aware Dynamic Scheduler for Parallel Machine Learning

12/19/2013 ∙ by Seunghak Lee, et al. ∙ Carnegie Mellon University 0

Training large machine learning (ML) models with many variables or parameters can take a long time if one employs sequential procedures even with stochastic updates. A natural solution is to turn to distributed computing on a cluster; however, naive, unstructured parallelization of ML algorithms does not usually lead to a proportional speedup and can even result in divergence, because dependencies between model elements can attenuate the computational gains from parallelization and compromise correctness of inference. Recent efforts toward this issue have benefited from exploiting the static, a priori block structures residing in ML algorithms. In this paper, we take this path further by exploring the dynamic block structures and workloads therein present during ML program execution, which offers new opportunities for improving convergence, correctness, and load balancing in distributed ML. We propose and showcase a general-purpose scheduler, STRADS, for coordinating distributed updates in ML algorithms, which harnesses the aforementioned opportunities in a systematic way. We provide theoretical guarantees for our scheduler, and demonstrate its efficacy versus static block structures on Lasso and Matrix Factorization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sensory techniques and digital storage media have improved at a breakneck pace, leading to massive collections of data. The resultant so-called Big Data problems have been a common focus in recent enthusiasms toward scalable machine learning, and numerous algorithmic and system solutions have been proposed to alleviate the time-bottleneck due to Big Data by exploring various heuristic or principled strategies for

data parallelism [3, 18, 20, 28].

However, another important aspect of Big ML is what we refer to as Big Model

problems, in which models with millions if not billions of variables and/or parameters (e.g., as one would see in a deep network or a large-scale topic model) must be estimated from big (or even modestly-sized) data; such Big Model problems seem to have received relatively less attention from the community. In this paper, we investigate how to facilitate effective and sound parallelization of inference over a large number of variables and/or parameters in such models, an issue we call

model parallelism. Model-parallel inference is necessary for many ultra-high dimensional problems that have recently emerged in modern applications. For example, in genetics and personalized medicine, the number of model variables (e.g., candidate genetic variations) can easily exceed millions; and in e-commerce applications such as personalized ads recommendation, the so-called “interest genome” derived for every person from their multi-media social trace is also very high-dimensional. These high-dimensional problems must be solved quickly to be practically useful for patients or consumers, therefore sequential computation even on a powerful, high-end single machine is usually not an option, and distributing the computation over large number of processors in a cluster becomes a natural choice. It is important to note that model-parallelism is not the same as data-parallelism, and poses very different challenges — model-parallelism requires model variables to be partitioned for parallel updates with tight synchronization, whereas data-parallelism involves computation on (usually) independent data subsets. Our focus in this paper is to systematically study the algorithm, system, and theory issues that support model-parallelism.

A major challenge to model-parallelism is that many existing algorithms for ML are derived with the assumption of sequential iteration over variables — for example, optimization algorithms for Lasso [25], matrix factorization [27], sparse coding [8]

, and support vector machines

[12]; MCMC algorithms for topic models [9], Bayesian nonparametric models [7], and direct posterior regularization models [5]. However, the convergence rates and correctness guarantees for these algorithms do not always extend to the parallel execution over model variables. In other words, naive model-parallelization can slow down the convergence rate or even lead to failure of ML algorithms [2].

In this paper, we focus on the problem of how to parallelize ML algorithms over different model variables. Recent efforts toward parallel ML over model variables can be divided into two approaches: (1) unstructured distributed ML, and (2) structured distributed ML. The first approach includes algorithms that select model variables uniformly at random for parallel execution [2]; the second approach uses the problem structure to select which variables to update in parallel, thus speeding up per-iteration convergence rates [23], boosting iteration frequency by improving distributed system performance (e.g. minimizing network communications or disk I/O) [14, 15], and guaranteeing algorithm correctness [23]. While structured distributed ML has benefits over unstructured approaches, there is an additional cost to finding such structures. In this paper, we adopt the approach of structured distributed ML, but in a way that departs significantly from conventional strategies, as inspired by the following insights on how structures in an ML program can be explored and exploited.

Static Block Structures:

Static block structures, which are often assumed to be intrinsic to a model and discovered before algorithm start and held fixed during execution, have been widely used for efficient parallerization of ML algorithms. Examples include block Gibbs sampling [17], structured mean field approximations [13], graph-partitioning for parallel executions in GraphLab [18], parallel coordinate descent for matrix factorization [27], and block-greedy coordinate descent algorithm [23]. The key insight of this approach is that, if decoupled blocks of variables are updated in parallel, then both inconsistencies due to parallel variable updates as well as communications between different blocks are minimized, resulting in improved ML algorithm convergence rates and guaranteed correctness. However, static block structures must be discovered prior to starting an ML algorithm, resulting in a large, unavoidable runtime cost. Furthermore, static block structures fail to capture the dynamic, changing aspects of ML algorithms driven by data (such as how variables and parameters change throughout execution), and thus cannot obtain a holistic view of an ML problem’s structure.

Dynamic Block Structures:

In reality, model block structures are not completely static, but can dynamically change at runtime according to the values of parameters and variables due to data-driven updates. Notably, transient block structures can arise due to recently updated parameters and variables. Taking -regularized regression as an example, let us consider parallel updates of at the -th iteration. If stays zero at the -th and -th iteration, then does not affect the update of — even when the correlation (i.e. dependency) between and is large. Such dynamic, runtime structure discovery is critical for distributed ML algorithms, because static block structure, while useful, relies on finding separable blocks from the input data and the model’s a priori topology — a challenging task on many real datasets [26]. Furthermore, dynamic block structures have computational advantages over their static counterparts — because dynamic structure is discovered online during algorithm execution, its cost can be amortized over multiple processors working in the background (as opposed to the lump sum cost of static structure discovery).

Projected-Progress on Dynamic Block Structures:

While dynamic block structures ensure correct parallel execution, they do not directly expedite or improve ML algorithm convergence rates — for that, it is necessary to account for (1) each block’s projected progress or importance (e.g. expected objective value improvement upon updates) and (2) the actual workload of each block (e.g. number or magnitude of variables to be updated), when dispatching blocks of variables to parallel workers. Continuing the -regularized regression example, if we prioritize the

that are changing the most with each update, we will speed up the decrease in the loss function per variable update. Moreover, by ensuring every worker gets a similar number of variables to update, we perform

load-balancing, thus preventing situations where workers with fewer variables end up sitting idle.

Figure 1: Convergence rates of two different approaches for parallel Lasso: dynamic block structures (STRADS), and no structures (Shotgun [2]) (We used Alzheimer’s disease dataset [10] with ).

In this paper, we seek to explore Structure-Aware Parallelism (SAP) inspired by the above insights, at both the algorithmic front-end and system back-end levels. Accordingly, we present a new model-parallel ML strategy called STRADS, or STRucture-Aware Dynamic Scheduler. Figure 1 showcases the key advantage resulting from this new strategy: under a dynamic block-based parallel approach, the convergence of an ML algorithm can escape from the slow-progressing trajectory characteristic of a static block-based parallelism, thus arriving at a better solution much more quickly. This is made possible by the dynamic approach’s ability to adapt to changing structure and execution status, as an ML program (in this case, parallel Lasso) progresses.

More precisely, STRADS is a statistically motivated scheduler that executes distributed ML algorithms correctly and with high-convergence rates by jointly considering dynamic block structures, load balancing, and algorithmic progress made by updates on variable blocks. Figure 2 outlines the basic rationales behind STRADS, while Figure 3 sketches out the system architecture. We apply STRADS to two example applications: parallel Lasso and parallel matrix factorization (MF), using coordinate descent algorithms (we expect STRADS applies to other algorithms on additional ML programs, which we will subject to future case studies). In the Lasso example, we showcase the benefits of using dynamic block structures based on the runtime values of coefficients; and in the MF example, we demonstrate the advantages of load balancing for parallel execution. Furthermore, we provide a theoretical analysis that proves our scheduling scheme for Lasso is approximately optimal. Our experiments show that for Lasso and MF, STRADS yields faster convergence than unstructured and static block-based approaches, as well as better final objective function values (for Lasso).

Figure 2: Concept diagram for the SAP scheduling model underlying STRADS, explaining how blocks of model variables are selected, grouped and dispatched to workers.

Notation

We denote a matrix with samples and variables as , and a vector with samples as . We represent a column index by subscript, and a row index by superscript. We denote iteration index by parenthesized superscript, matrices by bold-faced uppercase letters, and vectors by bold-faced lowercase letters. We also denote the “dependency strength” (e.g. correlation) between the -th and -th variables by , and a set of blocks of variables by .

2 Structure-Aware Parallelism (SAP) for Dynamic Block Scheduling

We begin with an outline of the scheduling model upon which our proposed approach is built, which we call Structure-Aware Parallelism (SAP). Suppose there are model variables, and we have parallel workers to update them. SAP iterates over four steps:

  1. Draw a set of variables to update at iteration , from an “importance” distribution . The idea is to choose variables that give high expected improvement in the loss function.

  2. From the set , find a set of variable blocks such that , where and is a user-defined parameter. Here, , where , and is a user-defined measure of coupling between and (e.g. correlation or partial correlation).

  3. Merge blocks of variables in until every block has a similar workload, thus achieving load balance. We denote the set of regrouped blocks by , and we dispatch the blocks of variables in to workers.

  4. When the workers have finished and returned their updated blocks to SAP, SAP updates the importance distribution and the dependency function according to the updated blocks. SAP then iterates steps (1)-(4) until the ML algorithm converges.

The first step maximizes convergence rates, by selecting variables that will contribute the most to loss function improvement. This is carried out by sampling variables from the distribution

, which assigns higher probability to variables whose recent updates have had higher impact on the loss function. Note that

changes across iterations, because the importance of each variable changes as the algorithm progresses — this is a key difference between dynamic and static block structures.

The second step ensures correctness of an ML algorithm, by decoupling only those blocks of variables with little to no interdependency. It is well-known that simultaneous updates to strongly coupled variables can cause interference; this not only slows down convergence but can even lead to algorithm failure (e.g. divergence) [2]. By organizing variables into nearly-independent blocks, we can control the degree of interference, thus guaranteeing correctness.

The third step performs load balancing. Because blocks can greatly vary in size, situations can arise where most workers end up waiting for the worker with the biggest block to finish — the “curse of the last reducer” [24]. We address this problem by merging blocks until all remaining blocks contain similar workloads.

The fourth step is a “progress monitoring” step, in which SAP estimates the progress each variable contributes to algorithm convergence. Depending on the ML algorithm being run, the definition of progress can vary: examples include the magnitude of change in each variable, or the change in residuals due to variable updates. SAP then uses this information to update (e.g. by increasing the probability of faster-changing variables) and (e.g. by removing dependencies between variables that have reached zero).

We note that SAP is only one scheme for dynamic block scheduling, and other designs are possible. The key advantage of our design is computational efficiency: step 1 minimizes the computational cost of scheduling by reducing the set of variables from which we must find block structures — essentially a bootstrap-based approach to structure discovery. This is important because the scheduler must be able to find block structures faster than workers consume them (i.e. the scheduler must not be a bottleneck).

We now showcase how coordinate descent algorithms for two popular ML models, -regularized regression and Matrix Factorization, can be cast into parallel versions via SAP.

2.1 Case 1: -regularized Regression

  Choose , (a small positive constant), and
  Set for all , where represents the iteration-counter for the -th coefficient
  Set for all , where is a very large positive constant
  while not converged do
     Draw a set of coefficients, from a distribution , where is a constant larger than
     Choose a set of coefficients (i.e. one-variable blocks) from , such that for all , where is the covariate corresponding to
     In parallel on workers
      Get assigned coefficient from
      Update using update rule (2)
      ;
  end while
Algorithm 1 Parallel CD for Lasso, using SAP

The -regularized regression (a.k.a Lasso) [25] is used to discover a small subset of features or dimensions that are relevant to an output . -regularized regression takes the form of an optimization program:

(1)

where denotes the regularization parameter that can be tuned, and is a non-negative convex loss function such as squared-loss or logistic-loss; we assume that and are standardized and consider (1) without an intercept. Throughout this paper, for simplicity but without loss of generality, we let . However, it is straightforward to use other loss functions such as logistic-loss using the same approach shown in [2].

By taking gradient of (1), we obtain the coordinate descent (CD) algorithm [4] update rule for :

(2)

where is a soft-thresholding operator [4].

SAP schedules parallel CD updates on the Lasso optimization program (1), according to the four steps:

  • Step 1: We use probability , where , where represents at the -th iteration. Intuitively, convergence is improved when we update variables (coefficients) that change more rapidly per iteration, and thus we prioritize variables based on their value change. In Section 4, we provide a theoretical justification for use of .

  • Step 2: We define dependency for the parallel updates of (2). In this case, , i.e., correlation between -th and -th covariates (note that we standardized ). If the -th and -th covariates, and , are highly correlated, then updating in parallel will cause an interference effect that may dramatically attenuate improvement in the objective function [2]. SAP ensures that variables are grouped into blocks such that variables in different blocks have nearly independent covariates — thus keeping intereference effects to a minimum.

  • Step 3: For parallel Lasso, we fixe the size of blocks to one for application-specific reasons. It turns out that it is non-trivial to choose an appropriate size of blocks considering both load balance and quality of updates (i.e., decrease of objective value). Thus, choosing an appropriate size of blocks at runtime is left for future work.

  • Step 4: After collecting the updated variables from workers, SAP uses them to update from Step 1.

2.2 Case 2: Matrix Factorization

MF is often used for collaborative filtering, where the goal is to predict a user’s unknown preferences, given his/her known preferences and the preferences of others. The input data is modeled as an incomplete matrix , where is the number of users, and is the number of items/preferences. The idea is to discover smaller rank- matrices and such that . Thus, the product can be used to predict the missing entries (user preferences). Formally, let be the set of indices of observed entries in , be the set of observed column indices in the -th row of , and be the set of observed row indices in the -th column of . Then, the MF problem is defined as an optimization program

(3)

This optimization is solved via parallel CD, with the following update rules for and :

(4)
(5)

where for all .

To solve the MF problem, SAP iterates through each rank , parallelizing the updates over blocks of rows in , and parallelizing the updates over blocks of columns in . Specifically:

  • Step 1: For MF, prioritizing variables within a full column or row

    results in minimal benefit, hence we use a uniform distribution for

    .

  • Step 2: In MF, each coefficient can be independently updated without interference. Thus, , and any coefficients can be grouped together.

  • Step 3: Because the observed matrix entries often follow a power-law distribution, we perform load balancing by grouping rows and columns into larger blocks, such that the nonzero entries of are equally distributed.

  • Step 4: Since and are constant functions, no modification is required.

3 STRADS: an Efficient, Distributed Implementation of SAP

Figure 3: A high-level view of the STRADS architecture. STRADS begins by selecting model variables from an importance distribution (to improve convergence rate), and then groups these variables into blocks according to a dependency function — the idea being to avoid scheduling highly-dependent variables in parallel on different blocks (to maintain correctness). STRADS then merges these blocks into larger blocks (for load balancing), and then dispatches these load-balanced blocks to workers. The workers then report their updated blocks back to STRADS which uses them to update the importance distribution and dependency function . This process constitutes one iteration, which is repeat with a new set of blocks (i.e. dynamic structure). Our implementation of STRADS is fully distributed over multiple machines.

Now we describe STRADS (Figure 3), a distributed implementation of the SAP scheduling model, which can use any number of machines to provide scheduling for an arbitrary degree of pallellism. Having a distributed implementation ensures that STRADS will scale to meet the computation and memory demands of finding dynamic block structure on extremely large models and input data. The key ideas behind STRADS are (1) each scheduler thread is responsible for scheduling its own disjoint set of variables (and only those variables), and (2) the scheduler threads take turns to send blocks to the worker clients.

Implementation Overview

Suppose the user invokes STRADS threads (which can be on different machines) to solve an ML model with variables. STRADS proceeds as follows: First, each thread is randomly assigned variables (with no overlaps) before the algorithm starts; these assignments remain fixed throughout. Next, all threads execute the four SAP steps — (1) select variables from , where is the importance distribution over the variables assigned to thread , (2) use those variables to form the set of dynamic variable blocks according to , (3) merge blocks to get a new set of blocks that are load-balanced, and distribute them to workers, and (4) receive the updated blocks from workers, and update , . The STRADS scheduler threads take turns to dispatch to the workers: thread 1 dispatches first, then thread 2, and so on until thread , before returning to thread 1.

In our experiments, we assume the entire input data is available to every machine, though we note that it can just as easily be stored in a distributed key-value store or parameter server. Each scheduler thread maintains and stores only the variables assigned to it. We implement STRADS in C++, using the Boost libraries and the 0MQ 3.2.4 library [11] for inter-machine network communications.

Programming Interface

Per the SAP model, STRADS requires users to define model-specific functions and , via the following interface:

  • define_sampling(p), where p is a function object such that p(j) returns the probability of variable j. STRADS also provides p with an interface to access the input data, as well as the model variables (on the current STRADS thread); we shall not go into the details for space reasons.

  • define_dependency(d), where d is a function object such that d(j,k) returns the dependency between variables j and k. Like p, d has access to the model variables and input data through STRADS.

Properties of STRADS

From a distributed systems perspective, the round-robin design of STRADS carries the following benefits: one, it makes effective use of distributed cluster memory — every scheduler thread only needs to store the state of the variables assigned to it. Two, the scheduler threads require almost no communication between each other; they just need to coordinate taking turns to serve workers. Three, the round-robin arrangement allows each scheduler thread more time to prepare for dispatch — if there are threads, then each thread has -fold more time. This prevents situations where workers have to wait for the schduler, and is essentially a form of hiding computational latency.

STRADS is essentially a bootstrap of the SAP model. Even though the importance distribution is now split into distributions , since for Big Model problems, each will be approximately similar in shape to the original . Furthermore, STRADS preserves algorithm correctness: because blocks from different scheduler threads will be updated at different iterations, there is no need to cross-check depedencies for blocks between threads. Load balancing is also unaffected, provided that is sufficiently large (so that enough blocks are produced). Thus, STRADS is a close, bootstrapped approximation to SAP scheduling for Big Models with a large number of variables and/or parameters.

4 Theoretical Analysis of Parallel CD Under SAP

The SAP model specifies a general-purpose dynamic block scheduler for distributed ML algorithms; given a specific ML algorithm, the user must input appropriate definitions for (important variable subsampling) and

(dependency checking). To provide theoretical analysis of parallel CD under SAP, let us consider the definitions for Lasso regression in Section

2.1 — under them, we will show that SAP approximately obtains the optimal Lasso convergence rate for worker threads. We formally re-state those definitions:

  1. Select a subset of variables in the -th iteration: choose Lasso coefficients (variables), i.e., , where are selected from the distribution , where and is a small constant (e.g. we used ).

  2. Group the coefficients into jobs to be dispatched in the -th iteration, where each job contains exactly one coefficient. More precisely, find a set of coefficients to be dispatched such that

    such that for all

    Here represents the correlation between the -th covariate and the -th covariate; we assume has been standardized for Lasso.

  3. Dispatch to parallel workers.

  4. Receive updated from the workers, to be used in steps 1-2 next iteration.

Below, we present highlights from our theoretical results. Our analysis is based on the sampling distribution (in practice, we approximate with since is unavailable at -th iteration before computing ; we introduced to give all s non-zero probability to account for the approximation), and the allowed model dependency threshold at each iteration — this is unlike the global condition for all iterations used in [2, 23].

For theoretical analysis, we rewrite problem (1) as: , where contains features by duplicating original features with opposite sign (see appendix for details), and , for all . We define the Lasso objective as , and the following theorem shows that is approximately optimal for SAP.

Theorem 1.

Suppose is the set of indices of coefficients updated in parallel at the -th iteration, and is sufficiently small such that , for all , where is a small positive constant. Then, the sampling distribution approximately maximizes a lower bound to the expected decrease in the objective function after updating coefficients indexed by , where is defined as

(6)

This means that our scheduling strategy for parallel lasso approximately maximizes the lower bound for the progress per iteration (We defer the proof to the appendix).

We now discuss SAP’s scalability with respect to the Shotgun algorithm, which determines uniformly at random [2]. Firstly, SAP always acheives the maximum effective parallelization allowed by input data, by actively minimizing the interference caused by parallel updates. In contrast, Shotgun’s effective parallelization is reduced whenever the (randomly drawn) coefficients happen to be correlated, thus producing intereference when updated in parallel. Furthermore, SAP always chooses the coefficients with the effort to decrease the objective function, whereas Shotgun is agnostic to coefficient importance. Because of these two factors, SAP has superior theoretical (and as we shall show, empirical) scalability over Shotgun.

5 Experimental Results

We show that the SAP model (implemented as STRADS) outperforms the unstructured model parallelism, which selects variables uniformly at random for parallel execution, as well as the static block-structured parallelism model, which does not change block structures during execution. We demonstrate this on two exemplar applications, parallel Lasso and parallel MF; experimental details follow:

Datasets

For parallel Lasso, we used one real and one synthetic dataset. Our real dataset was the Alzheimer’s disease (AD) dataset [10], containing 463 samples and 508,999 covariates (single nucleotide polymorphisms) for , and real-valued APOE gene expression levels for . For synthetic data, we generated 450 samples with 1,000,000 features; and a real-valued output with 10,000 true non-zero coefficients. For parallel MF, we used the NetFlix [6] and Yahoo-Music [16] datasets. The NetFlix dataset contains 480,189 users versus 17,770 movies (100,480,507 non-zero entries) while the Yahoo-Music dataset contains 1,948,882 users versus 98,213 songs (115,579,440 non-zero entries).

Experimental platform and STRADS configurations

We ran the experiments on a compute cluster, with the following machine specifications: 64 cores ( AMD Opteron 1.4 GHz), 3TB SATA drive, 128GB RAM, and 10GbE network interface. Parallel Lasso and MF applications were tested in different platforms. We ran the parallel Lasso application in the distributed setting (multiple machines) using from 60 to 240 cores, and parallel MF in the single multi-core machine setting using from 4 to 16 cores. STRADS was configured as follows: for Lasso, we used , , and , and for MF, we partitioned variables such that each block contains or variables, where is the number of cores.

5.1 Experiments on Parallel Lasso

Figure 4: Distributed parallel Lasso results for three scheduling models: SAP (scheduled dynamic block structures using STRADS), static block structures, and uniform random selection of variables (no structures). The first row shows results for the Alzheimer’s disease (AD) dataset, while the second row corresponds to our synthetic dataset. We vary the number of processor cores from 60 to 240.

Fig. 4 shows objective vs. time plots for STRADS (SAP model for dynamic block structures), a static correlation scheduler (static block structures), and a random scheduler (no block structures), over several machine configurations. The static block scheduling uses the following strategy: pick a set of variables uniformly at random, and dispatch only variables that are nearly independent (i.e. correlation). As for unstructured scheduling, we used the Shotgun approach [2], which selects variables uniformly at random; note that the original Shotgun paper was limited to a single multi-core machine, whereas our experiments bring Shotgun into the distributed setting.

The first row of Fig. 4 contains AD data results, while the second row contains synthetic data results, over 60, 120, and 240 cores. In all cases, STRADS converged much faster than the other two schedulers. We point out three phenomena observed in these experiments: first, STRADS consistently generates an early sharp drop in the objective function value; this is because after all variables have been updated at least once, STRADS now has a full estimate of the importance distribution , so it can now prioritize more important variables. This results in a dramatic reduction in objective value.

Second, STRADS exhibits not only a faster convergence rate, but also a substantially better objective function value when converged. It is possible that the other two approaches will eventually achieve the same objective that STRADS had. In practice however, algorithms are run with an automatic stopping condition — typically a minimum threshold on change in objective value. Under such a stopping condition, STRADS achieves a better final objective value than the other schedulers.

Finally, we observe that static correlation scheduling only beats random scheduling by a significant margin when using a large number of cores (e.g., 240). The reason is that, with a low core count, random scheduling is unlikely to select highly correlated variables, and hence static block structures do not yield any benefit. Once the core count increases, the probability of picking multiple correlated variables goes up, and static correlation scheduling begins to show an advantage. However, STRADS dynamic scheduling based on variable importance yields an even greater improvement.

5.2 Experiments on Parallel Matrix Factorization

Figure 5: Single-machine parallel Matrix Factorization results for two scheduling models: SAP (using STRADS), and a model with no load balancing. The first row shows results for the Netflix dataset, while the second row corresponds to the Yahoo-Music dataset. We vary the number of processor cores from 4 to 16.

Fig. 5 compares, for 4 to 16 cores on a single machine, parallel MF using STRADS, versus a scheduler with no load balancing (that partitions the matrix rows and columns uniformly, without regard to the number of non-zero entries in each row/column). This experiment is intended to demonstrate the performance gains from load balancing through STRADS.

On the NetFlix dataset (first row of Fig. 5

), STRADS exhibits slightly better convergence rate for 4 and 8 cores, but an insubstantial benefit for 16 cores. The reason is one of sampling statistics: when using a small number of cores/blocks and uniformly sampling over rows and columns, the final distribution of block sizes (i.e. number of non-zero entries) exhibits a large variance — that is to say, some blocks can be much larger than others. Hence, the largest block becomes a severe bottleneck. However, once the number of cores/blocks is increased, the variance in block sizes drops, and the bottleneck is thus reduced.

For the Yahoo-Music dataset (second row of Fig. 5), STRADS exhibits much clearer benefits from load balancing. Moreover, unlike the NetFlix dataset, the gain due to load balancing actually increases with more cores. It turns out that the non-zero entries in the Yahoo-Music dataset are heavily biased towards a few items (i.e. strong power-law behavior) — hence without load balancing, algorithm performance is no better than a single thread due to bottlenecking on the extreme users. STRADS load balancing resolves this problem, allowing for full parallelism (which explains the widening gap w.r.t. the naive scheduler at higher core counts).

6 Related Work and Discussion

Variable scheduling is a key component of many distributed platforms such as Pregel [20], MapReduce [3] and GraphLab [18]. For example, GraphLab paritions graph data to minimize communication and synchronization costs between different connected nodes; furthermore, GraphLab provides various consistency schemes to synchronize dependent parameters or variables. Pregel is designed to process large scale graphs, and schedules computations using workflow graphs. Hadoop distributes the data to workers, in a manner that limits communication due to map-reduce synchronization. Our work differs from these scheduling approaches, in that we consider not only static information embedded in the data, but also dynamic information such as transient parameters or variables learned at runtime.

Algorithms for our two exemplar applications, parallel Lasso and MF, have been extensively studied in the literature: examples include randomized block-coordinate descent [22], dual decomposition [1], parallel stochastic gradient decent [19, 21], and parallel coordinate descent [2, 27]. These works differ from ours in the sense that we suggest a general-purpose dynamic scheduler to boost the performance and correctness of parallel ML algorithms, rather than an algorithm tailored to a specific application. In fact, we used existing algorithms for parallel Lasso and MF without any modification. In that regard, STRADS can be combined with any new developments in parallel Lasso or MF algorithms, so as to yield further performance improvements.

Future work includes harnessing STRADS to accelerate diverse Big Model applications. By considering the unique ML properties of each application, we can develop principles for analyzing intermediate variables/parameter values in the context of the data, in order to formulate the importance distribution and dependency function necessary for high performance model-parallelism with STRADS. Furthermore, we will explore principled ways to improve the efficiency of STRADS, such as increasing the size of blocks to be dispatched while still tightly controlling interference effects between model variables — in order to minimize communication costs between workers and scheduler and thus maximize CPU utilization.

Appendix: Proof of Theorem 1

Preliminaries

The -regularized regression [25] takes the form of an optimization program:

(7)

where denotes the regularization parameter, and is a non-negative convex loss function. We assume that and are standardized and consider (7) without an intercept. For simplicity but without loss of generality, we let . However, it is straightforward to use other loss functions such as logistic-loss using the same approach shown in [2].

For theoretical analysis, we rewrite problem (7) as:

(8)

where contains duplicated features with opposite sign such that , and for all and , and , for all . Note that problem (7) and (8) are equivalent optimization problem [2]. To optimize the problem 8, we can use parallel coordinate descent method (Shotgun) proposed by [2], and the update rule is , where is given by,

where .

Theorem 1.

Suppose is the set of indices of coefficients updated in parallel at the -th iteration, and is sufficiently small such that , for all , where is a small positive constant. Then, the sampling distribution approximately maximizes a lower bound to the expected decrease in the objective function after updating coefficients indexed by , where is defined as

(9)
Proof.

From assumption 3.1 in [2], we have

where . For simple notation, let us omit the super script representing -th iteration.

Suppose index of coefficient is drawn from a sample distribution , and a pair of indices is drawn from . Taking expectaion with respect to :

(10)
(11)
(12)
(13)
(14)
(15)

In (Preliminaries), we used if because and cannot be updated in parallel if . Recall that we find coefficients to be updated in parallel by solving:

such that for all .

Further, in (Preliminaries) we used our assumption that , for all for small . Thus, from (Preliminaries) the lower bound of is maximized when . Furthermore, . Thus, because . Therefore, gives us approximately optimal distribution to maximize the lower bound of . ∎

References

  • [1] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3:1–124, 2011.
  • [2] Joseph K Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for l1-regularized loss minimization. ICML, 2011.
  • [3] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
  • [4] J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate optimization. Annals of Applied Statistics, 1(2):302–332, 2007.
  • [5] Kuzman Ganchev, Joao Graça, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for structured latent variable models. The Journal of Machine Learning Research, 99:2001–2049, 2010.
  • [6] Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis.

    Large-scale matrix factorization with distributed stochastic gradient descent.

    In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 69–77. ACM, 2011.
  • [7] Jayanta K Ghosh and RV Ramamoorthi. Bayesian nonparametrics. Springer, 2003.
  • [8] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 399–406, 2010.
  • [9] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5228–5235, 2004.
  • [10] Harvard Brain Tissue Resource Center. Downloaded from Sage Bionetworks: https://synapse.prod.sagebase.org/#Synapse:4505, 2013.
  • [11] Pieter Hintjens. ZeroMQ: Messaging for Many Applications. O’Reilly, 2013.
  • [12] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th international conference on Machine learning, pages 408–415. ACM, 2008.
  • [13] Tommi S Jaakkola. 10 tutorial on variational approximation methods. Advanced mean field methods: theory and practice, page 129, 2001.
  • [14] U Kang, Charalampos E Tsourakakis, and Christos Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on, pages 229–238. IEEE, 2009.
  • [15] Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. Graphchi: Large-scale graph computation on just a pc. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 31–46, 2012.
  • [16] Yahoo! Labs. Webscope from yahoo! labs. http://webscope.sandbox.yahoo.com/catalog.php?datatype=r, 2013.
  • [17] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng.

    Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.

    In Proceedings of the 26th Annual International Conference on Machine Learning, pages 609–616. ACM, 2009.
  • [18] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 2012.
  • [19] Xin Luo, Huijun Liu, Gaopeng Gou, Yunni Xia, and Qingsheng Zhu. A parallel matrix factorization based recommender by alternating stochastic gradient decent.

    Engineering Applications of Artificial Intelligence

    , 25(7):1403–1412, 2012.
  • [20] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135–146. ACM, 2010.
  • [21] Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, pages 1–26, 2011.
  • [22] Peter Richtárik and Martin Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, pages 1–38, 2012.
  • [23] Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, and David Haglin. Feature clustering for accelerating parallel coordinate descent. NIPS, 2012.
  • [24] Siddharth Suri and Sergei Vassilvitskii. Counting triangles and the curse of the last reducer. In Proceedings of the 20th international conference on World wide web, pages 607–614. ACM, 2011.
  • [25] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996.
  • [26] Ian H Witten and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.
  • [27] Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 765–774. IEEE, 2012.
  • [28] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 2010.