1 Introduction
Sensory techniques and digital storage media have improved at a breakneck pace, leading to massive collections of data. The resultant socalled Big Data problems have been a common focus in recent enthusiasms toward scalable machine learning, and numerous algorithmic and system solutions have been proposed to alleviate the timebottleneck due to Big Data by exploring various heuristic or principled strategies for
data parallelism [3, 18, 20, 28].However, another important aspect of Big ML is what we refer to as Big Model
problems, in which models with millions if not billions of variables and/or parameters (e.g., as one would see in a deep network or a largescale topic model) must be estimated from big (or even modestlysized) data; such Big Model problems seem to have received relatively less attention from the community. In this paper, we investigate how to facilitate effective and sound parallelization of inference over a large number of variables and/or parameters in such models, an issue we call
model parallelism. Modelparallel inference is necessary for many ultrahigh dimensional problems that have recently emerged in modern applications. For example, in genetics and personalized medicine, the number of model variables (e.g., candidate genetic variations) can easily exceed millions; and in ecommerce applications such as personalized ads recommendation, the socalled “interest genome” derived for every person from their multimedia social trace is also very highdimensional. These highdimensional problems must be solved quickly to be practically useful for patients or consumers, therefore sequential computation even on a powerful, highend single machine is usually not an option, and distributing the computation over large number of processors in a cluster becomes a natural choice. It is important to note that modelparallelism is not the same as dataparallelism, and poses very different challenges — modelparallelism requires model variables to be partitioned for parallel updates with tight synchronization, whereas dataparallelism involves computation on (usually) independent data subsets. Our focus in this paper is to systematically study the algorithm, system, and theory issues that support modelparallelism.A major challenge to modelparallelism is that many existing algorithms for ML are derived with the assumption of sequential iteration over variables — for example, optimization algorithms for Lasso [25], matrix factorization [27], sparse coding [8]
, and support vector machines
[12]; MCMC algorithms for topic models [9], Bayesian nonparametric models [7], and direct posterior regularization models [5]. However, the convergence rates and correctness guarantees for these algorithms do not always extend to the parallel execution over model variables. In other words, naive modelparallelization can slow down the convergence rate or even lead to failure of ML algorithms [2].In this paper, we focus on the problem of how to parallelize ML algorithms over different model variables. Recent efforts toward parallel ML over model variables can be divided into two approaches: (1) unstructured distributed ML, and (2) structured distributed ML. The first approach includes algorithms that select model variables uniformly at random for parallel execution [2]; the second approach uses the problem structure to select which variables to update in parallel, thus speeding up periteration convergence rates [23], boosting iteration frequency by improving distributed system performance (e.g. minimizing network communications or disk I/O) [14, 15], and guaranteeing algorithm correctness [23]. While structured distributed ML has benefits over unstructured approaches, there is an additional cost to finding such structures. In this paper, we adopt the approach of structured distributed ML, but in a way that departs significantly from conventional strategies, as inspired by the following insights on how structures in an ML program can be explored and exploited.
Static Block Structures:
Static block structures, which are often assumed to be intrinsic to a model and discovered before algorithm start and held fixed during execution, have been widely used for efficient parallerization of ML algorithms. Examples include block Gibbs sampling [17], structured mean field approximations [13], graphpartitioning for parallel executions in GraphLab [18], parallel coordinate descent for matrix factorization [27], and blockgreedy coordinate descent algorithm [23]. The key insight of this approach is that, if decoupled blocks of variables are updated in parallel, then both inconsistencies due to parallel variable updates as well as communications between different blocks are minimized, resulting in improved ML algorithm convergence rates and guaranteed correctness. However, static block structures must be discovered prior to starting an ML algorithm, resulting in a large, unavoidable runtime cost. Furthermore, static block structures fail to capture the dynamic, changing aspects of ML algorithms driven by data (such as how variables and parameters change throughout execution), and thus cannot obtain a holistic view of an ML problem’s structure.
Dynamic Block Structures:
In reality, model block structures are not completely static, but can dynamically change at runtime according to the values of parameters and variables due to datadriven updates. Notably, transient block structures can arise due to recently updated parameters and variables. Taking regularized regression as an example, let us consider parallel updates of at the th iteration. If stays zero at the th and th iteration, then does not affect the update of — even when the correlation (i.e. dependency) between and is large. Such dynamic, runtime structure discovery is critical for distributed ML algorithms, because static block structure, while useful, relies on finding separable blocks from the input data and the model’s a priori topology — a challenging task on many real datasets [26]. Furthermore, dynamic block structures have computational advantages over their static counterparts — because dynamic structure is discovered online during algorithm execution, its cost can be amortized over multiple processors working in the background (as opposed to the lump sum cost of static structure discovery).
ProjectedProgress on Dynamic Block Structures:
While dynamic block structures ensure correct parallel execution, they do not directly expedite or improve ML algorithm convergence rates — for that, it is necessary to account for (1) each block’s projected progress or importance (e.g. expected objective value improvement upon updates) and (2) the actual workload of each block (e.g. number or magnitude of variables to be updated), when dispatching blocks of variables to parallel workers. Continuing the regularized regression example, if we prioritize the
that are changing the most with each update, we will speed up the decrease in the loss function per variable update. Moreover, by ensuring every worker gets a similar number of variables to update, we perform
loadbalancing, thus preventing situations where workers with fewer variables end up sitting idle.In this paper, we seek to explore StructureAware Parallelism (SAP) inspired by the above insights, at both the algorithmic frontend and system backend levels. Accordingly, we present a new modelparallel ML strategy called STRADS, or STRuctureAware Dynamic Scheduler. Figure 1 showcases the key advantage resulting from this new strategy: under a dynamic blockbased parallel approach, the convergence of an ML algorithm can escape from the slowprogressing trajectory characteristic of a static blockbased parallelism, thus arriving at a better solution much more quickly. This is made possible by the dynamic approach’s ability to adapt to changing structure and execution status, as an ML program (in this case, parallel Lasso) progresses.
More precisely, STRADS is a statistically motivated scheduler that executes distributed ML algorithms correctly and with highconvergence rates by jointly considering dynamic block structures, load balancing, and algorithmic progress made by updates on variable blocks. Figure 2 outlines the basic rationales behind STRADS, while Figure 3 sketches out the system architecture. We apply STRADS to two example applications: parallel Lasso and parallel matrix factorization (MF), using coordinate descent algorithms (we expect STRADS applies to other algorithms on additional ML programs, which we will subject to future case studies). In the Lasso example, we showcase the benefits of using dynamic block structures based on the runtime values of coefficients; and in the MF example, we demonstrate the advantages of load balancing for parallel execution. Furthermore, we provide a theoretical analysis that proves our scheduling scheme for Lasso is approximately optimal. Our experiments show that for Lasso and MF, STRADS yields faster convergence than unstructured and static blockbased approaches, as well as better final objective function values (for Lasso).
Notation
We denote a matrix with samples and variables as , and a vector with samples as . We represent a column index by subscript, and a row index by superscript. We denote iteration index by parenthesized superscript, matrices by boldfaced uppercase letters, and vectors by boldfaced lowercase letters. We also denote the “dependency strength” (e.g. correlation) between the th and th variables by , and a set of blocks of variables by .
2 StructureAware Parallelism (SAP) for Dynamic Block Scheduling
We begin with an outline of the scheduling model upon which our proposed approach is built, which we call StructureAware Parallelism (SAP). Suppose there are model variables, and we have parallel workers to update them. SAP iterates over four steps:

Draw a set of variables to update at iteration , from an “importance” distribution . The idea is to choose variables that give high expected improvement in the loss function.

From the set , find a set of variable blocks such that , where and is a userdefined parameter. Here, , where , and is a userdefined measure of coupling between and (e.g. correlation or partial correlation).

Merge blocks of variables in until every block has a similar workload, thus achieving load balance. We denote the set of regrouped blocks by , and we dispatch the blocks of variables in to workers.

When the workers have finished and returned their updated blocks to SAP, SAP updates the importance distribution and the dependency function according to the updated blocks. SAP then iterates steps (1)(4) until the ML algorithm converges.
The first step maximizes convergence rates, by selecting variables that will contribute the most to loss function improvement. This is carried out by sampling variables from the distribution
, which assigns higher probability to variables whose recent updates have had higher impact on the loss function. Note that
changes across iterations, because the importance of each variable changes as the algorithm progresses — this is a key difference between dynamic and static block structures.The second step ensures correctness of an ML algorithm, by decoupling only those blocks of variables with little to no interdependency. It is wellknown that simultaneous updates to strongly coupled variables can cause interference; this not only slows down convergence but can even lead to algorithm failure (e.g. divergence) [2]. By organizing variables into nearlyindependent blocks, we can control the degree of interference, thus guaranteeing correctness.
The third step performs load balancing. Because blocks can greatly vary in size, situations can arise where most workers end up waiting for the worker with the biggest block to finish — the “curse of the last reducer” [24]. We address this problem by merging blocks until all remaining blocks contain similar workloads.
The fourth step is a “progress monitoring” step, in which SAP estimates the progress each variable contributes to algorithm convergence. Depending on the ML algorithm being run, the definition of progress can vary: examples include the magnitude of change in each variable, or the change in residuals due to variable updates. SAP then uses this information to update (e.g. by increasing the probability of fasterchanging variables) and (e.g. by removing dependencies between variables that have reached zero).
We note that SAP is only one scheme for dynamic block scheduling, and other designs are possible. The key advantage of our design is computational efficiency: step 1 minimizes the computational cost of scheduling by reducing the set of variables from which we must find block structures — essentially a bootstrapbased approach to structure discovery. This is important because the scheduler must be able to find block structures faster than workers consume them (i.e. the scheduler must not be a bottleneck).
We now showcase how coordinate descent algorithms for two popular ML models, regularized regression and Matrix Factorization, can be cast into parallel versions via SAP.
2.1 Case 1: regularized Regression
The regularized regression (a.k.a Lasso) [25] is used to discover a small subset of features or dimensions that are relevant to an output . regularized regression takes the form of an optimization program:
(1) 
where denotes the regularization parameter that can be tuned, and is a nonnegative convex loss function such as squaredloss or logisticloss; we assume that and are standardized and consider (1) without an intercept. Throughout this paper, for simplicity but without loss of generality, we let . However, it is straightforward to use other loss functions such as logisticloss using the same approach shown in [2].
By taking gradient of (1), we obtain the coordinate descent (CD) algorithm [4] update rule for :
(2) 
where is a softthresholding operator [4].
SAP schedules parallel CD updates on the Lasso optimization program (1),
according to the four steps:

Step 1: We use probability , where , where represents at the th iteration. Intuitively, convergence is improved when we update variables (coefficients) that change more rapidly per iteration, and thus we prioritize variables based on their value change. In Section 4, we provide a theoretical justification for use of .

Step 2: We define dependency for the parallel updates of (2). In this case, , i.e., correlation between th and th covariates (note that we standardized ). If the th and th covariates, and , are highly correlated, then updating in parallel will cause an interference effect that may dramatically attenuate improvement in the objective function [2]. SAP ensures that variables are grouped into blocks such that variables in different blocks have nearly independent covariates — thus keeping intereference effects to a minimum.

Step 3: For parallel Lasso, we fixe the size of blocks to one for applicationspecific reasons. It turns out that it is nontrivial to choose an appropriate size of blocks considering both load balance and quality of updates (i.e., decrease of objective value). Thus, choosing an appropriate size of blocks at runtime is left for future work.

Step 4: After collecting the updated variables from workers, SAP uses them to update from Step 1.
2.2 Case 2: Matrix Factorization
MF is often used for collaborative filtering, where the goal is to predict a user’s unknown preferences, given his/her known preferences and the preferences of others. The input data is modeled as an incomplete matrix , where is the number of users, and is the number of items/preferences. The idea is to discover smaller rank matrices and such that . Thus, the product can be used to predict the missing entries (user preferences). Formally, let be the set of indices of observed entries in , be the set of observed column indices in the th row of , and be the set of observed row indices in the th column of . Then, the MF problem is defined as an optimization program
(3) 
This optimization is solved via parallel CD, with the following update rules for and :
(4)  
(5) 
where for all .
To solve the MF problem, SAP iterates through each rank , parallelizing the updates
over blocks of rows in ,
and parallelizing the updates
over blocks of columns in . Specifically:

Step 1: For MF, prioritizing variables within a full column or row
results in minimal benefit, hence we use a uniform distribution for
.

Step 2: In MF, each coefficient can be independently updated without interference. Thus, , and any coefficients can be grouped together.

Step 3: Because the observed matrix entries often follow a powerlaw distribution, we perform load balancing by grouping rows and columns into larger blocks, such that the nonzero entries of are equally distributed.

Step 4: Since and are constant functions, no modification is required.
3 STRADS: an Efficient, Distributed Implementation of SAP
Now we describe STRADS (Figure 3), a distributed implementation of the SAP scheduling model, which can use any number of machines to provide scheduling for an arbitrary degree of pallellism. Having a distributed implementation ensures that STRADS will scale to meet the computation and memory demands of finding dynamic block structure on extremely large models and input data. The key ideas behind STRADS are (1) each scheduler thread is responsible for scheduling its own disjoint set of variables (and only those variables), and (2) the scheduler threads take turns to send blocks to the worker clients.
Implementation Overview
Suppose the user invokes STRADS threads (which can be on different machines) to solve an ML model with variables. STRADS proceeds as follows: First, each thread is randomly assigned variables (with no overlaps) before the algorithm starts; these assignments remain fixed throughout. Next, all threads execute the four SAP steps — (1) select variables from , where is the importance distribution over the variables assigned to thread , (2) use those variables to form the set of dynamic variable blocks according to , (3) merge blocks to get a new set of blocks that are loadbalanced, and distribute them to workers, and (4) receive the updated blocks from workers, and update , . The STRADS scheduler threads take turns to dispatch to the workers: thread 1 dispatches first, then thread 2, and so on until thread , before returning to thread 1.
In our experiments, we assume the entire input data is available to every machine, though we note that it can just as easily be stored in a distributed keyvalue store or parameter server. Each scheduler thread maintains and stores only the variables assigned to it. We implement STRADS in C++, using the Boost libraries and the 0MQ 3.2.4 library [11] for intermachine network communications.
Programming Interface
Per the SAP model, STRADS requires users to define modelspecific functions and , via the following interface:

define_sampling(p), where p is a function object such that p(j) returns the probability of variable j. STRADS also provides p with an interface to access the input data, as well as the model variables (on the current STRADS thread); we shall not go into the details for space reasons.

define_dependency(d), where d is a function object such that d(j,k) returns the dependency between variables j and k. Like p, d has access to the model variables and input data through STRADS.
Properties of STRADS
From a distributed systems perspective, the roundrobin design of STRADS carries the following benefits: one, it makes effective use of distributed cluster memory — every scheduler thread only needs to store the state of the variables assigned to it. Two, the scheduler threads require almost no communication between each other; they just need to coordinate taking turns to serve workers. Three, the roundrobin arrangement allows each scheduler thread more time to prepare for dispatch — if there are threads, then each thread has fold more time. This prevents situations where workers have to wait for the schduler, and is essentially a form of hiding computational latency.
STRADS is essentially a bootstrap of the SAP model. Even though the importance distribution is now split into distributions , since for Big Model problems, each will be approximately similar in shape to the original . Furthermore, STRADS preserves algorithm correctness: because blocks from different scheduler threads will be updated at different iterations, there is no need to crosscheck depedencies for blocks between threads. Load balancing is also unaffected, provided that is sufficiently large (so that enough blocks are produced). Thus, STRADS is a close, bootstrapped approximation to SAP scheduling for Big Models with a large number of variables and/or parameters.
4 Theoretical Analysis of Parallel CD Under SAP
The SAP model specifies a generalpurpose dynamic block scheduler for distributed ML algorithms; given a specific ML algorithm, the user must input appropriate definitions for (important variable subsampling) and
(dependency checking). To provide theoretical analysis of parallel CD under SAP, let us consider the definitions for Lasso regression in Section
2.1 — under them, we will show that SAP approximately obtains the optimal Lasso convergence rate for worker threads. We formally restate those definitions:
Select a subset of variables in the th iteration: choose Lasso coefficients (variables), i.e., , where are selected from the distribution , where and is a small constant (e.g. we used ).

Group the coefficients into jobs to be dispatched in the th iteration, where each job contains exactly one coefficient. More precisely, find a set of coefficients to be dispatched such that
such that for all
Here represents the correlation between the th covariate and the th covariate; we assume has been standardized for Lasso.

Dispatch to parallel workers.

Receive updated from the workers, to be used in steps 12 next iteration.
Below, we present highlights from our theoretical results. Our analysis is based on the sampling distribution (in practice, we approximate with since is unavailable at th iteration before computing ; we introduced to give all s nonzero probability to account for the approximation), and the allowed model dependency threshold at each iteration — this is unlike the global condition for all iterations used in [2, 23].
For theoretical analysis, we rewrite problem (1) as: , where contains features by duplicating original features with opposite sign (see appendix for details), and , for all . We define the Lasso objective as , and the following theorem shows that is approximately optimal for SAP.
Theorem 1.
Suppose is the set of indices of coefficients updated in parallel at the th iteration, and is sufficiently small such that , for all , where is a small positive constant. Then, the sampling distribution approximately maximizes a lower bound to the expected decrease in the objective function after updating coefficients indexed by , where is defined as
(6) 
This means that our scheduling strategy for parallel lasso approximately maximizes the lower bound for the progress per iteration (We defer the proof to the appendix).
We now discuss SAP’s scalability with respect to the Shotgun algorithm, which determines uniformly at random [2]. Firstly, SAP always acheives the maximum effective parallelization allowed by input data, by actively minimizing the interference caused by parallel updates. In contrast, Shotgun’s effective parallelization is reduced whenever the (randomly drawn) coefficients happen to be correlated, thus producing intereference when updated in parallel. Furthermore, SAP always chooses the coefficients with the effort to decrease the objective function, whereas Shotgun is agnostic to coefficient importance. Because of these two factors, SAP has superior theoretical (and as we shall show, empirical) scalability over Shotgun.
5 Experimental Results
We show that the SAP model (implemented as STRADS) outperforms the unstructured model parallelism, which selects variables uniformly at random for parallel execution, as well as the static blockstructured parallelism model, which does not change block structures during execution. We demonstrate this on two exemplar applications, parallel Lasso and parallel MF; experimental details follow:
Datasets
For parallel Lasso, we used one real and one synthetic dataset. Our real dataset was the Alzheimer’s disease (AD) dataset [10], containing 463 samples and 508,999 covariates (single nucleotide polymorphisms) for , and realvalued APOE gene expression levels for . For synthetic data, we generated 450 samples with 1,000,000 features; and a realvalued output with 10,000 true nonzero coefficients. For parallel MF, we used the NetFlix [6] and YahooMusic [16] datasets. The NetFlix dataset contains 480,189 users versus 17,770 movies (100,480,507 nonzero entries) while the YahooMusic dataset contains 1,948,882 users versus 98,213 songs (115,579,440 nonzero entries).
Experimental platform and STRADS configurations
We ran the experiments on a compute cluster, with the following machine specifications: 64 cores ( AMD Opteron 1.4 GHz), 3TB SATA drive, 128GB RAM, and 10GbE network interface. Parallel Lasso and MF applications were tested in different platforms. We ran the parallel Lasso application in the distributed setting (multiple machines) using from 60 to 240 cores, and parallel MF in the single multicore machine setting using from 4 to 16 cores. STRADS was configured as follows: for Lasso, we used , , and , and for MF, we partitioned variables such that each block contains or variables, where is the number of cores.
5.1 Experiments on Parallel Lasso
Fig. 4 shows objective vs. time plots for STRADS (SAP model for dynamic block structures), a static correlation scheduler (static block structures), and a random scheduler (no block structures), over several machine configurations. The static block scheduling uses the following strategy: pick a set of variables uniformly at random, and dispatch only variables that are nearly independent (i.e. correlation). As for unstructured scheduling, we used the Shotgun approach [2], which selects variables uniformly at random; note that the original Shotgun paper was limited to a single multicore machine, whereas our experiments bring Shotgun into the distributed setting.
The first row of Fig. 4 contains AD data results, while the second row contains synthetic data results, over 60, 120, and 240 cores. In all cases, STRADS converged much faster than the other two schedulers. We point out three phenomena observed in these experiments: first, STRADS consistently generates an early sharp drop in the objective function value; this is because after all variables have been updated at least once, STRADS now has a full estimate of the importance distribution , so it can now prioritize more important variables. This results in a dramatic reduction in objective value.
Second, STRADS exhibits not only a faster convergence rate, but also a substantially better objective function value when converged. It is possible that the other two approaches will eventually achieve the same objective that STRADS had. In practice however, algorithms are run with an automatic stopping condition — typically a minimum threshold on change in objective value. Under such a stopping condition, STRADS achieves a better final objective value than the other schedulers.
Finally, we observe that static correlation scheduling only beats random scheduling by a significant margin when using a large number of cores (e.g., 240). The reason is that, with a low core count, random scheduling is unlikely to select highly correlated variables, and hence static block structures do not yield any benefit. Once the core count increases, the probability of picking multiple correlated variables goes up, and static correlation scheduling begins to show an advantage. However, STRADS dynamic scheduling based on variable importance yields an even greater improvement.
5.2 Experiments on Parallel Matrix Factorization
Fig. 5 compares, for 4 to 16 cores on a single machine, parallel MF using STRADS, versus a scheduler with no load balancing (that partitions the matrix rows and columns uniformly, without regard to the number of nonzero entries in each row/column). This experiment is intended to demonstrate the performance gains from load balancing through STRADS.
On the NetFlix dataset (first row of Fig. 5
), STRADS exhibits slightly better convergence rate for 4 and 8 cores, but an insubstantial benefit for 16 cores. The reason is one of sampling statistics: when using a small number of cores/blocks and uniformly sampling over rows and columns, the final distribution of block sizes (i.e. number of nonzero entries) exhibits a large variance — that is to say, some blocks can be much larger than others. Hence, the largest block becomes a severe bottleneck. However, once the number of cores/blocks is increased, the variance in block sizes drops, and the bottleneck is thus reduced.
For the YahooMusic dataset (second row of Fig. 5), STRADS exhibits much clearer benefits from load balancing. Moreover, unlike the NetFlix dataset, the gain due to load balancing actually increases with more cores. It turns out that the nonzero entries in the YahooMusic dataset are heavily biased towards a few items (i.e. strong powerlaw behavior) — hence without load balancing, algorithm performance is no better than a single thread due to bottlenecking on the extreme users. STRADS load balancing resolves this problem, allowing for full parallelism (which explains the widening gap w.r.t. the naive scheduler at higher core counts).
6 Related Work and Discussion
Variable scheduling is a key component of many distributed platforms such as Pregel [20], MapReduce [3] and GraphLab [18]. For example, GraphLab paritions graph data to minimize communication and synchronization costs between different connected nodes; furthermore, GraphLab provides various consistency schemes to synchronize dependent parameters or variables. Pregel is designed to process large scale graphs, and schedules computations using workflow graphs. Hadoop distributes the data to workers, in a manner that limits communication due to mapreduce synchronization. Our work differs from these scheduling approaches, in that we consider not only static information embedded in the data, but also dynamic information such as transient parameters or variables learned at runtime.
Algorithms for our two exemplar applications, parallel Lasso and MF, have been extensively studied in the literature: examples include randomized blockcoordinate descent [22], dual decomposition [1], parallel stochastic gradient decent [19, 21], and parallel coordinate descent [2, 27]. These works differ from ours in the sense that we suggest a generalpurpose dynamic scheduler to boost the performance and correctness of parallel ML algorithms, rather than an algorithm tailored to a specific application. In fact, we used existing algorithms for parallel Lasso and MF without any modification. In that regard, STRADS can be combined with any new developments in parallel Lasso or MF algorithms, so as to yield further performance improvements.
Future work includes harnessing STRADS to accelerate diverse Big Model applications. By considering the unique ML properties of each application, we can develop principles for analyzing intermediate variables/parameter values in the context of the data, in order to formulate the importance distribution and dependency function necessary for high performance modelparallelism with STRADS. Furthermore, we will explore principled ways to improve the efficiency of STRADS, such as increasing the size of blocks to be dispatched while still tightly controlling interference effects between model variables — in order to minimize communication costs between workers and scheduler and thus maximize CPU utilization.
Appendix: Proof of Theorem 1
Preliminaries
The regularized regression [25] takes the form of an optimization program:
(7) 
where denotes the regularization parameter, and is a nonnegative convex loss function. We assume that and are standardized and consider (7) without an intercept. For simplicity but without loss of generality, we let . However, it is straightforward to use other loss functions such as logisticloss using the same approach shown in [2].
For theoretical analysis, we rewrite problem (7) as:
(8) 
where contains duplicated features with opposite sign such that , and for all and , and , for all . Note that problem (7) and (8) are equivalent optimization problem [2]. To optimize the problem 8, we can use parallel coordinate descent method (Shotgun) proposed by [2], and the update rule is , where is given by,
where .
Theorem 1.
Suppose is the set of indices of coefficients updated in parallel at the th iteration, and is sufficiently small such that , for all , where is a small positive constant. Then, the sampling distribution approximately maximizes a lower bound to the expected decrease in the objective function after updating coefficients indexed by , where is defined as
(9) 
Proof.
From assumption 3.1 in [2], we have
where . For simple notation, let us omit the super script representing th iteration.
Suppose index of coefficient is drawn from a sample distribution , and a pair of indices is drawn from . Taking expectaion with respect to :
(10)  
(11)  
(12)  
(13)  
(14)  
(15) 
In (Preliminaries), we used if because and cannot be updated in parallel if . Recall that we find coefficients to be updated in parallel by solving:
such that for all . 
Further, in (Preliminaries) we used our assumption that , for all for small . Thus, from (Preliminaries) the lower bound of is maximized when . Furthermore, . Thus, because . Therefore, gives us approximately optimal distribution to maximize the lower bound of . ∎
References
 [1] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3:1–124, 2011.
 [2] Joseph K Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for l1regularized loss minimization. ICML, 2011.
 [3] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
 [4] J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate optimization. Annals of Applied Statistics, 1(2):302–332, 2007.
 [5] Kuzman Ganchev, Joao Graça, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for structured latent variable models. The Journal of Machine Learning Research, 99:2001–2049, 2010.

[6]
Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis.
Largescale matrix factorization with distributed stochastic gradient descent.
In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 69–77. ACM, 2011.  [7] Jayanta K Ghosh and RV Ramamoorthi. Bayesian nonparametrics. Springer, 2003.
 [8] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 399–406, 2010.
 [9] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5228–5235, 2004.
 [10] Harvard Brain Tissue Resource Center. Downloaded from Sage Bionetworks: https://synapse.prod.sagebase.org/#Synapse:4505, 2013.
 [11] Pieter Hintjens. ZeroMQ: Messaging for Many Applications. O’Reilly, 2013.
 [12] ChoJui Hsieh, KaiWei Chang, ChihJen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. A dual coordinate descent method for largescale linear svm. In Proceedings of the 25th international conference on Machine learning, pages 408–415. ACM, 2008.
 [13] Tommi S Jaakkola. 10 tutorial on variational approximation methods. Advanced mean field methods: theory and practice, page 129, 2001.
 [14] U Kang, Charalampos E Tsourakakis, and Christos Faloutsos. Pegasus: A petascale graph mining system implementation and observations. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on, pages 229–238. IEEE, 2009.
 [15] Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. Graphchi: Largescale graph computation on just a pc. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 31–46, 2012.
 [16] Yahoo! Labs. Webscope from yahoo! labs. http://webscope.sandbox.yahoo.com/catalog.php?datatype=r, 2013.

[17]
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng.
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.
In Proceedings of the 26th Annual International Conference on Machine Learning, pages 609–616. ACM, 2009.  [18] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 2012.

[19]
Xin Luo, Huijun Liu, Gaopeng Gou, Yunni Xia, and Qingsheng Zhu.
A parallel matrix factorization based recommender by alternating
stochastic gradient decent.
Engineering Applications of Artificial Intelligence
, 25(7):1403–1412, 2012.  [20] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for largescale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135–146. ACM, 2010.
 [21] Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithms for largescale matrix completion. Mathematical Programming Computation, pages 1–26, 2011.
 [22] Peter Richtárik and Martin Takáč. Iteration complexity of randomized blockcoordinate descent methods for minimizing a composite function. Mathematical Programming, pages 1–38, 2012.
 [23] Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, and David Haglin. Feature clustering for accelerating parallel coordinate descent. NIPS, 2012.
 [24] Siddharth Suri and Sergei Vassilvitskii. Counting triangles and the curse of the last reducer. In Proceedings of the 20th international conference on World wide web, pages 607–614. ACM, 2011.
 [25] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996.
 [26] Ian H Witten and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.
 [27] HsiangFu Yu, ChoJui Hsieh, Si Si, and Inderjit Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 765–774. IEEE, 2012.
 [28] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 2010.
Comments
There are no comments yet.