1 Introduction
Many statistical estimation tasks amount to solving an optimization problem of the form
(1) 
where
is the regularization parameter. The loss functions
depend on labels and linearly on the coefficients,through a vector of covariates,
. Furthermore, we assume all to be convex and smooth with Lipschitz continuous gradients. Concretely, when , Eq. (1) corresponds to ridge regression; for logistic regression
.For largescale problems, it is no longer practical to solve even relatively simple estimation tasks such as (1) on a single machine. To deal with this, approaches to distributed data analysis have been proposed that take advantage of many cores or computing nodes on a cluster. A common idea which links many of these methods is stochastic optimization. Typically, each of the workers only sees a small portion of the data points and performs incremental updates to a global parameter vector. It is typically assumed that the number of data points, , is very large compared with the number of features, , or that the data is extremely sparse. In such settings – which are common, but not ubiquitous in large datasets – distributed stochastic optimization algorithms perform well but may converge slowly otherwise.
A fundamentally different approach to distributing learning is for each worker to only have access to a portion of the available features. Distributing according to the features could be a preferable alternative for several reasons. Firstly, for highdimensional data, where is large relative to
, better scaling can be achieved. This setting is challenging, however, since most loss functions are not separable across coordinates. Highdimensional data is commonly encountered in the fields of bioinformatics, climate science and computer vision. Furthermore, for a variety of prediction tasks it is often beneficial to map input vectors into a higher dimensional feature space, e.g. using deep representation learning or considering higherorder interactions. Secondly,
privacy. Individual blocks of features could correspond to sensitive information (such as medical records) which should be included in the predictive model but is not allowed to be communicated in an undisguised form.Our contribution.
In this work we introduce DualLoco to solve problems of the form (1) in the distributed setting when each worker only has access to a subset of the features. DualLoco is an extension of the Loco algorithm [1] which was recently proposed for solving distributed ridge regression in this setting. We propose an alternative formulation where each worker instead locally solves a dual optimization problem. DualLoco has a number of practical and theoretical improvements over the original algorithm:

DualLoco is applicable to a wider variety of smooth, convex penalized loss minimization problems encompassing many widely used regression and classification loss functions, including ridge regression, logistic regression and others.

In §4 we provide a more intuitive and tighter theoretical result which crucially does not depend on specific details of the ridge regression model and has weaker dependence on the number of workers, .

We also show that adding (rather than concatenating) random features allows for an efficient implementation yet retains good approximation guarantees.
In §5 we report experimental results with highdimensional real world datasets corresponding to two different problem domains: climate science and computer vision. We compare DualLoco with CoCoA, a recently proposed stateoftheart algorithm for distributed dual coordinate ascent [2]. Our experiments show that DualLoco demonstrates better scaling with than CoCoA while retaining a good approximation of the optimal solution. We provide an implementation of DualLoco in Apache Spark^{1}^{1}1http://spark.apache.org/. The portability of this framework ensures that DualLoco is able to be run in a variety of distributed computing environments.
2 Related work
2.1 Distributed Estimation
Recently, several asynchronous stochastic gradient descent (SGD) methods
[3, 4] have been proposed for solving problems of the form (1) in a parallel fashion in a multicore, sharedmemory environment and have been extended to the distributed setting. For such methods, large speedups are possible with asynchronous updates when the data is sparse. However, in some problem domains the data collected is dense with many correlated features. Furthermore, the setting can result in slow convergence. In the distributed setting, such methods can be impractical since the cost of communicating updates can dominate other computational considerations.Jaggi et al. proposed a communicationefficient distributed dual coordinate ascent algorithm (CoCoA resp. CoCoA) [5, 2]. Each worker makes multiple updates to its local dual variables before communicating the corresponding primal update. This allows for trading off communication and convergence speed. Notably they show that convergence is actually independent of the number of workers, thus CoCoA exhibits strong scaling with .
Other recent work considers solving statistical estimation tasks using a single round of communication [6, 7]. However, all of these methods consider only distributing over the rows of the data where an i.i.d. assumption on the observations holds.
On the other hand, few approaches have considered distributing across the columns (features) of the data. This is a more challenging task for both estimation and optimization since the columns are typically assumed to have arbitrary dependencies and most commonly used loss functions are not separable over the features. Recently, Loco was proposed to solve ridge regression when the data is distributed across the features [1]. Loco requires a single round to communicate small matrices of randomly projected features which approximate the dependencies in the rest of the dataset (cf. Figure 1). Each worker then optimizes its own subproblem independently and finally sends its portion of the solution vector back to the master where they are combined. Loco makes no assumptions about the correlation structure between features. It is therefore able to perform well in challenging settings where the features are correlated between blocks and is particularly suited when . Indeed, since the relative dimensionality of local problems decreases when splitting by columns, they are easier in a statistical sense. Loco makes no assumptions about data sparsity so it is also able to obtain speedups when the data is dense.
Oneshot communication schemes are beneficial as the cost of communication consists of a fixed cost and a cost that is proportional to the size of the message. Therefore, it is generally cheaper to communicate a few large objects than many small objects.
2.2 Random projections for estimation and optimization
Random projections are lowdimensional embeddings which approximately preserve an entire subspace of vectors. They have been extensively used to construct efficient algorithms when the samplesize is large in a variety of domains such as: nearest neighbours [8], matrix factorization [9], least squares [10, 11] and recently in the context of optimization [12].
We concentrate on the Subsampled Randomized Hadamard Transform (SRHT), a structured random projection [13]. The SRHT consists of a projection matrix, [9] with the definitions: (i) is a subsampling matrix. (ii) is a diagonal matrix whose entries are drawn independently from . (iii) is a normalized WalshHadamard matrix. The key benefit of the SRHT is that due to its recursive definition the product between and can be computed in time while never constructing explicitly.
For moderately sized problems, random projections have been used to reduce the dimensionality of the data prior to performing regression [14, 15]. However after projection, the solution vector is in the compressed space and so interpretability of coefficients is lost. Furthermore, the projection of the lowdimensional solution back to the original highdimensional space is in fact guaranteed to be a bad approximation of the optimum [16].
Dual Random Projections.
Recently, [16, 17] studied the effect of random projections on the dual optimization problem. For the primal problem in Eq. (1), defining , we have the corresponding dual
(2) 
where is the conjugate Fenchel dual of and . For example, for squared loss functions , we have . For problems of this form, the dual variables can be directly mapped to the primal variables, such that for a vector which attains the maximum of (2), the optimal primal solution has the form
Clearly, a similar dual problem to (2) can be defined in the projected space. Defining we have
(3) 
Importantly, the vector of dual variables does not change dimension depending on whether the original problem (2) or the projected problem (3) is being solved. Under mild assumptions on the loss function, by mapping the solution to this new problem, , back to the original space one obtains a vector , which is a good approximation to , the solution to the original problem (1) [16, 17].
3 The DualLoco algorithm
In this section we detail the DualLoco algorithm. DualLoco differs from the original Loco algorithm in two important ways. (i) The random features from each worker are summed, rather than concatenated, to obtain a dimensional approximation allowing for an efficient implementation in a largescale distributed environment. (ii) Each worker solves a local dual problem similar to (3). This allows us to extend the theoretical guarantees to a larger class of estimation problems beyond ridge regression (§4).
We consider the case where features are distributed across different workers in nonoverlapping subsets of equal size^{2}^{2}2This is for simplicity of notation only, in general the partitions can be of different sizes., .
Since most loss functions of interest are not separable across coordinates, a key challenge addressed by DualLoco is to define a local minimization problem for each worker to solve independently and asynchronously while still maintaining important dependencies between features in different blocks and keeping communication overhead low. Algorithm 1 details DualLoco in full.
We can rewrite (1) making explicit the contribution from block . Letting be the submatrix whose columns correspond to the coordinates in (the “raw” features of block ) and be the remaining columns of , we have
(4) 
Where and are the rows of and respectively. We replace in each block with a lowdimensional randomized approximation which preserves its contribution to the loss function. This procedure is described in Figure 1.
In Step 5, these matrices of random features are communicated and worker constructs the matrix
(5) 
which is the concatenation of worker ’s raw features and the sum of the random features from all other workers. is the SRHT matrix introduced in §2.2.
As we prove in Lemma 2, summing dimensional random projections from blocks is equivalent to computing the dimensional random projection in one go. The latter operation is impractical for very large and not applicable when the features are distributed. Therefore, summing the random features from each worker allows the dimensionality reduction to be distributed across workers. Additionally, the summed random feature representation can be computed and combined very efficiently. We elaborate on this aspect in §5.
For a single worker the local, approximate primal problem is then
(6) 
where is the row of . The corresponding dual problem for each worker in the DualLoco algorithm is
(7) 
The following steps in Algorithm 1 detail respectively how the solution to (7) and the final DualLoco estimates are obtained.
Step 6. LocalDualSolver.
Step 7. Obtaining the global primal solution.
Each worker maps its local dual solution to the primal solution corresponding only to the coordinates in . In this way, each worker returns coefficients corresponding only to its own raw features. The final primal solution vector is obtained by concatenating the local solutions. Unlike Loco, we no longer require to discard the coefficients corresponding to the random features for each worker. Consequently, computing estimates is more efficient (especially when ).
4 DualLoco Approximation Error
In this section we bound the recovery error between the DualLoco solution and the solution to Eq. (1).
Theorem 1 (DualLoco error bound).
Consider a matrix with rank, . Assume that the loss is smooth and Lipschitz continuous. For a subsampling dimension where , let be the solution to (1) and be the estimate returned by Algorithm 1
. We have with probability at least
where 
Proof.
Theorem 1 guarantees that the solution to DualLoco will be close to the optimal solution obtained by a single worker with access to all of the data. Our result relies on the data having rank . In practice, this assumption is often fulfilled, in particular when the data is high dimensional. For a large enough projection dimension, the bound has only a weak dependence on through the union bound used to determine . The error is then mainly determined by the ratio between the rank and the random projection dimension. When the rank of increases for a fixed , we need a larger projection dimension to accurately capture its spectrum. On the other hand, the failure probability increases with and decreases with . However, this countering effect is negligible as typically .
5 Implementation and Experiments
In this section we report on the empirical performance of DualLoco in two sets of experiments. The first demonstrates the performance of DualLoco in a large, distributed classification task. The second is an application of penalized regression to a problem in climate science where accurate recovery of the coefficient estimates is of primary interest.
Cross validation. In most practical cases, the regularization parameter is unknown and has to be determined via fold cross validation (CV). The chosen algorithm is usually run entirely once for each fold and each of values of , leading to a runtime that is approximately as large as the runtime of a single run^{3}^{3}3“Approximately” since the cross validation procedure also requires time for testing. For a single run we only count the time it takes to estimate the parameters.. In this context, DualLoco has the advantage that steps 3 and 4 in Algorithm 1 are independent of . Therefore, these steps only need to be performed once per fold. In step 5, we then estimate for each value in the provided sequence for . Thus, the runtime of DualLoco will increase by much less than compared to the runtime of a single run. The performance of each value for is then not only averaged over the random split of the training data set into parts but also over the randomness introduced by the random projections which are computed and communicated once per fold. The procedure is provided in full detail in Algorithm 2 in Appendix C.
Implementation details. We implemented DualLoco using the Apache Spark framework^{4}^{4}4A software package will be made available under the Apache license.. Spark is increasingly gaining traction in the research community as well as in industry due to its easytouse highlevel API and the benefits of inmemory processing. Spark is up to 100 faster than Hadoop MapReduce. Additionally, Spark can be used in many different largescale computing environments and the various, easilyintegrated libraries for a diverse set of tasks greatly facilitate the development of applications.
When communicating and summing the random features in Spark, DualLoco leverages the treeReduce scheme as illustrated in Figure 2LABEL:sub@fig:treeagg. Summing has the advantage that increasing the number of workers simply introduces more layers in the tree structure (Figure (b)b) while the load on the driver remains constant and the aggregation operation also benefits from a parallel execution. Thus, when increasing only relatively little additional communication cost is introduced which leads to speedups as demonstrated below.
In practice, we used the discrete cosine transform (DCT) provided in the FFT library jTransforms^{5}^{5}5https://sites.google.com/site/piotrwendykier/software/jtransforms^{6}^{6}6For the Hadamard transform, must be a power of two. For the DCT there is no restriction on and very similar theoretical guarantees hold. and we ran DualLoco as well as CoCoA on a highperformance cluster^{7}^{7}7CoCoA is also implemented in Spark with code available from https://github.com/gingsmith/cocoa..
Competing methods. For the classification example, the loss function is the hinge loss. Although the problem is nonsmooth, and therefore not covered by our theory, we still obtain good results suggesting that Theorem 1 can be generalized to nonsmooth losses. Alternatively, for classification the smoothed hinge or logistic losses could be used. For the regression problem we use the squared error loss and modify CoCoA accordingly. As the LocalDualSolver we use SDCA [18].
We also compared DualLoco
against the reference implementation of distributed loss minimization in the MLlib library in Spark using SGD and LBFGS. However, after extensive crossvalidation over regularization strength (and step size and minibatch size in case of SGD), we observed that the variance was still very large and so we omit the MLlib implementations from the figures. A comparison between
CoCoA and variants of SGD and minibatch SDCA can be found in [5].Kaggle Dogs vs Cats dataset. This is a binary classification task consisting of images of dogs and cats^{8}^{8}8https://www.kaggle.com/c/dogsvscats. We resize all images to pixels and use Overfeat [19]
– a pretrained convolutional neural network – to extract
fully dense feature vectors from the layer of the network for each image. We train on images and test on the remaining . The size of the training data is GB with over 4 billion nonzero elements. All results we report in the following are averaged over five repetitions and by “runtime” we refer to wall clock time.Figure 3 shows the median normalized training and test prediction MSE of DualLoco and CoCoA for different numbers of workers^{9}^{9}9In practice, this choice will depend on the available resources in addition to the size of the data set.. For DualLoco, we also vary the size of the random feature representation and choose . The corresponding errors are labeled with DualLoco , DualLoco and DualLoco . Note that combinations of and that would result in cannot be used, e.g. and . We ran CoCoA until a duality gap of was attained so that the number of iterations varies for different numbers of workers^{10}^{10}10For ranging from to , the number of iterations needed were , resp. .. Notably, for more iterations were needed than in the other cases which is reflected in the very low training error in this case. The fraction of local points to be processed per round was set to 10%. We determined the regularization parameter via 5fold cross validation.
While the differences in training errors between DualLoco and CoCoA are notable, the differences between the test errors are minor as long as the random feature representation is large enough. Choosing to be only % of seems to be slightly too small for this data set. When setting to be % of the largest difference between the test errors of DualLoco and CoCoA
is 0.9%. The averaged mean squared prediction errors and their standard deviations are collected in Table
1 in Appendix C.Next, we would like to compare the wall clock time needed to find the regularization parameter via 5fold cross validation. For CoCoA, using the number of iterations needed to attain a duality gap of would lead to runtimes of more than 24 hours for when comparing possible values for . One might argue that using a duality gap of is sufficient for the cross validation runs which would speed up the model selection procedure significantly as much fewer iterations would be required. Therefore, for we use a duality gap of during cross validation and a duality gap of for learning the parameters, once has been determined. Figure 4 shows the runtimes when possible values for are compared; Figure 6LABEL:sub@fig:totalTime50 compares the runtimes when cross validation is performed over values. The absolute runtime of CoCoA for a single run is smaller for and and larger for , so using more workers increased the amount of wall clock time necessary for job completion. The total runtime, including cross validation and a single run to learn the parameters with the determined value for , is always smaller for DualLoco, except when and .
Figures 5 and 6LABEL:sub@fig:speedup_50 show the relative speedup of DualLoco and CoCoA when increasing . The speedup is computed by dividing the runtime for by the runtime achieved for the corresponding . A speedup value smaller than 1 implies an increase in runtime. When considering a single run, we run CoCoA in two different settings: (i) We use the number of iterations that are needed to obtain a duality gap of which varies for different number of workers^{10}. Here, the speedup is smaller than 1 for all . (ii) We fix the number of outer iterations to a constant number. As increases, the number of inner iterations decreases, making it easier for CoCoA to achieve a speedup. We found that although CoCoA attains a speedup of when increasing from to (equivalent to a decrease in runtime of %), CoCoA suffers a % increase in runtime when increasing from to .
For DualLoco and DualLoco we observe significant speedups as increases. As we split the design matrix by features the number of observations remains constant for different number of workers. At the same time, the dimensionality of each worker’s local problem decreases with . Together with the efficient aggregation of the random features, this leads to shorter runtimes. In case of DualLoco 2, the communication costs dominate the costs of computing the random projection and of the LocalDualSolver, resulting in much smaller speedups.
Although CoCoA was demonstrated to obtain speedups for lowdimensional data sets [2] it is plausible that the same performance cannot be expected on a very highdimensional data set. This illustrates that in such a highdimensional setting splitting the design matrix according to the columns instead of the rows is more suitable.
Climate data. This is a regression task where we demonstrate that the coefficients returned by DualLoco are interpretable. The data set contains the outcome of control simulations of the GISS global circulation model [20] and is part of the CMIP5 climate modeling ensemble. We aim to forecast the monthly global average temperature in February using the air pressure measured in January. Results are very similar for other months. The features are pressure measurements taken at geographic grid points in January. The time span of the climate simulation is 531 years and we use the results from two control simulations, yielding and .
In Figure 7 we compare the coefficient estimates for four different methods. The problem is small enough to be solved on a single machine so that the full solution can be computed (using SDCA; cf. Figure 7LABEL:sub@fig:beta_full). This allows us to report the normalized parameter estimation mean squared error () with respect to the full solution in addition to the normalized mean squared prediction error (MSE). The solution indicates that the pressure differential between Siberia (red area, top middleleft) and Europe and the North Atlantic (blue area, top left and top right) is a good predictor for the temperature anomaly. This pattern is concealed in Figure 7LABEL:sub@fig:beta_proj which shows the result of upprojecting the coefficients estimated following a random projection of the columns. Using this scheme for prediction was introduced in [15]. Although the MSE is similar to the optimal solution, the recovered coefficients are not interpretable as suggested by [16]. Thus, this method should only be used if prediction is the sole interest. Figure 7LABEL:sub@fig:beta_loco shows the estimates returned by DualLoco which is able to recover estimates which are close to the full solution. Finally, Figure 7LABEL:sub@fig:beta_cocoa shows that CoCoA also attains accurate results.
Considering a longer time period or adding additional model variables such as temperature, precipitation or salinity rapidly increases the dimensionality of the problem while the number of observations remains constant. Each additional variable adds dimensions per month of simulation. Estimating very highdimensional linear models is a significant challenge in climate science and one where distributing the problem across features instead of observations is advantageous. The computational savings are much larger when distributing across features as and thus reducing is associated with larger gains than when distributing across observations.
6 Conclusions and further work
We have presented DualLoco which considers the challenging and rarely studied problem of statistical estimation when data is distributed across features rather than samples. DualLoco generalizes Loco to a wider variety of loss functions for regression and classification. We show that the estimated coefficients are close to the optimal coefficients that could be learned by a single worker with access to the entire dataset. The resulting bound is more intuitive and tighter than previous bounds, notably with a very weak dependence on the number of workers. We have demonstrated that DualLoco is able to recover accurate solutions for largescale estimation tasks whilst also achieving better scaling than a stateoftheart competitor, CoCoA, as increases. Additionally, we have shown that DualLoco allows for fast model selection using crossvalidation.
The dual formulation is convenient for penalized problems but other penalties are not as straightforward. Similarly, the theory only holds for smooth loss functions. However, as demonstrated empirically DualLoco also performs well with a nonsmooth loss function.
As grows very large, the random feature matrices may become too large to communicate efficiently even when the projection dimension is very small. For these situations, there are a few simple extensions we aim to explore in future work. One possibility is to first perform rowwise random projections (c.f. [21]) to further reduce the communication requirement. Another option is to distribute according to rows and columns.
Contrary to stochastic optimization methods, the communication of DualLoco is limited to a single round. For fixed , and , the amount of communication is deterministic and can be fixed ahead of time. This can be beneficial in settings where there are additional constraints on communication (for example when different blocks of features are distributed a priori across different physical locations).
Clearly with additional communication, the theoretical and practical performance of DualLoco could be improved. For example, [16] suggest an iterative dual random projection scheme which can reduce the error in Lemma 4 exponentially. A related question for future research involves quantifying the amount of communication performed by DualLoco in terms of known minimax lower bounds [22].
References
 [1] Christina Heinze, Brian McWilliams, Nicolai Meinshausen, and Gabriel Krummenacher. Loco: Distributing ridge regression with random projections. arXiv preprint arXiv:1406.3469, 2014.

[2]
Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I Jordan, Peter
Richtárik, and Martin Takáč.
Adding vs. averaging in distributed primaldual optimization.
In
Proceedings of The 32nd International Conference on Machine Learning
, 2015.  [3] Feng Niu, Benjamin Recht, Christopher Ré, and Stephen J. Wright. Hogwild!: A lockfree approach to parallelizing stochastic gradient descent. In NIPS, 2011.
 [4] John C. Duchi, Michael I. Jordan, and H. Brendan McMahan. Estimation, optimization, and parallelism when data is sparse. In Advances in Neural Information Processing Systems, 2013.
 [5] Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. Communicationefficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 3068–3076, 2014.
 [6] Yuchen Zhang, John C Duchi, and Martin J Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. arXiv preprint arXiv:1305.5029, 2013.
 [7] Qiang Liu and Alex T Ihler. Distributed estimation, information loss and exponential families. In Advances in Neural Information Processing Systems, pages 1098–1106, 2014.
 [8] Nir Ailon and Bernard Chazelle. The fast johnsonlindenstrauss transform and approximate nearest neighbors. SIAM Journal on Computing, 39(1):302–322, 2009.
 [9] Christos Boutsidis and Alex Gittens. Improved matrix algorithms via the Subsampled Randomized Hadamard Transform. 2012. arXiv:1204.0062v4 [cs.DS].
 [10] P Dhillon, Y Lu, D P Foster, and L Ungar. New Subsampling Algorithms for Fast Least Squares Regression. In Advances in Neural Information Processing Systems, 2013.
 [11] Brian McWilliams, Gabriel Krummenacher, Mario Lucic, and Joachim M Buhmann. Fast and robust least squares estimation in corrupted linear models. In Advances in Neural Information Processing Systems, pages 415–423, 2014.
 [12] Mert Pilanci and Martin J Wainwright. Newton sketch: A lineartime optimization algorithm with linearquadratic convergence. arXiv preprint arXiv:1505.02250, 2015.
 [13] Joel A Tropp. Improved analysis of the subsampled randomized Hadamard transform. November 2010. arXiv:1011.1595v4 [math.NA].
 [14] Ata Kabán. New bounds on compressive linear least squares regression. In Artificial Intelligence and Statistics, 2014.
 [15] Yichao Lu, Paramveer Dhillon, Dean P Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized hadamard transform. In Advances in Neural Information Processing Systems 26, pages 369–377, 2013.
 [16] Lijun Zhang, Mehrdad Mahdavi, Rong Jin, Tianbao Yang, and Shenghuo Zhu. Recovering optimal solution by dual random projection. arXiv preprint arXiv:1211.3046, 2012.
 [17] Lijun Zhang, Mehrdad Mahdavi, Rong Jin, Tianbao Yang, and Shenghuo Zhu. Random projections for classification: A recovery approach. IEEE Transactions on Information Theory, 2014.
 [18] Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1):567–599, 2013.
 [19] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In International Conference on Learning Representations (ICLR 2014). CBLS, April 2014.
 [20] Gavin A. Schmidt, Max Kelley, Larissa Nazarenko, Reto Ruedy, Gary L. Russell, Igor Aleinov, Mike Bauer, Susanne E. Bauer, Maharaj K. Bhat, Rainer Bleck, et al. Configuration and assessment of the GISS ModelE2 contributions to the CMIP5 archive. Journal of Advances in Modeling Earth Systems, 6(1):141–184, 2014.
 [21] Michael W Mahoney. Randomized algorithms for matrices and data. April 2011. arXiv:1104.5557v3 [cs.DS].
 [22] Yuchen Zhang, John Duchi, Michael Jordan, and Martin J Wainwright. Informationtheoretic lower bounds for distributed statistical estimation with communication constraints. In Advances in Neural Information Processing Systems, pages 2328–2336, 2013.
 [23] Joel A Tropp. Userfriendly tail bounds for sums of random matrices. April 2010. arXiv:1004.4389v7 [math.PR].
Appendix A Supplementary Results
Here we introduce two lemmas. The first describes the random projection construction which we use in the distributed setting.
Lemma 2 (Summing random features).
Consider the singular value decomposition
where and have orthonormal columns and is diagonal; . is a fixed positive constant. In addition to the raw features, let contain random features which result from summing the random projections from the other workers. Furthermore, assume without loss of generality that the problem is permuted so that the raw features of worker ’s problem are the first columns of and . Finally, letsuch that
With probability at least
Proof.
See Appendix B. ∎
Definition 1.
For ease of exposition, we shall rewrite the dual problems so that we consider minimizing convex objective functions. More formally, the original problem is then given by
(9) 
The problem worker solves is described by
(10) 
Recall that , where is the concatenation of the raw features and random features for worker .
To proceed we need the following result which relates the solution of the original problem to that of the approximate problem solved by worker .
Proof.
See [17]. ∎
For our main result, we rely heavily on the following variant of Theorem 1 in [17] which bounds the difference between the coefficients estimated by worker , and the corresponding coordinates of the optimal solution vector .
Lemma 4 (Local optimization error. Adapted from [17]).
For the following holds
with probability at least .
The proof closely follows the proof of Theorem 1 in [17] which we restate here identifying the major differences.
Proof.
Let the quantities , , be as in Definition 1. For ease of notation, we shall omit the subscript in and in the following.
By the SVD we have . So and . We can make the following definitions
Defining and plugging these into Lemma 3 we obtain
(12) 
We now bound the spectral norm of using Lemma 2. Recall that Lemma 2 bounds the difference between a matrix and its approximation by a distributed dimensionality reduction using the SRHT.
Using the CauchySchwarz inequality we have for the l.h.s. of (12)
For the r.h.s. of (12), we can write
Combining these two expressions and inequality (12) yields
(13) 
From the definition of and above and and , respectively we have
so and due to the orthonormality of . Plugging this into (13) and using the fact that we obtain the stated result. ∎
Appendix B Proof of Row Summing Lemma
Proof of Lemma 2 .
Let contain the first rows of and let be the matrix containing the remaining rows. Decompose the matrix products as follows
and  
with . Then
Since
is an orthogonal matrix, from Lemma 3.3 in
[13] and Lemma 5, summing independent SRHTs from to is equivalent to applying a single SRHT from to . Therefore we can simply apply Lemma 1 of [15] to the above to obtain the result. ∎Lemma 5 (Summed row sampling).
Let be an matrix with orthonormal columns. Let be a balanced, random partitioning of the rows of where each matrix has exactly rows. Define the quantity . For a positive parameter , select the subsample size
Let denote the operation of uniformly at random sampling a subset, of the rows of by sampling coordinates from without replacement. Now denote as the sum of the subsampled rows
Then
and  
with failure probability at most
Proof.
Define as the row of and . Suppose and consider the matrix
In general, we can express as
By the orthonormality of , the cross terms cancel as , yielding
We can consider as a sum of random matrices
sampled uniformly at random without replacement from the family .
To use the matrix Chernoff bound in Lemma 6, we require the quantities , and . Noticing that , we can set .
Taking expectations with respect to the random partitioning () and the subsampling within each partition (), using the fact that columns of are orthonormal we obtain
Recall that we take samples in blocks so we can define
Plugging these values into Lemma 6, the lower and upper Chernoff bounds respectively yield
Noting that , similarly for and using the identity for above obtains the desired result. ∎
For ease of reference, we also restate the Matrix Chernoff bound from [13, 23] but defer its proof to the original papers.
Lemma 6 (Matrix Chernoff from [13]).
Let be a finite set of positivesemidefinite matrices with dimension , and suppose that
Sample uniformly at random from without replacement. Compute
Then
Appendix C Supplementary Material for Section 5
Algorithm  K  TEST MSE  TRAIN MSE 

DualLoco 0.5  12  0.0343 (3.75e03)  0.0344 (2.59e03) 
DualLoco 0.5  24  0.0368 (4.22e03)  0.0344 (3.05e03) 
DualLoco 0.5  48  0.0328 (3.97e03)  0.0332 (2.91e03) 
DualLoco 0.5  96  0.0326 (3.13e03)  0.0340 (2.67e03) 
DualLoco 0.5  192  0.0345 (3.82e03)  0.0345 (2.69e03) 
DualLoco 1  12  0.0310 (2.89e03)  0.0295 (2.28e03) 
DualLoco 1  24  0.0303 (2.87e03)  0.0307 (1.44e03) 
DualLoco 1  48  0.0328 (1.92e03)  0.0329 (1.55e03) 
DualLoco 1  96  0.0299 (1.07e03)  0.0299 (7.77e04) 
DualLoco 2  12  0.0291 (2.16e03)  0.0280 (6.80e04) 
DualLoco 2  24  0.0306 (2.38e03)  0.0279 (1.24e03) 
DualLoco 2  48  0.0285 (6.11e04)  0.0293 (4.77e04) 
CoCoA  12  0.0282 (4.25e18)  0.0246 (2.45e18) 
CoCoA  24  0.0278 (3.47e18)  0.0212 (3.00e18) 
CoCoA  48  0.0246 (6.01e18)  0.0011 (1.53e19) 
CoCoA  96  0.0254 (5.49e18)  0.0137 (1.50e18) 
CoCoA  192  0.0268 (1.23e17)  0.0158 (6.21e18) 
Comments
There are no comments yet.