DUAL-LOCO: Distributing Statistical Estimation Using Random Projections

06/08/2015 ∙ by Christina Heinze, et al. ∙ 0

We present DUAL-LOCO, a communication-efficient algorithm for distributed statistical estimation. DUAL-LOCO assumes that the data is distributed according to the features rather than the samples. It requires only a single round of communication where low-dimensional random projections are used to approximate the dependences between features available to different workers. We show that DUAL-LOCO has bounded approximation error which only depends weakly on the number of workers. We compare DUAL-LOCO against a state-of-the-art distributed optimization method on a variety of real world datasets and show that it obtains better speedups while retaining good accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many statistical estimation tasks amount to solving an optimization problem of the form

(1)

where

is the regularization parameter. The loss functions

depend on labels and linearly on the coefficients,

through a vector of covariates,

. Furthermore, we assume all to be convex and smooth with Lipschitz continuous gradients. Concretely, when , Eq. (1

) corresponds to ridge regression; for logistic regression

.

For large-scale problems, it is no longer practical to solve even relatively simple estimation tasks such as (1) on a single machine. To deal with this, approaches to distributed data analysis have been proposed that take advantage of many cores or computing nodes on a cluster. A common idea which links many of these methods is stochastic optimization. Typically, each of the workers only sees a small portion of the data points and performs incremental updates to a global parameter vector. It is typically assumed that the number of data points, , is very large compared with the number of features, , or that the data is extremely sparse. In such settings – which are common, but not ubiquitous in large datasets – distributed stochastic optimization algorithms perform well but may converge slowly otherwise.

A fundamentally different approach to distributing learning is for each worker to only have access to a portion of the available features. Distributing according to the features could be a preferable alternative for several reasons. Firstly, for high-dimensional data, where is large relative to

, better scaling can be achieved. This setting is challenging, however, since most loss functions are not separable across coordinates. High-dimensional data is commonly encountered in the fields of bioinformatics, climate science and computer vision. Furthermore, for a variety of prediction tasks it is often beneficial to map input vectors into a higher dimensional feature space, e.g. using deep representation learning or considering higher-order interactions. Secondly,

privacy. Individual blocks of features could correspond to sensitive information (such as medical records) which should be included in the predictive model but is not allowed to be communicated in an un-disguised form.

Our contribution.

In this work we introduce Dual-Loco to solve problems of the form (1) in the distributed setting when each worker only has access to a subset of the features. Dual-Loco is an extension of the Loco algorithm [1] which was recently proposed for solving distributed ridge regression in this setting. We propose an alternative formulation where each worker instead locally solves a dual optimization problem. Dual-Loco has a number of practical and theoretical improvements over the original algorithm:

  • Dual-Loco is applicable to a wider variety of smooth, convex penalized loss minimization problems encompassing many widely used regression and classification loss functions, including ridge regression, logistic regression and others.

  • In §4 we provide a more intuitive and tighter theoretical result which crucially does not depend on specific details of the ridge regression model and has weaker dependence on the number of workers, .

  • We also show that adding (rather than concatenating) random features allows for an efficient implementation yet retains good approximation guarantees.

In §5 we report experimental results with high-dimensional real world datasets corresponding to two different problem domains: climate science and computer vision. We compare Dual-Loco with CoCoA, a recently proposed state-of-the-art algorithm for distributed dual coordinate ascent [2]. Our experiments show that Dual-Loco demonstrates better scaling with than CoCoA while retaining a good approximation of the optimal solution. We provide an implementation of Dual-Loco in Apache Spark111http://spark.apache.org/. The portability of this framework ensures that Dual-Loco is able to be run in a variety of distributed computing environments.

2 Related work

2.1 Distributed Estimation

Recently, several asynchronous stochastic gradient descent (SGD) methods

[3, 4] have been proposed for solving problems of the form (1) in a parallel fashion in a multi-core, shared-memory environment and have been extended to the distributed setting. For such methods, large speedups are possible with asynchronous updates when the data is sparse. However, in some problem domains the data collected is dense with many correlated features. Furthermore, the setting can result in slow convergence. In the distributed setting, such methods can be impractical since the cost of communicating updates can dominate other computational considerations.

Jaggi et al. proposed a communication-efficient distributed dual coordinate ascent algorithm (CoCoA resp. CoCoA) [5, 2]. Each worker makes multiple updates to its local dual variables before communicating the corresponding primal update. This allows for trading off communication and convergence speed. Notably they show that convergence is actually independent of the number of workers, thus CoCoA exhibits strong scaling with .

Other recent work considers solving statistical estimation tasks using a single round of communication [6, 7]. However, all of these methods consider only distributing over the rows of the data where an i.i.d. assumption on the observations holds.

On the other hand, few approaches have considered distributing across the columns (features) of the data. This is a more challenging task for both estimation and optimization since the columns are typically assumed to have arbitrary dependencies and most commonly used loss functions are not separable over the features. Recently, Loco was proposed to solve ridge regression when the data is distributed across the features [1]. Loco requires a single round to communicate small matrices of randomly projected features which approximate the dependencies in the rest of the dataset (cf. Figure 1). Each worker then optimizes its own sub-problem independently and finally sends its portion of the solution vector back to the master where they are combined. Loco makes no assumptions about the correlation structure between features. It is therefore able to perform well in challenging settings where the features are correlated between blocks and is particularly suited when . Indeed, since the relative dimensionality of local problems decreases when splitting by columns, they are easier in a statistical sense. Loco makes no assumptions about data sparsity so it is also able to obtain speedups when the data is dense.

One-shot communication schemes are beneficial as the cost of communication consists of a fixed cost and a cost that is proportional to the size of the message. Therefore, it is generally cheaper to communicate a few large objects than many small objects.

2.2 Random projections for estimation and optimization

Figure 1: Schematic for the distributed approximation of a large data set with random projections, used by Dual-Loco.

Random projections are low-dimensional embeddings which approximately preserve an entire subspace of vectors. They have been extensively used to construct efficient algorithms when the sample-size is large in a variety of domains such as: nearest neighbours [8], matrix factorization [9], least squares [10, 11] and recently in the context of optimization [12].

We concentrate on the Subsampled Randomized Hadamard Transform (SRHT), a structured random projection [13]. The SRHT consists of a projection matrix, [9] with the definitions: (i) is a subsampling matrix. (ii) is a diagonal matrix whose entries are drawn independently from . (iii) is a normalized Walsh-Hadamard matrix. The key benefit of the SRHT is that due to its recursive definition the product between and can be computed in time while never constructing explicitly.

For moderately sized problems, random projections have been used to reduce the dimensionality of the data prior to performing regression [14, 15]. However after projection, the solution vector is in the compressed space and so interpretability of coefficients is lost. Furthermore, the projection of the low-dimensional solution back to the original high-dimensional space is in fact guaranteed to be a bad approximation of the optimum [16].

Dual Random Projections.

Recently, [16, 17] studied the effect of random projections on the dual optimization problem. For the primal problem in Eq. (1), defining , we have the corresponding dual

(2)

where is the conjugate Fenchel dual of and . For example, for squared loss functions , we have . For problems of this form, the dual variables can be directly mapped to the primal variables, such that for a vector which attains the maximum of (2), the optimal primal solution has the form

Clearly, a similar dual problem to (2) can be defined in the projected space. Defining we have

(3)

Importantly, the vector of dual variables does not change dimension depending on whether the original problem (2) or the projected problem (3) is being solved. Under mild assumptions on the loss function, by mapping the solution to this new problem, , back to the original space one obtains a vector , which is a good approximation to , the solution to the original problem (1) [16, 17].

3 The Dual-Loco algorithm

Input:  Data: , , no. workers:

Parameters: ,

1:  Partition into subsets of equal size and distribute feature vectors in accordingly over workers.
2:  for each worker in parallel do
3:     Compute and send random features .
4:     Receive random features and construct .
5:     
6:     
7:     Send to driver.
8:  end for

Output:  Solution vector:

Algorithm 1 Dual-Loco

In this section we detail the Dual-Loco algorithm. Dual-Loco differs from the original Loco algorithm in two important ways. (i) The random features from each worker are summed, rather than concatenated, to obtain a dimensional approximation allowing for an efficient implementation in a large-scale distributed environment. (ii) Each worker solves a local dual problem similar to (3). This allows us to extend the theoretical guarantees to a larger class of estimation problems beyond ridge regression (§4).

We consider the case where features are distributed across different workers in non-overlapping subsets of equal size222This is for simplicity of notation only, in general the partitions can be of different sizes., .

Since most loss functions of interest are not separable across coordinates, a key challenge addressed by Dual-Loco is to define a local minimization problem for each worker to solve independently and asynchronously while still maintaining important dependencies between features in different blocks and keeping communication overhead low. Algorithm 1 details Dual-Loco in full.

We can rewrite (1) making explicit the contribution from block . Letting be the sub-matrix whose columns correspond to the coordinates in (the “raw” features of block ) and be the remaining columns of , we have

(4)

Where and are the rows of and respectively. We replace in each block with a low-dimensional randomized approximation which preserves its contribution to the loss function. This procedure is described in Figure 1.

In Step 5, these matrices of random features are communicated and worker constructs the matrix

(5)

which is the concatenation of worker ’s raw features and the sum of the random features from all other workers. is the SRHT matrix introduced in §2.2.

As we prove in Lemma 2, summing -dimensional random projections from blocks is equivalent to computing the -dimensional random projection in one go. The latter operation is impractical for very large and not applicable when the features are distributed. Therefore, summing the random features from each worker allows the dimensionality reduction to be distributed across workers. Additionally, the summed random feature representation can be computed and combined very efficiently. We elaborate on this aspect in §5.

For a single worker the local, approximate primal problem is then

(6)

where is the row of . The corresponding dual problem for each worker in the Dual-Loco algorithm is

(7)

The following steps in Algorithm 1 detail respectively how the solution to (7) and the final Dual-Loco estimates are obtained.

Step 6. LocalDualSolver.

The LocalDualSolver computes the solution for (7), the local dual problem. The solver can be chosen to best suit the problem at hand. This will depend on the absolute size of and as well as on their ratio. For example, we could use SDCA [18] or Algorithm 1 from [16].

Step 7. Obtaining the global primal solution.

Each worker maps its local dual solution to the primal solution corresponding only to the coordinates in . In this way, each worker returns coefficients corresponding only to its own raw features. The final primal solution vector is obtained by concatenating the local solutions. Unlike Loco, we no longer require to discard the coefficients corresponding to the random features for each worker. Consequently, computing estimates is more efficient (especially when ).

4 Dual-Loco Approximation Error

In this section we bound the recovery error between the Dual-Loco solution and the solution to Eq. (1).

Theorem 1 (Dual-Loco error bound).

Consider a matrix with rank, . Assume that the loss is smooth and Lipschitz continuous. For a subsampling dimension where , let be the solution to (1) and be the estimate returned by Algorithm 1

. We have with probability at least

where
Proof.

By Lemma 4 and applying a union bound we can decompose the global optimization error in terms of the error due to each worker as which holds with probability . The final bound, (1) follows by setting and and noting that . ∎

Theorem 1 guarantees that the solution to Dual-Loco will be close to the optimal solution obtained by a single worker with access to all of the data. Our result relies on the data having rank . In practice, this assumption is often fulfilled, in particular when the data is high dimensional. For a large enough projection dimension, the bound has only a weak dependence on through the union bound used to determine . The error is then mainly determined by the ratio between the rank and the random projection dimension. When the rank of increases for a fixed , we need a larger projection dimension to accurately capture its spectrum. On the other hand, the failure probability increases with and decreases with . However, this countering effect is negligible as typically .

5 Implementation and Experiments

In this section we report on the empirical performance of Dual-Loco in two sets of experiments. The first demonstrates the performance of Dual-Loco in a large, distributed classification task. The second is an application of penalized regression to a problem in climate science where accurate recovery of the coefficient estimates is of primary interest.

Cross validation. In most practical cases, the regularization parameter is unknown and has to be determined via -fold cross validation (CV). The chosen algorithm is usually run entirely once for each fold and each of values of , leading to a runtime that is approximately as large as the runtime of a single run333“Approximately” since the cross validation procedure also requires time for testing. For a single run we only count the time it takes to estimate the parameters.. In this context, Dual-Loco has the advantage that steps 3 and 4 in Algorithm 1 are independent of . Therefore, these steps only need to be performed once per fold. In step 5, we then estimate for each value in the provided sequence for . Thus, the runtime of Dual-Loco will increase by much less than compared to the runtime of a single run. The performance of each value for is then not only averaged over the random split of the training data set into parts but also over the randomness introduced by the random projections which are computed and communicated once per fold. The procedure is provided in full detail in Algorithm 2 in Appendix C.

Implementation details. We implemented Dual-Loco using the Apache Spark framework444A software package will be made available under the Apache license.. Spark is increasingly gaining traction in the research community as well as in industry due to its easy-to-use high-level API and the benefits of in-memory processing. Spark is up to 100 faster than Hadoop MapReduce. Additionally, Spark can be used in many different large-scale computing environments and the various, easily-integrated libraries for a diverse set of tasks greatly facilitate the development of applications.

(a)
(b)
Figure 2: Schematic for the aggregation of the random features in Spark. LABEL:sub@fig:notreeagg When concatenating the random features naively, every worker node (exec.) sends its random features to the driver from where they are broadcasted to all workers. LABEL:sub@fig:treeagg Using the treeReduce scheme we can reduce the load on the driver by summing the random features from each worker node as this operation is associative and commutative. Worker is only required to subtract its own random features locally.

When communicating and summing the random features in Spark, Dual-Loco leverages the treeReduce scheme as illustrated in Figure 2LABEL:sub@fig:treeagg. Summing has the advantage that increasing the number of workers simply introduces more layers in the tree structure (Figure (b)b) while the load on the driver remains constant and the aggregation operation also benefits from a parallel execution. Thus, when increasing only relatively little additional communication cost is introduced which leads to speedups as demonstrated below.

In practice, we used the discrete cosine transform (DCT) provided in the FFT library jTransforms555https://sites.google.com/site/piotrwendykier/software/jtransforms666For the Hadamard transform, must be a power of two. For the DCT there is no restriction on and very similar theoretical guarantees hold. and we ran Dual-Loco as well as CoCoA on a high-performance cluster777CoCoA is also implemented in Spark with code available from https://github.com/gingsmith/cocoa..

Competing methods. For the classification example, the loss function is the hinge loss. Although the problem is non-smooth, and therefore not covered by our theory, we still obtain good results suggesting that Theorem 1 can be generalized to non-smooth losses. Alternatively, for classification the smoothed hinge or logistic losses could be used. For the regression problem we use the squared error loss and modify CoCoA accordingly. As the LocalDualSolver we use SDCA [18].

We also compared Dual-Loco

against the reference implementation of distributed loss minimization in the MLlib library in Spark using SGD and L-BFGS. However, after extensive cross-validation over regularization strength (and step size and mini-batch size in case of SGD), we observed that the variance was still very large and so we omit the MLlib implementations from the figures. A comparison between

CoCoA and variants of SGD and mini-batch SDCA can be found in [5].

Kaggle Dogs vs Cats dataset. This is a binary classification task consisting of images of dogs and cats888https://www.kaggle.com/c/dogs-vs-cats. We resize all images to pixels and use Overfeat [19]

– a pre-trained convolutional neural network – to extract

fully dense feature vectors from the layer of the network for each image. We train on images and test on the remaining . The size of the training data is GB with over 4 billion non-zero elements. All results we report in the following are averaged over five repetitions and by “runtime” we refer to wall clock time.

Figure 3 shows the median normalized training and test prediction MSE of Dual-Loco and CoCoA for different numbers of workers999In practice, this choice will depend on the available resources in addition to the size of the data set.. For Dual-Loco, we also vary the size of the random feature representation and choose . The corresponding errors are labeled with Dual-Loco , Dual-Loco and Dual-Loco . Note that combinations of and that would result in cannot be used, e.g.  and . We ran CoCoA until a duality gap of was attained so that the number of iterations varies for different numbers of workers101010For ranging from to , the number of iterations needed were , resp. .. Notably, for more iterations were needed than in the other cases which is reflected in the very low training error in this case. The fraction of local points to be processed per round was set to 10%. We determined the regularization parameter via 5-fold cross validation.

While the differences in training errors between Dual-Loco and CoCoA are notable, the differences between the test errors are minor as long as the random feature representation is large enough. Choosing to be only % of seems to be slightly too small for this data set. When setting to be % of the largest difference between the test errors of Dual-Loco and CoCoA

is 0.9%. The averaged mean squared prediction errors and their standard deviations are collected in Table 

1 in Appendix C.

Figure 3: Dogs vs Cats data: Median normalized training and test prediction MSE based on 5 repetitions.

Next, we would like to compare the wall clock time needed to find the regularization parameter via 5-fold cross validation. For CoCoA, using the number of iterations needed to attain a duality gap of would lead to runtimes of more than 24 hours for when comparing possible values for . One might argue that using a duality gap of is sufficient for the cross validation runs which would speed up the model selection procedure significantly as much fewer iterations would be required. Therefore, for we use a duality gap of during cross validation and a duality gap of for learning the parameters, once has been determined. Figure 4 shows the runtimes when possible values for are compared; Figure 6LABEL:sub@fig:totalTime50 compares the runtimes when cross validation is performed over values. The absolute runtime of CoCoA for a single run is smaller for and and larger for , so using more workers increased the amount of wall clock time necessary for job completion. The total runtime, including cross validation and a single run to learn the parameters with the determined value for , is always smaller for Dual-Loco, except when and .

Figure 4: Total wall clock time including 5-fold CV over values for . For CoCoA, we use a duality gap (DG) of for the CV runs when .

Figures 5 and 6LABEL:sub@fig:speedup_50 show the relative speedup of Dual-Loco and CoCoA when increasing . The speedup is computed by dividing the runtime for by the runtime achieved for the corresponding . A speedup value smaller than 1 implies an increase in runtime. When considering a single run, we run CoCoA in two different settings: (i) We use the number of iterations that are needed to obtain a duality gap of which varies for different number of workers10. Here, the speedup is smaller than 1 for all . (ii) We fix the number of outer iterations to a constant number. As increases, the number of inner iterations decreases, making it easier for CoCoA to achieve a speedup. We found that although CoCoA attains a speedup of when increasing from to (equivalent to a decrease in runtime of %), CoCoA suffers a % increase in runtime when increasing from to .

For Dual-Loco and Dual-Loco we observe significant speedups as increases. As we split the design matrix by features the number of observations remains constant for different number of workers. At the same time, the dimensionality of each worker’s local problem decreases with . Together with the efficient aggregation of the random features, this leads to shorter runtimes. In case of Dual-Loco 2, the communication costs dominate the costs of computing the random projection and of the LocalDualSolver, resulting in much smaller speedups.

Although CoCoA was demonstrated to obtain speedups for low-dimensional data sets [2] it is plausible that the same performance cannot be expected on a very high-dimensional data set. This illustrates that in such a high-dimensional setting splitting the design matrix according to the columns instead of the rows is more suitable.

(a)
(b)
Figure 5: Relative speedup for LABEL:sub@fig:speedup_single a single run and LABEL:sub@fig:speedup_20 5-fold CV over values for .
(a)
(b)
Figure 6: 5-fold CV over values for : LABEL:sub@fig:totalTime50 Total wall clock time and LABEL:sub@fig:speedup_50 relative speedup.

Climate data. This is a regression task where we demonstrate that the coefficients returned by Dual-Loco are interpretable. The data set contains the outcome of control simulations of the GISS global circulation model [20] and is part of the CMIP5 climate modeling ensemble. We aim to forecast the monthly global average temperature in February using the air pressure measured in January. Results are very similar for other months. The features are pressure measurements taken at geographic grid points in January. The time span of the climate simulation is 531 years and we use the results from two control simulations, yielding and .

In Figure 7 we compare the coefficient estimates for four different methods. The problem is small enough to be solved on a single machine so that the full solution can be computed (using SDCA; cf. Figure 7LABEL:sub@fig:beta_full). This allows us to report the normalized parameter estimation mean squared error () with respect to the full solution in addition to the normalized mean squared prediction error (MSE). The solution indicates that the pressure differential between Siberia (red area, top middle-left) and Europe and the North Atlantic (blue area, top left and top right) is a good predictor for the temperature anomaly. This pattern is concealed in Figure 7LABEL:sub@fig:beta_proj which shows the result of up-projecting the coefficients estimated following a random projection of the columns. Using this scheme for prediction was introduced in [15]. Although the MSE is similar to the optimal solution, the recovered coefficients are not interpretable as suggested by [16]. Thus, this method should only be used if prediction is the sole interest. Figure 7LABEL:sub@fig:beta_loco shows the estimates returned by Dual-Loco which is able to recover estimates which are close to the full solution. Finally, Figure 7LABEL:sub@fig:beta_cocoa shows that CoCoA also attains accurate results.

(a) Single machine: Full solution ()
(b) Single machine: Column-wise compression
(, )
(c) Distributed setting: Dual-Loco 10 with
(, )
(d) Distributed setting: CoCoA with
(, )
Figure 7: Climate data: The regression coefficients are shown as maps with the prime median (passing through London) corresponding to the left and right edge of the plot. The Pacific Ocean lies in the center of each map.

Considering a longer time period or adding additional model variables such as temperature, precipitation or salinity rapidly increases the dimensionality of the problem while the number of observations remains constant. Each additional variable adds dimensions per month of simulation. Estimating very high-dimensional linear models is a significant challenge in climate science and one where distributing the problem across features instead of observations is advantageous. The computational savings are much larger when distributing across features as and thus reducing is associated with larger gains than when distributing across observations.

6 Conclusions and further work

We have presented Dual-Loco which considers the challenging and rarely studied problem of statistical estimation when data is distributed across features rather than samples. Dual-Loco generalizes Loco to a wider variety of loss functions for regression and classification. We show that the estimated coefficients are close to the optimal coefficients that could be learned by a single worker with access to the entire dataset. The resulting bound is more intuitive and tighter than previous bounds, notably with a very weak dependence on the number of workers. We have demonstrated that Dual-Loco is able to recover accurate solutions for large-scale estimation tasks whilst also achieving better scaling than a state-of-the-art competitor, CoCoA, as increases. Additionally, we have shown that Dual-Loco allows for fast model selection using cross-validation.

The dual formulation is convenient for penalized problems but other penalties are not as straightforward. Similarly, the theory only holds for smooth loss functions. However, as demonstrated empirically Dual-Loco also performs well with a non-smooth loss function.

As grows very large, the random feature matrices may become too large to communicate efficiently even when the projection dimension is very small. For these situations, there are a few simple extensions we aim to explore in future work. One possibility is to first perform row-wise random projections (c.f. [21]) to further reduce the communication requirement. Another option is to distribute according to rows and columns.

Contrary to stochastic optimization methods, the communication of Dual-Loco is limited to a single round. For fixed , and , the amount of communication is deterministic and can be fixed ahead of time. This can be beneficial in settings where there are additional constraints on communication (for example when different blocks of features are distributed a priori across different physical locations).

Clearly with additional communication, the theoretical and practical performance of Dual-Loco could be improved. For example, [16] suggest an iterative dual random projection scheme which can reduce the error in Lemma 4 exponentially. A related question for future research involves quantifying the amount of communication performed by Dual-Loco in terms of known minimax lower bounds [22].

References

Appendix A Supplementary Results

Here we introduce two lemmas. The first describes the random projection construction which we use in the distributed setting.

Lemma 2 (Summing random features).

Consider the singular value decomposition

where and have orthonormal columns and is diagonal; . is a fixed positive constant. In addition to the raw features, let contain random features which result from summing the random projections from the other workers. Furthermore, assume without loss of generality that the problem is permuted so that the raw features of worker ’s problem are the first columns of and . Finally, let

such that

With probability at least

Proof.

See Appendix B. ∎

Definition 1.

For ease of exposition, we shall rewrite the dual problems so that we consider minimizing convex objective functions. More formally, the original problem is then given by

(9)

The problem worker solves is described by

(10)

Recall that , where is the concatenation of the raw features and random features for worker .

To proceed we need the following result which relates the solution of the original problem to that of the approximate problem solved by worker .

Lemma 3 (Adapted from Lemma 1 [17]).

Let and be as defined in Definition 1. We obtain

(11)
Proof.

See [17]. ∎

For our main result, we rely heavily on the following variant of Theorem 1 in [17] which bounds the difference between the coefficients estimated by worker , and the corresponding coordinates of the optimal solution vector .

Lemma 4 (Local optimization error. Adapted from [17]).

For the following holds

with probability at least .

The proof closely follows the proof of Theorem 1 in [17] which we restate here identifying the major differences.

Proof.

Let the quantities , , be as in Definition 1. For ease of notation, we shall omit the subscript in and in the following.

By the SVD we have . So and . We can make the following definitions

Defining and plugging these into Lemma 3 we obtain

(12)

We now bound the spectral norm of using Lemma 2. Recall that Lemma 2 bounds the difference between a matrix and its approximation by a distributed dimensionality reduction using the SRHT.

Using the Cauchy-Schwarz inequality we have for the l.h.s. of (12)

For the r.h.s. of (12), we can write

Combining these two expressions and inequality (12) yields

(13)

From the definition of and above and and , respectively we have

so and due to the orthonormality of . Plugging this into (13) and using the fact that we obtain the stated result. ∎

Appendix B Proof of Row Summing Lemma

Proof of Lemma 2 .

Let contain the first rows of and let be the matrix containing the remaining rows. Decompose the matrix products as follows

and

with . Then

Since

is an orthogonal matrix, from Lemma 3.3 in

[13] and Lemma 5, summing independent SRHTs from to is equivalent to applying a single SRHT from to . Therefore we can simply apply Lemma 1 of [15] to the above to obtain the result. ∎

Lemma 5 (Summed row sampling).

Let be an matrix with orthonormal columns. Let be a balanced, random partitioning of the rows of where each matrix has exactly rows. Define the quantity . For a positive parameter , select the subsample size

Let denote the operation of uniformly at random sampling a subset, of the rows of by sampling coordinates from without replacement. Now denote as the sum of the subsampled rows

Then

and

with failure probability at most

Proof.

Define as the row of and . Suppose and consider the matrix

In general, we can express as

By the orthonormality of , the cross terms cancel as , yielding

We can consider as a sum of random matrices

sampled uniformly at random without replacement from the family .

To use the matrix Chernoff bound in Lemma 6, we require the quantities , and . Noticing that , we can set .

Taking expectations with respect to the random partitioning () and the subsampling within each partition (), using the fact that columns of are orthonormal we obtain

Recall that we take samples in blocks so we can define


Plugging these values into Lemma 6, the lower and upper Chernoff bounds respectively yield

Noting that , similarly for and using the identity for above obtains the desired result. ∎

For ease of reference, we also restate the Matrix Chernoff bound from [13, 23] but defer its proof to the original papers.

Lemma 6 (Matrix Chernoff from [13]).

Let be a finite set of positive-semidefinite matrices with dimension , and suppose that

Sample uniformly at random from without replacement. Compute

Then

Appendix C Supplementary Material for Section 5

Input:  Data: , , no. workers: , no. folds:

Parameters: ,

1:  Partition into subsets of equal size and distribute feature vectors in accordingly over workers.
2:  Partition into folds of equal size.
3:  for each fold  do
4:     Communicate indices of training and test points.
5:     for each worker in parallel do
6:        Compute and send .
7:        Receive random features and construct .
8:        for each  do
9:           
10:           
11:           
12:           Send to driver.
13:        end for
14:     end for
15:     for each  do
16:        Compute .
17:        Compute with and .
18:     end for
19:  end for
20:  for each  do
21:     Compute .
22:  end for

Output:  Parameter attaining smallest

Algorithm 2 Dual-Loco– cross validation
Algorithm K TEST MSE TRAIN MSE
Dual-Loco 0.5 12 0.0343 (3.75e-03) 0.0344 (2.59e-03)
Dual-Loco 0.5 24 0.0368 (4.22e-03) 0.0344 (3.05e-03)
Dual-Loco 0.5 48 0.0328 (3.97e-03) 0.0332 (2.91e-03)
Dual-Loco 0.5 96 0.0326 (3.13e-03) 0.0340 (2.67e-03)
Dual-Loco 0.5 192 0.0345 (3.82e-03) 0.0345 (2.69e-03)
Dual-Loco 1 12 0.0310 (2.89e-03) 0.0295 (2.28e-03)
Dual-Loco 1 24 0.0303 (2.87e-03) 0.0307 (1.44e-03)
Dual-Loco 1 48 0.0328 (1.92e-03) 0.0329 (1.55e-03)
Dual-Loco 1 96 0.0299 (1.07e-03) 0.0299 (7.77e-04)
Dual-Loco 2 12 0.0291 (2.16e-03) 0.0280 (6.80e-04)
Dual-Loco 2 24 0.0306 (2.38e-03) 0.0279 (1.24e-03)
Dual-Loco 2 48 0.0285 (6.11e-04) 0.0293 (4.77e-04)
CoCoA 12 0.0282 (4.25e-18) 0.0246 (2.45e-18)
CoCoA 24 0.0278 (3.47e-18) 0.0212 (3.00e-18)
CoCoA 48 0.0246 (6.01e-18) 0.0011 (1.53e-19)
CoCoA 96 0.0254 (5.49e-18) 0.0137 (1.50e-18)
CoCoA 192 0.0268 (1.23e-17) 0.0158 (6.21e-18)
Table 1: Dogs vs Cats data: Normalized training and test MSE: mean and standard deviations (based on 5 repetitions).