# Distributed Principal Component Analysis with Limited Communication

We study efficient distributed algorithms for the fundamental problem of principal component analysis and leading eigenvector computation on the sphere, when the data are randomly distributed among a set of computational nodes. We propose a new quantized variant of Riemannian gradient descent to solve this problem, and prove that the algorithm converges with high probability under a set of necessary spherical-convexity properties. We give bounds on the number of bits transmitted by the algorithm under common initialization schemes, and investigate the dependency on the problem dimension in each case.

## Authors

• 2 publications
• 12 publications
• 3 publications
• 50 publications
04/02/2015

### Quantum image classification using principal component analysis

We present a novel quantum algorithm for classification of images. The a...
05/15/2014

### Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent

We propose a new two stage algorithm LING for large scale regression pro...
05/06/2020

### A Communication-Efficient Distributed Algorithm for Kernel Principal Component Analysis

Principal Component Analysis (PCA) is a fundamental technology in machin...
01/24/2019

### Overcomplete Independent Component Analysis via SDP

We present a novel algorithm for overcomplete independent components ana...
09/05/2020

### Communication-efficient distributed eigenspace estimation

Distributed computing is a standard way to scale up machine learning and...
02/16/2020

### Fair Principal Component Analysis and Filter Design

We consider Fair Principal Component Analysis (FPCA) and search for a lo...
11/04/2021

### Analog MIMO Communication for One-shot Distributed Principal Component Analysis

A fundamental algorithm for data analytics at the edge of wireless netwo...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The need to process ever increasing quantities of data has motivated significant research interest into efficient distributed algorithms for standard tasks in optimization and data analysis; see, e.g., Jaggi et al. (2014); Shamir et al. (2014); Garber et al. (2017). In this context, one of the fundamental metrics of interest is communication cost

, measured as the number of bits sent and received by the processors throughout computation. Communication-efficient variants are known for several classical tasks and algorithms, from basic ones such as mean estimation

Suresh et al. (2017); Konečnỳ and Richtárik (2018); Davies et al. (2021) to variants of gradient descent Konečnỳ et al. (2016); Horváth et al. (2019) or even higher-order optimization Alimisis et al. (2021); Islamov et al. (2021).

By contrast, much less is known concerning the communication cost for classical problems in data analysis, such as Principal Component Analysis (PCA) Hotelling (1933); Bishop (2006)

Ng et al. (2001). In this paper, we will focus on the closely-related problem of computing the leading eigenvector of the data covariance matrix, i.e., determining the first principal component, in the setting where -dimensional data comes from an unknown distribution, and is split among machines.

Given the fundamental nature of PCA, there is a rich body of work on solving sequential variants, both in the Euclidean domain Oja and Karhunen (1985); Shamir (2015, 2016) and, more recently, based on Riemannian optimization. This is because maximizing the Rayleigh quotient can be re-interpreted in terms of a hyper-sphere, leading to an unconstrained Riemannian optimization problem Absil et al. (2009); Zhang and Sra (2016); Zhang et al. (2016). These references have exclusively focused on the sequential setting, where the data can be stored and processed by a single node. However, in many settings the large volume of data requires processing by multiple nodes, which has inspired significant work on reducing the distribution overheads of PCA, e.g. Balcan et al. (2014); Garber et al. (2017), specifically on obtaining efficient algorithms in terms of round complexity.

More precisely, an impressive line of work considered the deterministic setting Kannan et al. (2014); Balcan et al. (2014); Boutsidis et al. (2016); Fan et al. (2019), where the input is partitioned arbitrarily between machines, for which both round-efficient and communication-efficient solutions are now known. The general approach is to first perform a careful low-rank approximation of the local data and then communicate iteratively to reconstruct a faithful estimate of the covariance matrix. Due to the worst-case setting, the number of communication rounds scales polynomially in the eigengap and in the precision , making them costly in practice.

Thus, on the practical side, it is common to assume a setting with stochastic rather than worst-case data partitioning. Specifically, Garber et al. Garber et al. (2017) consider this setting, and note that standard solutions based on distributed Power Method or Lanczos would require a number of communication rounds which increases with the sample size, which is impractical at scale. They circumvent this issue by replacing explicit Power Method iterations with a set of convex optimization problems, which can be solved in a round-efficient manner. Huang and Pan Huang and Pan (2020) proposed a round-efficient solution, by leveraging the connection to Riemannian optimization. Their proposal relies on a procedure which allows the update of node-local variables via a surrogate gradient, whose deviation from the global gradient should be bounded. Thus, communication could only be used to reach consensus on the local variables among the nodes. Unfortunately, upon close inspection, we have noted significant technical gaps in their analysis, which we discuss in detail in the next section.

The line of work in the stochastic setting focused primarily on latency cost

, i.e., the number of communication rounds for solving this problem. For highly-dimensional data and/or highly-distributed settings, it is known that

bandwidth cost, i.e., the number of bits which need to be transmitted per round, can be a significant bottleneck, and an important amount of work deals with it for other classic problems, e.g. Suresh et al. (2017); Horváth et al. (2019); Islamov et al. (2021). By contrast, much less is known concerning the communication complexity of distributed PCA.

#### Contribution.

In this paper, we provide a new algorithm for distributed leading eigenvector computation, which specifically minimizes the total number of bits sent and received by the nodes. Our approach leverages the connection between PCA and Riemannian optimization; specifically, we propose the first distributed variant of Riemannian gradient descent, which supports quantized communication. We prove convergence of this algorithm utilizing a set of necessary convexity-type properties, and bound its communication complexity under different initializations.

At the technical level, our result is based on a new sequential variant of Riemannian gradient descent on the sphere. This may be of general interest, and the algorithm can be generalized to Riemannian manifolds of bounded sectional curvatures using more sophisticated geometric bounds. We strengthen the gradient dominance property proved in Zhang et al. (2016), by proving that our objective is weakly-quasi-convex in a ball of some minimizer and combining that with a known quadratic-growth condition. We leverage these properties by adapting results of Bu and Mesbahi (2020) in the Riemannian setting and choosing carefully the learning-rate in order to guarantee convergence for the single-node version.

This algorithm is specifically-designed to be amenable to distribution: specifically, we port it to the distributed setting, and add communication-compression in the tangent spaces of the sphere, using a variant of the lattice quantization technique of Davies et al. (2021), which we port from the context of standard gradient descent. We provide a non-trivial analysis for the communication complexity of the resulting algorithm under different initializations. We show that, under random initialization, a solution can be reached with total communication that is quadratic in the dimension , and linear in , where

is the gap between the two biggest eigenvalues; when the initialization is done to the leading eigenvector of one of the local covariance matrices, this dependency can be reduced to linear in

.

## 2 Setting and Related Work

#### Setting.

We assume to be given total samples coming from some distribution , organized as a global data matrix , which is partitioned row-wise among processors, with node being assigned the matrix , consisting of consecutive rows, such that . As is common, let be the global covariance matrix, and the local covariance matrix owned by node . We denote by the eigenvalues of in descending order and by the corresponding eigenvectors. Then, it is well-known that, if the gap between the two leading eigenvalues of is positive, we can approximate the leading eigenvector by solving the following empirical risk minimization problem up to accuracy :

 x⋆=argminx∈Rd(−xTAx∥x∥2)=argminx∈Rd,∥x∥2=1(−xTAx)=argminx∈Sd−1(−xTAx), (1)

where is the -dimensional sphere.

We define , with and , with . Since the inner product is bilinear, we can write the global cost as the sum of the local costs:
.

In our analysis we use notions from the geometry of the sphere, which we recall in Appendix A.

#### Related Work.

As discussed, there has been a significant amount of research on efficient variants of PCA and related problems Bishop (2006); Oja and Karhunen (1985); Shamir (2015, 2016); Absil et al. (2009); Zhang and Sra (2016); Zhang et al. (2016). Due to space constraints, we focus on related work on communication-efficient algorithms. In particular, we discuss the relationship to previous round-efficient algorithms; to our knowledge, this is the first work to specifically focus on the bit complexity of this problem in the setting where data is randomly partitioned. More precisely, previous work on this variant implicitly assumes that algorithms are able to transmit real numbers at unit cost.

The straightforward approach to solve the minimization problem (1) in a distributed setting, where the data rows are partitioned, would be to use a distributed version of the Power Method, Riemannian gradient descent (RGD), or the Lanczos algorithm. In order to achieve an -approximation of the minimizer , the latter two algorithms require rounds, where the notation hides poly-logarithmic factors in . Distributed Lanczos and accelerated RGD would improve this by an factor. However, Garber et al. Garber et al. (2017) point out that, when is small, e.g. , then unfortunately the number of communication rounds would increase with the sample size, which renders these algorithms non-scalable.

Standard distributed convex approaches, e.g. Jaggi et al. (2014); Shamir et al. (2014), do not directly extend to our setting due to non-convexity and the unit-norm constraint. Garber et al. Garber et al. (2017) proposed a variant of the Power Method called Distributed Shift-and-Invert (DSI) which converges in roughly rounds, where is a bound on the squared -norm of the data. Huang and Pan Huang and Pan (2020) aimed to improve the dependency of the algorithm on and , and proposed an algorithm called Communication-Efficient Distributed Riemannian Eigensolver (CEDRE). This algorithm is shown to have round complexity , which does not scale with the sample size for , and has only logarithmic dependency on the accuracy .

#### Technical issues in Huang and Pan (2020).

Huang and Pan Huang and Pan (2020) proposed an interesting approach, which could provide the most round-efficient distributed algorithm to date. Despite the fact that we find the main idea of this paper very creative, we have unfortunately identified a significant gap in their analysis, which we now outline.

Specifically, one of their main results, Theorem 3, uses the local gradient dominance shown in Zhang et al. (2016); yet, the application of this result is invalid, as it is done without knowing in advance that the iterates of the algorithms continue to remain in the ball of initialization. This is compounded by another error on the constant of gradient dominance (Lemma 2), which we believe is caused by a typo in the last part of the proof of Theorem 4 in Zhang et al. (2016). This typo is unfortunately propagated into their proof. More precisely, the objective is indeed gradient dominated, but with a constant which vanishes when we approach the equator, in contrast with the which is claimed globally. (We reprove this result independently in Proposition 4). Thus, starting from a point lying in some ball of the minimizer can lead to a new point where the objective is more poorly gradient dominated.

This is a non-trivial technical problem which, in the case of gradient descent, can be addressed by a choice of the learning rate depending on the initialization. Given these issues, we perform a new and formally-correct analysis for gradient descent based on new convexity properties which guarantee convergence directly in terms of iterates, and not just function values. We would like however to note that our focus in this paper is on bit and not round complexity.

## 3 Step 1: Computing the Leading Eigenvector in One Node

### 3.1 Convexity-like Properties and Smoothness

We first analyze the case that all the matrices are simultaneously known to all machines. This practically means that we can run gradient descent in one of the machines (let it be the master), since we have all the data stored there. All the proofs for this section are in Appendix B.

 minx∈Sd−1−xTAx

where is the global covariance matrix. If , this problem has exactly two global minima: and . Let be an arbitrary point. Then can be written in the form . Fixing the minimizer , we have that a ball in around is of the form

 Ba={x∈Sd−1 | α1≥a}={x∈Sd−1 | ⟨x,v1⟩≥a} (2)

for some . Without loss of generality, we may assume (otherwise, consider the ball around to establish convergence to ). In this case, the region as defined in Def. 11 is identical to and .

We begin by investigating the convexity properties of the function . In particular, we prove that this function is weakly-quasi-convex (see Def. 13) with constant in the ball . See also Appendix A for a definition of the Riemannian gradient and the logarithm .

###### Proposition 1.

The function satisfies

for any with .

We continue by providing a quadratic growth condition for our cost function, which can also be found in (Xu et al., 2018, Lemma 2) in a slightly different form. Here dist is the intrinsic distance in the sphere, that we also define in Appendix A.

###### Proposition 2.

The function satisfies

 f(x)−f∗≥δ4dist2(x,x∗)

for any with .

According to Def. 14, this means that satisfies quadratic growth with constant .

Next, we prove that quadratic growth and weak-quasi-convexity imply a “strong-weak-convexity” property. This is similar to (Bu and Mesbahi, 2020, Proposition 3.1).

###### Proposition 3.

A function which satisfies quadratic growth with constant and is -weakly-quasi-convex with respect to a minimizer in a ball of this minimizer, satisfies for any in the same ball the inequality

Interestingly, using Proposition 3, we can recover the gradient dominance property proved in (Zhang et al., 2016, Theorem 4).

###### Proposition 4.

The function satisfies

for any with .

The proof of Proposition 4 can be done independently based only on Proposition 3 and (Bu and Mesbahi, 2020, Lemma 3.2). This verifies that our results until now are optimal regarding the quality of initialization since gradient dominance proved in Zhang et al. (2016) is optimal in that sense (i.e. the inequalities in the proof are tight).

###### Proposition 5.

The function is geodesically -smooth on the sphere.

Thus, the smoothness constant of , as defined in Def. 12, equals . Similarly, let denote the smoothness constant of , which equals twice the largest eigenvalue of . In order to estimate in a distributed fashion using only the local data matrices , we shall use the over-approximation .

### 3.2 Convergence

We now consider Riemannian gradient descent with learning rate starting from a point :

where is the exponential map of the sphere, defined in Appendix A.

Using Proposition 3 and a proper choice of , we can establish a convergence rate for the instrinsic distance of the iterates to the minimizer.

###### Proposition 6.

An iterate of Riemannian gradient descent for starting from a point and with step-size where , produces a point that satisfies

 dist2(x(t+1),x∗)≤(1−aμη)% dist2(x(t),x∗)for μ=δ/2.

Note that this result implies directly that if our initialization lies in the ball , the distance of to the center decreases and thus all subsequent iterates continue being in the initialization ball. This is essential since it guarantees that the convexity-like properties for continue to hold during the whole optimization process. The proof of Proposition 6 is in Appendix B.

This result is also valid in a more general setting when applied to an arbitrary that is -weakly-quasi convex and -smooth, and that satisfies quadratic growth with constant . The Euclidean analogue of this result can be found in (Bu and Mesbahi, 2020, Lemma 4.4).

## 4 Step 2: Computing the Leading Eigenvector in Several Nodes

We now present our version of distributed gradient descent for leading eigenvector computation and measure its bit complexity until reaching accuracy in terms of the intrinsic distance of an iterate from the minimizer ( is the leading eigenvector closest to the initialization point).

#### Lattice Quantization.

For estimating the Riemannian gradient in a distributed manner with limited communication, we use a quantization procedure developed in Davies et al. (2021). The original quantization scheme involves randomness, but we use a deterministic

version of it, by picking up the closest point to the vector that we want to encode. This is similar to the quantization scheme used by

Alistarh and Korhonen (2020) and has the following properties.

###### Proposition 7.

Davies et al. (2021); Alistarh and Korhonen (2020) Denoting by the number of bits that each machine uses to communicate, there exists a quantization function

which, for each , consists of an encoding function and a decoding one , such that, for all ,

• , if .

• , if .

• If , the cost of the quantization procedure in number of bits satisfies
.

In the following, the quantization takes place in the tangent space of each iterate , which is linearly isomorphic to . We denote by the specification of the function at . The vector inputs of the function are represented in the local coordinate system of the tangent space that the quantization takes place at each step. For decoding at , we use information obtained in the previous step, that we need to translate to the same tangent space. We do that using parallel transport , which we define in Appendix A.

#### Algorithm

We present now our main algorithm, which is inspired by quantized gradient descent firstly designed by Magnússon et al. (2020), and its similar version in Alistarh and Korhonen (2020).

1. Choose an arbitrary machine to be the master node, let it be .

2. Choose (we analyze later specific ways to do that).

3. Consider the following parameters

 σ:=1−cos(D)μη, K:=2√σ, θ:=√σ(1−√σ)4, √ξ:=θK+√σ, R(t)=γK(√ξ)tD

where is an over-approximation for (the way to practically choose is discussed later).

We assume that , otherwise we run the algorithm with .

In :

4. Compute the local Riemannian gradient at in each node.

5. Encode in each node and decode in the master node using its local information:

6. Sum the decoded vectors in the master node:

7. Encode the sum in the master and decode in each machine using its local information:

For :

8. Take a gradient step using the exponential map:

 x(t+1)=expx(t)(−ηq(t))

with step-size (the step-size is discussed later).

In :

9. Compute the local Riemannian gradient at in each node.

10. Encode in each node and decode in the master node using its (parallelly transported) local information from the previous step:

11. Sum the decoded vectors in the master node:

12. Encode the sum in the master and decode in each machine using its local information in the previous step after parallel transport:

 q(t+1)=Qx(t+1)(r(t+1),Γx(t+1)x(t)q(t),(1+θ2)R(t+1),θR(t+1)2).

#### Convergence

We first control the convergence of iterates simultaneously with the convergence of quantized gradients.

Note that

 √ξ=1−√σ2+√σ=1+√σ2≤√2√1+σ2=√1+σ2.
###### Lemma 8.

If , the previous quantized gradient descent algorithm produces iterates and quantized gradients that satisfy

The proof is a Riemannian adaptation of the similar one in Magnússon et al. (2020) and Alistarh and Korhonen (2020) and is presented in Appendix C. We recall that since the sphere is positively curved, it provides a landscape easier for optimization. It is quite direct to derive a general Riemannian method for manifolds of bounded curvature using more advanced geometric bounds, however this exceeds the scope of this work which focuses on leading eigenvector computation.

We now move to our main complexity result.

###### Theorem 9.

Let . Then, the previous quantized gradient descent algorithm needs at most

 b=O(nd1cos(D)δηlog(ncos(D)δη)log(Dϵ))

bits in total to estimate the leading eigenvector with an accuracy measured in intrinsic distance.

The proof is based on the previous Lemma 8 in order to count the number of steps that the algorithm needs to estimate the minimizer with accuracy and Proposition 7 to count the quantization cost in each round. We present the proof again in Appendix C.

## 5 Step 3: Dependence on Initialization

### 5.1 Uniformly random initialization

The cheapest choice to initialize quantized gradient descent is a point in the sphere chosen uniformly at random. According to Theorem 4 in Zhang et al. (2016), such a random point will lie, with probability at least , in a ball (see Def. 2) where

 a≥cp√d⟺dist(x(0),x∗)≤arccos(cp√d). (3)

Here, is a universal constant (we estimated numerically that is lower bounded by , see Appendix D, thus we can use ). By choosing the step-size as

 η=cp√dγ

and the parameter as

 D=arccos(cp√d),

we are guaranteed that , and with probability at least . Our analysis above therefore applies (up to probability ) and the general communication complexity result becomes

 O(ndcos(D)δηlogncos(D)δηlogDϵ)=O(ndηγδηlognηγδηlogDϵ)=O(ndη2γδlognη2γδlogDϵ).

Substituting , the number of bits satisfies (up to probability ) the upper bound

 b=O(nd2p2γδlogndγpδlogDϵ)=~O(nd2p2γδ).

### 5.2 Warm Start

A reasonable strategy to get a more accurate initialization is to perform an eigenvalue decomposition to one of the local covariance matrices, for instance (in the master node ), and compute its leading eigenvector, let it be . Then we communicate this vector to all machines in order to use its quantized approximation as initialization. We define:

 ~x(0)=Q(vi0,vi,∥vi0−vi∥,⟨vi0,x∗⟩2(√2+2)) x(0)=~x(0)∥~x(0)∥

where is the lattice quantization scheme in (i.e. we quantize the leading eigenvector of the master node as a vector in

and then project back to the sphere). The input and output variance in this quantization can be bounded by constants that we can practically estimate (see Appendix

D).

###### Proposition 10.

Assume that our data are i.i.d. and sampled from a distribution bounded in norm by a constant . Given that the eigengap and the number of machines satisfy

 δ≥Ω(√n√logdp), (4)

we have that the previous quantization costs many bits and is lower bounded by a constant with probability at least .

Thus, if bound 4 is satisfied, then the communication complexity becomes

 b=O(ndγδlognγδlogDϵ)=~O(ndγδ)

many bits in total with probability at least (notice that this can be further simplified using bound 4). This is because in Theorem 9 is upper bounded by a constant and the communication cost of quantizing does not affect the total communication cost. If we can estimate the specific relation in bound 4 (the constant hidden inside ), then we can compute estimations of the quantization parameters in the definition of .

Condition 4 is quite typical in this literature; see Huang and Pan (2020) and references therein (beginning of page 2), as we also briefly discussed in the introduction. Notice that appears in the numerator and not the denominator, only because we deal with the sum of local covariance matrices and not the average, thus our eigengap is times larger than the eigengap of the normalized covariance matrix.

## 6 Experiments

We evaluate our approach experimentally, comparing the proposed method of Riemannian gradient quantization against three other benchmark methods:

• Full-precision Riemannian gradient descent: Riemannian gradient descent, as described in Section 3.2, is performed with the vectors communicated at full (64-bit) precision.

• Euclidean gradient difference quantization: the ‘naïve’ approach to quantizing Riemannian gradient descent. Euclidean gradients are quantized and averaged before being projected to Riemannian gradients and used to take a step. To improve performance, rather than quantizing Euclidean gradients directly, we quantize the difference between the current local gradient and the previous local gradient, at each node. Since these differences are generally smaller than the gradients themselves, we expect this quantization to introduce lower error.

• Quantized power iteration: we also use as a benchmark a quantized version of power iteration, a common method for leading-eigenvector computation given by the update rule . can be computed in distributed fashion by communicating and summing the vectors . It is these vectors that we quantize.

All three of the quantized methods use the same vector quantization routine, for fair comparison.

We show convergence results (Figure 1

) for the methods on four real datasets: Human Activity from the MATLAB Statistics and Machine Learning Toolbox, and Mice Protein Expression, Spambase, and Libras Movement from the UCI Machine Learning Repository

Dua and Graff (2017). All results are averages over 10 runs of the cost function (for each method, the iterates are normalized to lie in the -ball, so a lower value of corresponds to being closer to the principal eigenvector).

All four datasets display similar behavior: our approach of Riemannian gradient quantization outperforms naïve Euclidean gradient quantization, and essentially matches the performance of the full-precision method while communicating only bits per coordinate, for a compression. Power iteration converges slightly faster (as would be expected from the theoretical convergence guarantees), but is much more adversely affected by quantization, and reaches a significantly suboptimal result.

We also tested other compression levels, though we do not present figures here due to space constraints: for fewer bits per coordinate (1-3), the performance differences between the four methods are in the same order but more pronounced; however, their convergence curves become more erratic due to increased variance from quantization. For increased number of bits per coordinate, the performance of the quantized methods reaches that of the full-precision methods at between 8-10 bits per coordinate depending on input, and the observed effect of quantized power iteration converging to a suboptimal result gradually diminishes. Our code is publicly available .

## 7 Discussion and Future Work

We proposed a communication efficient algorithm to compute the leading eigenvector of a covariance data matrix in a distributed fashion using only limited amount of bits. To the best of our knowledge, this is the first algorithm for this problem with bounded bit complexity. We focused on gradient descent since it provides robust convergence in terms of iterates, and can circumvent issues that arise due to a cost function that only has local convexity-like properties; second, it known to be amenable to quantization.

We would like briefly to discuss the round complexity of our method, emphasizing inherent round complexity trade-offs due to quantization. Indeed, the round complexity of our quantized RGD algorithm for accuracy is , where appears inside a logarithm. Specifically, if the initialization is uniformly random in the sphere (section 5.1), then . If we have warm start (section 5.2), then . Both statements hold with high probability.

For random initialization, the round complexity of distributed power method is (see the table at the beginning of page 4 in Garber et al. (2017)). This is because the convergence rate of power method is The dimension is again present in that rate, because , but it appears inside a logarithm in the final iteration complexity.

Since we want to insert quantization, the standard way to do so is by starting from a rate which is contracting (see e.g. Magnússon et al. (2020)). It may in theory be possible to add quantization in a non-contracting algorithm, but we are unaware of such a technique. Thus, we firstly prove that RGD with step-size is contracting (based on weak-convexity and quadratic growth) and then we add quantization, paying the price of having an extra in our round complexity. Thus, our round complexity is slightly worse than distributed power method, but this is in some sense inherent due to the introduction of quantization. Despite this, our algorithm can achieve low bit complexity and it is more robust in perturbations caused by bit limitations, as our experiments suggest.

We conclude with some interesting points for future work.

First, our choice of step-size is quite conservative since the local convexity properties of the cost function improve when approaching the optimum. Thus, instead of setting the step-size to be , where is a lower bound for , we can re-run the whole analysis with an adaptive step-size based on an improved value for from . This should improve the dependency on the dimension of our complexity results.

Second, the version of Riemannian gradient descent that we have chosen is expensive, since one needs to calculate the exponential map in each step using sin and cos calculations. Practitioners usually use cheaper retraction-based versions, and it would be of interest to study how our analysis can be modified if the exponential map is substituted by a retraction (i.e., any first-order approximation).

Third, it would be interesting to see whether the ideas and techniques presented in this work can also be used for other data science problems in a distributed fashion through communication-efficient Riemannian methods. This could be, for example, PCA for the top

eigenvectors using optimization on the Stiefel or Grassmann manifold Absil et al. (2009) and low-rank matrix completion on the fixed-rank manifold Vandereycken (2013).

We would like to thank the anonymous reviewers for helpful comments and suggestions. We also thank Aurelien Lucchi and Antonio Orvieto for fruitful discussions at an early stage of this work. FA is partially supported by the SNSF under research project No. 192363 and conducted part of this work while at IST Austria under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 805223 ScaleML). PD partly conducted this work while at IST Austria and was supported by the European Union’s Horizon 2020 programme under the Marie Skłodowska-Curie grant agreement No. 754411.

## References

• [1] P. Absil, R. Mahony, and R. Sepulchre (2009) Optimization algorithms on matrix manifolds. Princeton University Press. Cited by: Appendix A, §1, §2, §7.
• [2] F. Alimisis, P. Davies, and D. Alistarh (2021) Communication-efficient distributed optimization with quantized preconditioners. In International Conference on Machine Learning, Cited by: §1.
• [3] D. Alistarh and J. H. Korhonen (2020) Improved communication lower bounds for distributed optimisation. arXiv preprint arXiv:2010.08222. Cited by: §4, §4, §4, Proposition 7.
• [4] M. Balcan, V. Kanchanapally, Y. Liang, and D. Woodruff (2014) Improved distributed principal component analysis. arXiv preprint arXiv:1408.5823. Cited by: §1, §1.
• [5] C. M. Bishop (2006) Pattern recognition and machine learning. Springer. Cited by: §1, §2.
• [6] C. Boutsidis, D. P. Woodruff, and P. Zhong (2016) Optimal principal component analysis in distributed and streaming models. In

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

,
pp. 236–249. Cited by: §1.
• [7] J. Bu and M. Mesbahi (2020) A note on nesterov’s accelerated method in nonconvex optimization: a weak estimate sequence approach. External Links: 2006.08548 Cited by: §1, §3.1, §3.1, §3.2.
• [8] P. Davies, V. Gurunathan, N. Moshrefi, S. Ashkboos, and D. Alistarh (2021) Distributed variance reduction with optimal communication. In International Conference on Learning Representations, Cited by: §1, §1, §4, Proposition 7.
• [9] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §6.
• [10] J. Fan, D. Wang, K. Wang, and Z. Zhu (2019)

Distributed estimation of principal eigenspaces

.
Annals of statistics 47 (6), pp. 3009. Cited by: §1.
• [11] D. Garber, O. Shamir, and N. Srebro (2017) Communication-efficient algorithms for distributed stochastic principal component analysis. In International Conference on Machine Learning, pp. 1203–1212. Cited by: §1, §1, §1, §2, §2, §7.
• [12] S. Horváth, D. Kovalev, K. Mishchenko, S. Stich, and P. Richtárik (2019) Stochastic distributed learning with gradient quantization and variance reduction. arXiv preprint arXiv:1904.05115. Cited by: §1, §1.
• [13] H. Hotelling (1933) Analysis of a complex of statistical variables into principal components.. Journal of educational psychology 24 (6), pp. 417. Cited by: §1.
• [14] L. Huang and S. Pan (2020) Communication-efficient distributed PCA by Riemannian optimization. In International Conference on Machine Learning, pp. 4465–4474. Cited by: Appendix B, §D.2, §D.2, §1, §2, §2, §2, §5.2.
• [15] R. Islamov, X. Qian, and P. Richtárik (2021) Distributed second order methods with fast rates and compressed communication. arXiv preprint arXiv:2102.07158. Cited by: §1, §1.
• [16] M. Jaggi, V. Smith, M. Takáč, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan (2014) Communication-efficient distributed dual coordinate ascent. arXiv preprint arXiv:1409.1458. Cited by: §1, §2.
• [17] R. Kannan, S. Vempala, and D. Woodruff (2014) Principal component analysis and higher correlations for distributed data. In Conference on Learning Theory, pp. 1040–1057. Cited by: §1.
• [18] H. Karimi, J. Nutini, and M. Schmidt (2020) Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. External Links: 1608.04636 Cited by: Appendix A.
• [19] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §1.
• [20] J. Konečnỳ and P. Richtárik (2018) Randomized distributed mean estimation: accuracy vs. communication. Frontiers in Applied Mathematics and Statistics 4, pp. 62. Cited by: §1.
• [21] S. Li (2011) Concise Formulas for the Area and Volume of a Hyperspherical Cap. Asian Journal of Mathematics & Statistics 4 (1), pp. 66–70. External Links: Document Cited by: §D.1.
• [22] S. Magnússon, H. Shokri-Ghadikolaei, and N. Li (2020) On maintaining linear convergence of distributed learning and optimization under limited communication. IEEE Transactions on Signal Processing 68, pp. 6101–6116. Cited by: §4, §4, §7.
• [23] A. Ng, M. Jordan, and Y. Weiss (2001)

On spectral clustering: analysis and an algorithm

.
Advances in neural information processing systems 14, pp. 849–856. Cited by: §1.
• [24] E. Oja and J. Karhunen (1985)

On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix

.
Journal of mathematical analysis and applications 106 (1), pp. 69–84. Cited by: §1, §2.
• [25] O. Shamir, N. Srebro, and T. Zhang (2014) Communication-efficient distributed optimization using an approximate newton-type method. In International conference on machine learning, pp. 1000–1008. Cited by: §1, §2.
• [26] O. Shamir (2015) A stochastic pca and svd algorithm with an exponential convergence rate. In International Conference on Machine Learning, pp. 144–152. Cited by: §1, §2.
• [27] O. Shamir (2016) Fast stochastic algorithms for svd and pca: convergence properties and convexity. In International Conference on Machine Learning, pp. 248–256. Cited by: §1, §2.
• [28] A. T. Suresh, X. Y. Felix, S. Kumar, and H. B. McMahan (2017) Distributed mean estimation with limited communication. In International Conference on Machine Learning, pp. 3329–3337. Cited by: §1, §1.
• [29] B. Vandereycken (2013) Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization 23 (2), pp. 1214–1236. Cited by: §7.
• [30] Z. Xu, X. Cao, and X. Gao (2018) Convergence analysis of gradient descent for eigenvector computation. Cited by: Appendix B, §3.1.
• [31] H. Zhang, S. J. Reddi, and S. Sra (2016) Riemannian svrg: fast stochastic optimization on riemannian manifolds. arXiv preprint arXiv:1605.07147. Cited by: §1, §1, §2, §2, §3.1, §3.1, §5.1.
• [32] H. Zhang and S. Sra (2016) First-order methods for geodesically convex optimization. In Conference on Learning Theory, pp. 1617–1638. Cited by: §1, §2.

## Appendix A Geometry of the Sphere

We analyze here briefly some basic notions of the geometry of the sphere that we use in our algorithm and convergence analysis. We refer the reader to [1, p. 73–76] for a more comprehensive presentation.

#### Tangent Space:

The tangent space of the -dimensional sphere at a point is an -dimensional vector space, which generalizes the notion of tangent plane in two dimensions. We denote it by and a vector belongs in it, if and only if, it can be written as , where (for some ) is a smooth curve with . The tangent space at can be given also in an explicit way, as the set of all vectors in orthogonal to with respect to the usual inner product. Given a vector , we can always project it orthogonally in any tangent space of . Taking all vectors to be column vectors, the orthogonal projection in satisfies

 Pp(w)=(I−ppT)w.

#### Geodesics:

Geodesics on high-dimensional surfaces are defined to be locally length-minimizing curves. On the -dimensional sphere, they coincide with great circles. These can be computed explicitly and give rise to the exponential and logarithmic maps. The exponential map at a point is defined as with , with being the (unique) geodesic (i.e., length minimizing part of great circle) satisfying and . Defining the exponential map in this way makes it invertible with inverse . The exponential and logarithmic map are given by well-known explicit formulas:

 expp(v)=cos(∥v∥)p+sin(∥v∥)v∥v∥,logp(q)=arccos(⟨q,p⟩)Pp(q−p)∥Pp(q−p)∥. (5)

The distance between points and measured intrinsically on the sphere is

 dist(p,q)=∥logp(q)∥=arccos(⟨q,p⟩). (6)

Notice that , thus the distance of and is actually the angle between them.

The inner product inherited by the ambient Euclidean space provides a way to transport vectors from a tangent space to another one, parallelly to the geodesics. This operation is called parallel transport and constitutes an orthogonal isomorphism between the two tangent spaces. We denote for the parallel transport from to along the geodesic connecting and . If , then parallel transport is given by the formula

 Γqpu=(Id+cos(t∥v∥−1)vvT∥v∥2−sin(t∥v∥)pvt∥v∥)u.

Given a function , we can define its differential at a point in a tangent direction as

 df(p)v=limt→0f(α(t))−f(p)t,

where is a smooth curve defined such that and . Using this intrinsically defined Riemannian differential, we can define the Riemannian gradient at as a vector , such that

for any . In the case of the sphere, we can also compute the Riemannian gradient by orthogonally projecting the Euclidean gradient computed in the ambient space into the tangent space of :

#### Geodesic Convexity and Smoothness:

In this paragraph we define convexity-like properties in the context of the sphere, which we employ in our analysis.

###### Definition 11.

A subset is called geodesically uniquely convex, if every two points in are connected by a unique geodesic.

Open hemispheres satisfy this definition, thus they are geodesically uniquely convex. Actually, they are the biggest possible subsets of the sphere with this property.

###### Definition 12.

A smooth function is called geodesically -smooth if

for any .

###### Definition 13.

A smooth function defined in a geodesically uniquely convex subset is called geodesically -weakly-quasi-convex (for some ) with respect to a point if

Observe that weak-quasi-convexity implies that any local minimum of lying in the interior of is also a global one.

###### Definition 14.

A function is said to satisfy quadratic growth condition with constant , if

 f(x)−f∗≥μ2dist2(x,x∗)for all x∈A,

where is the minimum of and the global minimizer of closest to .

Quadratic growth condition is slightly weaker than the so-called gradient dominance one:

###### Definition 15.

A function is said to be -gradient dominated if

Gradient dominance with constant implies quadratic growth with the same constant. The proof can be done by a direct Riemannian adaptation of the argument from [18, p. 13–14].

#### Tangent Space Distortion:

The only non-trivial geometric result we use in this work is that the geodesics of the sphere spread more slowly than in a flat space. This is a direct application of the Rauch comparison theorem since the sphere has constant positive curvature.

###### Proposition 16.

Let a geodesic triangle on the sphere. Then we have

 dist(a,b)≤∥logc(a)−logc(b)∥.

Notice that in Euclidean space we have equality in this bound.

## Appendix B Properties of the Cost and Convergence of Single-Node Gradient Descent

In this appendix, we will prove the convergence of Riemannian gradient descent applied to

 minx∈Sd−1f(x)with f(x)=−xTAx

and a symmetric and positive definite matrix. In the following, we will denote the eigenvalues of by and their corresponding orthonormal eigenvectors by . Denote the spectral gap by .

Let denote the minimum of on which is attained at .

See 1

###### Proof.

For any , we can write

 x=d∑i=1αivi,Ax=d∑i=1λiαivi (7)

for some scalars . Recall that by definition (2) of .

With the orthogonal projector onto the tangent space , we get from (5) that

because .

Direct calculation now gives

 ⟨Px∇f(x),x−x∗⟩ =−2xTAx+2⟨Ax,x∗⟩−2f(x)∥x∥2+2f(x)⟨x,x∗⟩ =2f(x)+2λ1α1−2f(x)+2f(x)α1 =2α1(f(x)+λ1)=2α1(f(x)−f∗)≥0.

It is easy to verify that . We thus obtain

which gives the desired result since . ∎

See 2

###### Proof.

The proof follows the one in [30, Lemma 2]. Using the expansions in (7), we get

 xTAx=d∑i=1λiα2i=λ1α21+d∑i=2λiα2i≤λ1α21+λ2(1−α21)

since . From (6), we have that and so

 xTAx≤λ1cos2(dist(x,x∗))+λ2sin2(dist(x,x∗)).

Direct calculation now shows

 f(x)−f∗ =−xTAx+λ1≥λ1−λ1cos2(dist(x,x∗))−λ2sin2(dist(x,x∗)) =λ1sin2dist(x,x∗))−λ2sin2(dist(x,x∗))=δsin2(dist(x,x∗)).

Since with , we have that and are in the same hemisphere and thus . The desired result follows using for . ∎

See 3

###### Proof.

From quadratic growth and weak-quasi-convexity, we have

Now, again by weak-quasi-convexity

by substituting the previous inequality. ∎

See 4

###### Proof.

By Proposition 3, we have

since, in our case, . Using for all , we can write for any positive that

Combining with and using (6), we get