1 Introduction
A natural representation for many statistical and machine learning problems is to assume the parameter of interest lies on a more general space than the Euclidean space. Typical examples of this situation include diffusion matrices in large scale diffusion tensor imaging (DTI) which are
positive definite matrices, now commonly used in neuroimaging for clinical trials [1]. In computer vision, images are often preprocessed or reduced to a collection of subspaces
[11, 27] or, a digital image can also be represented by a set of landmarks, forming landmark based shapes [13]. One may also encounter data that are stored as orthonormal frames [8], surfaces[15], curves[16], and networks [14].In addition, parallel inference has become popular in overcoming the computational burden arising from the storage, processing and computation of big data, resulting in a vast literature in statistics and machine learning dedicated to this topic. The general scheme in the frequentist setting is to divide the data into subsets, obtain estimates from each subset which are combined to form an ultimate estimate for inference [9, 30, 17]. In the Bayesian setting, the subset posterior distributions are first obtained in the dividing step, and these subset posterior measures or the MCMC samples from each subset posterior are then combined for final inference [20, 29, 28, 21, 25, 22]. Most of these methods are “embarrassingly parallel” which often do not require communication across different machines or subsets. Some communication efficient algorithms have also been proposed with prominent methods including [12] and [26].
Despite tremendous advancement in parallel inference, previous work largely focuses only on Euclidean data and parameter spaces. To better address challenges arising from inference of big nonEuclidean data or data with nonEuclidean parameters, there is a crucial need for developing valid and efficient inference methods including parallel or distributed inference and algorithms that can appropriately incorporate the underlying geometric structure.
For a majority of applications, the parameter spaces fall into the general category of manifolds, whose geometry is wellcharacterized. Although there is a recent literature on inference of manifoldvalued data including methods based on Fréchet means or model based methods [3, 4, 5, 2, 18] and even scalable methods for certain models [23, 19, 24]
, there is still a vital lack of general parallel algorithms on manifolds. We aim to fill this critical gap by introducing our parallel inference strategy. The novelty of our paper is in the fact that is generalizable to a wide range of loss functions for manifold optimization problems and that we can parallelize the algorithm by splitting the data across processors. Furthermore, our theoretical development does not rely on previous results. In fact, generalizing Theorem 1 to the manifold setting requires totally different machineries from that of previous work.
Notably, our parallel optimization algorithm has several key features:

[label=(0)]

Our parallel algorithm efficiently exploits the geometric information of the data or parameters.

The algorithm minimizes expensive interprocessor communication.

The algorithm has theoretical guarantees in approximating the true optimizer, characterized in terms of convergence rates.

The algorithm has outstanding practical performance in simulation studies and real data examples.
Our paper is organized as follows: In Section 2 we introduce related work to the topic of parallel inference. Next we present our proposed parallel optimization framework in Section 3 and present theoretical convergence results for our parallel algorithm in Section 4. In Section 5, we consider a simulation study of estimating the Fréchet means on the spheres and a real data example using the Netflix prize data set. The paper ends with a conclusion and discussion of future work in Section 6.
2 Related work
In the typical “big data” scenario, it is usually the case that the entire data set cannot fit onto one machine. Hence, parallel inference algorithms with provably good theoretic convergence properties are crucial for this situation. In such a setting, we assume that we have identically distributed observations , which are i.i.d divided into subsets and stored in separate machines. While it is important to consider inference problems when the data are not i.i.d. distributed across processors, we will only consider the i.i.d. setting as a simplifying assumption for the theory.
For a loss function each machine has access to a local loss function, , where is the data space. Then, the local loss functions are combined into a global loss function . For our intended optimization routine, we are actually looking for the minimizer of an expected loss function . In the parallel setting, we cannot investigate directly and we may only analyze it through . However, calculating the total loss function directly and exactly requires excessive interprocessor communication, which carries a huge computational burden as the number of processors increase. Thus, we must approximate the true parameter by an empirical risk minimizer
In this work, we focus on generalizing a particular parallel inference framework, the Iterative Local Estimation Algorithm (ILEA) [12], to manifolds. This algorithm optimizes an approximate, surrogate loss function instead of the global loss function as a way to avoid processor communication. The idea of the surrogate function starts from the Taylor series expansion of
The global highorder derivatives are replaced by local highorder derivatives from the first machine
So the approximation error is
The infinite sum in the can be replaced by
We can omit the additive constant . Thus the surrogate loss function is defined as
Thus, the surrogate minimizer approximates the empirical risk minimizer .
[12] show that the consequent surrogate minimizers have a provably good convergence rate to given the following regularity conditions:

The parameter space is a compact and convex subset of Besides, and

The Hessian matrix is invertible at that is there exist constants such that

For any there exists such that

For a ball around the true parameter there exist constants and a function such that
for all
which leads to the following theorem:
Theorem 1.
Suppose that the standard regularity conditions hold and initial estimator lies in the neighborhood of Then the minimizer of the surrogate loss function satisfies
with probability at least
where the constants and are independent of3 Parallel optimizations on manifolds
Our work aims to generalize the typical gradient descent optimization framework to manifold optimization. In particular, we will use the ILEA framework as our working example to generalize parallel optimization algorithms. Instead of working with , we have a dimensional manifold We also consider a surrogate loss function where is a subset of the manifold , that approximates the global loss function Here we choose to optimize on the th machine–that is, on different iterations we optimize on different machine for efficient exploration unlike from previous algorithm, where the surrogate function is always optimized on the first machine.
To generalize the idea of moving along a gradient on the manifold , we use the retraction map, which is not necessarily the exponential map that one would typically use in manifold gradient descent, but shares several important properties with the exponential map. Namely, a retraction on is a smooth mapping with the following properties

where denotes the identity mapping on
We also demand that

For any curves and where must coincide,

The triangle inequality holds, that is for any , it is the case that where is the length of the curve for
Our construction starts with the Taylor’s formula for on the manifold
Because we split the data across machines, evaluating the derivatives requires excessive processor communication. We want to reduce the amount of communication by replacing the global highorder derivatives () with the highorder local derivatives This gives us the following surrogate to
Then we have the following approximation error
We replace with
Since we are not interested in the value of but in its minimizer, we omit the additive constant and redefine as Then we can generalize the exponential map and the inverse exponential map to the retraction map and the inverse retraction map which is also called the lifting, and redefine
Therefore we have the following generalization of the Iterative Local Estimation Algorithm (ILEA) for the manifold :
4 Convergence rates of the algorithm
To establish some theoretical convergence rates on our algorithm, we consequently have to impose some regularity conditions on the parameter space the loss function and the population risk . We must establish these conditions specifically for manifolds instead of simply using the regularity conditions placed on Euclidean spaces. For example, in the manifold the Hessians are defined in different tangent spaces meaning there cannot be any linear expressions of the secondorder derivatives.
In the manifold for any we can define the vector field as We can also take the covariant derivative of along the retraction
(1) 
The expression (1) defines the linear map from to and want to impose some conditions to this map. Finally, we impose the following regularity conditions on the parameter space , the loss function and the population risk .

The parameter space is a compact and convex subset of which means that for any curves and must be within for any and also demand that there exists such that
where is the geodesic distance,

The matrix is invertible at constants such that

For any there exists such that

There exist constants and a function such that for all and
where is a spectral norm of matrices, Moreover, satisfies for some constant
Given these conditions, we have the following theorem:
Theorem 2.
If the standard regularity conditions holds, the initial estimator lies in the neighborhood of and
where then any minimizer of the surrogate loss function satisfies
with probability at least where constants and are independent of
5 Simulation study and data analysis
To examine the quality of our parallel algorithm we first apply it to the estimation of Fréchet means on spheres, which has closed form expressions for the estimation of the extrinsic mean (true empirical minimizer). In addition, we apply our algorithm to Netflix movieranking data set as an example of optimization over Grassmannian manifolds in the lowrank matrix completion problem. In the following results, we demonstrate the utility of our algorithm both for high dimensional manifoldvalued data (Section 5.1) and Euclidean space data with nonEuclidean parameters (Section 5.2). We wrote the code for our implementations in Python and carried out the parallelization of the code through MPI^{1}^{1}1Our code is available at https://github.com/michaelzhang01/parallel_manifold_opt[7].
5.1 Estimation of Fréchet means on manifolds
We first consider the estimation problem of Fréchet means [10] on manifolds. In particular, the manifold under consideration is the sphere in which we wish to estimate both the extrinsic and intrinsic mean [3]. Let be a general manifold and be a distance on which can be an intrinsic distance, by employing a Riemannian structure of , or an extrinsic distance, via some embedding onto some Euclidean space. Also, let be sample of point on the hypersphere , the sample Fréchet mean of is defined as
(2) 
where is some distance on the sphere.
The extrinsic distance, for our spherical example, is defined to be with as the Euclidean distance and the embedding map as the identity map. We call the extrinsic Fréchet mean on the sphere. We choose this example in our simulation, as we know the true global optimizer which is given by where is the standard sample mean of in Euclidean distance. The intrinsic Fréchet mean, on the other hand, is defined to be where the distance is the geodesic distance (or the arc length). In this case we compare the estimator obtained from the parallel algorithm with the optimizer obtained from a gradient descent algorithm along the sphere applied to the entire data set. Despite that the spherical case may be an “easy” setting as it has a Betti number of zero, we chose this example so that we have ground truth with which to compare our results. We, in fact, perform favorably even when the dimensionality of the data is high as we increase the number of processors.
For this example, we simulate one million observations from a dimensional von Mises distribution projected onto the unit sphere with mean sampled randomly from and a precision of . For the extrinsic mean example, the closed form expression of the sample mean acts as a “ground truth” to which we can compare our results. In both the extrinsic and intrinsic mean examples, we run trials of our algorithm over 1, 2, 4, 6, 8 and 10 processors. For the extrinsic mean simulations we compare our results to the true global optimizer in terms of root mean squared error (RMSE) and for the intrinsic mean simulations we compare our distributed results to the single processor results, also in terms of RMSE.
As we can see in Figure 1, even if we divide our observations to as many as 10 processors we still obtain favorable results for the estimation of the Fréchet mean in terms of RMSE to the ground truth for the extrinsic mean case and the single processor results for the intrinsic mean case. To visualize this comparison, we show in Figure 2 an example of our method’s performance on two dimensional data so that we may see that our optimization results yield a very close estimate to the true global optimizer.
5.2 Real data analysis: the Netflix example
Next, we consider an application of our algorithm to the Netflix movie rating dataset. This dataset of over a million entries, consists of movies and users, in which only a sparse subset of the users and movies have ratings. In order to build a better recommendation systems to users, we can frame the problem of predicting users’ ratings for movies as a lowrank matrix completion problem by learning the rank Grassmannian manifold which optimizes for the set of observed entries the loss function
(3) 
where is by matrix. Each user has the loss function , where is the Hadamard product, and
Which results in the following gradient
We can assume that then for each local machine we have the local function . So the global function is
For iterations we have . Therefore the global gradient is . Instead of the logarithm map we will use the inverse retraction map
Which gives us the following surrogate function
and its gradient
To optimize with respect to our loss function, we have to find . To do this, we move according to the steepest descent by taking step size in the direction by taking the retraction, .^{2}^{2}2We select the step size parameter according to the modified Armijo algorithm seen in [6].
For our example we set the matrix rank to and the regularization parameter to and divided the data randomly across processors. Figure 3 shows that we can perform distributed manifold gradient descent in this complicated problem and we can reach convergence fairly quickly (after about seconds).
6 Conclusion
We propose in this paper a communication efficient parallel algorithm for general optimization problems on manifolds which is applicable to many different manifold spaces and loss functions. Moreover, our proposed algorithm can explore the geometry of the underlying space efficiently and perform well in simulation studies and practical examples all while having theoretical convergence guarantees.
In the age of “big data”, the need for distributable inference algorithms is crucial as we cannot reliably expect entire datasets to sit on a single processor anymore. Despite this, much of the previous work in parallel inference has only focused on data and parameters in Euclidean space. Realistically, much of the data that we are interested in is better modeled by manifolds and thus we need fast inference algorithms that are provably suitable for situations beyond the Euclidean setting. In future work, we aim to extend the situations under which parallel inference algorithms are generalizable to manifolds and demonstrate more critical problems (in neuroscience or computer vision, for example) in which parallel inference is a crucial solution.
Acknowledgments
Lizhen Lin acknowledges the support from NSF grants IIS 1663870, DMS Career 1654579, ARO grant W911NF1510440 and a DARPA grant N660011714041. Bayan Saparbayeva was partially supported by DARPA N660011714041. Michael Zhang is supported by NSF grant 1447721.
References
 [1] Andrew L Alexander, Jee Eun Lee, Mariana Lazar, and Aaron S. Field. Diffusion tensor imaging of the brain. Neurotherapeutics, 4(3):316–329, 2007.
 [2] Abhishek Bhattacharya and Rabi Bhattacharya. Nonparametric Inference on Manifolds: With Applications to Shape Spaces. IMS Monograph #2. Cambridge University Press, 2012.
 [3] Rabi Bhattacharya and Lizhen Lin. Omnibus CLTs for Fréchet means and nonparametric inference on nonEuclidean spaces. The Proceedings of the American Mathematical Society, 145:13–428, 2017.
 [4] Rabi Bhattacharya and Vic Patrangenaru. Large sample theory of intrinsic and extrinsic sample means on manifolds. The Annals of Statistics, 31(1):1–29, 2003.
 [5] Rabi Bhattacharya and Vic Patrangenaru. Large sample theory of intrinsic and extrinsic sample means on manifolds: II. Ann. Statist., 33:1225–1259, 2005.
 [6] Nicolas Boumal. Optimization and estimation on manifolds. PhD thesis, Université catholique de Louvain, 2014.
 [7] Lisandro Dalcín, Rodrigo Paz, and Mario Storti. MPI for Python. Journal of Parallel and Distributed Computing, 65(9):1108 – 1115, 2005.
 [8] T. Downs, J. Liebman, and W. Mackay. Statistical methods for vectorcardiogram orientations. International Symposium on Vectorcardiography, pages 216–222, 1971.
 [9] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57:592–606, 2012.
 [10] Maurice Fréchet. Lés élements aléatoires de nature quelconque dans un espace distancié. Ann. Inst. H. Poincaré, 10:215–310, 1948.

[11]
Jeffrey Ho, KuangChih Lee, MingHsuan Yang, and David Kriegman.
Visual tracking using learned linear subspaces.
In
Computer Vision and Pattern Recognition
, volume 1, pages I–782–I–789 Vol.1, June 2004.  [12] Michael I. Jordan, Jason D. Lee, and Yun Yang. Communicationefficient distributed statistical inference. Journal of the American Statistical Association, 0(ja):0–0, 2018.
 [13] David G. Kendall. Shape manifolds, Procrustean metrics, and complex projective spaces. Bull. of the London Math. Soc., 16:81–121, 1984.
 [14] Eric Kolaczyk, Lizhen Lin, Steven Rosenberg, and Jackson Walters. Averages of Unlabeled Networks: Geometric Characterization and Asymptotic Behavior. ArXiv eprints, September 2017.
 [15] Sebastian Kurtek, Eric Klassen, John C. Gore, Zhaohua Ding, and Anuj Srivastava. Elastic geodesic paths in shape space of parameterized surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9):1717–1730, Sept 2012.
 [16] Sebastian Kurtek, Anuj Srivastava, Eric Klassen, and Zhaohua Ding. Statistical modeling of curves using shapes and related features. Journal of the American Statistical Association, 107(499):1152–1165, 2012.
 [17] Jason D. Lee, Yuekai Sun, Qiang Liu, and Jonathan E. Taylor. Communicationefficient sparse regression: a oneshot approach. CoRR, abs/1503.04337, 2015.
 [18] Lizhen Lin, Vinayak Rao, and David B. Dunson. Bayesian nonparametric inference on the Stiefel manifold. Statistics Sinica, 27:535–553, 2017.
 [19] Lester Mackey, Ameet Talwalkar, and Michael I Jordan. Distributed matrix completion and robust factorization. The Journal of Machine Learning Research, 16(1):913–960, 2015.
 [20] Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David B. Dunson. Robust and scalable Bayes via a median of subset posterior measures. Journal of Machine Learning Research, 18(124):1–40, 2017.

[21]
Willie Neiswanger, Chong Wang, and Eric P. Xing.
Asymptotically exact, embarrassingly parallel MCMC.
In
Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence
, pages 623–632, 2914.  [22] Christopher Nemeth, Chris Sherlock, et al. Merging MCMC subposteriors through Gaussianprocess approximations. Bayesian Analysis, 13(2):507–530, 2018.
 [23] Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithms for largescale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013.
 [24] Hesamoddin Salehian, Rudrasis Chakraborty, Edward Ofori, David Vaillancourt, and Baba C. Vemuri. An efficient recursive estimator of the Fréchet mean on a hypersphere with applications to medical image analysis. In Mathematical Foundations of Computational Anatomy, volume 3, 2015.
 [25] Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bayes and big data: the consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78–88, 2016.
 [26] Ohad Shamir, Nathan Srebro, and Tong Zhang. Communicationefficient distributed optimization using an approximate newtontype method. In International Conference on Machine Learning, 2014.
 [27] G. Prabhu Teja and Sundaram Ravi. Face recognition using subspaces techniques. In International Conference on Recent Trends In Information Technology, pages 103–107, April 2012.
 [28] Xiangyu Wang, Fangjian Guo, Katherine A. Heller, and David B. Dunson. Parallelizing MCMC with random partition trees. In Advances in Neural Information Processing Systems, pages 451–459. 2015.
 [29] Michael Minyi Zhang, Henry Lam, and Lizhen Lin. Robust and parallel Bayesian model selection. Computational Statistics and Data Analysis, 127:229 – 247, 2018.
 [30] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communicationefficient algorithms for statistical optimization. Journal of Machine Learning Research, 14:3321–3363, 2013.
Proof to Theorem 2
For let and Consider the following "good events":
Lemma 1.
Under standard regularity conditions, we have
where constants and are independent of
Proof.
Apply Lemma 6 in Zhang et al [2013] [30], which was also generalized for the manifold.
Lemma 2.
Under the event we have
where
Proof.
We first prove that is strongly convex over the ball Indeed
Lets consider
From the first regularity condition we have
so
Then by our choice of
So finally we have
and so
and so
It means that is strongly convex.
Thus by the definition of strong convexity
then
Dividing each side by
Hence
∎
Then from the previous lemma
Lemma 3.
Under standard regularity conditions 2 and 4, there exist universal constants such that for
Now we apply Markov’s inequality, Jensen’s inequality and the union bound to obtain that there exist constants independent of such that
Comments
There are no comments yet.