Communication Efficient Parallel Algorithms for Optimization on Manifolds

10/26/2018 ∙ by Bayan Saparbayeva, et al. ∙ Princeton University University of Notre Dame 0

The last decade has witnessed an explosion in the development of models, theory and computational algorithms for "big data" analysis. In particular, distributed computing has served as a natural and dominating paradigm for statistical inference. However, the existing literature on parallel inference almost exclusively focuses on Euclidean data and parameters. While this assumption is valid for many applications, it is increasingly more common to encounter problems where the data or the parameters lie on a non-Euclidean space, like a manifold for example. Our work aims to fill a critical gap in the literature by generalizing parallel inference algorithms to optimization on manifolds. We show that our proposed algorithm is both communication efficient and carries theoretical convergence guarantees. In addition, we demonstrate the performance of our algorithm to the estimation of Fréchet means on simulated spherical data and the low-rank matrix completion problem over Grassmann manifolds applied to the Netflix prize data set.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A natural representation for many statistical and machine learning problems is to assume the parameter of interest lies on a more general space than the Euclidean space. Typical examples of this situation include diffusion matrices in large scale diffusion tensor imaging (DTI) which are

positive definite matrices, now commonly used in neuroimaging for clinical trials [1]

. In computer vision, images are often preprocessed or reduced to a collection of subspaces

[11, 27] or, a digital image can also be represented by a set of -landmarks, forming landmark based shapes [13]. One may also encounter data that are stored as orthonormal frames [8], surfaces[15], curves[16], and networks [14].

In addition, parallel inference has become popular in overcoming the computational burden arising from the storage, processing and computation of big data, resulting in a vast literature in statistics and machine learning dedicated to this topic. The general scheme in the frequentist setting is to divide the data into subsets, obtain estimates from each subset which are combined to form an ultimate estimate for inference [9, 30, 17]. In the Bayesian setting, the subset posterior distributions are first obtained in the dividing step, and these subset posterior measures or the MCMC samples from each subset posterior are then combined for final inference [20, 29, 28, 21, 25, 22]. Most of these methods are “embarrassingly parallel” which often do not require communication across different machines or subsets. Some communication efficient algorithms have also been proposed with prominent methods including [12] and [26].

Despite tremendous advancement in parallel inference, previous work largely focuses only on Euclidean data and parameter spaces. To better address challenges arising from inference of big non-Euclidean data or data with non-Euclidean parameters, there is a crucial need for developing valid and efficient inference methods including parallel or distributed inference and algorithms that can appropriately incorporate the underlying geometric structure.

For a majority of applications, the parameter spaces fall into the general category of manifolds, whose geometry is well-characterized. Although there is a recent literature on inference of manifold-valued data including methods based on Fréchet means or model based methods [3, 4, 5, 2, 18] and even scalable methods for certain models [23, 19, 24]

, there is still a vital lack of general parallel algorithms on manifolds. We aim to fill this critical gap by introducing our parallel inference strategy. The novelty of our paper is in the fact that is generalizable to a wide range of loss functions for manifold optimization problems and that we can parallelize the algorithm by splitting the data across processors. Furthermore, our theoretical development does not rely on previous results. In fact, generalizing Theorem 1 to the manifold setting requires totally different machineries from that of previous work.

Notably, our parallel optimization algorithm has several key features:

  1. [label=(0)]

  2. Our parallel algorithm efficiently exploits the geometric information of the data or parameters.

  3. The algorithm minimizes expensive inter-processor communication.

  4. The algorithm has theoretical guarantees in approximating the true optimizer, characterized in terms of convergence rates.

  5. The algorithm has outstanding practical performance in simulation studies and real data examples.

Our paper is organized as follows: In Section 2 we introduce related work to the topic of parallel inference. Next we present our proposed parallel optimization framework in Section 3 and present theoretical convergence results for our parallel algorithm in Section 4. In Section 5, we consider a simulation study of estimating the Fréchet means on the spheres and a real data example using the Netflix prize data set. The paper ends with a conclusion and discussion of future work in Section 6.

2 Related work

In the typical “big data” scenario, it is usually the case that the entire data set cannot fit onto one machine. Hence, parallel inference algorithms with provably good theoretic convergence properties are crucial for this situation. In such a setting, we assume that we have identically distributed observations , which are i.i.d divided into subsets and stored in separate machines. While it is important to consider inference problems when the data are not i.i.d. distributed across processors, we will only consider the i.i.d. setting as a simplifying assumption for the theory.

For a loss function each machine has access to a local loss function, , where is the data space. Then, the local loss functions are combined into a global loss function . For our intended optimization routine, we are actually looking for the minimizer of an expected loss function . In the parallel setting, we cannot investigate directly and we may only analyze it through . However, calculating the total loss function directly and exactly requires excessive inter-processor communication, which carries a huge computational burden as the number of processors increase. Thus, we must approximate the true parameter by an empirical risk minimizer

In this work, we focus on generalizing a particular parallel inference framework, the Iterative Local Estimation Algorithm (ILEA) [12], to manifolds. This algorithm optimizes an approximate, surrogate loss function instead of the global loss function as a way to avoid processor communication. The idea of the surrogate function starts from the Taylor series expansion of

The global high-order derivatives are replaced by local high-order derivatives from the first machine

So the approximation error is

The infinite sum in the can be replaced by

We can omit the additive constant . Thus the surrogate loss function is defined as

Thus, the surrogate minimizer approximates the empirical risk minimizer .

[12] show that the consequent surrogate minimizers have a provably good convergence rate to given the following regularity conditions:

  1. The parameter space is a compact and convex subset of Besides, and

  2. The Hessian matrix is invertible at that is there exist constants such that

  3. For any there exists such that

  4. For a ball around the true parameter there exist constants and a function such that

    for all

which leads to the following theorem:

Theorem 1.

Suppose that the standard regularity conditions hold and initial estimator lies in the neighborhood of Then the minimizer of the surrogate loss function satisfies

with probability at least

where the constants and are independent of

3 Parallel optimizations on manifolds

Our work aims to generalize the typical gradient descent optimization framework to manifold optimization. In particular, we will use the ILEA framework as our working example to generalize parallel optimization algorithms. Instead of working with , we have a -dimensional manifold We also consider a surrogate loss function where is a subset of the manifold , that approximates the global loss function Here we choose to optimize on the th machine–that is, on different iterations we optimize on different machine for efficient exploration unlike from previous algorithm, where the surrogate function is always optimized on the first machine.

To generalize the idea of moving along a gradient on the manifold , we use the retraction map, which is not necessarily the exponential map that one would typically use in manifold gradient descent, but shares several important properties with the exponential map. Namely, a retraction on is a smooth mapping with the following properties

  1. where is the restriction of from to the point and the tangent space

    denotes the zero vector on

  2. where denotes the identity mapping on

We also demand that

  1. For any curves and where must coincide,

  2. The triangle inequality holds, that is for any , it is the case that where is the length of the curve for

Our construction starts with the Taylor’s formula for on the manifold

Because we split the data across machines, evaluating the derivatives requires excessive processor communication. We want to reduce the amount of communication by replacing the global high-order derivatives () with the high-order local derivatives This gives us the following surrogate to

Then we have the following approximation error

We replace with

Since we are not interested in the value of but in its minimizer, we omit the additive constant and redefine as Then we can generalize the exponential map and the inverse exponential map to the retraction map and the inverse retraction map which is also called the lifting, and redefine

Therefore we have the following generalization of the Iterative Local Estimation Algorithm (ILEA) for the manifold :

Initialize
for  do
       Transmit the current iterate to local machines
       for  do
             Compute the local gradient at machine
             Transmit the local gradient to machine
            
      Calculate the global gradient in Machine
       Form the surrogate function
       Update
Return
Algorithm 1 ILEA for Manifolds

4 Convergence rates of the algorithm

To establish some theoretical convergence rates on our algorithm, we consequently have to impose some regularity conditions on the parameter space the loss function and the population risk . We must establish these conditions specifically for manifolds instead of simply using the regularity conditions placed on Euclidean spaces. For example, in the manifold the Hessians are defined in different tangent spaces meaning there cannot be any linear expressions of the second-order derivatives.

In the manifold for any we can define the vector field as We can also take the covariant derivative of along the retraction

(1)

The expression (1) defines the linear map from to and want to impose some conditions to this map. Finally, we impose the following regularity conditions on the parameter space , the loss function and the population risk .

  1. The parameter space is a compact and -convex subset of which means that for any curves and must be within for any and also demand that there exists such that

    where is the geodesic distance,

  2. The matrix is invertible at constants such that

  3. For any there exists such that

  4. There exist constants and a function such that for all and

    where is a spectral norm of matrices, Moreover, satisfies for some constant

Given these conditions, we have the following theorem:

Theorem 2.

If the standard regularity conditions holds, the initial estimator lies in the neighborhood of and

where then any minimizer of the surrogate loss function satisfies

with probability at least where constants and are independent of

5 Simulation study and data analysis

To examine the quality of our parallel algorithm we first apply it to the estimation of Fréchet means on spheres, which has closed form expressions for the estimation of the extrinsic mean (true empirical minimizer). In addition, we apply our algorithm to Netflix movie-ranking data set as an example of optimization over Grassmannian manifolds in the low-rank matrix completion problem. In the following results, we demonstrate the utility of our algorithm both for high dimensional manifold-valued data (Section 5.1) and Euclidean space data with non-Euclidean parameters (Section 5.2). We wrote the code for our implementations in Python and carried out the parallelization of the code through MPI111Our code is available at https://github.com/michaelzhang01/parallel_manifold_opt[7].

5.1 Estimation of Fréchet means on manifolds

We first consider the estimation problem of Fréchet means [10] on manifolds. In particular, the manifold under consideration is the sphere in which we wish to estimate both the extrinsic and intrinsic mean [3]. Let be a general manifold and be a distance on which can be an intrinsic distance, by employing a Riemannian structure of , or an extrinsic distance, via some embedding onto some Euclidean space. Also, let be sample of point on the hypersphere , the sample Fréchet mean of is defined as

(2)

where is some distance on the sphere.

The extrinsic distance, for our spherical example, is defined to be with as the Euclidean distance and the embedding map as the identity map. We call the extrinsic Fréchet mean on the sphere. We choose this example in our simulation, as we know the true global optimizer which is given by where is the standard sample mean of in Euclidean distance. The intrinsic Fréchet mean, on the other hand, is defined to be where the distance is the geodesic distance (or the arc length). In this case we compare the estimator obtained from the parallel algorithm with the optimizer obtained from a gradient descent algorithm along the sphere applied to the entire data set. Despite that the spherical case may be an “easy” setting as it has a Betti number of zero, we chose this example so that we have ground truth with which to compare our results. We, in fact, perform favorably even when the dimensionality of the data is high as we increase the number of processors.

For this example, we simulate one million observations from a -dimensional von Mises distribution projected onto the unit sphere with mean sampled randomly from and a precision of . For the extrinsic mean example, the closed form expression of the sample mean acts as a “ground truth” to which we can compare our results. In both the extrinsic and intrinsic mean examples, we run trials of our algorithm over 1, 2, 4, 6, 8 and 10 processors. For the extrinsic mean simulations we compare our results to the true global optimizer in terms of root mean squared error (RMSE) and for the intrinsic mean simulations we compare our distributed results to the single processor results, also in terms of RMSE.

Figure 1: Extrinsic mean comparison (left) and intrinsic mean comparison (right) on spheres in

As we can see in Figure 1, even if we divide our observations to as many as 10 processors we still obtain favorable results for the estimation of the Fréchet mean in terms of RMSE to the ground truth for the extrinsic mean case and the single processor results for the intrinsic mean case. To visualize this comparison, we show in Figure 2 an example of our method’s performance on two dimensional data so that we may see that our optimization results yield a very close estimate to the true global optimizer.

Figure 2: Extrinsic mean results on , for one (left) and ten (right) processors

5.2 Real data analysis: the Netflix example

Next, we consider an application of our algorithm to the Netflix movie rating dataset. This dataset of over a million entries, consists of movies and users, in which only a sparse subset of the users and movies have ratings. In order to build a better recommendation systems to users, we can frame the problem of predicting users’ ratings for movies as a low-rank matrix completion problem by learning the rank- Grassmannian manifold which optimizes for the set of observed entries the loss function

(3)

where is -by- matrix. Each user has the loss function , where is the Hadamard product, and

Which results in the following gradient

We can assume that then for each local machine we have the local function . So the global function is

For iterations we have . Therefore the global gradient is . Instead of the logarithm map we will use the inverse retraction map

Which gives us the following surrogate function

and its gradient

To optimize with respect to our loss function, we have to find . To do this, we move according to the steepest descent by taking step size in the direction by taking the retraction, .222We select the step size parameter according to the modified Armijo algorithm seen in [6].

For our example we set the matrix rank to and the regularization parameter to and divided the data randomly across processors. Figure 3 shows that we can perform distributed manifold gradient descent in this complicated problem and we can reach convergence fairly quickly (after about seconds).

Figure 3: Test set RMSE of the Netflix example over time, evaluated on 10 trials.

6 Conclusion

We propose in this paper a communication efficient parallel algorithm for general optimization problems on manifolds which is applicable to many different manifold spaces and loss functions. Moreover, our proposed algorithm can explore the geometry of the underlying space efficiently and perform well in simulation studies and practical examples all while having theoretical convergence guarantees.

In the age of “big data”, the need for distributable inference algorithms is crucial as we cannot reliably expect entire datasets to sit on a single processor anymore. Despite this, much of the previous work in parallel inference has only focused on data and parameters in Euclidean space. Realistically, much of the data that we are interested in is better modeled by manifolds and thus we need fast inference algorithms that are provably suitable for situations beyond the Euclidean setting. In future work, we aim to extend the situations under which parallel inference algorithms are generalizable to manifolds and demonstrate more critical problems (in neuroscience or computer vision, for example) in which parallel inference is a crucial solution.

Acknowledgments

Lizhen Lin acknowledges the support from NSF grants IIS 1663870, DMS Career 1654579, ARO grant W911NF1510440 and a DARPA grant N66001-17-1-4041. Bayan Saparbayeva was partially supported by DARPA N66001-17-1-4041. Michael Zhang is supported by NSF grant 1447721.

References

  • [1] Andrew L Alexander, Jee Eun Lee, Mariana Lazar, and Aaron S. Field. Diffusion tensor imaging of the brain. Neurotherapeutics, 4(3):316–329, 2007.
  • [2] Abhishek Bhattacharya and Rabi Bhattacharya. Nonparametric Inference on Manifolds: With Applications to Shape Spaces. IMS Monograph #2. Cambridge University Press, 2012.
  • [3] Rabi Bhattacharya and Lizhen Lin. Omnibus CLTs for Fréchet means and nonparametric inference on non-Euclidean spaces. The Proceedings of the American Mathematical Society, 145:13–428, 2017.
  • [4] Rabi Bhattacharya and Vic Patrangenaru. Large sample theory of intrinsic and extrinsic sample means on manifolds. The Annals of Statistics, 31(1):1–29, 2003.
  • [5] Rabi Bhattacharya and Vic Patrangenaru. Large sample theory of intrinsic and extrinsic sample means on manifolds: II. Ann. Statist., 33:1225–1259, 2005.
  • [6] Nicolas Boumal. Optimization and estimation on manifolds. PhD thesis, Université catholique de Louvain, 2014.
  • [7] Lisandro Dalcín, Rodrigo Paz, and Mario Storti. MPI for Python. Journal of Parallel and Distributed Computing, 65(9):1108 – 1115, 2005.
  • [8] T. Downs, J. Liebman, and W. Mackay. Statistical methods for vectorcardiogram orientations. International Symposium on Vectorcardiography, pages 216–222, 1971.
  • [9] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57:592–606, 2012.
  • [10] Maurice Fréchet. Lés élements aléatoires de nature quelconque dans un espace distancié. Ann. Inst. H. Poincaré, 10:215–310, 1948.
  • [11] Jeffrey Ho, Kuang-Chih Lee, Ming-Hsuan Yang, and David Kriegman. Visual tracking using learned linear subspaces. In

    Computer Vision and Pattern Recognition

    , volume 1, pages I–782–I–789 Vol.1, June 2004.
  • [12] Michael I. Jordan, Jason D. Lee, and Yun Yang. Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 0(ja):0–0, 2018.
  • [13] David G. Kendall. Shape manifolds, Procrustean metrics, and complex projective spaces. Bull. of the London Math. Soc., 16:81–121, 1984.
  • [14] Eric Kolaczyk, Lizhen Lin, Steven Rosenberg, and Jackson Walters. Averages of Unlabeled Networks: Geometric Characterization and Asymptotic Behavior. ArXiv e-prints, September 2017.
  • [15] Sebastian Kurtek, Eric Klassen, John C. Gore, Zhaohua Ding, and Anuj Srivastava. Elastic geodesic paths in shape space of parameterized surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9):1717–1730, Sept 2012.
  • [16] Sebastian Kurtek, Anuj Srivastava, Eric Klassen, and Zhaohua Ding. Statistical modeling of curves using shapes and related features. Journal of the American Statistical Association, 107(499):1152–1165, 2012.
  • [17] Jason D. Lee, Yuekai Sun, Qiang Liu, and Jonathan E. Taylor. Communication-efficient sparse regression: a one-shot approach. CoRR, abs/1503.04337, 2015.
  • [18] Lizhen Lin, Vinayak Rao, and David B. Dunson. Bayesian nonparametric inference on the Stiefel manifold. Statistics Sinica, 27:535–553, 2017.
  • [19] Lester Mackey, Ameet Talwalkar, and Michael I Jordan. Distributed matrix completion and robust factorization. The Journal of Machine Learning Research, 16(1):913–960, 2015.
  • [20] Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David B. Dunson. Robust and scalable Bayes via a median of subset posterior measures. Journal of Machine Learning Research, 18(124):1–40, 2017.
  • [21] Willie Neiswanger, Chong Wang, and Eric P. Xing. Asymptotically exact, embarrassingly parallel MCMC. In

    Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence

    , pages 623–632, 2914.
  • [22] Christopher Nemeth, Chris Sherlock, et al. Merging MCMC subposteriors through Gaussian-process approximations. Bayesian Analysis, 13(2):507–530, 2018.
  • [23] Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013.
  • [24] Hesamoddin Salehian, Rudrasis Chakraborty, Edward Ofori, David Vaillancourt, and Baba C. Vemuri. An efficient recursive estimator of the Fréchet mean on a hypersphere with applications to medical image analysis. In Mathematical Foundations of Computational Anatomy, volume 3, 2015.
  • [25] Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bayes and big data: the consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78–88, 2016.
  • [26] Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning, 2014.
  • [27] G. Prabhu Teja and Sundaram Ravi. Face recognition using subspaces techniques. In International Conference on Recent Trends In Information Technology, pages 103–107, April 2012.
  • [28] Xiangyu Wang, Fangjian Guo, Katherine A. Heller, and David B. Dunson. Parallelizing MCMC with random partition trees. In Advances in Neural Information Processing Systems, pages 451–459. 2015.
  • [29] Michael Minyi Zhang, Henry Lam, and Lizhen Lin. Robust and parallel Bayesian model selection. Computational Statistics and Data Analysis, 127:229 – 247, 2018.
  • [30] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14:3321–3363, 2013.

Proof to Theorem 2

For let and Consider the following "good events":

Lemma 1.

Under standard regularity conditions, we have

where constants and are independent of

Proof.

Apply Lemma 6 in Zhang et al [2013] [30], which was also generalized for the manifold.

Lemma 2.

Under the event we have

where

Proof.

We first prove that is -strongly convex over the ball Indeed

Lets consider

From the first regularity condition we have

so

Then by our choice of

So finally we have

and so

and so

It means that is -strongly convex.

Thus by the definition of strong convexity

then

Dividing each side by

Hence

Then from the previous lemma

Lemma 3.

Under standard regularity conditions 2 and 4, there exist universal constants such that for

Now we apply Markov’s inequality, Jensen’s inequality and the union bound to obtain that there exist constants independent of such that