1 Introduction
Over the past years, there has been a surge of interest in machine learning in nonEuclidean domains. Representation learning in spaces of constant negative curvature, i.e. hyperbolic, has shown to outperform Euclidean embeddings significantly on data with latent hierarchical, taxonomic or entailment structures. Such data naturally arises in language modeling [29, 30, 9, 38, 13], recommendation systems [10], image classification, and fewshot learning tasks [21]
, to name a few. Grassmannian manifolds find applications in computer vision to perform videobased face recognition and image set classification
[19, 25]. The Lie groups of transformations andappear naturally when dealing with problems like pose estimation and skeletonbased action recognition
[16, 18]. Stiefel manifolds are used for emotion recognition and action recognition from video data [8, 17]. The space of symmetric positive definite (SPD) matrices characterize data from diffusion tensor imaging
[33], and functional magnetic resonance imaging [40]. Among those advances is a family of neural network methods, which are parameterized by weights constrained to a particular nonEuclidean space [29, 9, 18, 19, 13, 10, 21].Stochastic Gradient Descent (SGD) [35]
algorithm has been used for the backpropagation learning in connectionist neural network models. The SGD is known to generate undesirable oscillations around optimal values of model parameters
[34]. Using a momentum term [37] has been shown to improve the optimization convergence. More recently, adaptive learning rates [15, 22, 43] have been proposed, and are popular optimization algorithms used for iterative optimization while training neural networks.These adaptations of the SGD methods have been studies extensively under the lenses of Riemannian geometry and translated to nonEuclidean settings [6, 36, 5, 20]. Remannian stochastic optimization algorithms have been robustly integrated with many popular machine learning toolboxes. However, previous work in this space was mainly motivated by research use cases [42, 27, 28, 23]. Whereas practical aspects, such as deploying and maintaining machine learning models, were often overlooked.
We present TensorFlow RiemOpt, a Python library for optimization on Riemannian manifolds in TensorFlow [1], to help bridge the aforementioned gap. The library is designed with the aim for a seamless integration with the TensorFlow ecosystem, targeting not only research, but also endtoend machine learning pipelines with the recently proposed TensorFlow Extended platform [4].
2 Riemannian optimization
In this section we provide a brief overview of Riemannian optimization with a focus on stochastic methods. We refer to [12, 39, 2] for more rigorous theoretical treatment.
Optimization on Riemannian manifolds, or Riemannian optimization, is a class of methods for optimization of the form
(1) 
where is an objective function, subject to constraints which are smooth, in the sense that the search space
admits the structure of a differentiable manifold, equipped with a positivedefinite inner product on the space of tangent vectors
at each point . Conceptually, the approach of Riemannian optimization translates the constrained optimization problem (1) into performing optimization in the Euclidean space and projecting the parameters back onto the search space after each iteration [2].In particular, several first order stochastic optimization methods commonly used in the Euclidean domain, such as the SGD algorithm and related adaptive methods, have already been adapted to Riemannian settings.
The update equation of the (Euclidean) SGD for differentiable takes the form
(2) 
where denotes the gradient of objective evaluated at the minibatch taken at step , and is the learning rate. For a Riemannian manifold , Riemannian SGD [6] performs the update
(3) 
where denotes the projection operator from the ambient Euclidean space on the tangent space . And is the exponential map operator, which moves a vector back to the manifold from the tangent space at .
In a neighborhood of , the exponential map identifies a point on the geodesic, thus guarantees decreasing the objective function. Intuitively, the exponential map enables to perform an update along the shortest path in the relevant direction in unit time, while remaining in the manifold. In practice, when is not known in closedform or it is computationally expensive to evaluate, it is common to replace it by a retraction map , which is a firstorder approximation of the exponential.
Adaptive optimization techniques compute smoother estimates of the gradient vector using its first or second order moments. RMSProp
[15] is an example of a gradient based optimization algorithm which does this by maintaining an exponentially moving average of the squared gradient, which is an estimate of the second raw moment of the gradient. The update equations for RMSProp can be expressed as follows(4)  
where
is a hyperparameter,
is the learning rate, and denotes the Hadamard product.In the Euclidean settings, these estimates can be obtained by linearly combining previous moment vectors due to the inherent “flat” nature of the underlying space. However, since general Riemannian manifolds can be curved, it is not possible to simply add moment vectors at different points on the manifold, as the resulting vectors may not lie in the tangent spaces at either points.
Riemannian versions of adaptive SGD algorithms use the parallel transport to circumvent this issue [36, 5]. The parallel transport operator takes and outputs . Informally, is obtained by moving in a “parallel” fashion along the geodesic curve connecting and , where the intermediate vectors obtained through this process have constant norm and always belong to tangent spaces. Such transformation enables combining moment vectors computed at different points of the optimization trajectory.
Geometryaware constrained RMSProp (cRMSProp) [36] translates the Equation (4) to Riemannian settings as follows
(5)  
In practice, if is not known for a given Riemannian manifold or it is computationally expensive to evaluate, it is replaced by its firstorder approximation .
While many optimization problems are of the described form, technicalities of differential geometry and implementation of corresponding operators often pose a significant barrier for experimenting with these methods. ‹
3 Design goals
Working out and implementing gradients is a laborious and error prone task, particularly when the objective function acts on higher rank tensors. TensorFlow RiemOpt builds upon the TensorFlow framework [1], and leverages its automatic differentiation capabilities for computing gradients of composite functions. The design of TensorFlow RiemOpt was informed by three main objectives:

Interoperability with the TensorFlow ecosystem. All components, including optimization algorithms and manifoldconstrained variables, can be used as dropin replacements with the native TensorFlow API. This ensures transparent integration with the rich ecosystem of tools and extensions in both research and production settings. Specifically, TensorFlow RiemOpt was tested to be compatible with the TensorFlow Extended platform in graph and eager execution modes.

Computational efficiency. TensorFlow RiemOpt aims for providing closedform expressions for manifold operators. The library also implements numerical approximation as a fallback option, when such solutions are not available. TensorFlow RiemOpt solvers can perform updates on dense and sparse tensors efficiently.

Robustness and numerical stability. The library makes use of half, single, and double precision arithmetic where appropriate.
4 Implementation overview
The package implements concepts in differential geometry, such as manifolds and Riemannian metrics with the associated exponential and logarithmic maps, geodesics, retractions, and transports. For manifolds, where closedform expressions are not known, the library provides numerical approximations. The core module also exposes functions for assigning TensorFlow variables to manifold instances.
4.1 Manifolds
The module manifolds is modeled after the Manopt [7] API, and provides implementations of the following manifolds:

Cholesky – manifold of lower triangular matrices with positive diagonal elements [26]

Euclidian – an unconstrained manifold with the Euclidean metric

Grassmannian – manifold of dimensional linear subspaces of the dimensional space [12]

Hyperboloid – manifold of dimensional hyperbolic space embedded in the dimensional Minkowski space

Poincare – the Poincaré ball model of the hyperbolic space[29]

Product – Cartesian product of manifolds [14]

SPDAffineInvariant – manifold of symmetric positive definite (SPD) matrices endowed with the affineinvariant metric [33]

SPDLogCholesky – SPD manifold with the LogCholesky metric [26]

SPDLogEuclidean – SPD manifold with the LogEuclidean metric [3]

SpecialOrthogonal – manifold of rotation matrices

Sphere – manifold of unitnormalized points

StiefelEuclidean – manifold of orthonormal frames in the dimensional space endowed with the Euclidean metric

StiefelCanonical – Stiefel manifold with the canonical metric [44]

StiefelCayley – Stiefel manifold the retraction map via an iterative Cayley transform [24]
Each manifold is implemented as a Python class, which inherits the abstract base class Manifold. The minimal set of methods required for optimization includes:

inner(x, u, v) – inner product (i.e., the Riemannian metric) between two tangent vectors and in the tangent space at

proju(x, u) – projection of a tangent vector in the ambient space on the tangent space at point

retr(x, u) – retraction from point with given direction . Retraction is a firstorder approximation of the exponential map introduced in [6]

transp(x, y, v) – vector transport of a tangent vector at point to the tangent space at point . Vector transport is the firstorder approximation of parallel transport
Selected manifolds also support exact computations for additional operators:

exp(x, u) – exponential map of a tangent vector at point to the manifold

log(x, y) – logarithmic map of a point to the tangent space at

ptransp(x, y, v) – parallel transport of a vector from the tangent space at to the tangent space at
All methods support broadcasting of tensors with different shapes to compatible shapes for arithmetic operations.
4.2 Optimizers
The module optimizers provides TensorFlow 2 API for optimization algorithms on Riemannian surfaces, including recently proposed adaptive methods:
Algorithms are implemented to support dense and sparse updates, as well as serialization, which is crucial for compatibility with TensorFlow functions.
4.3 Layers
5 Usage
A simple illustrative example of using lowlevel API is depicted in Listing 1
. There, TensorFlow RiemOpt closely follows the Manopt semantics and naming convention. Geometric meaning of those operations is visualized on Figure
1.Constructing an optimization problem is demonstrated on Listing 2. Manifoldvalued variables can be transparently passed to standard TensorFlow functions. And, conversely, native TensorFlow tensors are treated by TensorFlow RiemOpt algorithms as data in the Euclidean space.
6 Advanced Usage
7 Related work
This section compares TensorFlow RiemOpt with related implementations on differential geometry and learning. While inspired by related work, the main difference of our library lies in the choice of the underlying framework and design objectives.
The library Geoopt [23]
is the most closely related to TensorFlow RiemOpt. Geoopt is a researchoriented toolbox, which builds upon the PyTorch
[31]library for tensor computation and GPU acceleration. Geoopt supports the Riemannian SGD, as well as adaptive optimization algorithms on Riemannian manifolds, and Markov chain Monte Carlo methods for sampling. However, PyTorch, being a researchfirst framework, lacks tooling for bringing machine learning pipelines to production, which limits Geoopt applicability in industrial settings.
McTorch [27] is a manifold optimization library for deep learning, that extends the PyTorch C++ code base to closely follow its architecture. McTorch supports Riemannian variants of stochastic optimization algorithms, and also implements a collection of neural network layers with manifoldconstrained parameters. McTorch shares the same limitations as Geoopt due to its dependency on PyTorch.
Pymanopt [42]
is a Python package that computes gradients and Hessianvector products on Riemannian manifolds, and provides the following solvers: steepest descent, conjugate gradients, the NelderMead algorithm, and the Riemannian trust regions. Pymanopt leverages on Theano
[41] symbolic differentiation and on TensorFlow automatic differentiation for computing derivatives. Pymanopt is a great generalpurpose tools for Riemannian optimization. However, it is not wellsuited for neural network applications due to lack of supports for SGD algorithms. It is also not intended for production use.Lastly but not least, there is Geomstats [28] Python package for computations and statistics on nonlinear manifolds. Geomstats supports a broad list of manifolds such as hyperspheres, hyperbolic spaces, spaces of symmetric positive definite matrices and Lie groups of transformations. It also provides a modular library of differential geometry concepts, which includes the parallel transports, exponential and logarithmic maps, LeviCivita connections, and Christoffel symbols. Geomstats focuses on research in differential geometry and education use cases, by providing lowlevel code that follows the ScikitLearn [32] semantics. Geomstats examples also include a modified version of the Keras framework with support of the Riemannian SGD algorithm. However, it lacks engineering maintenance.
8 Conclusion
We propose TensorFlow RiemOpt, a library that combines optimization on Riemannian manifolds with automated differentiation capabilities of the TensorFlow framework. The library enables researchers to experiment with different state of the art solvers for optimization problems in nonEuclidean domains, while also allowing practitioners to efficiently pursue largescale machine learning. The benefits of TensorFlow RiemOpt are most noticeable when it comes to taking Riemannian machine learning models from research to production, where it unlocks advantages of TensorFlow tooling, such as continuous training and model validation.
References
 [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for largescale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.
 [2] P.A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
 [3] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache. Geometric means in a novel vector space structure on symmetric positivedefinite matrices. SIAM journal on matrix analysis and applications, 29(1):328–347, 2007.
 [4] D. Baylor, E. Breck, H.T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, et al. Tfx: A tensorflowbased productionscale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1387–1395, 2017.
 [5] G. Bécigneul and O.E. Ganea. Riemannian adaptive optimization methods. arXiv preprint arXiv:1810.00760, 2018.
 [6] S. Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
 [7] N. Boumal, B. Mishra, P.A. Absil, and R. Sepulchre. Manopt, a matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research, 15(1):1455–1459, 2014.
 [8] R. Chakraborty, B. C. Vemuri, et al. Statistics on the stiefel manifold: theory and applications. Annals of Statistics, 47(1):415–438, 2019.
 [9] B. P. Chamberlain, J. Clough, and M. P. Deisenroth. Neural embeddings of graphs in hyperbolic space. arXiv preprint arXiv:1705.10359, 2017.
 [10] B. P. Chamberlain, S. R. Hardwick, D. R. Wardrope, F. Dzogang, F. Daolio, and S. Vargas. Scalable hyperbolic recommender systems. arXiv preprint arXiv:1902.08648, 2019.
 [11] F. Chollet et al. Keras. https://keras.io, 2015.
 [12] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
 [13] O.E. Ganea, G. Bécigneul, and T. Hofmann. Hyperbolic neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5350–5360, 2018.
 [14] A. Gu, F. Sala, B. Gunel, and C. Ré. Learning mixedcurvature representations in product spaces. In International Conference on Learning Representations, 2018.
 [15] G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of minibatch gradient descent. Cited on, 14(8), 2012.

[16]
B. Hou, N. Miolane, B. Khanal, M. C. Lee, A. Alansary, S. McDonagh, J. V.
Hajnal, D. Rueckert, B. Glocker, and B. Kainz.
Computing cnn loss and gradients for pose estimation with riemannian geometry.
In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 756–764. Springer, 2018. 
[17]
Z. Huang and L. Van Gool.
A riemannian network for spd matrix learning.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume 31, 2017. 
[18]
Z. Huang, C. Wan, T. Probst, and L. Van Gool.
Deep learning on lie groups for skeletonbased action recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 6099–6108, 2017.  [19] Z. Huang, J. Wu, and L. Van Gool. Building deep networks on grassmann manifolds. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
 [20] H. Kasai, P. Jawanpuria, and B. Mishra. Riemannian adaptive stochastic gradient algorithms on matrix manifolds. In International Conference on Machine Learning, pages 3262–3271. PMLR, 2019.
 [21] V. Khrulkov, L. Mirvakhabova, E. Ustinova, I. Oseledets, and V. Lempitsky. Hyperbolic image embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6418–6428, 2020.
 [22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [23] M. Kochurov, R. Karimov, and S. Kozlukov. Geoopt: Riemannian optimization in pytorch. arXiv preprint arXiv:2005.02819, 2020.
 [24] J. Li, F. Li, and S. Todorovic. Efficient riemannian optimization on the stiefel manifold via the cayley transform. In International Conference on Learning Representations, 2019.
 [25] Y. Li, R. Wang, Z. Huang, S. Shan, and X. Chen. Face video retrieval with image query via hashing across euclidean space and riemannian manifold. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4758–4767, 2015.
 [26] Z. Lin. Riemannian geometry of symmetric positive definite matrices via cholesky decomposition. SIAM Journal on Matrix Analysis and Applications, 40(4):1353–1370, 2019.
 [27] M. Meghwanshi, P. Jawanpuria, A. Kunchukuttan, H. Kasai, and B. Mishra. Mctorch, a manifold optimization library for deep learning. arXiv preprint arXiv:1810.01811, 2018.
 [28] N. Miolane, N. Guigui, A. Le Brigant, J. Mathe, B. Hou, Y. Thanwerdas, S. Heyder, O. Peltre, N. Koep, H. Zaatiti, et al. Geomstats: A python package for riemannian geometry in machine learning. Journal of Machine Learning Research, 21(223):1–9, 2020.
 [29] M. Nickel and D. Kiela. Poincaré embeddings for learning hierarchical representations. Advances in Neural Information Processing Systems, 30:6338–6347, 2017.
 [30] M. Nickel and D. Kiela. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International Conference on Machine Learning, pages 3779–3788. PMLR, 2018.
 [31] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. Advances in Neural Information Processing Systems, 32:8026–8037, 2019.
 [32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikitlearn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
 [33] X. Pennec, P. Fillard, and N. Ayache. A riemannian framework for tensor computing. International Journal of computer vision, 66(1):41–66, 2006.
 [34] N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
 [35] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
 [36] S. K. Roy, Z. Mhammedi, and M. Harandi. Geometry aware constrained optimization techniques for deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4460–4469, 2018.
 [37] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. nature, 323(6088):533–536, 1986.
 [38] F. Sala, C. De Sa, A. Gu, and C. Ré. Representation tradeoffs for hyperbolic embeddings. In International conference on machine learning, pages 4460–4469. PMLR, 2018.
 [39] S. T. Smith. Optimization techniques on riemannian manifolds. Fields Institute Communications, 3, 1994.
 [40] O. Sporns, G. Tononi, and R. Kötter. The human connectome: a structural description of the human brain. PLoS Comput Biol, 1(4):e42, 2005.
 [41] T. T. D. Team, R. AlRfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, et al. Theano: A python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688, 2016.
 [42] J. Townsend, N. Koep, and S. Weichwald. Pymanopt: A python toolbox for optimization on manifolds using automatic differentiation. The Journal of Machine Learning Research, 17(1):4755–4759, 2016.
 [43] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
 [44] R. Zimmermann. A matrixalgebraic algorithm for the riemannian logarithm on the stiefel manifold under the canonical metric. SIAM Journal on Matrix Analysis and Applications, 38(2):322–342, 2017.
Comments
There are no comments yet.