An Optimal Transport (OT) problem can be briefly described as to find out the optimized transport plan (defined as transportation polytope) between two or more sets of subjects with certain constraints . It was firstly formalized by French mathematician Gaspard Monge in 1781, and was generalized by Kantorovich who provided a solution of Monge’s problem in 1942  and established its importance to logistics and economics.
As the solution of the OT problem provides the optimized transportation plan between probability distributions, and the advance in computer science allows us to perform a large amount of computation in a high dimensional space, the optimized distance, known as the Wasserstein distance , Monge-Kantorovich distance  and Earth Mover’s distance , has been treated as a target being analyzed in various aspects such as image processing [39, 16], pattern analysis [49, 11, 31] and domain adaption [9, 30, 47].
The OT-based method for comparing two probability densities and generative models are vital in machine learning research where data are often presented in the form of point clouds, histograms, bags-of-features, or more generally, even manifold-valued data set. In recent years, there has been an increase in the applications of the OT-based methods in machine learning. The authors of approached OT-based generative modeling, triggering fruitful research under the variational Bayesian concepts, such as Wassertein GAN [4, 22], Wasserstein Auto-encoders [45, 48], and Wasserstein variational inference  and their computationally efficient sliced version 
. Another reason that OT gains its popularity is convexity. As the classic Kantorovich OT problem is a constrained linear programming problem or a convex minimization problem where the minimal value of the transport cost objective function is usually defined as the divergence/distance between two distributions of loads, or the cost associated with the transportation between the source subjects and targets. Therefore, the convex optimization plays an essential role in finding the solutions of OT. The computation of the OT distance can be approached in principle by interior-point methods, and one of the best is from .
Although the methods for finding the solutions of OT have been widely investigated in the literature, one of the major problems is that these algorithms are excessively slow in handling large scale OT problems. Another issue with the classic Kantorovich OT formulation is that its solution plan merely relies on a few routes as a result of the sparsity of optimal couplings, and therefore fails to reflect the practical traffic conditions. These issues limit the wider applicability of OT-based distances for large-scale data within the field of machine learning until a regularized transportation plan was introduced by Cuturi  in 2013. By applying this new method (regularized OT), we are not only able to reduce the sparsity in the transportation plan, but also speed up the Sinkhorn algorithm with a linear convergence.
By offering a unique solution, better computational stability compared with the previous algorithms and being underpinned by the Sinkhorn algorithm, the entropy regularization method has successfully delivered OT approaches into modern machine learning aspects33]
, Wasserstein loss function, computer graphics  and discriminant analysis . Other algorithms that aim for high calculation speed in the area of big data have also been explored, such as the stochastic gradient-based algorithms  and fast methods to compute Wasserstein barycenters . Altschuler et al.  proposed the Greenkhorn algorithm, a greedy variant of the Sinkhorn algorithm that updates the rows and columns which violate most of the constraints.
In order to meet the requirements of various practical situations, many works have been done to define suitable regularizations. For newly introduced regularizations, Dessein et al.  extended the regularization in terms of convex functions. To apply OT to power functions, the Tsallis Regularized Optimal Transport (trot) distance problem was introduced in . Furthermore, in order to involve OT into series data, the order-preserving Wassertein distance with its regularizor was developed in. In addition, to maintain the locality in OT-assisted domain adaption, the Laplacian regularization was also proposed in . While entropy-based regularizations have achieved great success in terms of calculation efficiency, those problems without such regularization are still challenging. For example, to solve a Laplacian regularized OT problem, Courty et al. proposed a generalized conditional gradient algorithm, which is a variant of the classic conditional gradient algorithm. In this paper, we shall compare the experimental results of several entropy and non-entropy regularized OT problems based on previous studies and the new manifold optimization algorithm proposed in Section IV.
Non-entropy regularized OT problems arise the question about the development of a uniform and generalized method that is capable of efficiently and accurately calculating all sort of regularized OT problems. To answer this question, we first consider that all OT problems are constrained optimization problems on the transport plane space, namely the set of polytope. Such constrained problems can be regarded as the unconstrained problem on a specific manifold with certain constraints. The well-defined Riemannian optimization can provide better performance than the original constrained problem with the advantage of treating lower dimensional manifold as a new search space. Consequentially, those fundamental numerical iterative algorithms, such as the Riemannian gradient descent (RGD) and Riemannian trust region (RTR), can naturally solve the OT problems, achieving convergence under mild conditions.
The main purpose and contribution of this paper are to propose a manifold based framework for optimizing the transportation polytope for which the related Riemannian geometry will be explored. The “Coupling Matrix Manifold” provides an innovative method for solving OT problems under the framework of manifold optimization. The research on the coupling matrix manifold has rooted in our earlier paper 
in which the so-called multinomial manifold was explored in the context of tensor clustering. The optimization on multinomial manifolds has successfully been applied to several density learning tasks[23, 24, 25]. More recently, Douik and Hassibi  explored the manifold geometrical structure and the related convex optimization algorithms on three types of manifolds constructed by three types of matrices, namely the doubly stochastic matrices, symmetric stochastic matrices and positive stochastic matrices. The CMM introduced in this paper can be regarded as the generalization of their doubly positive stochastic manifolds. According to the mathematical and experimental results, the CMM framework unifies all types of OT solutions, providing closed form solutions compared with previous literature with higher efficiency, thus opening the door of solving OT problems under the manifold optimization framework.
The remainder of the paper is organized as follows. Section II introduces CMM and its Riemannian geometry,including the tangent space, Riemannian gradient, Riemannian Hessian, and Retraction operator, all the ingredients for the Riemannian optimization algorithms. In Section III, we review several OT problems with different regularizations from other studies. These regularization problems will be then converted into the optimization problem on CMM so that the Riemannian version of optimization algorithms (RGD and RTR) can be applied. In Section IV, we will conduct several numerical experiments to demonstrate the performance of the new Riemannian algorithms and compare the results with classic algorithms (i.e. Sinkhorn algorithm). Finally Section V concludes the paper with several recommendations for future research and applications.
Ii Coupling Matrix Manifolds–CMM
Throughout this paper, we use a bold lower case letter for a vector, a bold upper case letter for a matrix , and a calligraphy letter for a manifold . The embedded matrix manifold is a smooth subset of vector space embedded in the matrix space . For any , is the tangent space of the manifold at . and are the -dimensional vectors of zeros and ones, respectively, and is the set of all matrices with real and positive elements.
Ii-a The Definition of a Manifold
Two vectors and are coupled if . A matrix is called a coupling matrix for the coupled vectors and if and . The set of all the coupling matrices for the given coupled and is denoted by
The coupling condition
is vital in this paper as this condition ensures a non-empty transportation polytope so that the manifold optimization process can be naturally employed. This condition is checked in Lemma 2.2 of , and the proof of this lemma is based on the north-west corner rule algorithm described in .
The defined space is a subset of the classic transport plan space (or polytope)
where each entry of a plan is nonnegative. In practice, this constraint on does prevent the solution plan from being sparsity.
The subset forms a smooth manifold of dimension in its embedding space , named as the Coupling Matrix Manifold.
Define a mapping by
Clearly is a linear mapping from to with
Hence the null space of is
As there are only linearly independent constraints among and , the rank of the null space is . Hence the dimension of the range will be . According to the sub-immersion theorem (Proposition 3.3.4 in ), the dimension of the manifold is . ∎
Several special cases of the coupling matrix manifolds that have been explored recently are as follows:
When both and are discrete distributions, i.e., which are naturally coupled. In this case, we call the double probabilistic manifold, denoted by
The doubly stochastic multinomial manifold : This manifold is the special case of with and , e.g.
can be regarded as the two-dimensional extension of the multinomial manifold introduced in , defined as
Ii-B The Tangent Space and Its Metric
From now on, we only consider the coupling matrix manifold where and are a pair of coupled vectors. For any coupling matrix , the tangent space is given by the following proposition.
The tangent space can be calculated as
and its dimension is .
It is easy to prove Proposition 2 by differentiating the constraint conditions. We omit this.
Also it is clear that and consist of equations where only conditions are in general independent because . Hence the dimension of the tangent space is . ∎
where the operator means the element-wise division of two matrices in the same size.
Equivalently we may use the normalized Riemannian metric as follows
As one of building blocks for the optimization algorithms on manifolds, we consider how a matrix of size can be orthogonally projected onto the tangent space under its Riemannian metric .
The orthogonal projection from to takes the following form
where the symbol denotes the Hadamard product, and and are given by
where denotes the pseudo-inverse of , and .
We only present a simple sketch of the proof here. First, it is easy to verify that for any vectors and , is orthogonal to the tangent space . This is because for any , we have the following inner product induced by ,
Ii-C Riemannian Gradient and Retraction
The classical gradient descent method can be extended to the case of optimization on manifold with the aid of the so-called Riemannian gradient. As the coupling matrix manifold is embedded in the Enclidean space, the Riemannian gradient can be calculated via projecting the Euclidean gradient onto its tangent space. Given the Riemannian metric which is defined in (4), we can immediately formulate the following lemma, see [44, 14],
Suppose that is a real-valued smooth function defined on with its Euclidean gradient , then the Riemannian gradient can be calculated as
As , the directional derivative of along any tangent vector , according to the definition of Riemannian gradient, for the metric in (4) we have:
where the right equality comes from the definition of Euclidean gradient with the classic Euclidean metric . Clearly we have
where can be simply calculated according to the formula in (4), although is not in the tangent space . Considering its orthogonal decomposition according to the tangent space, we shall have
This completes the proof. ∎
As an important part of the manifold gradient descent process, the retraction function retracts a tangent vector back to the manifold. For Euclidean submanifolds, the simplest way to define a retraction is
In our case, to ensure , should be in the smaller neighbourhood of particularly when has smaller entries. This will result an inefficient descent optimization process. To provide a new retraction with high efficiency, following [44, 14], we define as the projection from the set of element-wise positive matrices onto the manifold under the Euclidean metric. Then we have the following lemma.
For any matrix , there exist two diagonal matrices and such that
where both and can be determined by the extended Sinkhorn-Knopp algorithm .
Based on the projection , we define the following retraction mapping for
Let be the projection defined in Lemma 5, the mapping given by
is a valid retraction on . Here is the element-wise exponential function and is any tangent vector at .
We need to prove that (i) and (ii) satisfies .
For (i), it is obvious that as for any .
As all , and are element-wise operations, the first order approximation of the exponential function gives
where . The next step is to show that when is very small. For this purpose, consider a smaller tangent vector such that . There exist two smaller diagonal matrices and that satisfy
where are identity matrices. By ignoring higher order small quantity, we have
As both and are on the coupling matrix manifold and is a tangent vector, we have
where and . Hence,
Hence is in the null space of the above matrix which contains . In general, there exists a constant such that and and this gives
Combining all results obtained above, we have as is sufficiently smaller. Hence, this completes the proof. ∎
Ii-D The Riemannina Hessian
Let and be the Euclidean gradient and Euclidean Hessian, respectively. The Riemennian Hessian can be expressed as
It is well known  that the Riemannian Hessian can be calculated from the Riemannian connection and Riemannian gradient via
Furthermore the connection on the submanifold can be given by the projection of the Levi-Civita connection , i.e., . For the Euclidean space endowed with the Fisher information, with the same approach used in , it can be shown that the Levi-Civita connection is given by
According to Lemma 4, the directional derivative can be expressed as
Taking in the expressions for and directly computing directional derivatives give all formulae in the theorem.
Iii Riemannian Optimization Applied to OT Problems
In this section, we illustrate the Riemannian optimization in solving various OT problems, starting by reviewing the framework of the optimization on Riemannian manifolds.
Iii-a Optimization on Manifolds
Early attempts to adapt standard manifold optimization methods were presented by  in which steepest descent, Newton and qusasi-Newtwon methods were introduced. The second-order geometry related optimization algorithm such as the Riemannian trust region algorithm was proposed in , where the algorithm was applied on some specific manifolds such as the Stiefel and Grassman manifolds.
This paper focuses only on the gradient descent method which is the most widely used optimization method in machine learning. Suppose that is a -dimensional Riemannian manifold. Let be a real-valued function defined on . Then, the optimization problem on has the form
For any and , there always exists a geodesic starting at with initial velocity , denoted by . With this geodesic the so-called exponential mapping is defined as
Thus the simplest Riemannian gradient descent (RGD) consists of the following two main steps:
Compute the Riemannian gradient of at the current position , i.e. ;
Move in the direction according to with a step-size .
Step 1) is straightforward as the Riemannian gradient can be calculated from the Euclidean gradient according to (8) in Lemma 4. However, it is generally difficult to compute the exponential map effectively as the computational processes require some second-order Riemannian geometrical elements to construct the geodesic, which sometimes is not unique on a manifold point. Therefore, instead of using the exponential map in RGD, an approximated method, namely the retraction map is commonly adopted. For coupling matrix manifold , a retraction mapping has been calculated in Lemma 6. Hence Step 2) in the RGD is defined by
Hence for any given OT-based optimization problem
conducting the RGD algorithm comes down to the computation of Euclidean gradient . Similarly, formulating the second-order Riemannian optimization algorithms based on Riemannian Hessian, such as Riemannian Newton method and Riemannian trust region method, boil down to calculating enable the calcuationn of the Euclidean Hessian. See Theorem 7.
Iii-B Computational Complexity of Coupling Matrix Manifold Optimization
In this section we give a simple complexity analysis on optimizing a function defined on the coupling matrix manifold by taking the RGD algorithm as an example. Suppose that we minimize a given objective function defined on . For the sake of simplicity, we consider the case of .
In each step of RGD, we first calculate the Euclidean gradient with the number of flops . In most cases shown in the next subsection, we have Before applying the GD step, we shall calculate the Riemannian gradient by the projection according to Lemma 6 which is implemented by the Sinkhorn-Knopp algorithm in Algorithm 1. The complexity of Sinkhorn-Knopp algorithm to have an -approximate solution .
If RGD is coducted iterations, the overall computational complexity will be
This means the new algorithm will be slower than the Sinkhorn-Knopp algorithm, however that is price we have to pay for the OT problems where Sinkhorn-Knopp algorithm is not feasible.
Iii-C Application Examples
As mentioned before, basic Riemannain optimization algorithms are constructed on the Euclidean gradient and Hessian of the objective function. In the first part of our application example, some classic OT problems are presented to illustrate the calculation process for their Riemannian gradient and Hessian.
Iii-C1 The Classic OT Problem
The objective function of the classic OT problem  is
where is the given cost matrix and gives the overall cost under the transport plan . The solution to this optimization problem is called the transport plan which induces the lowest overall cost . When the cost is measured by the distance between the source object and the target object, the best transport plan assists in defining the so-called Wasserstein distance between the source distribution and the target distribution.
Given that problem (12) is indeed a linear programming problem, it is straightforward to solve the problem by the linear programming algorithms. In this paper, we solve the OT problem under the Riemannian optimization framework. Thus, for the classic OT, obviously the Euclidean gradient and Hessian can be easily computed as:
Iii-C2 The Entropy Regularized OT Problem
It is obvious that this classic OT problem can be generalized to the manifold optimization process within our defined coupling matrix manifold where is not necessarily equal to 1, and the number of rows and the number of columns can be unequal. To improve the efficiency of the algorithm, we add an entropy regularization term. Hence, the OT problem becomes
where is the discrete entropy of the coupling matrix and is defined by:
In terms of matrix operation, has the form
where applies to each element of the matrix. The minimization is a strictly convex optimization process, and for the solution is unique and has the form:
where is computed entry-wisely , and and are obtained by the Sinkhorn-Knopp algorithm.
Now, for objective function
one can easily check that the Euclidean gradient is
where is a matrix of all 1s in size , and the Euclidean Hessian is, in terms of mapping differential, given by
Iii-C3 The Power Regularization for OT Problem
Dessein et al.  further extended the regularization to
where is an appropriate convex function. As an example, we consider the squared regularization proposed by 
and we apply a zero truncated operator in the manifold algorithm. It is then straightforward to prove that
The Tsallis Regularized Optimal Transport is used in  to define trot distance which comes with the following regularization problem
For the sake of convenience, we denote for any given constant . Then we have
Iii-C4 The Order-Preserving OT Problem
The order-preserving OT problem is proposed in  and is adopted by  for learning distance between sequences. This learning process takes the local order of temporal sequences and the learned transport defines a flexible alignment between two sequences. Thus, the optimal transport plan only assigns large loads to the most similar instance pairs of the two sequences.
For sequences and in the respective given orders, the distance matrix between them is
Define an matrix (distance between orders)
and the (exponential) similarity matrix
where is the scaling factor and
The (squared) distance between sequences and is given by
where the optimal transport plan is the solution to the following order-preserving regularized OT problem
where the KL-divergence is defined as
and specially and
are uniform distributions. Hence
Iii-C5 The OT Domain Adaption Problem
OT has also been widely used for solving the domain adaption problems. In this subsection, the authors of  formalized two class-based regularized OT problems, namely the group-induced OT (OT-GL) and the Laplacian regularized OT (OT-Laplace). As the OT-Laplace is found to be the best performer for domain adaption, we only apply our coupling matrix manifold optimization to it and thus we summarize its objective function here.
As pointed out in , this regularization aims at preserving the data graph structure during transport. Consider to be the source data points and the target data points, both are defined in . Obviously, and . The purpose of domain adaption is to transport the source towards the target so that the transported source and the target can be jointly used for other learning tasks.
Now suppose that for the source data we have extra label information . With this label information we sparsify similarities among the source data such that if for . That is, we define a similarity between two source data points if they do not belong to the same class or do not have the same labels. Then the following regularization is proposed
With a given transport plan , we can use the barycentric mapping in the target as the transported point for each source point . When we use the uniform marginals for both source and target and the cost, the transported source is expressed as
It is easy to verify that
where is the Laplacian of the graph