1 Introduction
To compare, summarize, and combine probability measures defined on a space is a fundamental task in statistics and machine learning. Given support points of probability measures in a metric space and a transportation cost function (e.g. the Euclidean distance), Wasserstein distance defines a distance between two measures as the minimal transportation cost between them. This notion of distance leads to a host of important applications, including text classification
[28], clustering [23, 24, 14][21][44], statistics [36, 37, 46, 19], and others [5, 39, 45]. Given a set of measures in the same space, the 2Wasserstein barycenter is defined as the measure minimizing the sum of squared 2Wasserstein distances to all measures in the set. For example, if a set of images (with common structure but varying noise) are modeled as probability measures, then the Wasserstein barycenter is a mixture of the images that share this common structure. The Wasserstein barycenter better captures the underlying geometric structure than the barycenter defined by the Euclidean or other distances. As a result, the Wasserstein barycenter has applications in clustering [23, 24, 14][13] and others [30, 41].From the computation point of view, finding the barycenter of a set of discrete measures can be formulated by linear programming
[4]. Nonetheless, stateoftheart linear programming solvers do not scale with the immense amount of data involved in barycenter calculations. Current research on computation mainly follows two types of methods. The first type attempts to solve the linear program (or some equivalent problem) with scalable firstorder methods. J.Ye et al. [53] use modified Bregman ADMM(BADMM) – introduced by [50] – to compute Wasserestein barycenters for clustering problems. L.Yang et al. [52] adopt symmetric GaussSeidel ADMM to solve the dual linear program, which reduces the computational cost in each iteration. S.Claici et al. [11] introduce a stochastic alternating algorithm that can handle continuous input measures. However, these methods are still computationally inefficient when the number of support points of the input measures and the number of input measures are large. Due to the nature of the firstorder methods, these algorithms often converge too slowly to reach highaccuracy solutions.The second, more mainstream, approach introduces an entropy regularization term to the linear programming formulation[13, 7]. M. Staib et al. [47] discuss the parallel computation issue and introduce a sampling method. P.Dvurechenskii et al. [16] study decentralized and distributed computation for the regularized problem. These methods are indeed suitable for largescale problems due to their low computational cost and parsimonious memory usage. However, this advantage is obtained at the expense of the solution accuracy: especially when the regularization term is weighted less in order to approximate the original problem more accurately, computational efficiency degenerates and the outputs become unstable [7]. See [40] for a detailed survey of related algorithms.
In this paper, we develop a new interiorpoint method (IPM), namely Matrixbased Adaptive Alternating interiorpoint Method (MAAIPM), to efficiently calculate the Wasserstein barycenters. If the support is prespecified, we apply one step of the MizunoToddYe predictorcorrector IPM. The algorithm gains a quadratic convergence rate showed by Y. Ye et al. [55], which is a distinct advantage of IPMs over firstorder methods. In practice, we implement Mehrotra’s predictorcorrector IPM [33]
, and add clever heuristics in choosing step lengths and centering parameters. If the support is also to be optimized, MAAIPM alternatively updates support and linear program variables in an adaptive strategy. At the beginning, MAAIPM updates support points
by an unconstrained quadratic program after a few number of IPM iterations. At the end, MAAIPM updates after every IPM iteration and applies the "jump" tricks to escape local minima.Under the framework of MAAIPM, we present two block matrixbased accelerated algorithms to quickly solve the Newton equations at each iteration. Despite a prevailing belief that IPMs are inefficient for largescale cases, we show that such an inefficiency can be overcome through careful manipulation of the blockdata structure of the normal equation. As a result, our stylized IPM has the following advantages.
Low theoretical complexity. The linear programming formulation of the Wasserstein barycenter has variables and constraints, where the integers , and will be specified later. Although MAAIPM is still a secondorder method, in our two block matrixbased accelerated algorithms, every iteration of solving the Newton direction has a time complexity of merely or , where a standard IPM would need . For simplicity, let , then the time complexity of our algorithm in each iteration is , instead of standard IPM’s complexity .
Practical effectiveness in speed and accuracy. Compared to regularized methods, IPMs gain highaccuracy solutions and high convergence rate by nature. Numerical experiments show that our algorithm converges to highly accurate solutions of the original linear program with the least computation time and the least number of iterations. Figure 1 shows the advantages of our methods in accuracy in comparison to the welldeveloped Sinkhorntype algorithm [13, 7].
There are more advantages of our approaches in real implementation. For example, when the support points of distributions are different, memory usage of our method is within a constant multiple of the memory usage of the most memoryefficient method IBP, which is much less than the memory used by a commercial solver Gurobi. Our algorithms also inherits a natural structure potentially fitting parallel computing scheme well. Those merits ensure that our algorithm is highly suitable for largescale computation of Wasserstein barycenters.
The rest of the paper is organized as follows. In section 2, we briefly define the Wasserstein barycenter. In section 3, we present its linear programming formulation and introduce the IPM framework. In section 4, we present an IPM implementation that greatly reduces the computational cost of classical IPMs. In section 5, we present our numerical results.
2 Background and Preliminaries
In this section, we briefly recall the Wasserstein distance and the Wasserstein barycenter for a set of discrete probability measures [1, 15]. Let be the probability simplex in
. For two vectors
, define the set of matrices . Let denote the discrete probability measure supported on points in with weights respectively. The Wasserstein barycenter of the two measures and is(1) 
where and
. Consider a set of probability distributions
where , and let . The Wasserstein barycenter (with support points) is another probability measure which is defined as a solution of the problem(2) 
Furthermore, define the simplex For a given set of support points , define the distance matrices for . Then problem (2) is equivalent to
(3) 
Problem (3) is a nonconvex problem, where one needs to find the optimal support points X and the optimal weight vector of a barycenter simultaneously. However, in many real applications, the support X of a barycenter can be specified empirically from the support points of . Indeed, in some cases, all distributions in have the same set of support points and hence the barycenter should also take the same set of support points. In view of this, we will also focus on the case when the support X is given. Consequently, problem (3) reduces to the following problem:
(4) 
where denotes for simplicity. In the following sections, we refer to problem (4) as the Prespecified Support Problem, and call problem (3) the Free Support Problem.
3 General Framework for MAAIPM
3.1 Linear programming formulation and preconditioning.
Note that the Prespecified Support Problem is a linear program. In this subsection, we focus on removing redundant constraints. First, we vectorize the constraints and captured in to become
Thus, problem (4) can be formulated into the standardform linear program:
(5) 
with , and , where is a block diagonal matrix: ; is a block diagonal matrix: ; and . Let , and . Then and . For efficient implementations of IPMs for this linear program, we need to remove redundant constraints.
Lemma 3.1
Let be obtained from by removing the th, th, , th rows of , and be obtained from by removing the th, th, , th entries of . Then 1) has full row rank; 2) satisfies if and only if satisfies .
Proof. We follow two steps: In step 1, we show that through a series of row transformations, we can transform matrix A into a matrix whose elements in th, th, , th rows are zeros, and elements in other positions are the same as A. In step 2, we prove that the matrix has full row rank.

Step 1
From the definition of matrix A, we have
(6) where , for . Let
and
Then
where , . It is easy to verify that elements in the first rows of and are zeros.

Step 2
As defined in the claims of lemma 3.1, is obtained by removing the th, th, , th rows of A. That is,
where , and . Let . For , let
then
where , , , and . Let
then
where , , . Let be the matrix composing of the first columns of . That is,
Matrix satisfies two properties: (1) Each row of has one and only one nonzero element (being 1) with other elements being 0; (2) Each column of has at most one nonzero element. Therefore, there exists permutation matrices and such that . Thus and .
With this lemma, the primal problem and dual problem of problem 5 can be written as
(7) 
3.2 Framework of Matrixbased Adaptive Alternating Interiorpoint Method (MAAIPM).
When the support points are not prespecified, we need to solve free support problem (3). As we just saw, When is fixed, the problem becomes a linear program. When are fixed, the problem is a quadratic optimization problem with respect to , and the optimal can be written in closed form as
(8) 
In anther word, (3) can be reformulated as
(9) 
Since, as stated above, (3) is a nonconvex problem, so it contains saddle points and local minima. This makes finding a global optimizer difficult.
Example 3.1 (an example of local minima)
Set . Let be any positive integer and , , , and . Then , and and is a local minimum. But it is not a global minimum because a lower objective value occurs when , , and .
Example 3.2 (an example of saddle point)
The alternating minimization strategy used in [15, 52, 53] alternates between optimizing by solving (8) and optimizing by solving (4). However, this alternating approach cannot avoid local minima or saddle points. Every iteration may require solving a linear program (4), which is expensive when the problem size is large.
To overcome the drawbacks, we propose Matrixbased Adaptive Alternating IPM (MAAIPM), Algorithm 1. If the support is prespecified, we solve a single linear program by predictorcorrector IPM. If the support needs to be optimized, MAAIPM uses an adaptive strategy. At the beginning, because the primal variables are far from the optimal solution , MAAIPM updates of (8) after a few number of IPM iterations for . Then, MAAIPM updates after every IPM iteration and applies the "jump" tricks to escape local minima. Since at the beginning MAAIPM updates after many IPM iterations, primal dual predictorcorrector IPM is more efficient. At the end, is updated more often and each update of changes the linear programming objective function so that dual variables may be infeasible. However, the primal variables always remain feasible so that the primal IPM is more suitable at the end. Moreover, primal IPM is better for applying "jump" tricks or other localminimaescaping techniques, which has been shown in [54]. Figure 2 visualizes the primal variables and objective gradients in each iteration of MAAIPM.
In predictorcorrector IPM, the main computational cost lies in solving the Newton equations, which can be reformulated as the normal equations
(10) 
where denotes and is in This linear system of matrix can be efficiently solved by the two methods proposed in the next section. In the primal IPM, MAAIPM combines following the central path with optimizing the support points, i.e., it contains three parts in one iteration, taking an Newton step in the logarithmic barrier function
(11) 
reducing the penalty , and updating the support (8). The Newton direction at the iteration is calculated by
(12) 
where . The main cost of primal IPM lies in solving a linear system of , which again can be efficiently solved by the two methods described in the following section. Further more, we can also apply the warmstart technique to smartly choose the starting point of the next IPM after "jump" [43]. Compared with primaldual IPMs’ warmstart strategies [27, 26], this technique saves the searching time, and only requires slightly more memory. When we suitably set the termination criterion, numerical studies show that MAAIPM outperforms previous algorithms in both speed and accuracy, no matter whether the support is prespecified or not.
4 Efficient Methods for Solving the Normal Equations
In this section, we discuss efficient methods for solving normal equations in the format , where is a diagonal matrix with all diagonal entries being positive. Let , and . First, through simple calculation, we have the following lemma on the structure of matrix .
Lemma 4.1
can be written in the following format:
where is a diagonal matrix with positive diagonal entries; is a blockdiagonal matrix with N blocks (the size of the ith block is ); is a diagonal matrix with positive diagonal entries; Let , then , and ; .
Proof. Let be the diagonal vector of matrix ; and . Same as the preceding section, the structure of as:
where , and . Let
Then
Now we analyze the structure of each submatrix and rename them for conciseness. Let
where , and . Then
Each is a diagonal matrix with positive diagonal entries.
Comments
There are no comments yet.