# Interior-point Methods Strike Back: Solving the Wasserstein Barycenter Problem

Computing the Wasserstein barycenter of a set of probability measures under the optimal transport metric can quickly become prohibitive for traditional second-order algorithms, such as interior-point methods, as the support size of the measures increases. In this paper, we overcome the difficulty by developing a new adapted interior-point method that fully exploits the problem's special matrix structure to reduce the iteration complexity and speed up the Newton procedure. Different from regularization approaches, our method achieves a well-balanced tradeoff between accuracy and speed. A numerical comparison on various distributions with existing algorithms exhibits the computational advantages of our approach. Moreover, we demonstrate the practicality of our algorithm on image benchmark problems including MNIST and Fashion-MNIST.

## Authors

• 1 publication
• 3 publications
• 1 publication
• 24 publications
02/24/2021

### Learning to Generate Wasserstein Barycenters

Optimal transport is a notoriously difficult problem to solve numericall...
02/12/2018

### A Fast Proximal Point Method for Wasserstein Distance

Wasserstein distance plays increasingly important roles in machine learn...
01/25/2019

### Subspace Robust Wasserstein distances

Making sense of Wasserstein distances between discrete measures in high-...
02/15/2018

### Stochastic Wasserstein Barycenters

We present a stochastic algorithm to compute the barycenter of a set of ...
04/27/2020

### Hierarchical Low-Rank Approximation of Regularized Wasserstein distance

Sinkhorn divergence is a measure of dissimilarity between two probabilit...
10/16/2019

### On the Computational Complexity of Finding a Sparse Wasserstein Barycenter

The discrete Wasserstein barycenter problem is a minimum-cost mass trans...
04/05/2021

### Quantized Gromov-Wasserstein

The Gromov-Wasserstein (GW) framework adapts ideas from optimal transpor...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

To compare, summarize, and combine probability measures defined on a space is a fundamental task in statistics and machine learning. Given support points of probability measures in a metric space and a transportation cost function (e.g. the Euclidean distance), Wasserstein distance defines a distance between two measures as the minimal transportation cost between them. This notion of distance leads to a host of important applications, including text classification

[28], clustering [23, 24, 14][21][44], statistics [36, 37, 46, 19], and others [5, 39, 45]. Given a set of measures in the same space, the 2-Wasserstein barycenter is defined as the measure minimizing the sum of squared 2-Wasserstein distances to all measures in the set. For example, if a set of images (with common structure but varying noise) are modeled as probability measures, then the Wasserstein barycenter is a mixture of the images that share this common structure. The Wasserstein barycenter better captures the underlying geometric structure than the barycenter defined by the Euclidean or other distances. As a result, the Wasserstein barycenter has applications in clustering [23, 24, 14][13] and others [30, 41].

From the computation point of view, finding the barycenter of a set of discrete measures can be formulated by linear programming

[4]. Nonetheless, state-of-the-art linear programming solvers do not scale with the immense amount of data involved in barycenter calculations. Current research on computation mainly follows two types of methods. The first type attempts to solve the linear program (or some equivalent problem) with scalable first-order methods. J.Ye et al. [53] use modified Bregman ADMM(B-ADMM) – introduced by [50] – to compute Wasserestein barycenters for clustering problems. L.Yang et al. [52] adopt symmetric Gauss-Seidel ADMM to solve the dual linear program, which reduces the computational cost in each iteration. S.Claici et al. [11] introduce a stochastic alternating algorithm that can handle continuous input measures. However, these methods are still computationally inefficient when the number of support points of the input measures and the number of input measures are large. Due to the nature of the first-order methods, these algorithms often converge too slowly to reach high-accuracy solutions.

The second, more mainstream, approach introduces an entropy regularization term to the linear programming formulation[13, 7]. M. Staib et al. [47] discuss the parallel computation issue and introduce a sampling method. P.Dvurechenskii et al. [16] study decentralized and distributed computation for the regularized problem. These methods are indeed suitable for large-scale problems due to their low computational cost and parsimonious memory usage. However, this advantage is obtained at the expense of the solution accuracy: especially when the regularization term is weighted less in order to approximate the original problem more accurately, computational efficiency degenerates and the outputs become unstable [7]. See [40] for a detailed survey of related algorithms.

In this paper, we develop a new interior-point method (IPM), namely Matrix-based Adaptive Alternating interior-point Method (MAAIPM), to efficiently calculate the Wasserstein barycenters. If the support is pre-specified, we apply one step of the Mizuno-Todd-Ye predictor-corrector IPM. The algorithm gains a quadratic convergence rate showed by Y. Ye et al. [55], which is a distinct advantage of IPMs over first-order methods. In practice, we implement Mehrotra’s predictor-corrector IPM [33]

, and add clever heuristics in choosing step lengths and centering parameters. If the support is also to be optimized, MAAIPM alternatively updates support and linear program variables in an adaptive strategy. At the beginning, MAAIPM updates support points

by an unconstrained quadratic program after a few number of IPM iterations. At the end, MAAIPM updates after every IPM iteration and applies the "jump" tricks to escape local minima.

Under the framework of MAAIPM, we present two block matrix-based accelerated algorithms to quickly solve the Newton equations at each iteration. Despite a prevailing belief that IPMs are inefficient for large-scale cases, we show that such an inefficiency can be overcome through careful manipulation of the block-data structure of the normal equation. As a result, our stylized IPM has the following advantages.

Low theoretical complexity. The linear programming formulation of the Wasserstein barycenter has variables and constraints, where the integers , and will be specified later. Although MAAIPM is still a second-order method, in our two block matrix-based accelerated algorithms, every iteration of solving the Newton direction has a time complexity of merely or , where a standard IPM would need . For simplicity, let , then the time complexity of our algorithm in each iteration is , instead of standard IPM’s complexity .

Practical effectiveness in speed and accuracy. Compared to regularized methods, IPMs gain high-accuracy solutions and high convergence rate by nature. Numerical experiments show that our algorithm converges to highly accurate solutions of the original linear program with the least computation time and the least number of iterations. Figure 1 shows the advantages of our methods in accuracy in comparison to the well-developed Sinkhorn-type algorithm [13, 7].

There are more advantages of our approaches in real implementation. For example, when the support points of distributions are different, memory usage of our method is within a constant multiple of the memory usage of the most memory-efficient method IBP, which is much less than the memory used by a commercial solver Gurobi. Our algorithms also inherits a natural structure potentially fitting parallel computing scheme well. Those merits ensure that our algorithm is highly suitable for large-scale computation of Wasserstein barycenters.

The rest of the paper is organized as follows. In section 2, we briefly define the Wasserstein barycenter. In section 3, we present its linear programming formulation and introduce the IPM framework. In section 4, we present an IPM implementation that greatly reduces the computational cost of classical IPMs. In section 5, we present our numerical results.

## 2 Background and Preliminaries

In this section, we briefly recall the Wasserstein distance and the Wasserstein barycenter for a set of discrete probability measures [1, 15]. Let be the probability simplex in

. For two vectors

, define the set of matrices . Let denote the discrete probability measure supported on points in with weights respectively. The Wasserstein barycenter of the two measures and is

 W2(U,V):=min⎧⎨⎩ ⎷m1∑i=1m2∑j=1πij∥qi−pj∥2:Π=[πij]∈M(a,b)⎫⎬⎭ (1)

where and

. Consider a set of probability distributions

where , and let . The Wasserstein barycenter (with support points) is another probability measure which is defined as a solution of the problem

 minP1NN∑t=1(W2(P,P(t)))2. (2)

Furthermore, define the simplex For a given set of support points , define the distance matrices for . Then problem (2) is equivalent to

 minw,X,Π(t)  N∑t=1⟨D(t)(X),Π(t)⟩  s.t.  (w,Π(1),…,Π(N))∈S, x1,…,xm∈Rn. (3)

Problem (3) is a nonconvex problem, where one needs to find the optimal support points X and the optimal weight vector of a barycenter simultaneously. However, in many real applications, the support X of a barycenter can be specified empirically from the support points of . Indeed, in some cases, all distributions in have the same set of support points and hence the barycenter should also take the same set of support points. In view of this, we will also focus on the case when the support X is given. Consequently, problem (3) reduces to the following problem:

 minw,Π(t)  N∑t=1⟨D(t),Π(t)⟩  s.t.  (w,Π(1),…,Π(N))∈S (4)

where denotes for simplicity. In the following sections, we refer to problem (4) as the Pre-specified Support Problem, and call problem (3) the Free Support Problem.

## 3 General Framework for MAAIPM

### 3.1 Linear programming formulation and preconditioning.

Note that the Pre-specified Support Problem is a linear program. In this subsection, we focus on removing redundant constraints. First, we vectorize the constraints and captured in to become

 (1⊤mt⊗Im)vec(Π(t))=w,  (Imt⊗1⊤m)vec(Π(t))=a(t),  t=1,⋯,N.

Thus, problem (4) can be formulated into the standard-form linear program:

 min c⊤x  s.t.  Ax=b,x≥0 (5)

with , and , where is a block diagonal matrix: ; is a block diagonal matrix: ; and . Let , and . Then and . For efficient implementations of IPMs for this linear program, we need to remove redundant constraints.

###### Lemma 3.1

Let be obtained from by removing the -th, -th, , -th rows of , and be obtained from by removing the -th, -th, , -th entries of . Then 1) has full row rank;  2) satisfies if and only if satisfies .

Proof. We follow two steps: In step 1, we show that through a series of row transformations, we can transform matrix A into a matrix whose elements in -th, -th, , -th rows are zeros, and elements in other positions are the same as A. In step 2, we prove that the matrix has full row rank.

• Step 1

From the definition of matrix A, we have

 A=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣F1F2⋱FNG1−ImG2−Im⋱⋮GN−Im1Tm⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (6)

where , for . Let

 e1=⎡⎢ ⎢ ⎢ ⎢⎣10⋮0⎤⎥ ⎥ ⎥ ⎥⎦m×1,  Ti=⎡⎢ ⎢ ⎢⎣11⋯1⎤⎥ ⎥ ⎥⎦m×mi,   Si=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣11⋯11⋱1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦m×m, i=1,...,N

and

 L1=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣Im1⋱ImN−T1Im⋱⋱−TNIm1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,  L2=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣Im1⋱ImNS1e1⋱⋮SNe11⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

Then

 L2L1A=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣F1F2⋱FNG(1)1H(1)G(1)2H(1)⋱⋮G(1)NH(1)1Tm⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,

where , . It is easy to verify that elements in the first rows of and are zeros.

• Step 2

As defined in the claims of lemma 3.1, is obtained by removing the -th, -th, , -th rows of A. That is,

 ¯A=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣F1F2⋱FNG(2)1H(2)G(2)2H(2)⋱⋮G(2)NH(2)1⊤m⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

where , and . Let . For , let

 Ui=Imi⊗⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1−1⋯−11⋱1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦m×m, UN+1=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1−1⋯−11⋱1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦m×m, R1=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣U1U2⋱UN+1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

then

 ¯AR1=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣F(3)1F(3)2⋱F(3)NG(3)1H(3)G(3)2H(3)⋱⋮G(3)NH(3)α⊤⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

where , , , and . Let

 ~K=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣0−1⋱−1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦m×m, Ki=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣Im~K⋯~KIm⋱Im⎤⎥ ⎥ ⎥ ⎥ ⎥⎦mmi×mmi, R2=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣K1⋱KNIm⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

then

 ¯AR1R2=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣F(4)1F(4)2⋱F(4)NG(4)1H(3)G(4)2H(3)⋱⋮G(4)NH(3)α⊤⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

where , , . Let be the matrix composing of the first columns of . That is,

 ~A=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣F(4)1F(4)2⋱F(4)NG(4)1G(4)2⋱G(4)N1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

Matrix satisfies two properties: (1) Each row of has one and only one nonzero element (being 1) with other elements being 0; (2) Each column of has at most one nonzero element. Therefore, there exists permutation matrices and such that . Thus and .

With this lemma, the primal problem and dual problem of problem 5 can be written as

 (Primal)min c⊤x  s.t.   ¯Ax=¯b,x≥0.(% Dual)max ¯b⊤p  s.t.   ¯A⊤λ+s=c,s≥0. (7)

### 3.2 Framework of Matrix-based Adaptive Alternating Interior-point Method (MAAIPM).

When the support points are not pre-specified, we need to solve free support problem (3). As we just saw, When is fixed, the problem becomes a linear program. When are fixed, the problem is a quadratic optimization problem with respect to , and the optimal can be written in closed form as

 x∗i=(∑Nt=1∑mtj=1π(t)ij)−1∑Nt=1∑mtj=1π(t)ijq(t)j,i=1,2…,m. (8)

In anther word, (3) can be reformulated as

 minc(x)⊤x  s.t.¯Ax=¯b, x≥0. (9)

Since, as stated above, (3) is a non-convex problem, so it contains saddle points and local minima. This makes finding a global optimizer difficult.

###### Example 3.1 (an example of local minima)

Set . Let be any positive integer and , , , and . Then , and and is a local minimum. But it is not a global minimum because a lower objective value occurs when , , and .

###### Example 3.2 (an example of saddle point)

Let be any positive integer and , , , and , then , , and is a saddle point. Fixing , the and is an optimal basic solution of problem 4. Fixing and , is the solution of (8). It is not a local minimum, because a lower objective value of problem 4 can occur when , .

The alternating minimization strategy used in [15, 52, 53] alternates between optimizing by solving (8) and optimizing by solving (4). However, this alternating approach cannot avoid local minima or saddle points. Every iteration may require solving a linear program (4), which is expensive when the problem size is large.

To overcome the drawbacks, we propose Matrix-based Adaptive Alternating IPM (MAAIPM), Algorithm 1. If the support is pre-specified, we solve a single linear program by predictor-corrector IPM. If the support needs to be optimized, MAAIPM uses an adaptive strategy. At the beginning, because the primal variables are far from the optimal solution , MAAIPM updates of (8) after a few number of IPM iterations for . Then, MAAIPM updates after every IPM iteration and applies the "jump" tricks to escape local minima. Since at the beginning MAAIPM updates after many IPM iterations, primal dual predictor-corrector IPM is more efficient. At the end, is updated more often and each update of changes the linear programming objective function so that dual variables may be infeasible. However, the primal variables always remain feasible so that the primal IPM is more suitable at the end. Moreover, primal IPM is better for applying "jump" tricks or other local-minima-escaping techniques, which has been shown in [54]. Figure 2 visualizes the primal variables and objective gradients in each iteration of MAAIPM.

In predictor-corrector IPM, the main computational cost lies in solving the Newton equations, which can be reformulated as the normal equations

 ¯A(Dk)2¯A⊤Δλk=fk, (10)

where denotes and is in This linear system of matrix can be efficiently solved by the two methods proposed in the next section. In the primal IPM, MAAIPM combines following the central path with optimizing the support points, i.e., it contains three parts in one iteration, taking an Newton step in the logarithmic barrier function

 minimizec⊤x−μ∑ni=1lnxi,subject to¯Ax=b, (11)

reducing the penalty , and updating the support (8). The Newton direction at the iteration is calculated by

 pk=xk+(Xk)2(¯A⊤(¯A(Xk)2¯A⊤)−1(¯A(Xk)2c−μ¯AXk1)−c)/μk, (12)

where . The main cost of primal IPM lies in solving a linear system of , which again can be efficiently solved by the two methods described in the following section. Further more, we can also apply the warm-start technique to smartly choose the starting point of the next IPM after "jump" [43]. Compared with primal-dual IPMs’ warm-start strategies [27, 26], this technique saves the searching time, and only requires slightly more memory. When we suitably set the termination criterion, numerical studies show that MAAIPM outperforms previous algorithms in both speed and accuracy, no matter whether the support is pre-specified or not.

## 4 Efficient Methods for Solving the Normal Equations

In this section, we discuss efficient methods for solving normal equations in the format , where is a diagonal matrix with all diagonal entries being positive. Let , and . First, through simple calculation, we have the following lemma on the structure of matrix .

###### Lemma 4.1

can be written in the following format:

where is a diagonal matrix with positive diagonal entries; is a block-diagonal matrix with N blocks (the size of the i-th block is ); is a diagonal matrix with positive diagonal entries; Let , then , and ; .

Proof. Let be the diagonal vector of matrix ; and . Same as the preceding section, the structure of as:

 ¯A=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣F1F2⋱FNG(2)1H(2)G(2)2H(2)⋱⋮G(2)NH(2)1⊤m⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

where , and . Let

 ¯A1:=¯A(1:M,:)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣F1F2⋱FN⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,
 ¯A2:=¯A(M+1:M+(m−1)N,:)=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣G(2)1H(2)G(2)2H(2)⋱⋮G(2)NH(2)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,
 ¯A3:=¯A(M+(m−1)N+1,:)=[1⊤m].

Then

 ¯A=⎡⎢⎣¯A1¯A2¯A3⎤⎥⎦ and ¯AD¯A⊤=⎡⎢ ⎢ ⎢⎣¯A1D¯A⊤1¯A1D¯A⊤2¯A1D¯A⊤3¯A2D¯A⊤1¯A2D¯A⊤2¯A2D¯A⊤3¯A3D¯A⊤1¯A3D¯A⊤2¯A3D¯A⊤3⎤⎥ ⎥ ⎥⎦.

Now we analyze the structure of each sub-matrix and rename them for conciseness. Let

 D=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣D1D2⋱DN+1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,

where , and . Then

 ¯A1D¯A⊤1=⎡⎢ ⎢ ⎢⎣F1D1F⊤1⋱FNDNF⊤N⎤⎥ ⎥ ⎥⎦:=B1.

Each is a diagonal matrix with positive diagonal entries.

 ¯A2D¯A⊤1=⎡⎢ ⎢ ⎢ ⎢⎣G(2)1D