# On the Convergence of Projected Alternating Maximization for Equitable and Optimal Transport

This paper studies the equitable and optimal transport (EOT) problem, which has many applications such as fair division problems and optimal transport with multiple agents etc. In the discrete distributions case, the EOT problem can be formulated as a linear program (LP). Since this LP is prohibitively large for general LP solvers, Scetbon <cit.> suggests to perturb the problem by adding an entropy regularization. They proposed a projected alternating maximization algorithm (PAM) to solve the dual of the entropy regularized EOT. In this paper, we provide the first convergence analysis of PAM. A novel rounding procedure is proposed to help construct the primal solution for the original EOT problem. We also propose a variant of PAM by incorporating the extrapolation technique that can numerically improve the performance of PAM. Results in this paper may shed lights on block coordinate (gradient) descent methods for general optimization problems.

• 6 publications
• 26 publications
• 24 publications
06/09/2019

### Accelerated Alternating Minimization

Alternating minimization (AM) optimization algorithms have been known fo...
02/09/2020

### On Unbalanced Optimal Transport: An Analysis of Sinkhorn Algorithm

We provide a computational complexity analysis for the Sinkhorn algorith...
02/12/2018

### Computational Optimal Transport: Complexity by Accelerated Gradient Descent Is Better Than by Sinkhorn's Algorithm

We analyze two algorithms for approximating the general optimal transpor...
09/25/2022

### Optimal Efficiency-Envy Trade-Off via Optimal Transport

We consider the problem of allocating a distribution of items to n recip...
05/03/2018

### Computational Optimal Transport for 5G Massive C-RAN Device Association

The massive scale of future wireless networks will create computational ...
04/08/2019

### Consensus-based Distributed Discrete Optimal Transport for Decentralized Resource Matching

Optimal transport has been used extensively in resource matching to prom...
03/10/2021

### Fast block-coordinate Frank-Wolfe algorithm for semi-relaxed optimal transport

Optimal transport (OT), which provides a distance between two probabilit...

## 1 Introduction

Optimal transport (OT) is a classical problem that recently finds many emerging applications in machine learning and artificial intelligence, including generative models

[3], representation learning [19][4] and word embeddings [2] etc. More recently, Scetbon et al. [21] proposed an equitable and optimal transport (EOT) problem that targets to fairly distribute the workload of OT when there are multiple agents. In this problem, there are multiple agents working together to move mass from measures to and each agent has its unique cost function. A very important issue that needs to be considered here is the fairness, which aims at finding transportation plans such that the workloads among all the agents are equal to each other. This can be achieved by minimizing the largest transportation cost among all agents, which leads to a convex-concave saddle point problem. The EOT problem has wide applications in economics and machine learning, such as fair division or the cake-cutting problem [16, 6], multi-type resource allocation [15], internet minimal transportation time and sequential optimal transport [21].

We now describe the EOT problem formally. Given two discrete probability measures

and , the EOT studies the problem of transporting mass from to by agents. Here, and are the support points of each measure and , are corresponding weights for each measure, where denotes the probability simplex in . Moreover, throughout this paper, we assume . For each agent , we denote its unique cost function as and its cost matrix as , where . Moreover, we define the following coupling decomposition set

 ΠNa,b:={π=(πk)k∈[N]∣∣ ∣∣r(∑kπk)=a,c(∑kπk)=b,πkij≥0,∀i,j∈[n]},

where are the row sum and column sum of matrix respectively. Mathematically, the EOT problem can be formulated as

 minπ∈ΠNa,bmax1≤k≤N ⟨πk,Ck⟩. (1)

When , (1) reduces to the standard OT problem. Note that (1) minimizes the point-wise maximum of a finite collection of functions. It is easy to see that (1) is equivalent to the following constrained problem:

 minπ∈ΠNa,bmaxλ∈ΔN+ ℓ(π,λ):=N∑k=1λk⟨πk,Ck⟩. (2)

The following proposition shows an important property of EOT: at the optimum of the minimax EOT formulation (2), the transportation costs of the agents are equal to each other.

###### Proposition 1

[21, Proposition 1] Assume that all cost matrices have the same sign. Let be the optimal solution of (2). It holds that

 ⟨(π∗)i,Ci⟩=⟨(π∗)j,Cj⟩,∀i,j∈[N]. (3)

Note that Proposition 1 requires all cost matrices to have the same sign. When the cost matrices are all non-negative, (2) solves the transportation problem with multiple agents. When the cost matrices are all non-positive, the cost matrices are interpreted as the utility functions and (2) solves the fair division problem [16].

The discrete OT is a linear programming (LP) problem (in fact, an assignment problem) with a complexity of [25]. Due to this cubic dependence on the dimension , it is challenging to solve large-scale OT in practice. A widely adopted compromise is to add an entropy regularizer to the OT problem [7]. The resulting problem is strongly convex and smooth, and its dual problem can be efficiently solved by the celebrated Sinkhorn’s algorithm [22, 7]. This strategy is now widely used in the OT community due to its computational advantages as well as improved sample complexity [9]. Similar ideas were also used for computing the Wasserstein barycenter [5], projection robust Wasserstein distance [20, 14, 12], projection robust Wasserstein barycenter [11]. Motivated by these previous works, Scetbon et al. [21] proposed to add an entropy regularizer to (2), and designed a projected alternating maximization algorithm (PAM) to solve its dual problem. However, the convergence of PAM has not been studied. Scetbon et al. [21] also proposed an accelerated projected gradient ascent algorithm (APGA) for solving a different form of the dual problem of the entropy regularized EOT. Since the objective function of this new dual form has Lipschitz continuous gradient, APGA is essentially the Nesterov’s accelerated gradient method and thus its convergence rate is known. However, numerical experiments conducted in [21] indicate that APGA performs worse than PAM. We will discuss the reasons in details later.

Our Contributions. There are mainly three issues with the PAM and APGA algorithms in [21], and we will address all of them in this paper. Our results may shed lights on designing new block coordinate descent algorithms. Our main contributions are given below.

• The PAM algorithm in [21] only returns the dual variables. How to find the primal solution of (2), i.e., the optimal transport plans , was not discussed in [21]. In this paper, we propose a novel rounding procedure to find the primal solution. Our rounding procedure is different from the one widely used in the literature [1].

• We provide the first convergence analysis of the PAM algorithm, and analyze its iteration complexity for finding an -optimal solution to the EOT problem (2). In particular, we show that it takes at most arithmetic operations to find an -optimal solution to (2). This matches the rate of the Sinkhorn’s algorithm for computing the Wasserstein distance [8].

• We propose a variant of PAM that incorporates the extrapolation technique as used in Nesterov’s accelerated gradient method. We name this variant as Projected Alternating Maximization with Extrapolation (PAME). The iteration complexity of PAME is also analyzed. Though we are not able to prove a better complexity over PAM at this moment, we find that PAME performs much better than PAM numerically.

Notation.

For vectors

and with the same dimension, denotes their entry-wise division. We denote . Throughout this paper, we assume vector , and we denote . We use to denote the -dimensional vector whose entries are all equal to one. We use to denote the indicator function of set , i.e., if , and otherwise. We denote . For integer , we denote . We also denote .

## 2 Projected Alternating Maximization Algorithm

The PAM algorithm proposed in [21] aims to solve the entropy regularized EOT problem, which is given by

 minπ∈ΠNa,bmaxλ∈ΔN+ ℓη(π,λ):=N∑k=1pkη(πk,λ) (4)

where is a regularization parameter, , and the entropy function is defined as . Note that (4) is a strongly-convex-concave minimax problem whose constraint sets are convex and bounded, and thus the Sion’s minimax theorem [24] guarantees that

 minπ∈ΠNa,bmaxλ∈ΔN+ ℓη(π,λ)=maxλ∈ΔN+minπ∈ΠNa,bℓη(π,λ). (5)

Now we consider the dual problem of . First, we add a redundant constraint and consider the dual of

 minπ∈ΠNa,b,∑k,i,jπki,j=1ℓη(π,λ). (6)

The reason for adding this redundant constraint is to guarantee that the dual objective function is Lipschitz smooth. It is easy to verify that the dual problem of (6) is given by

 maxf,gmin∑k,i,jπki,j=1,π∈(Rn×n+)NN∑k=1λk⟨πk,Ck⟩−ηH(π)+f⊤(a−r(∑kπk))+g⊤(b−c(∑k(πk)⊤)), (7)

where and are the dual variables and . It is noted that problem (7) admits the following solution:

 πk(f,g,λ)=ζk(f,g,λ)∑k∥ζk(f,g,λ)∥1,∀k∈[N], (8)

where

 ζk(f,g,λ)=exp(f1⊤n+1ng⊤−λkCkη),∀k∈[N]. (9)

By plugging (8) into (7), we obtain the following dual problem of (6):

 maxf∈Rn, g∈Rn⟨f,a⟩+⟨g,b⟩−ηlog(N∑k=1∥ζk(f,g,λ)∥1)−η. (10)

Plugging (10) into (5), we know that the entropy regularized EOT problem (4) is equavalent to a pure maximization problem:

 (11)

Function is a smooth concave function with three block variables . We use to denote an optimal solution of (11), and we denote . The PAM algorithm proposed in [21] is essentially a block coordinate descent (BCD) algorithm for solving (11). More specifically, the PAM updates the three block variables by the following scheme:

 ft+1 ∈argmaxfF(f,gt,λt), (12a) gt+1 ∈argmaxgF(ft+1,g,λt), (12b) λt+1 :=ProjΔN(λt+τ∇λF(ft+1,gt+1,λt)). (12c)

Each iteration of PAM consists of two exact maximization steps followed by one projected gradient step. Importantly, the two exact maximization problems (12a)-(12b) have numerous optimal solutions, and we choose to use the following ones:

 ft+1 =ft+ηlog⎛⎜ ⎜⎝ar(∑Nk=1ζk(ft,gt,λt))⎞⎟ ⎟⎠, (13) gt+1 =gt+ηlog⎛⎜ ⎜⎝bc(∑Nk=1ζk(ft+1,gt,λt))⎞⎟ ⎟⎠. (14)

Furthermore, the optimiality conditions of (12a)-(12b) imply that

 a−r(∑Nk=1ζk(ft+1,gt,λt))∑k∥ζk(ft+1,gt,λt)∥1=0,b−c(∑Nk=1ζk(ft+1,gt+1,λt))∑k∥ζk(ft+1,gt+1,λt)∥1=0,∀t. (15)

However, we need to point out that the PAM (12) only returns the dual variables . One can compute the primal variable using (8), but it is not necessarily a feasible solution. That is, computed from (8) does not satisfy . How to obtain an optimal primal solution from the dual variables was not discussed in [21]. For the OT problem, i.e., , a rounding procedure for returning a feasible primal solution has been proposed in [1]. However, this rounding procedure cannot be applied to the EOT problem directly. In the next section, we propose a new rounding procedure for returning a primal solution based on the dual solution . This new rounding procedure involves a dedicated way to compute the margins.

### 2.1 The Rounding Procedure and the Margins

Given , , and satisfying , we construct vectors from the procedure

 (ak,bk)k∈[N]=Margins(π,a,b). (16)

The details of this procedure is given below. First, we set , which immediately implies . We then construct such that the following properties hold (these properties are required in our convergence analysis later):

1. ;

2. ;

3. ;

4. For any fixed , the quantities have the same sign for all . That is, for any and , we have

 (bkj−[c(πk)]j)⋅(bk′j−[c(πk′)]j)≥0, (17)

which provides the following identity that is useful in our convergence analysis later:

 N∑k=1∥bk−c(πk)∥1 =N∑k=1n∑j=1|bkj−[c(πk)]j|=n∑j=1∣∣ ∣∣N∑k=1(bkj−[c(πk)]j)∣∣ ∣∣ (18) =n∑j=1∣∣ ∣∣bj−[c(N∑k=1πk)]j∣∣ ∣∣=∥∥ ∥∥b−c(N∑k=1πk)∥∥ ∥∥1.

The procedure on constructing satisfying these four properties is provided in Appendix A.

After are constructed from (16) with , we adopt the rounding procedure proposed in [1] to output a primal feasible solution . The rounding procedure is described in Algorithm 2.

With this new procedure for rounding and computing the margins , , we now formally describe our PAM algorithm in Algorithm 1.

### 2.2 Connections with BCD and BCGD Methods

We now discsuss the connections between PAM and the block coordinate descent (BCD) method and the block coordinate gradient descent (BCGD) method. For the ease of presentation, we now assume that we are dealing with the following general convex optimization problem with block variables:

 minxi∈Xi,i=1,…,m J(x1,x2,…,xm), (20)

where and is convex and differentiable. The BCD method for solving (20) iterates as follows:

 xt+1i=argminxi∈Xi J(xt+11,xt+12,…,xt+1i−1,xi,xti+1,…,xtm), (21)

and it assumes that these subproblems are easy to solve. The BCGD method for solving (20) iterates as follows:

 xt+1i=argminxi∈Xi ⟨∇xiJ(xt+11,xt+12,…,xt+1i−1,xi,xti+1,…,xtm),xi−xti⟩+1τ∥xi−xti∥22, (22)

where is the step size. The PAM (12) is a hybrid of BCD (21) and BCGD (22), in the sense that some block variables are updated by exactly solving a maximization problem (the and steps), and some other block variables are updated by taking a gradient step (the step). Though this hybrid idea has been studied in the literature [10, 26], their convergence analysis requires the blocks corresponding to exact minimization to be strongly convex. However, in our problem (11), the negative of the objective function is merely convex. Hence we need to develop new convergence proofs to analyze the convergence of PAM (Algorithm 1). How to extend our convergence results of PAM (Algorithm 1) to more general settings is a very interesting topic for future study.

## 3 Convergence Analysis of PAM

In this section, we analyze the iteration complexity of Algorithm 1 for obtaining an -optimal solution to the original EOT problem (2). The -optimal solution to (2) is defined as follows.

###### Definition 2 (see, e.g., [17])

We call an -optimal solution to the EOT problem (2) if the following inequality holds:

 maxλ∈ΔNℓ(^π,λ)−minπ∈ΠNa,bℓ(π,^λ)≤ϵ.

Note that the left hand side of the inequality is the duality gap of (2).

### 3.1 Technical Preparations

We first give the partial gradients of .

 [∇fF(f,g,λ)]i =ai−∑k,jexp((fi+gj−λkCkij)/η)∑k∥ζk(f,g,λ))∥1=ai−[r(∑kπk(f,g,λ))]i, (23a) j =bj−∑k,iexp((fi+gj−λkCkij)/η)∑k∥ζk(f,g,λ))∥1=bj−[c(∑kπk(f,g,λ))]j, (23b) k =∑i,jCkijexp((fi+gj−λkCkij)/η)∑k∥ζk(f,g,λ))∥1=⟨πk(f,g,λ),Ck⟩. (23c)

Since (13) and (14) renormalize the row sum and column sum of to be and , we immediately have

 N∑k=1∥ζk(ft+1,gt,λt)∥1=1,N∑k=1∥ζk(ft+1,gt+1,λt)∥1=1,∀t, (24)

which, combined with (8), yields

 πk(ft+1,gt,λt)=ζk(ft+1,gt,λt),πk(ft+1,gt+1,λt)=ζk(ft+1,gt+1,λt),∀t. (25)

The following lemma gives an error bound for Algorithm 2 (see [1]).

###### Lemma 3 (Rounding Error)

Let with , , and . The following inequality holds:

 ∥^π−π∥1≤2(∥r(π)−a∥1+∥c(π)−b∥1).

Proof. The proof is a slight modification from [1, Lemma 7]. Note that Lines 2-5 in Algorithm 2 renormalize the row sum and column sum that are larger than the corresponding and . It is easy to verify that , , and are nonnegative with and

 r(^π)=r(π′′)+r(erraerr⊤b/∥erra∥1)=r(π′′)+erra=a,

and likewise . Denote . Since we remove mass from a row of when , and from a column when , we have

 Δ=n∑i=1(ri(π)−ai)++n∑j=1(cj(π′)−bj)+.

Firstly, a simple calculation shows

 n∑i=1(ri(π)−ai)+=12[∥r(π)−a∥1+∥π∥1−q].

Secondly, the fact that the vector is entrywise larger than leads to

 n∑j=1(cj(π′)−bj)+ ≤n∑j=1(cj(π)−bj)+ ≤∥c(π)−b∥1.

Therefore we conclude

 ∥^π−π∥1 ≤Δ+∥erraerr⊤b∥1/∥erra∥1=Δ+q−∥π′′∥1=2Δ+q−∥π∥1 ≤∥r(π)−a∥1+2∥c(π)−b∥1≤2[∥r(π)−a∥1+∥c(π)−b∥1].

The following lemma shows that is Lipschitz continuous.

###### Lemma 4

For any and , the following inequality holds

 ∥∇λF(f,g,λ1)−∇λF(f,g,λ2)∥2≤c2∞∥λ1−λ2∥2/η, (26)

which immediately implies

 F(f,g,λ1)≥F(f,g,λ2)+⟨∇λF(f,g,λ2),λ1−λ2⟩−c2∞2η∥λ1−λ2∥22. (27)

Proof. The proof essentially follows [21]. It is easy to verify that the -th entry of the Hessian of with respect to is

 ∂2F∂λq∂λk=1ην2[σq,1(λ)σk,1(λ)−ν(σk,2(λ)11k=q)]

where iff and 0 otherwise, for all and

 σk,p(λ) =∑i,j(Cki,j)pexp(fi+gj−λkCki,jη), ν =N∑k=1∑i,jexp(fi+gj−λkCki,jη).

Let satisfying , and by denoting the Hessian of with respect to for fixed , we obtain

 v⊤∇2λFv=1ην2⎡⎣(N∑k=1vkσk,1(λ))2−νN∑k=1v2kσk,2⎤⎦ ≤ 1ην2(N∑k=1vkσk,1(λ))2−1ην2⎛⎜⎝N∑k=1|vk| ⎷∑i,jexp(fi+gj−λkCki,jη) ⎷∑i,j(Cki,j)2exp(fi+gj−λkCki,jη)⎞⎟⎠2 ≤ 1ην2⎡⎢⎣(N∑k=1vkσk,1(λ))2−(N∑k=1|vk|∑i,j|Cki,j|exp(fi+gj−λkCki,jη))2⎤⎥⎦≤0,

where the last three inequalities come from Cauchy Schwartz inequality. Moreover we have

 vT∇2λFv= 1ην2⎡⎣(N∑k=1vkσk,1(λ))2−νN∑k=1v2kσk,2⎤⎦≥−∑Nk=1v2kσk,2ην≥−c2∞η,

which completes the proof.

The next lemma gives a bound for .

###### Lemma 5

Let be the sequence generated by Algorithm 1. For any it holds that

 maxjgtj−minjgtj ≤c∞−ηι, (28a) maxjg∗j−minjg∗j ≤c∞−ηι. (28b)

Proof. We prove (28a) first. When , (28a) holds because of the initialization . When , from (15) we have

 N∑k=1e−λt−1kCkij/η≥N∑k=1e−λt−1k∥Ck∥∞/η≥N∑k=1e−∥Ck∥∞/η≥Ne−c∞/η, (29)

where the second inequality is due to Combining (29) and (24) we get

 egtj/η⋅Ne−c∞/η⟨1,eft/η⟩≤∑iegtj/η(N∑k=1e−λt−1kCkij/η)efti/η=bj≤1,

 maxjgtj≤c∞−ηlog(N⟨1,eft/η⟩). (30)

Moreover, note that therefore . This fact leads to:

 egtj/η⋅⟨1,eft/η⟩≥∑iegtj/η(1NN∑k=1e−λt−1kCkij/η)efti/η=1Nbj,

which gives

 minjgtj≥ηι−ηlog(N⟨1,eft/η⟩). (31)

Combining (30) with (31) yields (28a). The bound for (28b) can be obtained similarly, by noting that . We omit the details for brevity.

###### Lemma 6

Let be generated by PAM (Algorithm 1). The following equality holds.

 N∑k∥πk(ft+1,gt+1,λt)−πk(ft+1,gt,λt)∥1=∥∥ct−b∥∥1,∀t.

Proof. By (25), we have

 N∑k∥πk(ft+1,gt+1,λt)−πk(ft+1,gt,λt)∥1 = N∑k∑i,j|e(ft+1i+gt+1j−λtkCki,j)/η−e(ft+1i+gtj−λtkCki,j)/η| = N∑k∑i,j[πk(ft+1,gt,λt)]