1 Introduction
Optimal transport (OT) is a classical problem that recently finds many emerging applications in machine learning and artificial intelligence, including generative models
[3], representation learning [19][4] and word embeddings [2] etc. More recently, Scetbon et al. [21] proposed an equitable and optimal transport (EOT) problem that targets to fairly distribute the workload of OT when there are multiple agents. In this problem, there are multiple agents working together to move mass from measures to and each agent has its unique cost function. A very important issue that needs to be considered here is the fairness, which aims at finding transportation plans such that the workloads among all the agents are equal to each other. This can be achieved by minimizing the largest transportation cost among all agents, which leads to a convexconcave saddle point problem. The EOT problem has wide applications in economics and machine learning, such as fair division or the cakecutting problem [16, 6], multitype resource allocation [15], internet minimal transportation time and sequential optimal transport [21].We now describe the EOT problem formally. Given two discrete probability measures
and , the EOT studies the problem of transporting mass from to by agents. Here, and are the support points of each measure and , are corresponding weights for each measure, where denotes the probability simplex in . Moreover, throughout this paper, we assume . For each agent , we denote its unique cost function as and its cost matrix as , where . Moreover, we define the following coupling decomposition setwhere are the row sum and column sum of matrix respectively. Mathematically, the EOT problem can be formulated as
(1) 
When , (1) reduces to the standard OT problem. Note that (1) minimizes the pointwise maximum of a finite collection of functions. It is easy to see that (1) is equivalent to the following constrained problem:
(2) 
The following proposition shows an important property of EOT: at the optimum of the minimax EOT formulation (2), the transportation costs of the agents are equal to each other.
Proposition 1
Note that Proposition 1 requires all cost matrices to have the same sign. When the cost matrices are all nonnegative, (2) solves the transportation problem with multiple agents. When the cost matrices are all nonpositive, the cost matrices are interpreted as the utility functions and (2) solves the fair division problem [16].
The discrete OT is a linear programming (LP) problem (in fact, an assignment problem) with a complexity of [25]. Due to this cubic dependence on the dimension , it is challenging to solve largescale OT in practice. A widely adopted compromise is to add an entropy regularizer to the OT problem [7]. The resulting problem is strongly convex and smooth, and its dual problem can be efficiently solved by the celebrated Sinkhorn’s algorithm [22, 7]. This strategy is now widely used in the OT community due to its computational advantages as well as improved sample complexity [9]. Similar ideas were also used for computing the Wasserstein barycenter [5], projection robust Wasserstein distance [20, 14, 12], projection robust Wasserstein barycenter [11]. Motivated by these previous works, Scetbon et al. [21] proposed to add an entropy regularizer to (2), and designed a projected alternating maximization algorithm (PAM) to solve its dual problem. However, the convergence of PAM has not been studied. Scetbon et al. [21] also proposed an accelerated projected gradient ascent algorithm (APGA) for solving a different form of the dual problem of the entropy regularized EOT. Since the objective function of this new dual form has Lipschitz continuous gradient, APGA is essentially the Nesterov’s accelerated gradient method and thus its convergence rate is known. However, numerical experiments conducted in [21] indicate that APGA performs worse than PAM. We will discuss the reasons in details later.
Our Contributions. There are mainly three issues with the PAM and APGA algorithms in [21], and we will address all of them in this paper. Our results may shed lights on designing new block coordinate descent algorithms. Our main contributions are given below.

The PAM algorithm in [21] only returns the dual variables. How to find the primal solution of (2), i.e., the optimal transport plans , was not discussed in [21]. In this paper, we propose a novel rounding procedure to find the primal solution. Our rounding procedure is different from the one widely used in the literature [1].

We provide the first convergence analysis of the PAM algorithm, and analyze its iteration complexity for finding an optimal solution to the EOT problem (2). In particular, we show that it takes at most arithmetic operations to find an optimal solution to (2). This matches the rate of the Sinkhorn’s algorithm for computing the Wasserstein distance [8].

We propose a variant of PAM that incorporates the extrapolation technique as used in Nesterov’s accelerated gradient method. We name this variant as Projected Alternating Maximization with Extrapolation (PAME). The iteration complexity of PAME is also analyzed. Though we are not able to prove a better complexity over PAM at this moment, we find that PAME performs much better than PAM numerically.
Notation.
For vectors
and with the same dimension, denotes their entrywise division. We denote . Throughout this paper, we assume vector , and we denote . We use to denote the dimensional vector whose entries are all equal to one. We use to denote the indicator function of set , i.e., if , and otherwise. We denote . For integer , we denote . We also denote .2 Projected Alternating Maximization Algorithm
The PAM algorithm proposed in [21] aims to solve the entropy regularized EOT problem, which is given by
(4) 
where is a regularization parameter, , and the entropy function is defined as . Note that (4) is a stronglyconvexconcave minimax problem whose constraint sets are convex and bounded, and thus the Sion’s minimax theorem [24] guarantees that
(5) 
Now we consider the dual problem of . First, we add a redundant constraint and consider the dual of
(6) 
The reason for adding this redundant constraint is to guarantee that the dual objective function is Lipschitz smooth. It is easy to verify that the dual problem of (6) is given by
(7) 
where and are the dual variables and . It is noted that problem (7) admits the following solution:
(8) 
where
(9) 
By plugging (8) into (7), we obtain the following dual problem of (6):
(10) 
Plugging (10) into (5), we know that the entropy regularized EOT problem (4) is equavalent to a pure maximization problem:
(11) 
Function is a smooth concave function with three block variables . We use to denote an optimal solution of (11), and we denote . The PAM algorithm proposed in [21] is essentially a block coordinate descent (BCD) algorithm for solving (11). More specifically, the PAM updates the three block variables by the following scheme:
(12a)  
(12b)  
(12c) 
Each iteration of PAM consists of two exact maximization steps followed by one projected gradient step. Importantly, the two exact maximization problems (12a)(12b) have numerous optimal solutions, and we choose to use the following ones:
(13)  
(14) 
Furthermore, the optimiality conditions of (12a)(12b) imply that
(15) 
However, we need to point out that the PAM (12) only returns the dual variables . One can compute the primal variable using (8), but it is not necessarily a feasible solution. That is, computed from (8) does not satisfy . How to obtain an optimal primal solution from the dual variables was not discussed in [21]. For the OT problem, i.e., , a rounding procedure for returning a feasible primal solution has been proposed in [1]. However, this rounding procedure cannot be applied to the EOT problem directly. In the next section, we propose a new rounding procedure for returning a primal solution based on the dual solution . This new rounding procedure involves a dedicated way to compute the margins.
2.1 The Rounding Procedure and the Margins
Given , , and satisfying , we construct vectors from the procedure
(16) 
The details of this procedure is given below. First, we set , which immediately implies . We then construct such that the following properties hold (these properties are required in our convergence analysis later):

;

;

;

For any fixed , the quantities have the same sign for all . That is, for any and , we have
(17) which provides the following identity that is useful in our convergence analysis later:
(18)
The procedure on constructing satisfying these four properties is provided in Appendix A.
After are constructed from (16) with , we adopt the rounding procedure proposed in [1] to output a primal feasible solution . The rounding procedure is described in Algorithm 2.
With this new procedure for rounding and computing the margins , , we now formally describe our PAM algorithm in Algorithm 1.
(19) 
2.2 Connections with BCD and BCGD Methods
We now discsuss the connections between PAM and the block coordinate descent (BCD) method and the block coordinate gradient descent (BCGD) method. For the ease of presentation, we now assume that we are dealing with the following general convex optimization problem with block variables:
(20) 
where and is convex and differentiable. The BCD method for solving (20) iterates as follows:
(21) 
and it assumes that these subproblems are easy to solve. The BCGD method for solving (20) iterates as follows:
(22) 
where is the step size. The PAM (12) is a hybrid of BCD (21) and BCGD (22), in the sense that some block variables are updated by exactly solving a maximization problem (the and steps), and some other block variables are updated by taking a gradient step (the step). Though this hybrid idea has been studied in the literature [10, 26], their convergence analysis requires the blocks corresponding to exact minimization to be strongly convex. However, in our problem (11), the negative of the objective function is merely convex. Hence we need to develop new convergence proofs to analyze the convergence of PAM (Algorithm 1). How to extend our convergence results of PAM (Algorithm 1) to more general settings is a very interesting topic for future study.
3 Convergence Analysis of PAM
In this section, we analyze the iteration complexity of Algorithm 1 for obtaining an optimal solution to the original EOT problem (2). The optimal solution to (2) is defined as follows.
Definition 2 (see, e.g., [17])
3.1 Technical Preparations
We first give the partial gradients of .
(23a)  
(23b)  
(23c) 
Since (13) and (14) renormalize the row sum and column sum of to be and , we immediately have
(24) 
which, combined with (8), yields
(25) 
Lemma 3 (Rounding Error)
Let with , , and . The following inequality holds:
Proof. The proof is a slight modification from [1, Lemma 7]. Note that Lines 25 in Algorithm 2 renormalize the row sum and column sum that are larger than the corresponding and . It is easy to verify that , , and are nonnegative with and
and likewise . Denote . Since we remove mass from a row of when , and from a column when , we have
Firstly, a simple calculation shows
Secondly, the fact that the vector is entrywise larger than leads to
Therefore we conclude
The following lemma shows that is Lipschitz continuous.
Lemma 4
For any and , the following inequality holds
(26) 
which immediately implies
(27) 
Proof. The proof essentially follows [21]. It is easy to verify that the th entry of the Hessian of with respect to is
where iff and 0 otherwise, for all and
Let satisfying , and by denoting the Hessian of with respect to for fixed , we obtain
where the last three inequalities come from Cauchy Schwartz inequality. Moreover we have
which completes the proof.
The next lemma gives a bound for .
Lemma 5
Let be the sequence generated by Algorithm 1. For any it holds that
(28a)  
(28b) 
Proof. We prove (28a) first. When , (28a) holds because of the initialization . When , from (15) we have
(29) 
where the second inequality is due to Combining (29) and (24) we get
which leads to
(30) 
Moreover, note that therefore . This fact leads to:
which gives
(31) 
Combining (30) with (31) yields (28a). The bound for (28b) can be obtained similarly, by noting that . We omit the details for brevity.
Lemma 6
Let be generated by PAM (Algorithm 1). The following equality holds.
Proof. By (25), we have