is currently generating an increasing attraction in statistics, machine learning and optimization communities. Statistical procedures based on optimal transport are availableBigot et al. (2012); Del Barrio et al. (2015); Ebert et al. (2017); Le Gouic and Loubes (2017)
as well as many applications in different areas of machine learning including unsupervised learningArjovsky et al. (2017); Bigot et al. (2017)2014), clustering Ho et al. (2017), text classification Kusner et al. (2015). Optimal transport distances lead to the concept of Wasserstein barycenter, which allows to define a mean of a set of complex objects, e.g. images, preserving their geometric structure Cuturi and Doucet (2014). In this paper we focus on the computational aspects of optimal transport, namely on approximating Wasserstein barycenter of a set of histograms.
Starting with Altschuler et al. (2017), several groups of authors addressed the question of Wasserstein distance approximation complexity Chakrabarty and Khanna (2018); Dvurechensky et al. (2018b); Blanchet et al. (2018); Lin et al. (2019). Implementable schemes are based on Sinkhorn’s algorithm, which was first applied to OT in Cuturi (2013), and accelerated gradient descent proposed as an alternative in Dvurechensky et al. (2018b). Much less is known about the complexity of approximating Wasserstein barycenter. The works Staib et al. (2017); Uribe et al. (2018a); Dvurechensky et al. (2018a), are in some sense close, but do not provide an explicit answer. Following Genevay et al. (2016), Staib et al. (2017)to achieve accuracy in approximation of the barycenter, where is some constant depending on the problem data, i.e. transportation cost matrices, is some distance characterizing the solution of the dual problem. Dvurechensky et al. (2018a) consider regularized barycenter, but do not show, how to choose the regularization parameter to achieve -accuracy.
Following Dvurechensky et al. (2018b), we study two alternative approaches for approximating Wasserstein barycenter based on entropic regularization Cuturi (2013). The first approach is based on Iterative Bregman Projection (IBP) algorithm Benamou et al. (2015), which can be considered as a general alternating projections algorithm and also as a generalization of the Sinkhorn’s algorithm Sinkhorn (1974). The second approach is based on constructing a dual problem and solving it by primal-dual accelerated gradient descent. For both approaches, we show, how the regularization parameter should be chosen in order to approximate the original, non-regularized barycenter.
We also address the question of scalability of computations in the Big Data regime, i.e. the size of histograms and the number of histograms are large. In this case the dataset of histograms can be distributedly produced or stored in a network of agents/sensors/computers with the network structure given by an arbitrary connected graph. In a special case of centralized architecture, i.e. if there is a central "master" node surrounded by "slave" nodes, parallel algorithms such as Staib et al. (2017) can be applied. In a more general setup of arbitrary networks it makes sense to use decentralized distributed algorithms in the spirit of distributed optimization algorithms Nedić et al. (2017); Scaman et al. (2017).
It is very hard to cover all the increasing stream of works on OT and we mention these books Villani (2008); Santambrogio (2015); Peyré and Cuturi (2018) as a starting point and the references therein. Approximation of Wasserstein barycenter was considered in Cuturi and Doucet (2014); Bonneel et al. (2015); Benamou et al. (2015); Staib et al. (2017); Puccetti et al. (2018); Claici et al. (2018); Uribe et al. (2018a); Dvurechensky et al. (2018a) using different techniques as Sinkhorn-type algorithm, first-order methods, Newton-type methods. Considering the primal-dual approach based on accelerated gradient descent, our paper shares some similarities with Cuturi and Peyré (2016) with the main difference that we are focused on complexity and scalability of computations and explicitly analyze the algorithm applied to the dual problem.
There is a vast amount of literature on accelerated gradient descent with the canonical reference being Nesterov (1983). Primal-dual extensions can be found in Lan et al. (2011); Tran-Dinh et al. (2018); Yurtsever et al. (2015); Chernov et al. (2016); Dvurechensky et al. (2016, 2017); Anikin et al. (2017); Lin et al. (2019). We are focused on the extensions amenable to the decentralized distributed optimization, so that these algorithms can be scaled for large problems.
Distributed optimization algorithms were considered by many authors with the classical reference being Bertsekas and Tsitsiklis (1989). Initial algorithms, such as Distributed Gradient Descent Nedic and Ozdaglar (2009), were relatively slow compared with their centralized counterparts. However, recent work has made significant advances towards a better understanding of the optimal rates of such algorithms and their explicit dependencies to the function and network parameters Lan et al. (2017); Scaman et al. (2017); Uribe et al. (2018b). These approaches has been extended to other scenarios such as time-varying graphs Rogozin et al. (2018); Maros and Jaldén (2018); Wu and Lu (2017). The distributed setup is particularly interesting for machine learning applications on the big data regime, where the number of data points and the dimensionality is large, due to its flexibility to handle intrinsically distributed storage and limited communication, as well as privacy constraints He et al. (2018); Wai et al. (2018).
We consider -regularized Wasserstein barycenter problem and obtain complexity bounds for finding an approximation to the regularized barycenter by two algorithms. The first one is Iterative Bregman Projections algorithm Benamou et al. (2015), for which we prove complexity proportional to to achieve accuracy . The second one is based on accelerated gradient descent (AGD) and has complexity proportional to . The benefit of the second algorithm is that it is better scalable and can be implemented in the decentralized distributed optimization setting over an arbitrary network.
We show, how to choose the regularization parameter in order to find an -approximation for the non-regularized Wasserstein barycenter and find the resulting complexity for IBP to be proportional to and for AGD to be proportional to .
As we can see from the complexity bounds for IBP and AGD, they depend on the regularization parameter quite badly regarding that this parameter has to be small. To overcome this drawback we propose a proximal-IBP method, which can be considered as a proximal method using IBP on each iteration to find the next iterate.
1 Problem Statement and Preliminaries
We define the probability simplex as. Given two discrete measures and from we introduce the set of their coupling measures as
For coupling measure we denote the negative entropy (up to an additive constant) as
Here and further by (
) we denote the element-wise logarithm (exponent) of matrix or vector, and for any . For two matrices we also define element-wise multiplication and element-wise division as and
respectively. Kullback-Leibler divergence for measuresis defined as the Bregman divergence associated with :
We also define a symmetric cost matrix , which element corresponds to the cost of moving bin to bin . denotes the maximal element of this matrix.
We refer to
as the maximum eigenvalue of a symmetric matrix, and as the minimal non-zero eigenvalue, and define the condition number of matrix as . We use symbol as a column of ones.
1.2 Wasserstein barycenters and entropic regularization
Given two probability measures and a cost matrix we define optimal transportation distance between them as
For a given set of probability measures and cost matrices we define their weighted barycenter with weights as a solution of the following convex optimization problem:
where , and
We use to denote . Using entropic regularization proposed in Cuturi (2013) we define regularized OT-distance for :
Respectively, one can consider regularized barycenter that is a solution of the following problem:
2 Complexity of WB by Iterative Bregman Projections
In this section we provide theoretical analysis of the Iterative Bregman Projections algorithm Benamou et al. (2015) for regularized Wasserstein barycenter and obtain iteration complexity with . Then we estimate the bias itroduced by regularization and estimate the value of to obtain an -approximation for the non-regularized barycenter. Combining this result with the iteration complexity of IBP, we obtain complexity for approximating non-regularized barycenter by the IBP algorithm. This algorithm can be implemented in a centralized distributed manner such that each node performs arithmetic operations and the number of communication rounds is .
2.1 Convergence of IBP for regularized barycenter
and construct its dual (see the details below). To solve the dual problem we reformulate the IBP algorithm as Algorithm 1111In the original paper Benamou et al. (2015) there were misprints in description of IBP. Correct author’s variant can be found in https://github.com/gpeyre/2014-SISC-BregmanOT. In Algorithm 1 we use different denotations. First of all, our corresponds to from Benamou et al. (2015) and our corresponds to from Benamou et al. (2015). Secondly, our transport plan matrix equals to transpose transport plan matrix from Benamou et al. (2015). Thirdly, we build a little bit different dual problem, by introducing additional constraint , see Lemma 1. This allows us to simplify calculations in line 3 of Algorithm 1.. Notably, our reformulation of the IBP algorithm allows to solve simultaneously the primal and dual problem and has an adaptive stopping criterion (see line 7), which does not require to calculate any objective values.
Before we move to the analysis of the algorithm let us discuss the scalability of this algorithm by using centralized distributed computations framework. This framework includes a master and slave nodes. Each -th slave node stores data , , and variables , . On each iteration , it calculates and sends it to the master node, which aggregates these products to the sum and sends this sum back to the slave nodes. Based on this information, slave nodes update and . So, the main computational cost of multiplying a matrix by a vector, can be distributed on slave nodes and the total working time will be smaller. It is not clear, how this algorithm can be implemented on a general network, for example when the data is produced by a distributed network of sensors without one master node. In contrast, as we illustrate in Section 3, the alternative accelerated-gradient-based approach can be implemented on an arbitrary network.
Our first step is to recall the IBP algorithm from Benamou et al. (2015), formulate the dual problem for (5) and show that our Algorithm 1 solves this dual problem and is equivalent to the IBP algorithm.
for and the following affine convex sets and
Algorithm 2 introduced in Benamou et al. (2015) consists in alternating projections to sets and w.r.t. Kullback–Leibler divergence, and is a generalization of Sinkhorn’s algorithm and a particular case of Dykstra’s projection algorithm.
As we will show below, this algorithm is equivalent to alternating minimization in dual problem of (5) presented in the next lemma.
The Lagrangian for (5) is equal to
where , , with convention . Using the change of variables and we obtain
Notice that and this condition allows uniquely reconstruct . Then by min-max theorem
By straightforward computations and the definition (10) of we obtain
and the minimum is attained at point . Thus we have
Let us prove it by induction. For it is obviously true. Assume it holds for some . Then
where comes from (12).
Thus for even
Lagrangian for this problem has form , hence the dual problem is
and as ,
Similarly, for odd
with Lagrangian and dual problem
The next lemma gives us explicit recurrent expressions for and defined in the previous lemma. Equation (15) immediately follows from (Benamou et al., 2015, Proposition 1), and equation (16) is a reformulation of (Benamou et al., 2015, Proposition 2).
For any , it holds
Bound can be derived in almost the same way as in Dvurechensky et al. (2018b). For it obviously holds. Let us denote by the minimal entry of :
As , we obtain for all
Hence at any update of it holds
For solution to the problem condition (16) also holds, and consequently the solution also meets derived bounds. ∎
Let us define an excess function
for further complexity analysis of Algorithm 1.
Let be generated by Algorithm 1. Then for any even we have
Gradient inequality of any convex function at point reads as
Applying the latter inequality to function at point we obtain
If is even, then the first term in r.h.s. vanishes. Notice that , thus
By Lemma 4 and , therefore
For any odd the following bound on the change of objective function holds:
If is odd, then satisfies (16) and . Therefore,
Here we used equations , i.e. and thus , and the following fact: if , , then
Indeed, let , then