On the Complexity of Approximating Wasserstein Barycenter

We study the complexity of approximating Wassertein barycenter of m discrete measures, or histograms of size n by contrasting two alternative approaches, both using entropic regularization. The first approach is based on the Iterative Bregman Projections (IBP) algorithm for which our novel analysis gives a complexity bound proportional to mn^2/ε^2 to approximate the original non-regularized barycenter. Using an alternative accelerated-gradient-descent-based approach, we obtain a complexity proportional to mn^2.5/ε . As a byproduct, we show that the regularization parameter in both approaches has to be proportional to ε, which causes instability of both algorithms when the desired accuracy is high. To overcome this issue, we propose a novel proximal-IBP algorithm, which can be seen as a proximal gradient method, which uses IBP on each iteration to make a proximal step. We also consider the question of scalability of these algorithms using approaches from distributed optimization and show that the first algorithm can be implemented in a centralized distributed setting (master/slave), while the second one is amenable to a more general decentralized distributed setting with an arbitrary network topology.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/12/2018

Computational Optimal Transport: Complexity by Accelerated Gradient Descent Is Better Than by Sinkhorn's Algorithm

We analyze two algorithms for approximating the general optimal transpor...
05/20/2020

An Optimal Algorithm for Decentralized Finite Sum Optimization

Modern large-scale finite-sum optimization relies on two key aspects: di...
06/11/2018

Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters

We study the problem of decentralized distributed computation of a discr...
07/15/2020

A General Family of Stochastic Proximal Gradient Methods for Deep Learning

We study the training of regularized neural networks where the regulariz...
06/11/2020

Stochastic Saddle-Point Optimization for Wasserstein Barycenters

We study the computation of non-regularized Wasserstein barycenters of p...
05/27/2019

An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums

Modern large-scale finite-sum optimization relies on two key aspects: di...
02/22/2018

Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem

We study sampling as optimization in the space of measures. We focus on ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Optimal transport (OT) Monge (1781); Kantorovich (1942)

is currently generating an increasing attraction in statistics, machine learning and optimization communities. Statistical procedures based on optimal transport are available

Bigot et al. (2012); Del Barrio et al. (2015); Ebert et al. (2017); Le Gouic and Loubes (2017)

as well as many applications in different areas of machine learning including unsupervised learning

Arjovsky et al. (2017); Bigot et al. (2017)

, semi-supervised learning

Solomon et al. (2014), clustering Ho et al. (2017), text classification Kusner et al. (2015). Optimal transport distances lead to the concept of Wasserstein barycenter, which allows to define a mean of a set of complex objects, e.g. images, preserving their geometric structure Cuturi and Doucet (2014). In this paper we focus on the computational aspects of optimal transport, namely on approximating Wasserstein barycenter of a set of histograms.

Starting with Altschuler et al. (2017), several groups of authors addressed the question of Wasserstein distance approximation complexity Chakrabarty and Khanna (2018); Dvurechensky et al. (2018b); Blanchet et al. (2018); Lin et al. (2019). Implementable schemes are based on Sinkhorn’s algorithm, which was first applied to OT in Cuturi (2013), and accelerated gradient descent proposed as an alternative in Dvurechensky et al. (2018b). Much less is known about the complexity of approximating Wasserstein barycenter. The works Staib et al. (2017); Uribe et al. (2018a); Dvurechensky et al. (2018a), are in some sense close, but do not provide an explicit answer. Following Genevay et al. (2016), Staib et al. (2017)

use stochastic gradient descent and estimate the convergence rate of their algorithm. From their rate, one can obtain the iteration complexity

to achieve accuracy in approximation of the barycenter, where is some constant depending on the problem data, i.e. transportation cost matrices, is some distance characterizing the solution of the dual problem. Dvurechensky et al. (2018a) consider regularized barycenter, but do not show, how to choose the regularization parameter to achieve -accuracy.

Following Dvurechensky et al. (2018b), we study two alternative approaches for approximating Wasserstein barycenter based on entropic regularization Cuturi (2013). The first approach is based on Iterative Bregman Projection (IBP) algorithm Benamou et al. (2015), which can be considered as a general alternating projections algorithm and also as a generalization of the Sinkhorn’s algorithm Sinkhorn (1974). The second approach is based on constructing a dual problem and solving it by primal-dual accelerated gradient descent. For both approaches, we show, how the regularization parameter should be chosen in order to approximate the original, non-regularized barycenter.

We also address the question of scalability of computations in the Big Data regime, i.e. the size of histograms and the number of histograms are large. In this case the dataset of histograms can be distributedly produced or stored in a network of agents/sensors/computers with the network structure given by an arbitrary connected graph. In a special case of centralized architecture, i.e. if there is a central "master" node surrounded by "slave" nodes, parallel algorithms such as Staib et al. (2017) can be applied. In a more general setup of arbitrary networks it makes sense to use decentralized distributed algorithms in the spirit of distributed optimization algorithms Nedić et al. (2017); Scaman et al. (2017).

Related Work

It is very hard to cover all the increasing stream of works on OT and we mention these books Villani (2008); Santambrogio (2015); Peyré and Cuturi (2018) as a starting point and the references therein. Approximation of Wasserstein barycenter was considered in Cuturi and Doucet (2014); Bonneel et al. (2015); Benamou et al. (2015); Staib et al. (2017); Puccetti et al. (2018); Claici et al. (2018); Uribe et al. (2018a); Dvurechensky et al. (2018a) using different techniques as Sinkhorn-type algorithm, first-order methods, Newton-type methods. Considering the primal-dual approach based on accelerated gradient descent, our paper shares some similarities with Cuturi and Peyré (2016) with the main difference that we are focused on complexity and scalability of computations and explicitly analyze the algorithm applied to the dual problem.

There is a vast amount of literature on accelerated gradient descent with the canonical reference being Nesterov (1983). Primal-dual extensions can be found in Lan et al. (2011); Tran-Dinh et al. (2018); Yurtsever et al. (2015); Chernov et al. (2016); Dvurechensky et al. (2016, 2017); Anikin et al. (2017); Lin et al. (2019). We are focused on the extensions amenable to the decentralized distributed optimization, so that these algorithms can be scaled for large problems.

Distributed optimization algorithms were considered by many authors with the classical reference being Bertsekas and Tsitsiklis (1989). Initial algorithms, such as Distributed Gradient Descent Nedic and Ozdaglar (2009), were relatively slow compared with their centralized counterparts. However, recent work has made significant advances towards a better understanding of the optimal rates of such algorithms and their explicit dependencies to the function and network parameters Lan et al. (2017); Scaman et al. (2017); Uribe et al. (2018b). These approaches has been extended to other scenarios such as time-varying graphs Rogozin et al. (2018); Maros and Jaldén (2018); Wu and Lu (2017). The distributed setup is particularly interesting for machine learning applications on the big data regime, where the number of data points and the dimensionality is large, due to its flexibility to handle intrinsically distributed storage and limited communication, as well as privacy constraints He et al. (2018); Wai et al. (2018).

Our contributions

  • We consider -regularized Wasserstein barycenter problem and obtain complexity bounds for finding an approximation to the regularized barycenter by two algorithms. The first one is Iterative Bregman Projections algorithm Benamou et al. (2015), for which we prove complexity proportional to to achieve accuracy . The second one is based on accelerated gradient descent (AGD) and has complexity proportional to . The benefit of the second algorithm is that it is better scalable and can be implemented in the decentralized distributed optimization setting over an arbitrary network.

  • We show, how to choose the regularization parameter in order to find an -approximation for the non-regularized Wasserstein barycenter and find the resulting complexity for IBP to be proportional to and for AGD to be proportional to .

  • As we can see from the complexity bounds for IBP and AGD, they depend on the regularization parameter quite badly regarding that this parameter has to be small. To overcome this drawback we propose a proximal-IBP method, which can be considered as a proximal method using IBP on each iteration to find the next iterate.

1 Problem Statement and Preliminaries

1.1 Notation

We define the probability simplex as

. Given two discrete measures and from we introduce the set of their coupling measures as

For coupling measure we denote the negative entropy (up to an additive constant) as

Here and further by (

) we denote the element-wise logarithm (exponent) of matrix or vector

, and for any . For two matrices we also define element-wise multiplication and element-wise division as and

respectively. Kullback-Leibler divergence for measures

is defined as the Bregman divergence associated with :

We also define a symmetric cost matrix , which element corresponds to the cost of moving bin to bin . denotes the maximal element of this matrix.

We refer to

as the maximum eigenvalue of a symmetric matrix

, and as the minimal non-zero eigenvalue, and define the condition number of matrix as . We use symbol as a column of ones.

1.2 Wasserstein barycenters and entropic regularization

Given two probability measures and a cost matrix we define optimal transportation distance between them as

(1)

For a given set of probability measures and cost matrices we define their weighted barycenter with weights as a solution of the following convex optimization problem:

(2)

where , and

We use to denote . Using entropic regularization proposed in Cuturi (2013) we define regularized OT-distance for :

(3)

Respectively, one can consider regularized barycenter that is a solution of the following problem:

(4)

2 Complexity of WB by Iterative Bregman Projections

In this section we provide theoretical analysis of the Iterative Bregman Projections algorithm Benamou et al. (2015) for regularized Wasserstein barycenter and obtain iteration complexity with . Then we estimate the bias itroduced by regularization and estimate the value of to obtain an -approximation for the non-regularized barycenter. Combining this result with the iteration complexity of IBP, we obtain complexity for approximating non-regularized barycenter by the IBP algorithm. This algorithm can be implemented in a centralized distributed manner such that each node performs arithmetic operations and the number of communication rounds is .

2.1 Convergence of IBP for regularized barycenter

In this subsection we analyze Iterative Bregman Projection Algorithm (Benamou et al., 2015, Section 3.2) and analyze its complexity for solving problem (4). We slightly reformulate this problem as

(5)

and construct its dual (see the details below). To solve the dual problem we reformulate the IBP algorithm as Algorithm 1111In the original paper Benamou et al. (2015) there were misprints in description of IBP. Correct author’s variant can be found in https://github.com/gpeyre/2014-SISC-BregmanOT. In Algorithm 1 we use different denotations. First of all, our corresponds to from Benamou et al. (2015) and our corresponds to from Benamou et al. (2015). Secondly, our transport plan matrix equals to transpose transport plan matrix from Benamou et al. (2015). Thirdly, we build a little bit different dual problem, by introducing additional constraint , see Lemma 1. This allows us to simplify calculations in line 3 of Algorithm 1.. Notably, our reformulation of the IBP algorithm allows to solve simultaneously the primal and dual problem and has an adaptive stopping criterion (see line 7), which does not require to calculate any objective values.

0:  , , ,
1:  , , ,
2:  repeat
3:     
4:     
5:     
6:     
7:  until , where ,
7:  
Algorithm 1 Dual Iterative Bregman Projection

Before we move to the analysis of the algorithm let us discuss the scalability of this algorithm by using centralized distributed computations framework. This framework includes a master and slave nodes. Each -th slave node stores data , , and variables , . On each iteration , it calculates and sends it to the master node, which aggregates these products to the sum and sends this sum back to the slave nodes. Based on this information, slave nodes update and . So, the main computational cost of multiplying a matrix by a vector, can be distributed on slave nodes and the total working time will be smaller. It is not clear, how this algorithm can be implemented on a general network, for example when the data is produced by a distributed network of sensors without one master node. In contrast, as we illustrate in Section 3, the alternative accelerated-gradient-based approach can be implemented on an arbitrary network.

Theorem 1.

For given Algorithm 1 stops in number of iterations satisfying

It returns s.t.

and

(6)

where is a solution of problem (5).

Our first step is to recall the IBP algorithm from Benamou et al. (2015), formulate the dual problem for (5) and show that our Algorithm 1 solves this dual problem and is equivalent to the IBP algorithm.

Following the approach from Benamou et al. (2015) we present the problem (5) in a Kullback–Leibler projection form.

(7)

for and the following affine convex sets and

(8)

Algorithm 2 introduced in Benamou et al. (2015) consists in alternating projections to sets and w.r.t. Kullback–Leibler divergence, and is a generalization of Sinkhorn’s algorithm and a particular case of Dykstra’s projection algorithm.

0:  Cost matrices , probability measures , , starting transport plans ,
1:  repeat
2:     if  then
3:        
4:     else
5:        
6:     end if
7:     
8:  until Converge
8:  
Algorithm 2 Iterative Bregman Projections, see Benamou et al. (2015)

As we will show below, this algorithm is equivalent to alternating minimization in dual problem of (5) presented in the next lemma.

Lemma 1.

The dual problem of (5) is (up to a multiplicative constant)

(9)

, , , and

(10)

Moreover, solution to (5) is given by the formula

(11)

where is a solution to the problem (9).

Proof.

The Lagrangian for (5) is equal to

where , , with convention . Using the change of variables and we obtain

(12)

Notice that and this condition allows uniquely reconstruct . Then by min-max theorem

By straightforward computations and the definition (10) of we obtain

and the minimum is attained at point . Thus we have

Consequently, the dual problem to (5) is equivalent to (9), and solution to (5) has the form . ∎

The following lemma shows that Algorithms 2 is equivalent to alternating minimization in dual problem (9) (what is a general fact for Dykstra’s algorithm).

Lemma 2.

Sequence generated by Algorithm 2 has a form , where is defined in (10) and

(13)
(14)
Proof.

Let us prove it by induction. For it is obviously true. Assume it holds for some . Then

Therefore,

where comes from (12).

Thus for even

Lagrangian for this problem has form , hence the dual problem is

and as ,

Similarly, for odd

with Lagrangian and dual problem

Consequently,

The next lemma gives us explicit recurrent expressions for and defined in the previous lemma. Equation (15) immediately follows from (Benamou et al., 2015, Proposition 1), and equation (16) is a reformulation of (Benamou et al., 2015, Proposition 2).

Lemma 3.

Equation (13) for even is equivalent to

(15)

and equation (14) for odd is equivalent to

(16)

where , .

Now we turn to the proof of Theorem 1. Next lemmas are preliminaries to the proof of correctness and complexity bound for Algorithm 1.

Lemma 4.

For any , it holds

where

(17)
Proof.

Bound can be derived in almost the same way as in Dvurechensky et al. (2018b). For it obviously holds. Let us denote by the minimal entry of :

As , we obtain for all

therefore

Hence at any update of it holds

For solution to the problem condition (16) also holds, and consequently the solution also meets derived bounds. ∎

Let us define an excess function

for further complexity analysis of Algorithm 1.

Lemma 5.

Let be generated by Algorithm 1. Then for any even we have

(18)
Proof.

Gradient inequality of any convex function at point reads as

Applying the latter inequality to function at point we obtain

If is even, then the first term in r.h.s. vanishes. Notice that , thus

where

By Lemma 4 and , therefore

Lemma 6.

For any odd the following bound on the change of objective function holds:

where .

Proof.

If is odd, then satisfies (16) and . Therefore,

Here we used equations , i.e.  and thus , and the following fact: if , , then

Indeed, let , then