We consider the problem of finding a mean
of probability measures. Observing a sample of probability measuresfrom a some probabilistic family, we are interested in their population mean. Since the geometric structure of the problem is not Euclidean we refer to optimal transport (OT) problem, formulated long time ago and solved by G. Monge [Monge (1781)] and then improved by L. Kantorovich [Kantorovich (1942) ]. Nowadays, OT is a popular framework using in clustering Ho et al. (2017), text classification Kusner et al. (2015)
, image retrieval [Rubner et al. (2000)
], computer vision [Ni et al. (2009)], economics and finance [Beiglböck et al. (2013); Rachev et al. (2011)]. Inspired by the problem of OT, distance function was introduced which is usually named by Wassertein distance, or Monge–Kantorovich distance. This distance allows to measure how one object is deffer from the other one even for non-linear objects such as probability measures or histograms. To define the notion of barycenter we consider 2-Wasserstein distance between two probability measures and supported on complete metric space with metric
where is the set of of probability measures on the product space with respective marginals and . In the Wasserstein space of all probability measures with metric we define the Fréchet mean which is a generalization of usual mean for non-linear space [Fréchet (1948)]
where is the 2-Wasserstein distance. We refer to minimizer (2), which is also a probability measure, as the population (Fréchet) mean. In this work, we consider the specific case where measures are discrete measures with finite support of size to reduce the problem (1
) to finite linear program requiring[Tarjan (1997)] arithmetic iterations for solving it. Approximation of a probability measures by a measure with finite support were studied in [Genevay et al. (2018); Mena and Weed (2019); Panaretos and Zemel (2019); Weed et al. (2019)]. To tackle the problem of computational complexity for solving linear program there were proposed an entropic regularization [Cuturi (2013)]. It helps to reduce the computational complexity to 111The estimate is the best theoretical known estimate for solving OT problem [Blanchet et al. (2018); Jambulapati et al. (2019); Lee and Sidford (2014); Quanrud (2018)]. Moreover, entropic regularization improves statistical properties of Wasserstein distance itself [Klatt et al. (2018); Bigot et al. (2019)]. This regularization shows good results in generative models [Genevay et al. (2017)], multi-label learning Frogner et al. (2015), dictionary learning [Rolet et al. (2016)], image processing [Cuturi and Peyré (2016); Rabin and Papadakis (2015)], neural imaging [Gramfort et al. (2015)]. A nice survey of OT and Wasserstein barycenter presents in [Peyré and Cuturi (2018)]. In this paper we are aim at constructing an -approximation for population Wasserstein barycenter. To do so, we estimate the number of sampled measures to get -precision in function value. We consider online and offline algorithms with providing comparison study for their convergence rates. Both types of approaches (online and offline) have the pros and the cons depending on specific of the problem which covering by this paper. Generally, starting from the work [Nemirovski et al. (2009)] SA (Stochastic Averaging) was considered to be better than SAA (Stochastic Average Approximation). On the example of population Wasserstein barycenter problem we demonstrate superiority of SA under SAA with definite values of regularization parameter. The main reason for that is the observation that gradient of dual function for has the complexity times smaller than primal one. We emphasize that the transition to a dual function is possible only in SAA approach. Furthermore, in our paper we study an
-confidence interval for population Wasserstein barycenter defined w.r.t. entropic-regularized OT. Our choice of entropic regularization is due to it ensures strong convexity of OT that allows to write-convergence in argument.
1.1 Related work
Consistency of the empirical barycenter to its population counterpart as the number of measures tends to infinity was considered in many papers, e.g, [Le Gouic and Loubes (2015); Panaretos and Zemel (2019); Le Gouic and Loubes (2017); Bigot and Klein (2012); Rios et al. (2018)]. In [Bigot et al. (2017)] convergence of the empirical barycenter to its population counterpart in Wasserstein metric was also studied. However, they do not provide any rates of convergence. The rate of convergence can be found in paper [Boissard et al. (2015)] for the problem of template estimation. Authors provide a confidence interval for population barycenter, approximating it by iterated barycenter in Wasserstein space. However, they only consider probability measures obtained by template deformations with definite properties, e.g., the expectation of a function of deformation from admissible deformations family is identity. Our aim is refusing this particular conditions on is identity. Without any assumptions of generating process for observing measures one can find the rate of convergence for empirical Wasserstein barycenter towards its population counterpart in [Bigot et al. (2018)]. However, it is only valid when measures has one-dimensional support. Our approaches for constructing a confidence interval significantly use the results of the paper [Shalev-Shwartz et al. (2009)] for stochastic convex optimiazation.
We summarize our contribution as follows. To the best of our knowledge, this is the first paper which provides a confidence interval for population Wasserstein barycenter with specifying the rates of convergence to calculate its approximation. Our first result is that SAA gets better rates of convergence in comparison with SA for the problem of approximating of the population Wasserstein barycenter defined w.r.t. entropy-regularized OT with the proper value of regularization parameter. We comment on its value in Section 6. Our second result is providing new regularization in SAA approach for the problem of calculating of the population Wasserstein barycenter (Section 5). To the best of our knowledge, this regularization contributes to improving convergence bounds compared to the state-of-the-art regularization from [Shalev-Shwartz et al. (2009)] for general convex function. Finally, we show that for the problem of calculating an approximation for population Wasserstein barycenter, stochastic mirror descent and risk minimization approach with our new regularization show better complexity bounds in comparison with SA and SAA.
1.3 Paper organization
The structure of the paper is the following. Sections 3 and 4 presents SA and SAA approaches respectively to find the confidence interval for barycenter w.r.t. regularized OT. In Section 5 we estimate barycenter w.r.t. OT involving some other algorithm with particular structure on regularization constant, including zero constant. In Section 6 we compare the rates of convergences. Finally, we present concluding remarks.
2 Wasserstein distance and Wasserstein barycenter: Notations and Properties
In this paper we consider discrete probability measures given in the probability simplex
Measures and are discrete if they can be presented in the form and , where is the Dirac measure at point ; and are histograms. Let , then we define transportation polytope
We define optimal transport (OT) problem between discrete probability measures as follows
Here is the cost matrix: is the cost to move the unit mass from point to point . When , where is the distance on support points of probability measures , then is known as 2-Wasserstein distance on .222We skeep the sub-index 2 for simplicity. We consider entropic OT, introduced in [Peyré and Cuturi (2018)]
For statistical explanation of such regularization see [Rigollet and Weed (2018)]. Further we define the notion of population barycenter of probability measures w.r.t. regularized OT by using the notion of Fréchet mean
We refer to empirical barycenter as the empirical counterpart of
If , then is population Wasserstein barycenter (i.e. w.r.t. OT) and its empirical counterpart.
For our convenience we also define the following notation
We refer to when we want to indicate the complexity hiding constants and the logarithms.
2.1 Dual formulation and properties
The structure of allows us to write Lagrangian dual function with dual variables and to the constraints and respectively
Here we denoted by the Lagrangian dual function for . Let be a solution of problem (2.1), then since dual function is strictly convex we have
is the smallest eigenvalue of positive semi-definite matrix. Note, that from the theoretical point of view we do not know any accurate bounds from below for better than exponentially small of . Moreover,
We use denotation for gradient w.r.t. . [ properties] Entropy-regularized Wasserstein distance is
-strongly convex in w.r.t 2-norm: for any
-Lipschitz in w.r.t 2-norm: for any
where ( is Lipschitz constant w.r.t. -norm, .
The first proposition follows from [Kakade et al. (2009); Nesterov (2005)]. Indeed, according to [Peyré and Cuturi (2018); Nesterov (2005)] the gradient of function in (2.1) is -Lipschitz continuous in 2-norm. From [Kakade et al. (2009)] we may conclude that in this case is -strongly convex w.r.t. in 2-norm [Nesterov (2005)]. The second proposition follows from [Dvurechensky et al. (2018); Kroshnin et al. (2019b); Lin et al. (2019)]. Note, that this result assumes some additional assumptions about the separability of considered measures from zero. But without loss of generality we can reduce general case to the desired one [Dvurechensky et al. (2018); Kroshnin et al. (2019b)]. The next two sections present constructing an -confidence interval for population barycenter defined w.r.t regularized Wasserstein distance.
3 Population barycenter with respect to regularized OT. Online (SA) approach
In this section we assume that probability measures come in online regime.
One of the benefits of online approach is no need to fix the number of measures that allows to regulate the precision for calculated barycenter. Moreover, a problem of storing a large number of measures in a computing node is not present if we have an access to online oracle, e.g., some measuring device. Using online to batch conversation [Shalev-Shwartz et al. (2009)] we build the confidence interval for population barycenter defined w.r.t. regularized OT and provide complexity bounds to do it.
The following algorithm calculate online sequence of measures by online stochastic gradient descent, that at each iteration call Sinkhorn algorithm to compute an approximation for the gradient of entropic-regularized Wasserstein distance .
where is an output of Sinkhorn algorithm [Dvurechensky et al. (2018); Peyré and Cuturi (2018)] after iterations. To approximate population barycenter by the outputs of Algorithm 1 we use online to batch conversation [Shalev-Shwartz et al. (2009)] and define as the average of online outputs from Algorithm 1
Next we formulate two theorems which indicates the precision of as an approximation for population barycenter in function value and in argument. With probability for the from (9) the following holds
where The proof follows from proof in [Kakade and Tewari (2009)] accounting the error accumulation for inexactness of the gradient. [Barycenter confidence interval] With probability for from (9) the following holds
The proof follows directly from Theorem 3, strong convexity of and . Using (not online) algorithm from [Juditsky and Nesterov (2014)] instead of online algorithm allows to avoid accumulation of inexactness : the term in Theorems 3, 3 can be replaced by . From Theorem 3 we can immediately conclude that the number of probability measures taking as inputs of Algorithm 1 and which is the precision for Sinkhorn algorithm performing at each iteration of Algorithm 1. [Number of probability measures and auxiliary precision] To get the -confidence region:
it suffices to take the following number of probability measures (iterations)
and the following -precision
where we use . The proof follows from the complexity of Sinkhorn algorithm. To state the complexity of Sinkhorn we firstly define as (see also (8))
The complexity for Accelerated Sinkhorn can be improved [Guminov et al. (2019)]
Multiplying both of this estimates by the number of measures (iterations), taking the minimum and using Corollary 3 we get the statement of the theorem. Suppose that after iterations of Sinkhorn algorithm we get approximate dual solutions and . According to [Franklin and Lorenz (1989)] there is a convergence of calculated variables and (see (2.1)) to the true variables and in Hilbert–Birkhoff metric :
Since all norms are equivalent in finite spaces we can obtain by proper choosing (8). The number of iterations will be proportional to [Franklin and Lorenz (1989)], however, in general theoretical constant before logarithm can be too big to get good theoretical results. But accurate calculations allows to obtain here the result like (14), where can be replaced by for some . This means that in (12) we may consider to be . That is better than direct bound from the definition.
4 Barycenter with respect to regularized OT: Offline (SAA) approach
In this section we suppose that we sample measures in advance. We construct the confidence interval for population barycenter calculating the approximation for the empirical barycneter. Moreover, we also provide the total complexity bounds to do it. This offline setting can be relevant when we are interested in parallelization or decentralization.
We refer to as the approximated empirical barycenter of if it satisfies the following inequality for some precision
Now suppose that we somehow find . The following theorem estimates the precision for approximation of in function value. For from (15) with probability the following holds
Consider for any the following difference
From Theorem 6 from [Shalev-Shwartz et al. (2009)] with probability at least for the empirical minimizer the following holds
From Lipschitz continuity of we have
From strong convexity of we get
By using (19) and (20) for (17) and taking we get the first inequality of the theorem.
Then using strong convexity of function we formulate the results of convergence in argument. [Barycenter confidence interval] For from (15) with probability at least
The proof consists in application of strong convexity of for Theorem 4. From Theorem 20 we estimate the number of measures and auxialiary precision of fidelity term (15) to get -confidence interval with probability [Number of probability measures and auxiliary precision] To get the -confidence region:
it suffices to take the following number of probability measures
and auxiliary -precision in (15):
Next theorem estimates the complexity for calculating which is an approximation for population barycenter . The total complexity per each node of offline algorithm from [Kroshnin et al. (2019b)] is
where is the parameter of the architecture:
Moreover, for one node architecture (without parallelization) the complexity can be simplified
To calculate the total complexity we refer to the Algorithm 6 in the paper [Kroshnin et al. (2019b)] providing . For the readers convenience we repeat the the scheme of the proof. This algorithm relates to the class of Fast Gradient Methods for Lipschitz smooth functions and, consequently, has the complexity [Nesterov (2018)]. Here is the constant for dual function from (2.1) ( from the proof of Proposition 5) and is the radius for dual solution (Lemma 8 from [Kroshnin et al. (2019b)]). Combining all of this we get the following number of iterations
where we denoted by the parameter of the architecture. Multiplying by the complexity of calculating the gradient for the dual function (which is ) and using Corollary 4 for definition of we get the following complexity per each node
Using Corollary 4 for the number of measures we get the first statement of the theorem. By using for one-machine architecture we get the second statement and finish the proof. From the recent results [Feldman and Vondrák (2019)] we may expect that the dependence on in (22) and (25) is indeed much better (logarithmic). But, unfortunately, as far as we know in general (not small ) it’s still a hypothesis. In the next section we construct confidence interval for population Wasserstein barycenter .
5 Population Wasserstein barycenter problem
In previous sections we were aim at constructing the confidence interval for population barycenter defined w.r.t regularized OT . Now we refuse the regularization of OT and seek to find population barycenter w.r.t OT . To do so, we firstly use the results from Sect. 3 and 4 (SA and SAA algorithms), then we consider another two methods, one of them is based on our new regularization. Since is not strongly convex, in this section we construct -confidence interval in function value.
Thought this section we use the following notation
5.1 SA and SAA
We start with application of the results obtained by SA in Sect. 3. We regularize by entropy with the definite regularization parameter . Here is a desired accuracy in function value [Dvurechensky et al. (2018); Peyré and Cuturi (2018); Weed (2018)]
Taking regularization parameter we ensure the following
The last inequality allows us to modify Theorem 3 and get the complexity of calculating as an -approximation for in function value. To ensure for from (9) with probability , it suffices to take the following number of probability measures (iterations) in restarted SGD from [Juditsky and Nesterov (2014)]
and the following auxiliary -precision . The total complexity will be
Similarly we use the results of SAA (Theorem 23) to state the complexity bounds for calculating an approximation for . For simplicity we provide only results for only one-machine architecture. To ensure with probability it suffices to take the following number of probability measures
and the following auxiliary -precision
The total complexity of algorithm form [Kroshnin et al. (2019b)] on one machine (without parallelization/decentralization) is
5.2 Stochastic Mirror Descent
Another approach to construct -confidence interval in function value is refusing any regularization and using stochastic mirror descent333By using Dual Averaging scheme [Nesterov (2009)] we can rewrite Algorithm 2 in online regime without including in the step-size policy. Note, that Mirror Descent and Dual Averaging schemes are very close to each other [Juditsky et al. (2019)]. with 1-norm and KL-prox structure, see, e.g., [Hazan et al. (2016); Nemirovski et al. (2009); Orabona (2019)], for inexact case see, e.g., [Gasnikov et al. (2016); Juditsky and Nemirovski (2012)].
-th) component of a vector,is calculated with -precision (e.g., by Simplex Method or Interior Point Method)
where , see [Dvurechensky et al. (2018); Kroshnin et al. (2019b); Lin et al. (2019)]. Bound (27) is -times better the bound for Stochastic Gradient Descent with Euclidean set up [Nemirovski et al. (2009); Shalev-Shwartz et al. (2009)]. We also notice that the smoothed complexity of finding exact is , see [Dadush and Huiberts (2018)] and references there in. To ensure with probability it suffices to make the following number of iterations of Algorithm 2
The total complexity is
5.3 Empirical Risk Regularization
In offline approach with Euclidean set up one may use regularization trick from [Shalev-Shwartz et al. (2009)].444In the same paper [Shalev-Shwartz et al. (2009)] one can find an explanation why do we need regularization in offline approach for non strongly convex case. The problems of SAA approach for non strongly convex case are also discussed in [Guigues et al. (2017); Shapiro and Nemirovski (2005)]. For more complete understanding see [Shapiro et al. (2009); Sridharan (2012)]. We introduce composite term in r.h.s of equation (2) ( is some initial vector from )
Assume that such that
Then the main result of Theorem 4 can be rewritten as follows [Shalev-Shwartz et al. (2009)]555Note, that in [Shalev-Shwartz et al. (2009)] instead of it was used simple . For the moment we do not know how to justify this replacement. That is why we write . Fortunately, when is big enough it doesn’t matter. : for from (28) with probability
Consider to be big enough, we choose (like in [Shalev-Shwartz et al. (2009)]) approximately and obtain with probability
Recently, it was shown [Feldman and Vondrák (2019)] that dependence on in (29) can be improved to logarithmic. Another type of regularization allows to improve (29). We refer to Bregman divergence ([Ben-Tal and Nemirovski (2015)])
, , .
We notice that is 1-strongly convex in 1-norm and -Lipschitz continuous in 1-norm on . In [Bigot et al. (2017)] there proposed to use entropy as regularizer. For entropy we have the same strong convexity properties in 1-norm on , but we loose limitation from above on a Lipschitz constant. Using this we redefine as follows666Note, that to solve (30) we may use the same dual distributed tricks like in [Kroshnin et al. (2019a)] if we put composite term in a separate node. But before, we should regularized with . The complexity in terms of will be the same as in Theorem 23.
Assume that such that
Consider to be big enough, we choose approximately , where , and obtain for from (31) with probability
To prove this estimate we use the same arguments as in [Shalev-Shwartz et al. (2009)], but replace strong convexity and smoothness from 2-norm to 1-norm. Since we may conclude that (32) is -times better than (29). Note, that if we choose then (32) can be written as follows:
This fact can be easily extract from [Shalev-Shwartz et al. (2009)], see formula (21) of this work in the proof of Claim 6.2. The same thing one can say about (29). We summarize the result in the next theorem. To ensure with probability we need to take the following number of probability measures
and find satisfy (31) with
The total complexity of properly corrected algorithm form [Kroshnin et al. (2019b)] on one machine (without parallelization/decentralization) is
From the recent results [Feldman and Vondrák (2019)] we may expect that the dependence on in Theorems 5.1, 5.3 is indeed much better (logarithmic). For the moment we don’t possess an accurate prove of it, but we suspect that original ideas in [Feldman and Vondrák (2019)] allow to prove it.
In this section we compare approaches from Sections 3, 4, 5. For the reader convenience we skip the details about high probability bounds. The first reason is we can fixed , say as , and consider it to be a fixed parameter in all the bounds. The second reason is an intuition (goes back to [Shalev-Shwartz et al. (2009)]) that all the bounds in this paper have logarithmic dependence on in fact and up to a denotation we can ignore the dependence on . The main result is proving the possible superiority of SAA under SA for population (-entropy regularized) Wasserstein barycenter estimation even in non-parallel case (and non-decentralized). For this purpose we provide Table 1, where we used (7), to compare the total complexity of the algorithms. Here -precision is the precision in argument.
When is not too large, SA has the complexity according to the second term under the minimum. In this case we have obvious advantage of SAA since its complexity about in times less that SA complexity. Next, we compare the results complexity with proper regularization of OT (with definite or ). Table 2 presents the results. Here is the precision in function value.
We do not make any conclusions about comparison of Stochastic MD and Regularized ERM since it depends on comparison and . However, both of this methods are definitely outperform (according to complexity results) SA and SAA approaches based on entropy regularization of OT. The conclusions of advantages SAA approach vs SA approach can be reinforced by using parallelization or distributed calculations. For that we can use estimate (24) instead of (25). The same we can say about the following formulas that we used in Section 5. We are grateful to Alexander Gasnikov and Vladimir Spokoiny who initiated this research. We thank Pavel Dvurechensky, Eduard Gorbunov for fruitful discussions as well. We thank Ohad Shamir for useful reference. The work in the first part was funded by RFBR, project number 19-31-51001. In the second part (from section 5) the work was funded by Russian Science Foundation, project no. 18-71-10108.
- Beiglböck et al. (2013) Mathias Beiglböck, Pierre Henry-Labordere, and Friedrich Penkner. Model-independent bounds for option prices—a mass transport approach. Finance and Stochastics, 17(3):477–501, 2013.
- Ben-Tal and Nemirovski (2015) Aaron Ben-Tal and Arkadi Nemirovski. Lectures on Modern Convex Optimization (Lecture Notes). Personal web-page of A. Nemirovski, 2015. URL http://www2.isye.gatech.edu/~nemirovs/Lect_ModConvOpt.pdf.
- Bigot and Klein (2012) Jérémie Bigot and Thierry Klein. Characterization of barycenters in the wasserstein space by averaging optimal transport maps. arXiv:1212.2562, 2012.
- Bigot et al. (2017) Jérémie Bigot, Elsa Cazelles, and Nicolas Papadakis. Penalized barycenters in the wasserstein space. 2017.
- Bigot et al. (2018) Jérémie Bigot, Raúl Gouet, Thierry Klein, Alfredo Lopez, et al. Upper and lower risk bounds for estimating the wasserstein barycenter of random measures on the real line. Electronic journal of statistics, 12(2):2253–2289, 2018.
- Bigot et al. (2019) Jérémie Bigot, Elsa Cazelles, and Nicolas Papadakis. Central limit theorems for entropy-regularized optimal transport on finite spaces and statistical applications. arXiv preprint arXiv:1711.08947, 2019.
- Blanchet et al. (2018) Jose Blanchet, Arun Jambulapati, Carson Kent, and Aaron Sidford. Towards optimal running times for optimal transport. arXiv preprint arXiv:1810.07717, 2018.
- Boissard et al. (2015) Emmanuel Boissard, Thibaut Le Gouic, Jean-Michel Loubes, et al. Distribution’s template estimate with wasserstein metrics. Bernoulli, 21(2):740–759, 2015.
- Cuturi (2013) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2292–2300. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/4927-sinkhorn-distances-lightspeed-computation-of-optimal-transport.pdf.
- Cuturi and Peyré (2016) Marco Cuturi and Gabriel Peyré. A smoothed dual approach for variational wasserstein problems. SIAM Journal on Imaging Sciences, 9(1):320–343, 2016.
Dadush and Huiberts (2018)
Daniel Dadush and Sophie Huiberts.
A friendly smoothed analysis of the simplex method.
Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 390–403. ACM, 2018.
- Dvurechensky et al. (2018) Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 1367–1376, 2018. arXiv:1802.04367.
- Feldman and Vondrák (2019) Vitaly Feldman and Jan Vondrák. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. arXiv preprint arXiv:1902.10710, 2019.
- Franklin and Lorenz (1989) Joel Franklin and Jens Lorenz. On the scaling of multidimensional matrices. Linear Algebra and its Applications, 114:717 – 735, 1989. ISSN 0024-3795. doi: http://dx.doi.org/10.1016/0024-3795(89)90490-4. URL http://www.sciencedirect.com/science/article/pii/0024379589904904. Special Issue Dedicated to Alan J. Hoffman.
- Fréchet (1948) Maurice Fréchet. Les éléments aléatoires de nature quelconque dans un espace distancié. In Annales de l’institut Henri Poincaré, volume 10, pages 215–310, 1948.
- Frogner et al. (2015) Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053–2061, 2015.
- Gasnikov et al. (2016) A. V. Gasnikov, A. A. Lagunovskaya, I. N. Usmanova, and F. A. Fedorenko. Gradient-free proximal methods with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex. Automation and Remote Control, 77(11):2018–2034, Nov 2016. ISSN 1608-3032. doi: 10.1134/S0005117916110114. URL http://dx.doi.org/10.1134/S0005117916110114. arXiv:1412.3890.
- Gasnikov et al. (2015) Alexander Gasnikov, Pavel Dvurechensky, Dmitry Kamzolov, Yurii Nesterov, Vladimir Spokoiny, Petr Stetsyuk, Alexandra Suvorikova, and Alexey Chernov. Universal method with inexact oracle and its applications for searching equillibriums in multistage transport problems. arXiv preprint arXiv:1506.00292, 2015.
- Genevay et al. (2017) Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn divergences. arXiv preprint arXiv:1706.00292, 2017.
- Genevay et al. (2018) Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of sinkhorn divergences. arXiv preprint arXiv:1810.02733, 2018.
- Gramfort et al. (2015) Alexandre Gramfort, Gabriel Peyré, and Marco Cuturi. Fast optimal transport averaging of neuroimaging data. In International Conference on Information Processing in Medical Imaging, pages 261–272. Springer, 2015.
- Guigues et al. (2017) Vincent Guigues, Anatoli Juditsky, and Arkadi Nemirovski. Non-asymptotic confidence bounds for the optimal value of a stochastic program. Optimization Methods and Software, 32(5):1033–1058, 2017. doi: 10.1080/10556788.2017.1350177. URL https://doi.org/10.1080/10556788.2017.1350177.
- Guminov et al. (2019) S. V. Guminov, Yu. E. Nesterov, P. E. Dvurechensky, and A. V. Gasnikov. Accelerated primal-dual gradient descent with linesearch for convex, nonconvex, and nonsmooth optimization problems. Doklady Mathematics, 99(2):125–128, 2019.
- Hazan et al. (2016) Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
- Ho et al. (2017) Nhat Ho, XuanLong Nguyen, Mikhail Yurochkin, Hung Hai Bui, Viet Huynh, and Dinh Phung. Multilevel clustering via Wasserstein means. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1501–1509, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/ho17a.html.
- Jambulapati et al. (2019) Arun Jambulapati, Aaron Sidford, and Kevin Tian. A direct