Recently, optimal transport map (OTM) draws great attention in machine learning, statistics, and computer science due to its close relationship to generative models, including generative adversarial nets(Goodfellow et al., 2014)
, the “decoder” network in variational autoencoders(Kingma and Welling, 2013), among others. In a generative model, the goal is usually to generate a “fake” sample, which is indistinguishable from the genuine one. This is equivalent to find a transport map from random noises with distribution 2016; Liu et al., 2017), songs (Blaauw and Bonada, 2016; Engel et al., 2017) and videos (Liang et al., 2017; Vondrick et al., 2016). Besides generative models, OTM also plays essential roles in various machine learning applications, say color transfer (Ferradans et al., 2014; Rabin et al., 2014), shape match (Su et al., 2015) et al., 2017; Peyré et al., 2019) et al., 2019).
Despite its impressive performance, the computation of OTM is challenging for a large-scale sample with massive sample size and/or high dimensionality. Traditional methods for estimating the OTM includes finding a parametric map and using ordinary differential equations(Brenier, 1997; Benamou et al., 2002)
. To address the computational concern, recent developments of OTM estimation have been made based on solving linear programs(Rubner et al., 1997; Pele and Werman, 2009). Let and
be two samples from two continuous probability distributions functionsand , respectively. Estimating the OTM from to by solving a linear program requiring computational time for fixed (Peyré et al., 2019; Seguy et al., 2017). To alleviate the computational burden, some literature (Cuturi, 2013; Genevay et al., 2016; Arjovsky et al., 2017; Gulrajani et al., 2017) pursued fast computation approaches of the OTM objective, i.e., the Wasserstein distance. Another school of methods aims to estimate the OTM efficiently when is small, including multi-scale approaches (Mérigot, 2011; Gerber and Maggioni, 2017) and dynamic formulations (Solomon et al., 2014; Papadakis et al., 2014). These methods utilize the space discretization, thus are generally not applicable in high-dimensional cases.
The random projection method (or known as the radon transformation method) is proposed to estimate OTMs efficiently when is large Pitie et al. (2005); Pitié et al. (2007). Such a method tackles the problem of estimating a -dimensional OTM iteratively by breaking down the problem into a series of subproblems, each of which finds a one-dimensional OTM using projected samples. Denote as the -dimensional unit sphere. In each iteration, a random direction is picked, and the one-dimensional OTM is then calculated between the projected samples and . The collection of all the one-dimensional maps serves as the final estimate of the target OTM. The sliced method modifies the random projection method by considering a large set of random directions from in each iteration (Bonneel et al., 2015; Rabin et al., 2011). The “mean map” of the one-dimensional OTMs over these random directions is considered as a component of the final estimate of the target OTM. We call the random projection method, the sliced method, and their variants as the projection-based approach. Such an approach reduces the computational cost of calculating an OTM from to , where is the number of iterations until convergence. However, there is no theoretical guideline on the order of . In addition, the existing projection-based approaches usually require a large number of iterations to convergence or even fail to converge. We speculate that the slow convergence is because a randomly selected projection direction may not be “informative”, leading to a one-dimensional OTM that failed to be a decent representation of the target OTM. We illustrate such a phenomenon through an illustrative example as follows.
An illustrative example. The left and right panels in Figure 1 illustrates the importance of choosing the “informative” projection direction in OTM estimation. The goal is to obtain the OTM which maps a source distribution (colored in red) to a target distribution (colored in green). For each panel, we first randomly pick a projection direction (black arrow) and obtain the marginal distributions of and (the bell-shaped curves), respectively. The one-dimensional OTM then can be calculated based on the marginal distributions. Applying such a map to the source distribution yields the transformed distribution (colored in blue). One can observe that the transformed distributions are significantly different from the target ones. Such an observation indicates that the one-dimensional OTM with respect to a random projection direction may fail to well-represent the target OTM. This observation motivates us to select the “informative” projection direction (red arrow), which yields a better one-dimensional OTM.
Our contributions. To address the issues mentioned above, this paper introduces a novel statistical approach to estimate large-scale OTMs. The proposed method, named projection pursuit Monge map (PPMM), improves the existing projection-based approaches from two aspects. First, PPMM uses a sufficient dimension reduction technique to estimate the most “informative” projection direction in each iteration. Second, PPMM is based on projection pursuit (Friedman and Stuetzle, 1981). The idea is similar to boosting that search for the next optimal direction based on the residual of previous ones. Theoretically, we show the proposed method can consistently estimate the most “informative” projection direction in each iteration, and the algorithm weakly convergences to the target large-scale OTM in a reasonable number of steps. The finite sample performance of the proposed algorithm is evaluated by two applications: Wasserstein distance estimation and generative model. We show the proposed method outperforms several state-of-the-art large-scale OTM estimation methods through extensive experiments on various synthetic and real-world datasets.
2 Problem setup and methodology
Optimal transport map and Wasserstein distance. Denote andand , respectively. The problem of finding a transport map such that and have the same distribution, has been widely-studied in mathematics, probability, and economics, see (Ferradans et al., 2014; Su et al., 2015; Reich, 2013) for examples of some new developments. Note that the transport map between the two distributions is not unique. Among all transport maps, it may be of interest to define the “optimal” one according to some criteria. A standard approach, named Monge formulation (Villani, 2008), is to find the OTM111Such a map is thus also called the Monge map. that satisfies
where is the set of all transport maps,
is the vector norm andis a positive integer. Given the existence of the Monge map, the Wasserstein distance of order is defined as
Denote as an estimator of . Suppose one observe and from and , respectively. The Wasserstein distance thus can be estimated by
Projection pursuit method. Projection pursuit regression (Friedman and Stuetzle, 1981; Huber, 1985; Friedman, 1987; Ifarraguerri and Chang, 2000) is widely-used for high-dimensional nonparametric regression models which takes the form.
where is a hyper-parameter, is the univariate response, are covariates, and are i.i.d. normal errors. The goal is to estimate the unknown link functions and the unknown coefficients .
The additive model (1) can be fitted in an iterative fashion. In the th iteration, , denote the estimate of obtained from previous iterations. Denote , , the residuals. Then can be estimated by solving the following least squares problem
The above iterative process explains the intuition behind the projection pursuit regression. Given the model fitted in previous iterations, we fit a one dimensional regression model using the current residuals, rather than the original responses. We then add this new regression model into the fitted function in order to update the residuals. By adding small regression models to the residuals, we gradually improve fitted model in areas where it does not perform well.
The intuition of projection pursuit regression motivates us to modify the existing projection-based OTM estimation approaches from two aspects. First, in the th iteration, we propose to seek a new projection direction for the one-dimensional OTM in the subspace spanned by the residuals of the previously directions. On the contrary, following a direction that is in the span of used ones can lead to an inefficient one dimensional OTM. As a result, this “move” may hardly reduce the Wasserstein distance between and . Such inefficient “moves” can be one of the causes of the convergence issue in existing projection-based OTM estimation algorithms. Second, in each iteration, we propose to select the most “informative” direction with respect to the current residuals rather than a random one. Specifically, we choose the direction that explains the highest proportion of variations in the subspace spanned by the current residuals. Intuitively, this direction addresses the maximum marginal “discrepancy” between and among the ones that are not considered by previous iterations. We propose to estimate this most “informative” direction with sufficient dimension reduction techniques introduced as follows.
Sufficient dimension reduction. Consider a regression problem with univariate response and a -dimensional predictor . Sufficient dimension reduction for regression aims to reduce the dimension of while preserving its regression relation with . In other words, sufficient dimension reduction seeks a set of linear combinations of , say with some and , such that depends on only through , i.e., . Then, the column space of , denoted as is called a dimension reduction space (DRS). Furthermore, if the union of all possible DRSs is also a DRS, we call it the central subspace and denote it as . When exists, it is the minimum DRS. We call a sufficient dimension reduction method exclusive if it induces a DRS that equals to the central subspace. Some popular sufficient dimension reduction techniques include sliced inverse regression (SIR) (Li, 1991), principal Hessian directions (PHD) (Li, 1992)
, sliced average variance estimator (SAVE)(Cook and Weisberg, 1991), directional regression (DR) (Li and Wang, 2007), among others.
Estimation of the most “informative” projection direction. Consider estimating an OTM between a source sample and a target sample. We first form a regression problem by adding a binary response, which equals zero for the source sample and one for the target sample. We then utilize the sufficient dimension reduction technique to select the most “informative” projection direction. To be specific, we select the projection direction. The direction is most “informative” in the sense that, the projected samples and have the most substantial “ discrepancy.” The metric of the “discrepancy” depends on the choice of the sufficient dimension reduction technique. Figure 2 gives a toy example to illustrate this idea. In this paper, we opt to use SAVE for calculating , and hence the “discrepancy” metric is the difference between and
. Empirically, we find other sufficient dimension reduction techniques, like PHD and DR, also yield similar performance. The SIR method, however, yields inferior performance, since it only considers the first moment. The Algorithm1 below introduces our estimation method of “informative" projection direction in detail.
Projection pursuit Monge map algorithm. Now, we are ready to present our estimation method for large-scale OTM. The detailed algorithm, named projection pursuit Monge map, is summarized in Algorithm 2 below. In each iteration, the PPMM applies a one-dimensional OTM following the most “informative” projection direction selected by the Algorithm 1.
Computational cost of PPMM. In Algorithm 2, the computational cost mainly resides in the first two steps within each iteration. In step (a), one calculates using Algorithm 1, whose computational cost is of order . In step (b), one calculates a one-dimensional OTM using the look-up table, which is simply a sorting algorithm (Pitié et al., 2007; Peyré et al., 2019).
The computational cost for step (b) is of order . Suppose that the algorithm converges after iterations. The overall computational cost of Algorithm 2 is of order . Empirically, we find works reasonably well. When , the order of computational cost of PPMM is which is smaller than the computational cost of the naive method for calculating OTMs. When , the order of computational cost reduces to which is faster than the exiting projection-based methods given PPMM converges faster. The memory cost for Algorithm 2 mainly resides in the step (a), which is of the order .
3 Theoretical results
Exclusiveness of SAVE. For mathematical simplicity, we assume . When , one can use a first-order dimension reduction method like SIR to adjust means before applying SAVE.
Denote , , and
. For a univariate continuous response variable, one can approximate the central subspace by , which is the population version of the dimension reduction space of SAVE. To be specific, is the column space of matrix
where the above equation used the fact that .
Let be the projection onto the central space with respect to the inner project . For any nonzero vectors , such that is orthogonal to and , we assume
is a linear function of Z;
is a nonrandom number;
Let be an independent copy of . is non degenerate; that is, it is not equal almost surely to a constant.
Let be a univariate continuous response variable. Under Assumption 1, the dimension reduction space induced by SAVE is exclusive. In other words, .
Consistency of the most “informative” projection direction. Let and be the sample covariance matrix estimator of and , respectively. Denote
Denote and the eigenvectors correspond to the largest eigenvalues of and , respectively. Further, denote , the rank of .
Let be an i.i.d. sample of . We assume that
Denote and the th and th component of and , respectively. for all and ;
There are and such that, for any , and ,
Let be the eigenvalues of in descending order. There exist positive constants and such that
but for random variables.
Under Assumption 2, the SAVE estimator of most “informative” projection direction satisfies,
Weak convergence of PPMM algorithm. Denote as the -dimensional optimal transport map from to and as the PPMM estimator after iterations, i.e. . The following theorem gives the weak convergence results of the PPMM algorithm.
Suppose Assumption 1 and Assumption 2 hold. Let for some large enough positive constant , one has
Works are proving the convergence rates of the empirical optimal transport objectives (Boissard and others, 2011; Sriperumbudur et al., 2012; Boissard and Le Gouic, 2014; Weed and Bach, 2017). The convergence rate of the OTM has rarely been studied except for a recent paper (Hütter and Rigollet, 2019). We believe Theorem 3 is the first step in this direction.
4 Numerical experiments
4.1 Estimation of optimal transport map
Suppose that we observe i.i.d. samples from and from , respectively. We set , , , , , and , for .
We apply PPMM to estimate the OTM between and from and . In comparison, we also consider the following two projection-based competitors: (1) the random projection method (RANDOM) as proposed in (Pitie et al., 2005; Pitié et al., 2007); (2) the sliced method as proposed in (Bonneel et al., 2015; Rabin et al., 2011). The number of slices is set to be 10, 20, and 50. We assess the convergence of each method by the estimated Wasserstein distance of order 2 after each iteration, i.e. , where is the estimator of OTM after th iteration. For all three methods, we set the maximum number of iterations to be 200. Notice that, the Wasserstein distance between and admits a closed form,
which serves as the ground-truth. The results are presented in Figure 3.
In all three scenarios, PPMM (red line) converges to the ground truth within a small number of iterations. The fluctuations of the convergence curves observed in Figure 3 are caused by the non-equal sample means. This can be adjusted by applying a first-order dimension reduction method (e.g., SIR). We do not pursue this approach as the fluctuations do not cover the main pattern in Figure 3. When , RANDOM and SLICED converge to the ground truth but in a much slower manner. When and , neither RANDOM nor SLICED manages to converge within 200 iterations. We also find a large number of slices does not necessarily lead to a better estimation for the SLICED method. As we can see, PPMM is the only one among three that is adaptive to large-scale OTM estimation problems.
In Table 1 below, we compare the computational cost of three methods by reporting the CPU time per iteration over 100 replication.222 The experiments are implemented by an Intel 2.6 GHz processor. As we expected, the RANDOM method has the lowest CPU time per iteration due to it does not select projection direction. We notice that the CPU time per iteration of the SLICED method is proportional to the number of slices . Last but not least, the CPU time per iteration of PPMM is slightly larger than RANDOM but much smaller than SLICED.
|0.019 (0.008)||0.011 (0.008)||0.111 (0.019)||0.213 (0.024)||0.529 (0.031)|
|0.027 (0.011)||0.014 (0.008)||0.125 (0.027)||0.247 (0.033)||0.605 (0.058)|
|0.059 (0.036)||0.015 (0.008)||0.171 (0.037)||0.338 (0.049)||0.863 (0.117)|
In the Table 2 below, we report the mean convergence time over 100 replications for PPMM, RANDOM, SLICED, the refined auction algorithm (AUCTIONBF)(Bertsekas, 1992), the revised simplex algorithm (REVSIM) (Luenberger et al., 1984) and the shortlist method (SHORTSIM) (Gottschlich and Schuhmacher, 2014).333 AUCTIONBF, REVSIM and SHORTSIM are implemented by the R package “transport” (Schuhmacher et al., 2019). Table 2 shows that the PPMM is the most computationally efficient method thanks to its cheap per iteration cost and fast convergence.
|0.6 (0.1)||4.8 (1.7)||23.0 (2.6)||99.7 (10.4)||40.2 (4.0)||42.5 (3.2)|
|2.1 (0.3)||24.4 (3.2)||230.2 (28.4)||109.4 (12.5)||42.6 (5.3)||50.2 (6.6)|
|5.5 (0.4)||-||-||125.5 (13.3)||46.5 (5.6)||56.5 (7.1)|
4.2 Application to generative models
A critical issue in generative models is the so-called mode collapse, i.e., the generated “fake” sample fails to capture some modes present in the training data (Guo et al., 2019; Salimans et al., 2018). To address this issue, recent studies (Tolstikhin et al., 2017; Guo et al., 2019; Kolouri et al., 2018) incorporated generative models with the optimal transportation theory. As illustrated in Figure 4, one can decompose the problem of generating fake samples into two major steps: (1) manifold learning and (2) probability transformation. The step (1) aims to discover the manifold structure of the training data by mapping the training data from the original space to a latent space with . Notice that the probability distribution of the transformed data in may not be convex, leading to the problem of mode collapse. The step (2) then addresses the mode collapse issue through transporting the distribution in to the uniform distribution . Then, the generative model takes a random input from and sequentially applies the inverse transformations in step (2) and step (1) to generate the output. In practice, one may implement the step (1) and (2) using variational autoencoders (VAE) and OTM, respectively. As we can see, the estimation of OTM plays an essential role in this framework.
In this subsection, we apply PPMM as well as RANDOM and SLICED to generative models to study two datasets: MINST and Google doodle dataset. For the SLICED method, we set the number of slices to be 10, 20, and 50. For all three methods, we set the number of iterations is set to be . We use the squared Euclidean distance as the cost for the VAE model.
Left: random samples generated by PPMM. Right: linear interpolation between random pairs of images.
MNIST. We first study the MNIST dataset, which contains 60,000 training images and 10,000 testing images of hand written digits. We pull each image to a -dimensional vector and rescale the grayscale values from to . Following the method in (Tolstikhin et al., 2017), we apply VAE to encode the data into a latent space of dimensionality . Then, the OTM from the distribution in to is estimated by PPMM as well as RANDOM and SLICED.
First, we visually examine the fake sample generated with PPMM. In the left-hand panel of Figure 5, we display some random images generated by PPMM. The right-hand panel of Figure 5 shows that PPMM can predict the continuous shift from one digit to another. To be specific, let be the sample of two digits (e.g. 3 and 9) in the testing set. Let be the map induced by VAE and the OTM estimated by PPMM. Then, maps the sample distribution to . We linearly interpolate between and with equal-size steps. Then we transform the interpolated points back to the sample distribution to generate the middle columns in the right panel of Figure 5.
We use the “Frchet Inception Distance” (FID) (Heusel et al., 2017) to quantify the similarity between the generated fake sample and the training sample. Specifically, we first generate 1,000 random inputs from . We then apply PPMM, RANDOM, and SLICED to this input sample, yields the fake samples in the latent space . Finally, we calculate the FID between the encoded training sample in the latent space and the generated fake samples, respectively. A small value of FID indicates the generated fake sample is similar to the training sample and vice versa. The sample mean and sample standard deviation (in parentheses) of FID over 50 replications are presented in Table 3. Table 3 indicates PPMM significantly outperforms the other two methods in terms of estimating the OTM.
|MNIST||0.17 (0.01)||4.62 (0.02)||2.98 (0.01)||3.04 (0.01)||3.12 (0.01)|
|Doodle (face)||0.59 (0.09)||8.78 (0.04)||5.69 (0.01)||6.01 (0.01)||5.52 (0.01)|
|Doodle (cat)||0.24 (0.03)||8.93 (0.03)||5.99 (0.01)||5.26 (0.01)||5.33 (0.01)|
|Doodle (bird)||0.36 (0.03)||7.81 (0.03)||5.44 (0.01)||5.50 (0.01)||4.98 (0.01)|
Google doodle dataset. The Google Doodle dataset444https://quickdraw.withgoogle.com/data contains over 50 million drawings created by users with a mouse under 20 secs. We analyze a pre-processed version of this dataset from the quick draw Github account555https://github.com/googlecreativelab/quickdraw-dataset. In the dataset we use, the drawings are centered and rendered into grayscale images. We pull each image to a -dimensional vector and rescale the grayscale values from to . In this experiment, we study the drawings from three different categories: smile face, cat, and bird. These three categories contain 161,666, 123,202, and 133,572 drawings, respectively. Within each category, we randomly split the data into a training set and a validation set of equal sample sizes.
We apply VAE to the training set with a stopping criterion selected by the validation set. The dimension of the latent space is set to be 16. Let be two vectors in the validation set, be the map induced by VAE and be the OTM estimated by PPMM. Note that maps the sample distribution to . We then linearly interpolate between and with equal-size steps. The results are presented in Figure 6.
Then, we quantify the similarity between the generated fake samples and the truth by calculating the FID in the latent space. The sample mean and sample standard deviation (in parentheses) of FID over 50 replications are presented in Table 3. Again, the results in Table 3 justify the superior performance of PPMM over existing projection-based methods.
First, the PPMM can be extended to address the penitential heterogeneous in the dataset by assigning non-equal weights to the points in source and target samples. This is equivalent to calculate weighted variance-covariance matrices in Step 2 of Algorithm 1. Second, the PPMM method can be modified to allow the sizes of the source and target samples to be different. In such a scenario, we can replace the look-up table in the Step (b) of Algorithm 2 with an approximate lookup table. Recall that the one-dimensional lookup table is just sorting, the one-dimensional approximate look-up table can be achieved by combining sorting and linear interpolation. We validate the above extensions with a simulated experiment similar to the one in Section 4.1 except that we draw and points from and , respectively. We set and assign weights to the observations randomly. The estimation results are presented in Figure 7. In addition, the average convergence time is: PPMM(0.3s), RANDOM (1.4s), SLICED10 (14s) and SHORTSIM (74s).
Theorem 3 suggests that, for the PPMM algorithm, the number of iterations until converge, i.e., , is on the order of dimensionality . Here we use a simulated example to assess whether this order is attainable. We follow a similar setting as in Section 4.1 except that we increase from 10 to 100 with a step size of 10. Besides, we set the termination criteria to be a hard threshold, i.e., . In Figure 8, we report the sample mean (solid line) and standard deviation (vertical bars) of over 100 replications with respect to the increased . One can observe a clear linear pattern.
We would like to thank Xiaoxiao Sun, Rui Xie, Xinlian Zhang, Yiwen Liu, and Xing Xin for many fruitful discussions. We would also like to thank Dr. Xianfeng David Gu for his insightful blog about the Optimal transportation theory. Also, we would like to thank the UC Irvine Machine Learning Repository for dataset assistance. This work was partially supported by National Science Foundation grants DMS-1440037, DMS-1440038, DMS-1438957, and NIH grants R01GM113242, R01GM122080.
Appendix A Appendix
This appendix provides the proofs of the theoretical results for the main document.
a.1 Proof of Theorem 1
First, we presents some Lemmas to facilitate the proof of Theorem 1.
Let be an independent copy of . We denote
Let be the projection onto the central space with respect to the inner project , and let . Further, define two quantities
Denote the column space of matrix , then .
Proof of Lemma 1.
Follow the Theorem 2 in Li and Wang (2007) and notice , the matrix can be re-expressed as
First, let be a vector orthogonal to . We have and almost surely. Therefore, for . This implies that is orthogonal to , and hence .
On the other hand, let be a vector orthogonal to . Then, implies
The second equality implies that almost surely. Furthermore, Using the fact that and , the first inequality can be re-expressed as
The second to fourth terms are 0 since . Thus the first term must also be 0, almost surely, implying . that . We complete the proof by showing that .
Suppose the Assumption 1 (a) and (b) hold. Denote the column space of matrix , then .
Proof of Lemma 2.
Let be a vector orthogonal to . By assumption (a), for some . Multiply both sides by and then take unconditional expectation to obtain . Thus .
By Assumption 1 (a) and (b), , for some constant . Take unconditional expectations on both sides to obtain . Thus .
Because , we have
Substitute the above two lines into A.1, we have
which implies . Then, we have .
Let be a symmetric and positive semi-definite matrix which satisfies . Then, iff for all , .
Proof of Lemma 3.
Suppose that is a strict subspace of . Then for any , . Conversely, for , , , we have , and hence . ∎
Proof of Theorem 1.
We first show that . is symmetric and positive semi-definite according to its definition. Also, Lemma 2 shows under Assumption 1 (a) and (b).
Let , . Without loss of generality, we assume . Then
Because , the first term on the right hand side of (7) is nonnegative. By Assumption 1 (c), is non-degenerate. Therefore, is non-degenerate. Then, by Jensen’s inequality and notice ,
a.2 Proof of Theorem 2
Proof of Theorem 2.
Suppose Assumption 2 holds. By applying Theorem 3 and Proposition 3 in Fan et al. (2018), we arrive at
where and are some positive constants.
It can be shown that
Follow the classic asymptotic result in univariate OLS and use the union bound, we have
Then, we bound the first operator norm in (A.2) as
where the second term of the last equality is due to derived from Assumption 2. Similarly, we have
a.3 Proof of Theorem 3
We will work on the space of probability measures on with bounded th moment, i.e.
The following Lemma follows the Theorem 5.10 in Santambrogio (2015), which provides the weak convergence in Wasserstein distance. Hence we omit its proof.
Let be compact, and . Then if and only if .
Denote the empirical Wasserstein distance with true OTM . The following Lemma follows the Theorem 2.1 in Klein et al. (2017) guarantees that is a consistent estimator of . We refer to Klein et al. (2017) for its proof.
Under Assumption 2 (a) and (b), converges almost surely to as .
Proof of Theorem 3.
Notice that, we can decompose the empirical Wasserstein distance as