Optimal transport (OT) is a machine learning technique with several applications in machine learning, computer vision, and natural language processing communities. The applications include the Wasserstein distance estimation, domain adaptation , multi-task learning , barycenter estimation , semantic correspondence , feature matching , and photo album summarization .
The OT problem is extensively studied in the computer vision community as the earth mover’s distance (EMD) . However, the computational cost of EMD is cubic and is computationally expensive. Recently, the entropic regularized EMD problem was proposed, where the problem can be solved by the Sinkhorn algorithm with quadratic cost . Owing to the development of the Sinkhorn algorithm, researchers have replaced the EMD computation with its regularized counterparts.
More recently, a robust variant of the OT was proposed and used for divergence estimation . In the robust OT framework, the transportation plan is computed with the discriminative subspace of the two data matrices and , where the subspace can be obtained by solving the dimensionality reduction problem. An advantage of the subspace robust approach is that it does not require prior information about the subspace. However, given prior information such as feature groups, we can consider a computationally efficient formulation. The computation of the subspacecan be computationally expensive if the dimensionality of data is high, for example .
One of the most common prior information is a feature group. Using a group feature is popular in feature selection problems and extensively studied in Group Lasso . The key idea of Group Lasso is to pre-specify the group variables and select the set of group variables using the group norm (also known as the sum of norms). For example, if we use pre-trained neural network for a feature extractor and to compute OT using the features, we require a careful selection of important layers to compute OT. Specifically, each layer output is regarded as a grouped input. Therefore, using feature group as a prior is a natural setup and important for considering OT for deep neural networks (DNNs).
-dimensional vectors, , where two-dimensional vectors and are true features and and are noisy features. (fig:syntetic_OT_data) OT between the distribution and is a reference. (fig:syntetic_OT_noise) OT between the distribution and . (fig:syntetic_FROT_noise) FROT transportation plan between the distribution and where true features and noisy features are grouped respectively.
This study proposes a feature selection variant of the optimal transport for high-dimensional data utilizing grouped feature prior information. Specifically, we propose a feature robust optimal transport (FROT) problem, where we select distinct group feature sets instead of determining its distinct subsets as proposed in . We formulate the FROT problem as a min–max optimization problem and transform it to a convex optimization problem, where it can be accurately solved by the Frank–Wolfe algorithm [10, 16]. The FROT’s sub-problem can be accurately solved by the Sinkhorn algorithm . An advantage of FROT is that we can obtain a globally optimal solution owing to its convexity. Moreover, we can determine the significance of the features after solving the FROT problem without any additional cost; this can aid in interpreting features. Therefore, the FROT formulation is suited for feature selection and layer selection in DNNs. Through synthetic experiments, we initially demonstrate that the proposed FROT can determine important groups (i.e., features) and is robust to noise dimensions (See Figure 1). Then, we use the FROT for high-dimensional feature selection problems. Furthermore, we applied the FROT to a semantic correspondence problem  and showed that the proposed algorithm improves semantic correspondence.
We propose a feature robust optimal transport (FROT) problem and derive a simple and efficient Frank–Wolfe based algorithm. Furthermore, we propose a feature robust Wasserstein distance (FRWD).
We apply FROT to the high-dimensional feature selection problem and show that FROT is consistent with the Wasserstein distance based feature selection algorithm with less computational cost than the original algorithm.
We used FROT for the layer selection problem in a semantic correspondence problem and showed that the proposed algorithm outperforms existing baseline algorithms.
In this section, we briefly introduce the OT problem.
Optimal transport (OT): Given independent and identically distributed (i.i.d.) samples from a -dimensional distribution and i.i.d. samples from the -dimensional distribution . In the Kantorovich relaxation of OT, admissible couplings are defined by the set of transportation plan:
where is called the transportation plan, is the -dimensional vector whose elements are ones, and and are the weights. The OT problem between two discrete measures and is to determine the optimal transportation plan of the following problem:
where is a cost function. For example, the squared Euclidean distance is used, that is., . To solve the OT problem, Eq. (1
), (also known as the earth mover’s distance) using linear programming requirescomputation, which is computationally expensive. To address this, the entropic-regularized optimal transport is used .
where is the regularization parameter and is the entropic regularization. If , the regularized OT problem reduces to the EMD problem. Owing to entropic regularization, the entropic regularized OT problem can be accurately solved using Sinkhorn iteration  with computational cost (See Algorithm 1).
Wasserstein distance: If the cost function is defined as with a distance function and , then we define the -Wasserstein distance of two discrete measures and as
3 Proposed Method
This study proposes a feature robust optimal transport. We assume that the vectors are grouped as and ,. Here, and are the dimensional vector, where . This setting is useful if we know the explicit group structure for the feature vectors a priori. In an application in -layer neural networks, we consider and as outputs of the th layer of the network. Specifically, for and , we consider each feature independently.
3.1 Feature Robust Optimal Transport (FROT)
The FROT formulation is given by
is the probability simplex.
The underlying concept of FROT is to estimate the transportation plan using the distinct groups with large distances between and . We note that determining transportation plan in non-distinct groups is difficult, because the data samples in and overlap. In contrast, in the distinct groups, and are different, and this aids determining an optimal transportation plan. This is an intrinsically similar idea to the subspace robust Wasserstein distance , that estimates the transportation plan at the discriminative subspace. In contrast, our approach selects the important groups. Therefore, FROT can be regarded as a feature selection variant of the vanilla OT problem Eq. (1), whereas the subspace robust one is the dimensionality reduction counterparts.
FROT with Frank–Wolfe: An alternative approach can be used to estimate FROT; we initially estimate and then . However, it can have an local optimal solution due to its non-convexity. Thus, we propose a convex optimization of FROT with Frank–Wolfe. Specifically, we introduce the entropic regularization for and rewrite the FROT as a function of . Therefore, we solve the following problem for :
where is the regularization parameter and is the entropic regularization for . An advantage of the entropic regularization is that the non-negative constraint is naturally satisfied and the entropic regularizer is a strong convex function.
The optimal solution of the optimization problem
with a fixed admissible transportation plan , is given by
Using Proposition 1 together with the setting , , the global problem is equivalent to
This function is the soft-maximum of the transportation costs in each group. The regularization parameter controls how ”soft” the maximum is: if is small, is similar to the maximum whereas if is large, the function becomes smooth.
is a convex function relative to .
The derived optimization problem is convex. Therefore, we can determine globally optimal solutions. We employ the Frank–Wolfe algorithm [10, 16], where we approximate by linear functions at and move towards the optimal solution in the convex set (See Algorithm 2).
The derivative of the loss functionat is given by
Then, we update the transportation plan by solving the EMD problem:
where . By the Frank–Wolfe algorithm, we can obtain the optimal solution. However, solving the EMD problem requires cubic computational cost that can be computationally expensive if and are large. To address this, we can solve the regularized OT problem.
We propose a -feature robust Wasserstein distance (-FRWD).
For the distance function ,
is a distance for .
3.2 Application 1: Feature Selection
We considered and as sets of samples from classes and , respectively. An advantage of the FROT formulation is that we can determine the important features for each grouped features. The optimal important feature is given by
where . Finally, we selected top- features by the ranking . Hence, changes to a one-hot vector for small and for large .
3.3 Application 2: Semantic Correspondence
We applied our proposed FROT algorithm to semantic correspondence. The semantic correspondence is a problem that determines the matching of objects in two images. That is, given input image pairs , with common objects, we formulated the semantic correspondence problem to estimate the transportation plan from the key points in to that in , where this framework is proposed in . In Figure 2, we show the overview of our proposed framework.
Cost matrix computation :
In our framework, we employed the pre-trained convolutional neural network to extract dense feature maps for each convolutional layer. The dense feature map of theth layer output of the th image is given by
where and are the width and the height of the th image, respectively and is the dimension of th layer’s feature map. Note that because the dimension of dense feature map is different for each layer, we sample feature maps to the size of the st layer’s feature map size (i.e., ).
The th layer’s cost matrix for images and is given by
A potential problem of FROT is that the estimation significantly depends on the magnitude of the cost of each layer (also known as group). Hence, normalizing each cost matrix is important. Therefore, we normalized each feature vector by . Consequently, the cost matrix is given by . We can use distances such as distance.
Computation of and with staircase re-weighting: For semantic correspondence, setting and is important because semantic correspondence can be affected by background clutter. Therefore, we generated the class activation maps  for the source and target images and use as and , respectively. For CAM, we chose the class with the highest classification probability and normalized it to the range .
4 Related Work
In this section, we review divergence measures and optimal transport.
Divergence measure and optimal transport: Divergence measures can be categorized into two: -divergence  including the Kullback–Leibler (KL) divergence  and the -divergence [28, 27], and integral probability metric , such as the Wasserstein distance .
The KL divergence is a commonly used divergence. A naive approach for estimating the KL divergence between and is to estimate the probability densities and separately using some density estimators and then computing their ratio. However, density estimation is a difficult problem, and the KL divergence estimation can be inaccurate. An efficient approach can be based on density ratio estimation approaches, where we directly estimate the ratio of and without using the density estimations . For the Jensen–Shannon divergence , we can use the relative density ratio estimation alternate to the standard density ratio estimation . For non-overlapping distributions, the KL divergence can be infinite. Moreover, in this case, neural network training with KL and JS divergences can be affected by vanishing gradients.
To address the instability problem in KL and JS divergences, using a distance based approach is promising. The maximum mean discrepancy (MMD)  is a kernel based measure defined as a difference of means of two distributions in a reproducing kernel Hilbert space (RKHS), that can be accurately computed without optimization. Another type of distance based measure is the Wasserstein distance . The Wasserstein distance can be determined by solving the OT problem. An advantage of the Wasserstein distance is its robustness to noise; moreover, we can obtain the transportation plan, which is useful for many machine learning applications. To reduce the Wasserstein distance computation cost, the sliced Wasserstein distance is useful . Recently, the tree variant of Wasserstein was proposed [9, 19]; the sliced Wasserstein distance is a special case of this alogorithm.
In addition to accelerating the computation, structured optimal transport (SOT) incorporates structural information directly into the OT problems [alvarez2018structured]. Specifically, they formulate the submodular optimal transport problem and solve the problem by a saddle-point mirror prox algorithm. Recently, the more complex structured information is introduced in the OT problem such as hierarchical structure [alvarez2019unsupervised, yurochkin2019hierarchical]. These approaches successfully incorporate the structured information into the OT problems with respect to data samples. In contrast, FROT incorporates the structured information into features.
The most related work to FROT is that a robust variant of Wasserstein distance called the subspace robust Wasserstein distance . The subspace robust Wasserstein distance method computes the OT problem in the extremely discriminative subspace, that can be determined by solving dimensionality reduction problems. Owing to the subspace robust Wasserstein, it can successfully compute the Wasserstein from noisy data. The FROT is a feature selection variant of Wasserstein distance, whereas the subspace robust one is for dimensionality reduction.
OT applications: OT has received significant attention in several computer vision tasks. Applications include the Wasserstein distance estimation , the domain adaptation , the multi-task learning , the barycenter estimation , the semantic correspondence , the feature matching , photo album summarization , generative model [2, 3, 8, 36], and graph matching [37, 38]. Recently, OT was applied to the semantic correspondence problem, and it outperformed existing state-of-the-art semantic correspondence algorithms .
In this section, we initially evaluate the FROT algorithm using synthetic datasets. Then, we demonstrate the performance using feature selection and semantic correspondence tasks.
5.1 Synthetic Data
We compare FROT with a standard OT using synthetic datasets. In these experiments, we initially generate two-dimensional vectors and . Here, we set , , . Then, we concatenate and to and , respectively to give , .
For FROT, we set and the number of iterations of the Frank–Wolfe algorithm as . The regularization parameter is set to for all methods. To show the proof-of-concepts, we set the true features as a group and the remaining noise features as another group.
Fig. 0(a) shows the correspondence from and with the vanilla OT algorithm. Figs. 0(b) and 0(c) show the correspondence of FROT and OT with and , respectively. Although FROT can identify a good matching, the OT fails to obtain a significant correspondence. We observed that the parameter corresponding to true group is nearly one.
|Colon||2000||62||21.38 ( 4.09)||0.00 ( 0.00)||1.36 ( 0.15)||0.41 ( 0.07)|
|Leukemia||7070||72||79.86 ( 16.95)||0.01 ( 0.00)||5.03 ( 0.79)||1.13 ( 0.14)|
|Prostate_GE||5966||102||61.05 ( 13.67)||0.02 ( 0.00)||6.01 ( 1.17)||1.04 ( 0.11)|
|GLI_85||22283||85||426.24 ( 21.45)||0.04 ( 0.00)||23.6 ( 1.21)||3.44 ( 0.36)|
5.2 Feature selection
Here, we compared FROT with several baseline algorithms in feature selection problems. In this study, we employed the high-dimensional and few sample datasets with two class classification tasks (see Table 1). All the feature selection experiments were run on a Linux server with Intel Xeon CPU E7-8890 v4 2.20 GHz and 2 TB RAM.
In our experiments, we initially randomly split the data into two sets ( for training and
for test) and used the training set for feature selection and building a classifier. Note that we standardized each feature using the training set. Then, we used the remaining set for the test. The trial was repeatedtimes and we reported the averaged classification accuracy. Considered as baseline methods, we computed the Wasserstein distance, the Maximum mean discrepancy (MMD) , and linear correlation111https://scikit-learn.org/stable/modules/feature_selection.html for each dimension and sorted them in descending order. Then, we selected the top- features as important features. For FROT, we computed the feature importance and selected the features that had significant importance score. In our experiments, we set and . Then, we trained 2-class SVM222https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html with the selected features.
Fig. 3 shows the averaged classification accuracy relative to the number of selected features. From Figure 3, FROT is consistent with the Wasserstein distance based feature selection, and outperforms the linear correlation method and the MMD for two datasets. Table 1 shows the computational time (second) of the methods. FROT is about two order of magnitude faster than that of Wasserstein distance and also faster than MMD. Note that although MMD is as fast as the proposed method, it cannot determine the correspondence between samples.
5.3 Semantic correspondence
We evaluated our FROT algorithm for semantic correspondence. In this study, we used the SPair-71k . The SPair-71k dataset consists of image pairs with variations in viewpoint and scale. For evaluation, we employed the percentage of accurate key-points (PCK), that counts the number of accurately predicted key-points given a fixed threshold . All the semantic correspondence experiments were run on a Linux server with NVIDIA P100.
For the proposed framework, we employed ResNet101 
that are pre-trained on ImageNet for feature and activation map extraction. Note that we did not fine-tune the network. We compared the proposed method to several baselines . In particular, HPF  and OT-HPF  are state-of-the-art methods for semantic correspondence. The HPF and OT-HPF required the validation dataset to select important layers, whereas FROT did not require the validation dataset. The OT is a simple optimal transport based method without selecting layers.
Table 2 shows the per-class PCK results using the SPair-71k dataset. FROT outperforms most existing baselines including HPF and OT. Moreover, FROT is consistent with OT-HPF , which requires the validation dataset to select important layers. In this experiment, setting gives favorable performance. Figure 3(a) shows an example of the matched key-points using the FROT algorithm. Fig.3(b) shows the corresponding feature importance. The lower the value, the smaller number of layers used. The interesting finding here is that the selected important layer in this case is the third layer from the last.
In this paper, we proposed a feature robust optimal transport (FROT) for high-dimensional data, which jointly solves the feature selection and OT problems. An advantage of FROT is that it is a convex optimization problem and can determine an accurate globally optimal solution by the Frank–Wolfe algorithm. Then, we used FROT for high-dimensional feature selection and semantic correspondence problems. By extensive experiments, we demonstrated that the proposed algorithm is consistent with state-of-the-art algorithms in both feature selection and semantic correspondence.
-  (1966) A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pp. 131–142. Cited by: §4.
-  (2017) Wasserstein generative adversarial networks. In ICML, Cited by: §4.
-  (2019) Learning generative models across incomparable spaces. In ICML, Cited by: §4.
-  (2012) Elements of information theory. John Wiley & Sons. Cited by: §4.
-  (2014) Fast computation of wasserstein barycenters. ICML. Cited by: §1, §4.
-  (2013) Sinkhorn distances: lightspeed computation of optimal transport. In NIPS, Cited by: §1, §1, §2.
-  (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §5.3.
-  (2019) Max-sliced wasserstein distance and its use for gans. In CVPR, Cited by: §4.
-  (2012) The phylogenetic kantorovich–rubinstein metric for environmental sequence samples. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 (3), pp. 569–592. Cited by: §4.
-  (1956) An algorithm for quadratic programming. Naval research logistics quarterly 3 (1-2), pp. 95–110. Cited by: §1, §3.1.
-  (2004) Jensen-shannon divergence and hilbert space embedding. In ISIT, Cited by: §4.
-  (2007) A kernel statistical test of independence. In NIPS, Cited by: §5.2.
-  (2012) A kernel two-sample test. Journal of Machine Learning Research 13 (Mar), pp. 723–773. Cited by: §4.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §5.3.
-  (2018) Attentive semantic alignment with offset-aware correlation kernels. In ECCV, Cited by: Table 2.
-  (2013) Revisiting frank-wolfe: projection-free sparse convex optimization.. In ICML, Cited by: §1, §3.1.
-  (2019) Wasserstein regularization for sparse multi-task regression. In AISTATS, Cited by: §1, §4.
Sliced wasserstein kernels for probability distributions. In CVPR, Cited by: §4.
-  (2019) Tree-sliced approximation of wasserstein distances. NeurIPS. Cited by: §4.
-  (2019) LSMI-sinkhorn: semi-supervised squared-loss mutual information estimation with optimal transport. arXiv preprint arXiv:1909.02373. Cited by: §1, §4.
-  (2020) Semantic correspondence as an optimal transport problem. In CVPR, Cited by: §1, §1, §3.3, §4, §5.3, §5.3, Table 2.
-  (2019) Hyperpixel flow: semantic correspondence with multi-layer neural features. In ICCV, Cited by: §5.3, Table 2.
-  (2019) SPair-71k: a large-scale benchmark for semantic correspondence. arXiv preprint arXiv:1908.10543. Cited by: §5.3, §5.3.
-  (1997) Integral probability metrics and their generating classes of functions. Advances in Applied Probability 29 (2), pp. 429–443. Cited by: §4.
-  (2019) Subspace robust wasserstein distances. In ICML, Cited by: §1, §1, §3.1, §4.
-  (2019) Computational optimal transport. Foundations and Trends® in Machine Learning 11 (5-6), pp. 355–607. Cited by: Triangle inequality, §1, §4, §4.
-  (2011) On the estimation of alpha-divergences. In AISTATS, Cited by: §4.
-  (1961) On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Cited by: §4.
-  (2017) Convolutional neural network architecture for geometric matching. In CVPR, Cited by: Table 2.
-  (2018) End-to-end weakly-supervised semantic alignment. In CVPR, Cited by: Table 2.
-  (2018) Neighbourhood consensus networks. In NeurIPS, Cited by: Table 2.
The earth mover’s distance as a metric for image retrieval. International journal of computer vision 40 (2), pp. 99–121. Cited by: §1.
-  (2019) SuperGlue: learning feature matching with graph neural networks. arXiv preprint arXiv:1911.11763. Cited by: §1, §4.
-  (2008) Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS, Cited by: §4.
-  (2008) Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §4.
-  (2019) Sliced wasserstein generative models. In CVPR, Cited by: §4.
-  (2019) Scalable gromov-wasserstein learning for graph partitioning and matching. arXiv preprint arXiv:1905.07645. Cited by: §4.
-  (2019) Gromov-wasserstein learning for graph matching and node embedding. In ICML, Cited by: §4.
-  (2013) Relative density-ratio estimation for robust distribution comparison. Neural computation 25 (5), pp. 1324–1370. Cited by: §4.
-  (2018) Semi-supervised optimal transport for heterogeneous domain adaptation.. In IJCAI, Cited by: §1, §4.
-  (2006) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1), pp. 49–67. Cited by: §1.
Learning deep features for discriminative localization. In CVPR, Cited by: §3.3.
Proof of Proposition 1
We optimize the function with respect to :
Because the entropic regularization is a strong convex function and its negative counterpart is a strong concave function, the maximization problem is a concave optimization problem.
We consider the following objective function with the Lagrange multiplier :
Note that owing to the entropic regularization, the non-negative constraint is automatically satisfied.
Taking the derivative with respect to , we have
Thus, the optimal has the form:
satisfies the sum to one constraint.
Hence, the optimal is given by
Substituting in to Eq.(2), we have
Therefore, the final objective function is given by
Proof of Proposition 2
Proof: For , we have
Here, we use the Hölder’s inequality with , , and .
Applying logarithm on both sides of the equation, we have
Proof of Proposition 3
For the distance function , we prove that
is a distance for .
It is clear that is symmetric and .
Let , , and , we prove that
To simplify the notations in this proof, we define the distance ”matrix” such that is the th row and th column element of the matrix , and . Moreover, note that , the ”matrix” where each element is the element of raised to the power .
Consider the optimal transportation plan of and the optimal transportation plan of . Similarly to the proof for Wasserstein distance in , let. We can show that .
By letting and , the right-hand side of this inequality can be rewritten as
by the Minkovski inequality.