# LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual Information Estimation with Optimal Transport

Estimating mutual information is an important machine learning and statistics problem. To estimate the mutual information from data, a common practice is preparing a set of paired samples. However, in some cases, it is difficult to obtain a large number of data pairs. To address this problem, we propose squared-loss mutual information (SMI) estimation using a small number of paired samples and the available unpaired ones. We first represent SMI through the density ratio function, where the expectation is approximated by the samples from marginals and its assignment parameters. The objective is formulated using the optimal transport problem and quadratic programming. Then, we introduce the least-square mutual information-Sinkhorn algorithm (LSMI-Sinkhorn) for efficient optimization. Through experiments, we first demonstrate that the proposed method can estimate the SMI without a large number of paired samples. We also evaluate and show the effectiveness of the proposed LSMI-Sinkhorn on various types of machine learning problems such as image matching and photo album summarization.

## Authors

• 5 publications
• 39 publications
• 14 publications
• 7 publications
• 112 publications
• 128 publications
• ### Neural Entropic Estimation: A faster path to mutual information estimation

We point out a limitation of the mutual information neural estimation (M...
05/30/2019 ∙ by Chan Chung, et al. ∙ 0

• ### Information-Maximization Clustering based on Squared-Loss Mutual Information

Information-maximization clustering learns a probabilistic classifier in...
12/03/2011 ∙ by Masashi Sugiyama, et al. ∙ 0

• ### Information Aging through Queues: A Mutual Information Perspective

In this paper, we propose a new measure for the freshness of information...
06/16/2018 ∙ by Yin Sun, et al. ∙ 0

• ### Traditional Machine Learning for Pitch Detection

Pitch detection is a fundamental problem in speech processing as F0 is u...
03/04/2019 ∙ by Thomas Drugman, et al. ∙ 0

• ### Output-weighted optimal sampling for Bayesian regression and rare event statistics using few samples

For many important problems the quantity of interest (or output) is an u...
07/17/2019 ∙ by Themistoklis P. Sapsis, et al. ∙ 2

• ### Inverting Supervised Representations with Autoregressive Neural Density Models

Understanding the nature of representations learned by supervised machin...
06/01/2018 ∙ by Charlie Nash, et al. ∙ 0

• ### Nonparanormal Information Estimation

We study the problem of using i.i.d. samples from an unknown multivariat...
02/24/2017 ∙ by Shashank Singh, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Mutual information (MI) represents the statistical independence between two random variables

[4]

, and it is widely used in various types of machine learning applications including feature selection

[24, 25], dimensionality reduction [23], and causal inference [26]

. More recently, deep neural network (DNN) models have started using MI as a regularizer for obtaining better representations from data such as infoVAE

[30] and deep infoMax [11]. Another application is improving the generative adversarial networks (GANs) [8]. For instance, Mutual Information Neural Estimation (MINE) [1] was proposed to maximize or minimize the MI in deep networks and alleviate the mode-dropping issues in GANS. In all these examples, MI estimation is the core of all these applications.

In various MI estimation approaches, the probability density ratio function is considered to be one of the most important components:

 r(x,y)=p(x,y)p(x)p(y).

A straightforward method to estimate this ratio is the estimation of the probability densities (i.e., , , and ), followed by calculating their ratio. However, directly estimating the probability density is difficult, thereby making this two-step approach inefficient. To address the issue, Suzuki et al. [25] proposed to directly estimate the density ratio by avoiding the density estimation [24, 25]. Nonetheless, the abovementioned methods requires a large number of paired data when estimating the MI.

Under practical setting, we can only obtain a small number of paired samples. For example, it requires a massive amount of human labor to obtain one-to-one correspondences from one language to another. Thus, it prevents us to easily measure the MI across languages. Hence, a research question arises:

Can we perform mutual information estimation using unpaired samples and a small number of data pairs?

To answer the above question, in this paper, we propose a semi-supervised MI estimation algorithm, particularly designed for the squared-loss mutual information (SMI) (a.k.a., -divergence between and ) [24]. We first formulate the SMI estimation as the optimal transport problem with density-ratio estimation. Then, we propose the least-squares mutual information with Sinkhorn (LSMI-Sinkhorn) algorithm to solve the problem. The algorithm has the computational complexity of ; hence, it is computationally efficient. We present the connection between the proposed algorithm and the Gromov-Wasserstein distance [15], which is a popular distance for measuring the discrepancy between different domains. Through experiments, we first demonstrate that the proposed method can estimate the SMI without a large number of paired samples. Finally, for image matching and photo album summarization, we show the effectiveness of our proposed method.

We summarize the contributions of this paper as follows:

• We proposed a semi-supervised mutual information estimation approach that does not require a large number of paired samples.

• We formulated the MI estimation as a combination of density-ratio fitting and optimal transport.

• We proposed the LSMI-Sinkhorn algorithm, which can be efficiently computed and the loss is guaranteed to be monotonically decreasing.

• We determined a connection between the proposed method and the Gromov-Wasserstein distance.

## 2 Problem Formulation

In this section, we formulate the problem of squared-loss mutual information (SMI) estimation using a small number of paired samples.

Let

be the domain of vector

and be the domain of vector . Suppose we are given independent and identically distributed (i.i.d.) paired samples:

 {(xi,yi)}ni=1,

where we consider the number of paired samples is small.

In addition to the paired samples, we suppose to have and i.i.d. samples from the marginal distributions:

 {xi}n+nxi=n+1i.i.d.∼p(x) and {yj}n+nyj=n+1i.i.d.∼p(y),

where the number of unpaired samples and is much larger than that of the paired samples . For example, and .

We also denote and , respectively. It should be noted that the numbers of input dimensions and and the number of samples and can be different.

This paper aims to estimate the SMI (a.k.a., -divergence between and ) [24] from by leveraging the use of the unpaired samples and , respectively.

The SMI between two random variables and is defined as

 SMI(X,Y) =12∬(r(x,y)−1)2p(x)p(y)dxdy, (1)

where

 r(x,y)=p(x,y)p(x)p(y) (2)

is the density-ratio function. SMI takes 0 if and only if and are independent (i.e., ), and takes a non-negative value if they are not independent.

If we know the estimation of the density-ratio function, we can approximate the SMI as

 ˆSMI(X,Y) =12(n+nx)(n+ny)n+nx∑i=1n+ny∑j=1(rα(xi,yi)−1)2,

where is an estimation of the true density ratio function parameterized by .

However, since we do not have enough number of paired samples in this paper to estimate the ratio function, estimating the SMI from the limited number of paired samples is very challenging. The key idea is to align the unpaired samples using the paired samples and use them to improve the SMI estimation accuracy.

## 3 Proposed Method

In this section, we propose the SMI estimation algorithm with limited number of paired samples and large number of unpaired samples.

### 3.1 Least-Squares Mutual Information with Sinkhorn (LSMI-Sinkhorn)

Model: We employ the following density-ratio model:

 rα(x,y) =b∑ℓ=1αℓK(˜xℓ,x)L(˜yℓ,y)=α⊤φ(x,y), (3)

where , and are the kernel functions, , , and . and are the sets of basis vectors which are sampled from and , respectively.

In this paper, we optimize by minimizing the difference between the true density-ratio function and its ratio model:

 12∬(p(x,y)p(x)p(y)−rα(x,y))2p(x)p(y)dxdy =Const.+12∬rα(x,y)2p(x)p(y)dxdy−∬rα(x,y)p(x,y)dxdy. (4)

For the second term of Eq. (3.1

), we can approximate it by using a large number of unpaired samples. However, to approximate the third term, paired samples from the joint distribution are required. Because we have limited number of paired samples in our setting, the approximation of the third term can be poor.

To deal with this issue, we propose the utilization of unpaired samples for the approximation of the expectation of the thrid term. Specifically, we first introduce () and we represent the third term as

 ∬rα(x,y)p(x,y)%dxdy≈βnn∑i=1rα(xi,yi)+(1−β)nx∑i=1ny∑j=1πijrα(x′i,y′j),

where is a tuning parameter between the terms of paired and unpaired samples. Note that if we set where is one if and are paired and 0 otherwise, and is the total number of pairs, then we can recover the original empirical estimate.

Then, for the density-ratio model (Eq. (3

)), the loss function (Eq. (

3.1)) can be approximated as

 J(α,Π)=12α⊤Hα−α⊤hΠ,β,

where

 H =1(n+nx)(n+ny)n+nx∑i=1n+ny∑j=1φ(xi,yj)φ(xi,yj)⊤, hΠ,β =βnn∑i=1φ(xi,yi)+(1−β)nx∑i=1ny∑j=1πijφ(x′i,y′j).

Since we want to estimate the density-ratio function by minimizing Eq. (3.1), the optimization problem is given as

 minΠ,α J(Π,α)=12α⊤Hα−α⊤hΠ,β+ϵH(Π)+λ2∥α∥22 s.t. Π1ny=n−1x1nx % and Π⊤1nx=n−1y1ny. (5)

where is the negative entropic regularization to ensure non-negative, is the regularization parameter, is the regularization, and is the regularization parameter.

The objective function is not jointly convex. However, if we fix one of the model parameters, it becomes a convex function. Thus, we employ the alternating optimization approach (see Algorithm 1).

Optimizing using the Sinkhorn algorithm: With fixing , because we have the relationship:

 nx∑i=1ny∑j=1πijα⊤φ(x′i,y′j) =nx∑i=1ny∑j=1πij[Cα]ij,

where , , and . It is evident that this representation can be considered to be an optimal transport problem if we maximize it with respect to [5]. It should be noted that the rank of is at most , where is a constant (e.g., ), and the computational complexity for the cost matrix is .

Thus, the optimization problem for can be written as

 minΠ −nx∑i=1ny∑j=1πij(1−β)[Cα]ij+ϵH(Π) s.t. Π1ny=n−1x1nx % and Π⊤1nx=n−1y1ny,

and this optimization problem can be efficiently solved using the Sinkhorn algorithm [5, 21]. In this paper, we use the log-stabilized Sinkhorn [20]. Note that the optimization problem is convex if we fix .

Optimizing : Next, we update with given . The optimization problem for is equivalent to

 minα 12α⊤Hα−α⊤hΠ,β+λ2∥α∥22. (6)

Since the optimization problem is a quadratic programming and convex, the solution is analytically given as

 ˆα=(H+λIb)−1hΠ,β, (7)

where

is the identity matrix. Note that the

matrix does not depend on both and .

Convergence Analysis: To optimize , we simply need to alternatively solve the two convex optimization problems. Thus, the following nice property holds true.

###### Proposition 1

Algorithm 1 will monotonically decrease the objective function in each iteration.

Proof. See the supplementary material.

Model Selection: The LSMI-Sinkhorn algorithm includes several tuning parameters (i.e., and ) and determining the model parameters is critical to obtain a good estimate of SMI. Accordingly, we use the cross validation with the hold-out set to select the model parameters.

First, the paired samples are divided into two subsets and . Then, we train the density-ratio using and the unpaired samples: and . The hold-out error can be calculated by approximating Eq. (3.1) using the hold-out samples as

 ˆJte=12|Dte|2∑x,y∈Dterˆα(x,y)2−1|Dte|∑(x,y)∈Dterˆα(x,y),

where denotes the number of samples in the set , denotes the summation over all combinations of and in , and denotes the summation over all pairs for and in . We select the parameter that has the smallest .

Relation to the Gromov-Wasserstein: For , , and , by substituting the optimal , the loss function (Eq. (3.1)) can be represented as

 J(Π,α) =−12nx∑i=1ny∑j=1nx∑i′=1ny∑j′=1πijπi′j′s(x′i,y′j,x′i′,y′j′),

where

Thus, the optimization problem for can be written as

 maxΠ 12nx∑i=1ny∑j=1nx∑i′=1ny∑j′=1πijπi′j′s(x′i,y′j,x′i′,y′j′) s.t. Π1ny=n−1x1nx,Π⊤1nx=n−1y1ny,πij≥0. (8)

This can be considered as a relaxed variant of the quadratic assignment problem (QAP) with . Since Gromov-Wasserstein is also related to a QAP problem [15], the proposed method is related to Gromov-Wasserstein.

Relation to Least-Squares Object Matching: In this section, we show that the LSOM algorithm [28, 27] can be considered a special case of the proposed framework.

If is a permutation matrix and ,

 Π={0,1}n′×n′, Π1n′=1n′, and Π⊤1n′=1n′,

where . Note that we only assume for the Sinkhorn formulation.

Then, the estimation of SMI using the permutation matrix can be written as

 ˆSMI(X,Y) =β2nn∑i=1rα(xi,yi)+12n′n′∑i=1(1−β)rα(x′i,y′π(i))−12,

where is the permutation function. The optimization problem is written as

 minΠ,α 12α⊤Hα−α⊤hΠ,β+λ2∥α∥22 s.t. Π1n′=1n′, Π⊤1n′=1n′, Π∈{0, 1}n′×n′. (9)

To solve this problem, we can use the Hungarian algorithm [14] instead of the Sinkhorn algorithm [5] for optimizing . It is noteworthy that in the original LSOM algorithm, the permutation matrix is introduced to permute the Gram matrix (i.e., ) and is also included within the computation. However, in our formulation, the permutation matrix depends only on . This is a small difference in formulation. However, owing to this difference, we can show that the monotonic decrease in the loss function of the proposed algorithm.

Since LSOM finds the alignment, this approach is more suited to find the exact match among samples. In contrast, the proposed Sinkhorn formulation is more suited when there are no exact matches. Moreover, the LSOM formulation can only handle the same number of samples (i.e., ). For computational complexity, the Hungarian algorithm requires while the Sinkhorn requires .

Computational Complexity: The computational complexity of estimating is based on the computation of the cost matrix and the Sinkhorn iterations. The computational complexity of is and that of Sinkhorn algorithm is . Therefore, the computational complexity of the Sinkhorn iteration is . For the computation, the complexity to compute is and that for is . However, to estimate the , the complexity should be but it is small. Therefore, the total computational complexity of the initialization needs and the iterations requires . In particular, for small and , the computational complexity is .

In contrast, the complexity of computing the objective function of Gromov-Wasserstein is for general cases and for some specific losses (e.g. loss, or Kullback-Leibler loss) [18]. Moreover, Gromov-Wasserstein is generally NP-hard for arbitrary inputs [18, 17].

## 4 Related Work

The proposed algorithms are related to MI estimation. Moreover, our LSMI-Sinkhorn algorithm is highly related to the Gromov-Wasserstein and the kernelized sorting.

Mutual information estimation: To estimate the MI, the simplest approach is to estimate the probability densities from the paired samples , from , and from , respectively.

However, because the estimation of the probability density is also a difficult problem, the naive approach does not tend to work well. To handle this, a density-ratio based approach can be promising [25, 24]

. More recently, a deep learning based mutual information estimation algorithm has been proposed

[1]. However, these approaches still require a large number of paired samples to estimate the models. Thus, if we have a limited number of paired samples, existing approaches are not efficient.

Most recently, the Wasserstein Dependency Measure (WDM), which measures the discrepancy between the joint probability and its marginals and , has been proposed and used for representation learning [16]. Since WDM can be used as an independence measure, it is highly related to LSMI-Sinkhorn. However, they focus on finding a good representation by maximizing WDM (i.e., maximize the mutual information), while we focus on estimating true SMI.

Gromov-Wasserstein and Kernelized Sorting: Given two set of vectors in different spaces, the Gromov-Wasserstein distance [15] can be used to find the optimal alignment between them. This method considers the pair-wise distance between samples in the same set to build the distance matrix, then find a match by minimizing the difference between the pair-wise distance matrices:

 minΠ nx∑i=1ny∑j=1nx∑i′=1ny∑i′=1πijπi′j′(D(xi,xi′)−D(yj,yj′))2, s.t. Π1ny=a,Π⊤1nx=b,πij≥0,

where , and .

Therefore, the alignment can be estimated first, followed by estimating the SMI from the aligned samples. However, Gromov-Wasserstein distance must solve the quadratic assignment problem (QAP), and it is generally NP-hard for arbitrary inputs [18, 17]. In this work, we estimate the SMI by simultaneously solving the alignment and fitting the distribution ratio by efficiently leveraging the Sinkhorn algorithm and properties of the squared-loss. Moreover, we show that our approach can be considered an example of the Gromov-Wasserstein by properly setting the cost function. Recently, semi-supervised Gromov-Wasserstein-based Optimal transport has been proposed and applied to the heterogeneous domain adaptation problems [29]. Their approach can handle tasks similar to those mentioned in this paper. However, their method cannot be used to measure the independence.

The kernelized sorting [12, 19, 6] is highly related to the Gromov-Wasserstein. Specifically, the kernelized sorting determines a set of paired samples by maximizing the Hilbert-Schmidt independence criterion (HSIC) between samples [9]. However, the kernelized sorting can only handle the same number of samples (i.e., and ).

## 5 Experiments

In this section, we evaluate the proposed algorithm using the synthetic data and benchmark datasets.

### 5.1 Setup

For all methods, we use the Gaussian kernels.

 K(x,x′) =exp(−∥x−x′∥222σ2x),L(y,y′)=exp(−∥y−y′∥222σ2y),

where and

denote the widths of the kernel that are set using the median heuristic

[22].

 σx =2−1/2median({∥xi−xj∥2}nxi,j=1), σy =2−1/2median({∥yi−yj∥2}nyi,j=1).

We set the number of basis , , the maximum number of iterations , and the stopping parameter . The parameters and are chosen by cross-validation.

### 5.2 Convergence and Runtime

We first demonstrate the convergence of the loss function and the estimated SMI value. Here, we generate synthetic data from and randomly choose as paired samples and as unpaired samples. The convergence curve is shown in Figure 1. The values of loss and SMI converge quickly (5 iterations). This is consistent with Proposition 1.

Then, we perform a comparison between the runtimes of the proposed LSMI-Sinkhorn and Gromov-Wasserstein for CPU and GPU implementation. The data are sampled using two 2D random measures, where is the unpaired data and is the paired data (only for LSMI-Sinkhorn). For Gromov-Wasserstein, we use the CPU implementation from Python Optimal Transport toolbox [7]

and the Pytorch GPU implementation from

[2]. We use the squared loss function and set the entropic regularization to 0.005 according to the original code. For LSMI-Sinkhorn, we implement the CPU and GPU versions using numpy and Pytorch, respectively. For fair comparison, we use the log-stabilized Sinkhorn algorithm and the same early stopping criteria and the same maximum iterations as in Gromov-Wasserstein. As shown in Figure 2, in comparison to the Gromov-Wasserstein, LSMI-Sinkhorn is more than one order of magnitude faster for the CPU version and several times faster for the GPU version. This is consistent with our computational complexity analysis. Moreover, the GPU version of our algorithm costs only 3.47s to compute unpaired samples, indicating that it is suitable for large-scale applications.

### 5.3 SMI Estimation

For SMI estimation, we set up four baselines:

• LSMI (full): paired samples are used for cross-validation and SMI estimation. It is considered as the ground truth value.

• LSMI: Only (usually small) paired samples are used for cross-validation and SMI estimation.

• LSMI (opt): paired samples are used for SMI estimation. However, we use the optimal parameters from LSMI (full) here. This can be seen as the upper bound of SMI estimation with limited number of paired data because the optimal parameters are usually unavailable.

• Gromov-SMI: The Gromov-Wasserstein distance is applied on unpaired samples to find potential matching (). Then, the matched pairs and existing paired samples are combined to perform cross-validation and SMI estimation.

Synthetic Data: In this experiment, we manually generated four types of paired samples: random normal, (Linear), (Nonlinear), and . We changed the number of paired samples while fixing and for Gromov-SMI and the proposed LSMI-Sinkhorn, respectively. The model parameters and are selected by cross-validation using the paired examples with and . The results are shown in Figure 3. In the random case, the data are nearly independent and our algorithm achieves a small SMI value. In other cases, LSMI-Sinkhorn yields a better estimation of the SMI value and it lies near the ground truth when increases. In contrast, Gromov-SMI has a small estimation value, which may be due to the incorrect potential matching when .

UCI Datasets: We selected four benchmark datasets from the UCI machine learning repository. For each dataset, we split the features into two sets as paired samples. To ensure high dependence between these two subsets of features, we utilized the same splitting strategy as [19] according to the correlation matrix. The experimental setting was same as the synthetic data, except . We show the SMI estimation results in Figure 4. Similarly, LSMI-Sinkhorn obtains better estimation values in all four datasets. In most cases, Gromov-SMI tends to overestimate the value by a large margin, while other baselines underestimate the value.

### 5.4 Deep Image Matching

Next, we consider an image matching task with deep convolution features. We use two commonly-used image classification benchmarks: CIFAR10 [13] and STL10 [3].We extracted 64 dim features from the last layer of ResNet20 [10] pretrained on the training set of CIFAR10. The features are divided into two 32-dim parts denoted by and . We shuffle the samples of and attempt to match and with limited pair samples and unpaired samples . Other settings are the same as above experiments.

To evaluate the matching performance, we used top-1 accuracy, top-2 accuracy (correct matching is achieved in the top-2 highest scores), and class accuracy (matched samples are in the same class). As shown in Figure 5, LSMI-Sinkhorn obtains high accuracy with only a few tens of supervised pairs. Additionally, the high class matching performance implies that our algorithm can be applied to further applications such as semi-supervised image classification.

We then investigate the impact of Sinkhorn regularization . With fixed to be 50, we show the matching accuracy of CIFAR10 and STL10 w.r.t changing in Figure 6. Matching accuracy gradually dropped when the value of increased. This is due to the intrinsic property of Sinkhorn regularization: with larger , the assignment matrix becomes smoother, thereby the matching accuracy drops.

### 5.5 Photo Album Summarization

Finally, we apply the proposed LSMI-Sinkhorn to the photo album summarization problem, where images are matched to a predefined structure according to the Cartesian coordinate system.

Color Feature: We first used 320 images collected from Flickr [19] and extracted the original RGB pixels as color feature. Figure 6(a) and 6(b) depict the semi-supervised summarization to the triangle and grids with the corners of the grids fixed to green, orange, black (triangle), and blue (rectangle) images. Similarly, we show the summarization results on an “AAAI 2020” grid with the center of each character fixed. It can be seen that these layouts show good color topology according to the fixed color images.

Semantic Feature: We then used CIFAR10 with the ResNet20 feature to illustrate the semantic album summarization. Figure 8 shows the layout of 1000 images into the same triangle, , and “AAAI 2020” grids. For Figure 7(a) and 7(b), we fixed corners of the grid to automobile, airplane, horse (triangle) and dog (rectangle) images. For Figure 7(c), we fixed the corresponding character centers. It can be seen that similar objects are aligned together by their semantic meanings rather than colors with respect to the fixed images.

In comparison to previous summarization algorithms, LSMI-Sinkhorn has two advantages. First, the semi-supervised property enables interactive album summarization, while kernelized sorting [19, 6] and object matching [28] cannot. Second, we obtained a solution for general rectangular matching (), e.g., 320 images to a triangle grid, 1000 images to a grid, while most previous methods [19, 28] relied on the Hungarian algorithm [14] to obtain square matching, which is not as flexible as the proposed method.

## 6 Conclusion

In this paper, we proposed the least-square mutual information Sinkhorn (LSMI-Sinkhorn) algorithm to estimate the SMI from a limited number of paired samples. To the best of our knowledge, this is the first semi-supervised SMI estimation algorithm. Through experiments on synthetic and real-world examples, we showed that the proposed algorithm can successfully estimate SMI with a small number of paired samples. Moreover, we demonstrated that the proposed algorithm can be used for image matching and photo album summarization.

## References

• Belghazi et al. [2018] Belghazi, M. I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Hjelm, D.; and Courville, A. 2018. Mutual information neural estimation. In ICML.
• Bunne et al. [2019] Bunne, C.; Alvarez-Melis, D.; Krause, A.; and Jegelka, S. 2019. Learning generative models across incomparable spaces. In ICML.
• Coates, Ng, and Lee [2011] Coates, A.; Ng, A.; and Lee, H. 2011. An analysis of single-layer networks in unsupervised feature learning. In AISTATS.
• Cover and Thomas [2006] Cover, T. M., and Thomas, J. A. 2006. Elements of Information Theory. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2nd edition.
• Cuturi [2013] Cuturi, M. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS.
• Djuric, Grbovic, and Vucetic [2012] Djuric, N.; Grbovic, M.; and Vucetic, S. 2012. Convex kernelized sorting. In AAAI.
• Flamary and Courty [2017] Flamary, R., and Courty, N. 2017. Pot python optimal transport library.
• Goodfellow et al. [2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS.
• Gretton et al. [2005] Gretton, A.; Bousquet, O.; Smola, A.; and Schölkopf, B. 2005. Measuring statistical dependence with Hilbert-Schmidt norms. In ALT.
• He et al. [2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
• Hjelm et al. [2019] Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; and Bengio, Y. 2019. Learning deep representations by mutual information estimation and maximization. In ICLR.
• Jebara [2004] Jebara, T. 2004. Kernelized sorting, permutation, and alignment for minimum volume PCA. In COLT.
• Krizhevsky and others [2009] Krizhevsky, A., et al. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer.
• Kuhn [1955] Kuhn, H. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1-2):83–97.
• Mémoli [2011] Mémoli, F. 2011. Gromov–wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics 11(4):417–487.
• Ozair et al. [2019] Ozair, S.; Lynch, C.; Bengio, Y.; Oord, A. v. d.; Levine, S.; and Sermanet, P. 2019. Wasserstein dependency measure for representation learning. NeurIPS.
• Peyré and Cuturi [2019] Peyré, G., and Cuturi, M. 2019. Computational optimal transport. Foundations and Trends® in Machine Learning 11(5-6):355–607.
• Peyré, Cuturi, and Solomon [2016] Peyré, G.; Cuturi, M.; and Solomon, J. 2016. Gromov-wasserstein averaging of kernel and distance matrices. In ICML.
• Quadrianto et al. [2010] Quadrianto, N.; Smola, A.; Song, L.; and Tuytelaars, T. 2010. Kernelized sorting. IEEE Transactions on Pattern Analysis and Machine Intelligence 32:1809–1821.
• Schmitzer [2019] Schmitzer, B. 2019. Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM Journal on Scientific Computing 41(3):A1443–A1481.
• Sinkhorn [1974] Sinkhorn, R. 1974. Diagonal equivalence to matrices with prescribed row and column sums. Proceedings of the American Mathematical Society 45(2):195–198.
• Sriperumbudur et al. [2009] Sriperumbudur, B. K.; Fukumizu, K.; Gretton, A.; Lanckriet, G. R.; and Schölkopf, B. 2009.

Kernel choice and classifiability for rkhs embeddings of probability distributions.

In NIPS.
• Suzuki and Sugiyama [2010] Suzuki, T., and Sugiyama, M. 2010. Sufficient dimension reduction via squared-loss mutual information estimation. In AISTATS.
• Suzuki et al. [2009] Suzuki, T.; Sugiyama, M.; Kanamori, T.; and Sese, J. 2009. Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics 10(S52).
• Suzuki, Sugiyama, and Tanaka [2009] Suzuki, T.; Sugiyama, M.; and Tanaka, T. 2009. Mutual information approximation via maximum likelihood estimation of density ratio. In ISIT.
• Yamada and Sugiyama [2010] Yamada, M., and Sugiyama, M. 2010. Dependence minimizing regression with model selection for non-linear causal inference under non-gaussian noise. In AAAI.
• Yamada and Sugiyama [2011] Yamada, M., and Sugiyama, M. 2011. Cross-domain object matching with model selection. In AISTATS.
• Yamada et al. [2015] Yamada, M.; Sigal, L.; Raptis, M.; Toyoda, M.; Chang, Y.; and Sugiyama, M. 2015. Cross-domain matching with squared-loss mutual information. IEEE transactions on Pattern Analysis and Machine Intelligence 37(9):1764–1776.
• Yan et al. [2018] Yan, Y.; Li, W.; Wu, H.; Min, H.; Tan, M.; and Wu, Q. 2018. Semi-supervised optimal transport for heterogeneous domain adaptation. In IJCAI.
• Zhao, Song, and Ermon [2017] Zhao, S.; Song, J.; and Ermon, S. 2017. Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262.