Mutual information (MI) represents the statistical independence between two random variables
, and it is widely used in various types of machine learning applications including feature selection[24, 25], dimensionality reduction , and causal inference 
. More recently, deep neural network (DNN) models have started using MI as a regularizer for obtaining better representations from data such as infoVAE and deep infoMax . Another application is improving the generative adversarial networks (GANs) . For instance, Mutual Information Neural Estimation (MINE)  was proposed to maximize or minimize the MI in deep networks and alleviate the mode-dropping issues in GANS. In all these examples, MI estimation is the core of all these applications.
In various MI estimation approaches, the probability density ratio function is considered to be one of the most important components:
A straightforward method to estimate this ratio is the estimation of the probability densities (i.e., , , and ), followed by calculating their ratio. However, directly estimating the probability density is difficult, thereby making this two-step approach inefficient. To address the issue, Suzuki et al.  proposed to directly estimate the density ratio by avoiding the density estimation [24, 25]. Nonetheless, the abovementioned methods requires a large number of paired data when estimating the MI.
Under practical setting, we can only obtain a small number of paired samples. For example, it requires a massive amount of human labor to obtain one-to-one correspondences from one language to another. Thus, it prevents us to easily measure the MI across languages. Hence, a research question arises:
Can we perform mutual information estimation using unpaired samples and a small number of data pairs?
To answer the above question, in this paper, we propose a semi-supervised MI estimation algorithm, particularly designed for the squared-loss mutual information (SMI) (a.k.a., -divergence between and ) . We first formulate the SMI estimation as the optimal transport problem with density-ratio estimation. Then, we propose the least-squares mutual information with Sinkhorn (LSMI-Sinkhorn) algorithm to solve the problem. The algorithm has the computational complexity of ; hence, it is computationally efficient. We present the connection between the proposed algorithm and the Gromov-Wasserstein distance , which is a popular distance for measuring the discrepancy between different domains. Through experiments, we first demonstrate that the proposed method can estimate the SMI without a large number of paired samples. Finally, for image matching and photo album summarization, we show the effectiveness of our proposed method.
We summarize the contributions of this paper as follows:
We proposed a semi-supervised mutual information estimation approach that does not require a large number of paired samples.
We formulated the MI estimation as a combination of density-ratio fitting and optimal transport.
We proposed the LSMI-Sinkhorn algorithm, which can be efficiently computed and the loss is guaranteed to be monotonically decreasing.
We determined a connection between the proposed method and the Gromov-Wasserstein distance.
2 Problem Formulation
In this section, we formulate the problem of squared-loss mutual information (SMI) estimation using a small number of paired samples.
be the domain of vectorand be the domain of vector . Suppose we are given independent and identically distributed (i.i.d.) paired samples:
where we consider the number of paired samples is small.
In addition to the paired samples, we suppose to have and i.i.d. samples from the marginal distributions:
where the number of unpaired samples and is much larger than that of the paired samples . For example, and .
We also denote and , respectively. It should be noted that the numbers of input dimensions and and the number of samples and can be different.
This paper aims to estimate the SMI (a.k.a., -divergence between and )  from by leveraging the use of the unpaired samples and , respectively.
The SMI between two random variables and is defined as
is the density-ratio function. SMI takes 0 if and only if and are independent (i.e., ), and takes a non-negative value if they are not independent.
If we know the estimation of the density-ratio function, we can approximate the SMI as
where is an estimation of the true density ratio function parameterized by .
However, since we do not have enough number of paired samples in this paper to estimate the ratio function, estimating the SMI from the limited number of paired samples is very challenging. The key idea is to align the unpaired samples using the paired samples and use them to improve the SMI estimation accuracy.
3 Proposed Method
In this section, we propose the SMI estimation algorithm with limited number of paired samples and large number of unpaired samples.
3.1 Least-Squares Mutual Information with Sinkhorn (LSMI-Sinkhorn)
Model: We employ the following density-ratio model:
where , and are the kernel functions, , , and . and are the sets of basis vectors which are sampled from and , respectively.
In this paper, we optimize by minimizing the difference between the true density-ratio function and its ratio model:
For the second term of Eq. (3.1
), we can approximate it by using a large number of unpaired samples. However, to approximate the third term, paired samples from the joint distribution are required. Because we have limited number of paired samples in our setting, the approximation of the third term can be poor.
To deal with this issue, we propose the utilization of unpaired samples for the approximation of the expectation of the thrid term. Specifically, we first introduce () and we represent the third term as
where is a tuning parameter between the terms of paired and unpaired samples. Note that if we set where is one if and are paired and 0 otherwise, and is the total number of pairs, then we can recover the original empirical estimate.
Then, for the density-ratio model (Eq. (3
)), the loss function (Eq. (3.1)) can be approximated as
Since we want to estimate the density-ratio function by minimizing Eq. (3.1), the optimization problem is given as
where is the negative entropic regularization to ensure non-negative, is the regularization parameter, is the regularization, and is the regularization parameter.
The objective function is not jointly convex. However, if we fix one of the model parameters, it becomes a convex function. Thus, we employ the alternating optimization approach (see Algorithm 1).
Optimizing using the Sinkhorn algorithm: With fixing , because we have the relationship:
where , , and . It is evident that this representation can be considered to be an optimal transport problem if we maximize it with respect to . It should be noted that the rank of is at most , where is a constant (e.g., ), and the computational complexity for the cost matrix is .
Thus, the optimization problem for can be written as
and this optimization problem can be efficiently solved using the Sinkhorn algorithm [5, 21]. In this paper, we use the log-stabilized Sinkhorn . Note that the optimization problem is convex if we fix .
Optimizing : Next, we update with given . The optimization problem for is equivalent to
Since the optimization problem is a quadratic programming and convex, the solution is analytically given as
is the identity matrix. Note that thematrix does not depend on both and .
Convergence Analysis: To optimize , we simply need to alternatively solve the two convex optimization problems. Thus, the following nice property holds true.
Algorithm 1 will monotonically decrease the objective function in each iteration.
Proof. See the supplementary material.
Model Selection: The LSMI-Sinkhorn algorithm includes several tuning parameters (i.e., and ) and determining the model parameters is critical to obtain a good estimate of SMI. Accordingly, we use the cross validation with the hold-out set to select the model parameters.
First, the paired samples are divided into two subsets and . Then, we train the density-ratio using and the unpaired samples: and . The hold-out error can be calculated by approximating Eq. (3.1) using the hold-out samples as
where denotes the number of samples in the set , denotes the summation over all combinations of and in , and denotes the summation over all pairs for and in . We select the parameter that has the smallest .
Relation to the Gromov-Wasserstein: For , , and , by substituting the optimal , the loss function (Eq. (3.1)) can be represented as
Thus, the optimization problem for can be written as
This can be considered as a relaxed variant of the quadratic assignment problem (QAP) with . Since Gromov-Wasserstein is also related to a QAP problem , the proposed method is related to Gromov-Wasserstein.
If is a permutation matrix and ,
where . Note that we only assume for the Sinkhorn formulation.
Then, the estimation of SMI using the permutation matrix can be written as
where is the permutation function. The optimization problem is written as
To solve this problem, we can use the Hungarian algorithm  instead of the Sinkhorn algorithm  for optimizing . It is noteworthy that in the original LSOM algorithm, the permutation matrix is introduced to permute the Gram matrix (i.e., ) and is also included within the computation. However, in our formulation, the permutation matrix depends only on . This is a small difference in formulation. However, owing to this difference, we can show that the monotonic decrease in the loss function of the proposed algorithm.
Since LSOM finds the alignment, this approach is more suited to find the exact match among samples. In contrast, the proposed Sinkhorn formulation is more suited when there are no exact matches. Moreover, the LSOM formulation can only handle the same number of samples (i.e., ). For computational complexity, the Hungarian algorithm requires while the Sinkhorn requires .
Computational Complexity: The computational complexity of estimating is based on the computation of the cost matrix and the Sinkhorn iterations. The computational complexity of is and that of Sinkhorn algorithm is . Therefore, the computational complexity of the Sinkhorn iteration is . For the computation, the complexity to compute is and that for is . However, to estimate the , the complexity should be but it is small. Therefore, the total computational complexity of the initialization needs and the iterations requires . In particular, for small and , the computational complexity is .
4 Related Work
The proposed algorithms are related to MI estimation. Moreover, our LSMI-Sinkhorn algorithm is highly related to the Gromov-Wasserstein and the kernelized sorting.
Mutual information estimation: To estimate the MI, the simplest approach is to estimate the probability densities from the paired samples , from , and from , respectively.
However, because the estimation of the probability density is also a difficult problem, the naive approach does not tend to work well. To handle this, a density-ratio based approach can be promising [25, 24]
. More recently, a deep learning based mutual information estimation algorithm has been proposed. However, these approaches still require a large number of paired samples to estimate the models. Thus, if we have a limited number of paired samples, existing approaches are not efficient.
Most recently, the Wasserstein Dependency Measure (WDM), which measures the discrepancy between the joint probability and its marginals and , has been proposed and used for representation learning . Since WDM can be used as an independence measure, it is highly related to LSMI-Sinkhorn. However, they focus on finding a good representation by maximizing WDM (i.e., maximize the mutual information), while we focus on estimating true SMI.
Gromov-Wasserstein and Kernelized Sorting: Given two set of vectors in different spaces, the Gromov-Wasserstein distance  can be used to find the optimal alignment between them. This method considers the pair-wise distance between samples in the same set to build the distance matrix, then find a match by minimizing the difference between the pair-wise distance matrices:
where , and .
Therefore, the alignment can be estimated first, followed by estimating the SMI from the aligned samples. However, Gromov-Wasserstein distance must solve the quadratic assignment problem (QAP), and it is generally NP-hard for arbitrary inputs [18, 17]. In this work, we estimate the SMI by simultaneously solving the alignment and fitting the distribution ratio by efficiently leveraging the Sinkhorn algorithm and properties of the squared-loss. Moreover, we show that our approach can be considered an example of the Gromov-Wasserstein by properly setting the cost function. Recently, semi-supervised Gromov-Wasserstein-based Optimal transport has been proposed and applied to the heterogeneous domain adaptation problems . Their approach can handle tasks similar to those mentioned in this paper. However, their method cannot be used to measure the independence.
The kernelized sorting [12, 19, 6] is highly related to the Gromov-Wasserstein. Specifically, the kernelized sorting determines a set of paired samples by maximizing the Hilbert-Schmidt independence criterion (HSIC) between samples . However, the kernelized sorting can only handle the same number of samples (i.e., and ).
In this section, we evaluate the proposed algorithm using the synthetic data and benchmark datasets.
For all methods, we use the Gaussian kernels.
denote the widths of the kernel that are set using the median heuristic.
We set the number of basis , , the maximum number of iterations , and the stopping parameter . The parameters and are chosen by cross-validation.
5.2 Convergence and Runtime
We first demonstrate the convergence of the loss function and the estimated SMI value. Here, we generate synthetic data from and randomly choose as paired samples and as unpaired samples. The convergence curve is shown in Figure 1. The values of loss and SMI converge quickly (5 iterations). This is consistent with Proposition 1.
Then, we perform a comparison between the runtimes of the proposed LSMI-Sinkhorn and Gromov-Wasserstein for CPU and GPU implementation. The data are sampled using two 2D random measures, where is the unpaired data and is the paired data (only for LSMI-Sinkhorn). For Gromov-Wasserstein, we use the CPU implementation from Python Optimal Transport toolbox 
and the Pytorch GPU implementation from. We use the squared loss function and set the entropic regularization to 0.005 according to the original code. For LSMI-Sinkhorn, we implement the CPU and GPU versions using numpy and Pytorch, respectively. For fair comparison, we use the log-stabilized Sinkhorn algorithm and the same early stopping criteria and the same maximum iterations as in Gromov-Wasserstein. As shown in Figure 2, in comparison to the Gromov-Wasserstein, LSMI-Sinkhorn is more than one order of magnitude faster for the CPU version and several times faster for the GPU version. This is consistent with our computational complexity analysis. Moreover, the GPU version of our algorithm costs only 3.47s to compute unpaired samples, indicating that it is suitable for large-scale applications.
5.3 SMI Estimation
For SMI estimation, we set up four baselines:
LSMI (full): paired samples are used for cross-validation and SMI estimation. It is considered as the ground truth value.
LSMI: Only (usually small) paired samples are used for cross-validation and SMI estimation.
LSMI (opt): paired samples are used for SMI estimation. However, we use the optimal parameters from LSMI (full) here. This can be seen as the upper bound of SMI estimation with limited number of paired data because the optimal parameters are usually unavailable.
Gromov-SMI: The Gromov-Wasserstein distance is applied on unpaired samples to find potential matching (). Then, the matched pairs and existing paired samples are combined to perform cross-validation and SMI estimation.
Synthetic Data: In this experiment, we manually generated four types of paired samples: random normal, (Linear), (Nonlinear), and . We changed the number of paired samples while fixing and for Gromov-SMI and the proposed LSMI-Sinkhorn, respectively. The model parameters and are selected by cross-validation using the paired examples with and . The results are shown in Figure 3. In the random case, the data are nearly independent and our algorithm achieves a small SMI value. In other cases, LSMI-Sinkhorn yields a better estimation of the SMI value and it lies near the ground truth when increases. In contrast, Gromov-SMI has a small estimation value, which may be due to the incorrect potential matching when .
UCI Datasets: We selected four benchmark datasets from the UCI machine learning repository. For each dataset, we split the features into two sets as paired samples. To ensure high dependence between these two subsets of features, we utilized the same splitting strategy as  according to the correlation matrix. The experimental setting was same as the synthetic data, except . We show the SMI estimation results in Figure 4. Similarly, LSMI-Sinkhorn obtains better estimation values in all four datasets. In most cases, Gromov-SMI tends to overestimate the value by a large margin, while other baselines underestimate the value.
5.4 Deep Image Matching
Next, we consider an image matching task with deep convolution features. We use two commonly-used image classification benchmarks: CIFAR10  and STL10 .We extracted 64 dim features from the last layer of ResNet20  pretrained on the training set of CIFAR10. The features are divided into two 32-dim parts denoted by and . We shuffle the samples of and attempt to match and with limited pair samples and unpaired samples . Other settings are the same as above experiments.
To evaluate the matching performance, we used top-1 accuracy, top-2 accuracy (correct matching is achieved in the top-2 highest scores), and class accuracy (matched samples are in the same class). As shown in Figure 5, LSMI-Sinkhorn obtains high accuracy with only a few tens of supervised pairs. Additionally, the high class matching performance implies that our algorithm can be applied to further applications such as semi-supervised image classification.
We then investigate the impact of Sinkhorn regularization . With fixed to be 50, we show the matching accuracy of CIFAR10 and STL10 w.r.t changing in Figure 6. Matching accuracy gradually dropped when the value of increased. This is due to the intrinsic property of Sinkhorn regularization: with larger , the assignment matrix becomes smoother, thereby the matching accuracy drops.
5.5 Photo Album Summarization
Finally, we apply the proposed LSMI-Sinkhorn to the photo album summarization problem, where images are matched to a predefined structure according to the Cartesian coordinate system.
Color Feature: We first used 320 images collected from Flickr  and extracted the original RGB pixels as color feature. Figure 6(a) and 6(b) depict the semi-supervised summarization to the triangle and grids with the corners of the grids fixed to green, orange, black (triangle), and blue (rectangle) images. Similarly, we show the summarization results on an “AAAI 2020” grid with the center of each character fixed. It can be seen that these layouts show good color topology according to the fixed color images.
Semantic Feature: We then used CIFAR10 with the ResNet20 feature to illustrate the semantic album summarization. Figure 8 shows the layout of 1000 images into the same triangle, , and “AAAI 2020” grids. For Figure 7(a) and 7(b), we fixed corners of the grid to automobile, airplane, horse (triangle) and dog (rectangle) images. For Figure 7(c), we fixed the corresponding character centers. It can be seen that similar objects are aligned together by their semantic meanings rather than colors with respect to the fixed images.
In comparison to previous summarization algorithms, LSMI-Sinkhorn has two advantages. First, the semi-supervised property enables interactive album summarization, while kernelized sorting [19, 6] and object matching  cannot. Second, we obtained a solution for general rectangular matching (), e.g., 320 images to a triangle grid, 1000 images to a grid, while most previous methods [19, 28] relied on the Hungarian algorithm  to obtain square matching, which is not as flexible as the proposed method.
In this paper, we proposed the least-square mutual information Sinkhorn (LSMI-Sinkhorn) algorithm to estimate the SMI from a limited number of paired samples. To the best of our knowledge, this is the first semi-supervised SMI estimation algorithm. Through experiments on synthetic and real-world examples, we showed that the proposed algorithm can successfully estimate SMI with a small number of paired samples. Moreover, we demonstrated that the proposed algorithm can be used for image matching and photo album summarization.
- Belghazi et al.  Belghazi, M. I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Hjelm, D.; and Courville, A. 2018. Mutual information neural estimation. In ICML.
- Bunne et al.  Bunne, C.; Alvarez-Melis, D.; Krause, A.; and Jegelka, S. 2019. Learning generative models across incomparable spaces. In ICML.
- Coates, Ng, and Lee  Coates, A.; Ng, A.; and Lee, H. 2011. An analysis of single-layer networks in unsupervised feature learning. In AISTATS.
- Cover and Thomas  Cover, T. M., and Thomas, J. A. 2006. Elements of Information Theory. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2nd edition.
- Cuturi  Cuturi, M. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS.
- Djuric, Grbovic, and Vucetic  Djuric, N.; Grbovic, M.; and Vucetic, S. 2012. Convex kernelized sorting. In AAAI.
- Flamary and Courty  Flamary, R., and Courty, N. 2017. Pot python optimal transport library.
- Goodfellow et al.  Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS.
- Gretton et al.  Gretton, A.; Bousquet, O.; Smola, A.; and Schölkopf, B. 2005. Measuring statistical dependence with Hilbert-Schmidt norms. In ALT.
- He et al.  He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
- Hjelm et al.  Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; and Bengio, Y. 2019. Learning deep representations by mutual information estimation and maximization. In ICLR.
- Jebara  Jebara, T. 2004. Kernelized sorting, permutation, and alignment for minimum volume PCA. In COLT.
- Krizhevsky and others  Krizhevsky, A., et al. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer.
- Kuhn  Kuhn, H. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1-2):83–97.
- Mémoli  Mémoli, F. 2011. Gromov–wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics 11(4):417–487.
- Ozair et al.  Ozair, S.; Lynch, C.; Bengio, Y.; Oord, A. v. d.; Levine, S.; and Sermanet, P. 2019. Wasserstein dependency measure for representation learning. NeurIPS.
- Peyré and Cuturi  Peyré, G., and Cuturi, M. 2019. Computational optimal transport. Foundations and Trends® in Machine Learning 11(5-6):355–607.
- Peyré, Cuturi, and Solomon  Peyré, G.; Cuturi, M.; and Solomon, J. 2016. Gromov-wasserstein averaging of kernel and distance matrices. In ICML.
- Quadrianto et al.  Quadrianto, N.; Smola, A.; Song, L.; and Tuytelaars, T. 2010. Kernelized sorting. IEEE Transactions on Pattern Analysis and Machine Intelligence 32:1809–1821.
- Schmitzer  Schmitzer, B. 2019. Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM Journal on Scientific Computing 41(3):A1443–A1481.
- Sinkhorn  Sinkhorn, R. 1974. Diagonal equivalence to matrices with prescribed row and column sums. Proceedings of the American Mathematical Society 45(2):195–198.
- Sriperumbudur et al.  Sriperumbudur, B. K.; Fukumizu, K.; Gretton, A.; Lanckriet, G. R.; and Schölkopf, B. 2009. In NIPS.
- Suzuki and Sugiyama  Suzuki, T., and Sugiyama, M. 2010. Sufficient dimension reduction via squared-loss mutual information estimation. In AISTATS.
- Suzuki et al.  Suzuki, T.; Sugiyama, M.; Kanamori, T.; and Sese, J. 2009. Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics 10(S52).
- Suzuki, Sugiyama, and Tanaka  Suzuki, T.; Sugiyama, M.; and Tanaka, T. 2009. Mutual information approximation via maximum likelihood estimation of density ratio. In ISIT.
- Yamada and Sugiyama  Yamada, M., and Sugiyama, M. 2010. Dependence minimizing regression with model selection for non-linear causal inference under non-gaussian noise. In AAAI.
- Yamada and Sugiyama  Yamada, M., and Sugiyama, M. 2011. Cross-domain object matching with model selection. In AISTATS.
- Yamada et al.  Yamada, M.; Sigal, L.; Raptis, M.; Toyoda, M.; Chang, Y.; and Sugiyama, M. 2015. Cross-domain matching with squared-loss mutual information. IEEE transactions on Pattern Analysis and Machine Intelligence 37(9):1764–1776.
- Yan et al.  Yan, Y.; Li, W.; Wu, H.; Min, H.; Tan, M.; and Wu, Q. 2018. Semi-supervised optimal transport for heterogeneous domain adaptation. In IJCAI.
- Zhao, Song, and Ermon  Zhao, S.; Song, J.; and Ermon, S. 2017. Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262.