1 Introduction
Mutual information (MI) represents the statistical independence between two random variables
[4], and it is widely used in various types of machine learning applications including feature selection
[24, 25], dimensionality reduction [23], and causal inference [26]. More recently, deep neural network (DNN) models have started using MI as a regularizer for obtaining better representations from data such as infoVAE
[30] and deep infoMax [11]. Another application is improving the generative adversarial networks (GANs) [8]. For instance, Mutual Information Neural Estimation (MINE) [1] was proposed to maximize or minimize the MI in deep networks and alleviate the modedropping issues in GANS. In all these examples, MI estimation is the core of all these applications.In various MI estimation approaches, the probability density ratio function is considered to be one of the most important components:
A straightforward method to estimate this ratio is the estimation of the probability densities (i.e., , , and ), followed by calculating their ratio. However, directly estimating the probability density is difficult, thereby making this twostep approach inefficient. To address the issue, Suzuki et al. [25] proposed to directly estimate the density ratio by avoiding the density estimation [24, 25]. Nonetheless, the abovementioned methods requires a large number of paired data when estimating the MI.
Under practical setting, we can only obtain a small number of paired samples. For example, it requires a massive amount of human labor to obtain onetoone correspondences from one language to another. Thus, it prevents us to easily measure the MI across languages. Hence, a research question arises:
Can we perform mutual information estimation using unpaired samples and a small number of data pairs?
To answer the above question, in this paper, we propose a semisupervised MI estimation algorithm, particularly designed for the squaredloss mutual information (SMI) (a.k.a., divergence between and ) [24]. We first formulate the SMI estimation as the optimal transport problem with densityratio estimation. Then, we propose the leastsquares mutual information with Sinkhorn (LSMISinkhorn) algorithm to solve the problem. The algorithm has the computational complexity of ; hence, it is computationally efficient. We present the connection between the proposed algorithm and the GromovWasserstein distance [15], which is a popular distance for measuring the discrepancy between different domains. Through experiments, we first demonstrate that the proposed method can estimate the SMI without a large number of paired samples. Finally, for image matching and photo album summarization, we show the effectiveness of our proposed method.
We summarize the contributions of this paper as follows:

We proposed a semisupervised mutual information estimation approach that does not require a large number of paired samples.

We formulated the MI estimation as a combination of densityratio fitting and optimal transport.

We proposed the LSMISinkhorn algorithm, which can be efficiently computed and the loss is guaranteed to be monotonically decreasing.

We determined a connection between the proposed method and the GromovWasserstein distance.
2 Problem Formulation
In this section, we formulate the problem of squaredloss mutual information (SMI) estimation using a small number of paired samples.
Let
be the domain of vector
and be the domain of vector . Suppose we are given independent and identically distributed (i.i.d.) paired samples:where we consider the number of paired samples is small.
In addition to the paired samples, we suppose to have and i.i.d. samples from the marginal distributions:
where the number of unpaired samples and is much larger than that of the paired samples . For example, and .
We also denote and , respectively. It should be noted that the numbers of input dimensions and and the number of samples and can be different.
This paper aims to estimate the SMI (a.k.a., divergence between and ) [24] from by leveraging the use of the unpaired samples and , respectively.
The SMI between two random variables and is defined as
(1) 
where
(2) 
is the densityratio function. SMI takes 0 if and only if and are independent (i.e., ), and takes a nonnegative value if they are not independent.
If we know the estimation of the densityratio function, we can approximate the SMI as
where is an estimation of the true density ratio function parameterized by .
However, since we do not have enough number of paired samples in this paper to estimate the ratio function, estimating the SMI from the limited number of paired samples is very challenging. The key idea is to align the unpaired samples using the paired samples and use them to improve the SMI estimation accuracy.
3 Proposed Method
In this section, we propose the SMI estimation algorithm with limited number of paired samples and large number of unpaired samples.
3.1 LeastSquares Mutual Information with Sinkhorn (LSMISinkhorn)
Model: We employ the following densityratio model:
(3) 
where , and are the kernel functions, , , and . and are the sets of basis vectors which are sampled from and , respectively.
In this paper, we optimize by minimizing the difference between the true densityratio function and its ratio model:
(4) 
For the second term of Eq. (3.1
), we can approximate it by using a large number of unpaired samples. However, to approximate the third term, paired samples from the joint distribution are required. Because we have limited number of paired samples in our setting, the approximation of the third term can be poor.
To deal with this issue, we propose the utilization of unpaired samples for the approximation of the expectation of the thrid term. Specifically, we first introduce () and we represent the third term as
where is a tuning parameter between the terms of paired and unpaired samples. Note that if we set where is one if and are paired and 0 otherwise, and is the total number of pairs, then we can recover the original empirical estimate.
Then, for the densityratio model (Eq. (3
)), the loss function (Eq. (
3.1)) can be approximated aswhere
Since we want to estimate the densityratio function by minimizing Eq. (3.1), the optimization problem is given as
s.t.  (5) 
where is the negative entropic regularization to ensure nonnegative, is the regularization parameter, is the regularization, and is the regularization parameter.
The objective function is not jointly convex. However, if we fix one of the model parameters, it becomes a convex function. Thus, we employ the alternating optimization approach (see Algorithm 1).
Optimizing using the Sinkhorn algorithm: With fixing , because we have the relationship:
where , , and . It is evident that this representation can be considered to be an optimal transport problem if we maximize it with respect to [5]. It should be noted that the rank of is at most , where is a constant (e.g., ), and the computational complexity for the cost matrix is .
Thus, the optimization problem for can be written as
s.t. 
and this optimization problem can be efficiently solved using the Sinkhorn algorithm [5, 21]. In this paper, we use the logstabilized Sinkhorn [20]. Note that the optimization problem is convex if we fix .
Optimizing : Next, we update with given . The optimization problem for is equivalent to
(6) 
Since the optimization problem is a quadratic programming and convex, the solution is analytically given as
(7) 
where
is the identity matrix. Note that the
matrix does not depend on both and .Convergence Analysis: To optimize , we simply need to alternatively solve the two convex optimization problems. Thus, the following nice property holds true.
Proposition 1
Algorithm 1 will monotonically decrease the objective function in each iteration.
Proof. See the supplementary material.
Model Selection: The LSMISinkhorn algorithm includes several tuning parameters (i.e., and ) and determining the model parameters is critical to obtain a good estimate of SMI. Accordingly, we use the cross validation with the holdout set to select the model parameters.
First, the paired samples are divided into two subsets and . Then, we train the densityratio using and the unpaired samples: and . The holdout error can be calculated by approximating Eq. (3.1) using the holdout samples as
where denotes the number of samples in the set , denotes the summation over all combinations of and in , and denotes the summation over all pairs for and in . We select the parameter that has the smallest .
Relation to the GromovWasserstein: For , , and , by substituting the optimal , the loss function (Eq. (3.1)) can be represented as
where
Thus, the optimization problem for can be written as
s.t.  (8) 
This can be considered as a relaxed variant of the quadratic assignment problem (QAP) with . Since GromovWasserstein is also related to a QAP problem [15], the proposed method is related to GromovWasserstein.
Relation to LeastSquares Object Matching: In this section, we show that the LSOM algorithm [28, 27] can be considered a special case of the proposed framework.
If is a permutation matrix and ,
where . Note that we only assume for the Sinkhorn formulation.
Then, the estimation of SMI using the permutation matrix can be written as
where is the permutation function. The optimization problem is written as
s.t.  (9) 
To solve this problem, we can use the Hungarian algorithm [14] instead of the Sinkhorn algorithm [5] for optimizing . It is noteworthy that in the original LSOM algorithm, the permutation matrix is introduced to permute the Gram matrix (i.e., ) and is also included within the computation. However, in our formulation, the permutation matrix depends only on . This is a small difference in formulation. However, owing to this difference, we can show that the monotonic decrease in the loss function of the proposed algorithm.
Since LSOM finds the alignment, this approach is more suited to find the exact match among samples. In contrast, the proposed Sinkhorn formulation is more suited when there are no exact matches. Moreover, the LSOM formulation can only handle the same number of samples (i.e., ). For computational complexity, the Hungarian algorithm requires while the Sinkhorn requires .
Computational Complexity: The computational complexity of estimating is based on the computation of the cost matrix and the Sinkhorn iterations. The computational complexity of is and that of Sinkhorn algorithm is . Therefore, the computational complexity of the Sinkhorn iteration is . For the computation, the complexity to compute is and that for is . However, to estimate the , the complexity should be but it is small. Therefore, the total computational complexity of the initialization needs and the iterations requires . In particular, for small and , the computational complexity is .
4 Related Work
The proposed algorithms are related to MI estimation. Moreover, our LSMISinkhorn algorithm is highly related to the GromovWasserstein and the kernelized sorting.
Mutual information estimation: To estimate the MI, the simplest approach is to estimate the probability densities from the paired samples , from , and from , respectively.
However, because the estimation of the probability density is also a difficult problem, the naive approach does not tend to work well. To handle this, a densityratio based approach can be promising [25, 24]
. More recently, a deep learning based mutual information estimation algorithm has been proposed
[1]. However, these approaches still require a large number of paired samples to estimate the models. Thus, if we have a limited number of paired samples, existing approaches are not efficient.Most recently, the Wasserstein Dependency Measure (WDM), which measures the discrepancy between the joint probability and its marginals and , has been proposed and used for representation learning [16]. Since WDM can be used as an independence measure, it is highly related to LSMISinkhorn. However, they focus on finding a good representation by maximizing WDM (i.e., maximize the mutual information), while we focus on estimating true SMI.
GromovWasserstein and Kernelized Sorting: Given two set of vectors in different spaces, the GromovWasserstein distance [15] can be used to find the optimal alignment between them. This method considers the pairwise distance between samples in the same set to build the distance matrix, then find a match by minimizing the difference between the pairwise distance matrices:
s.t. 
where , and .
Therefore, the alignment can be estimated first, followed by estimating the SMI from the aligned samples. However, GromovWasserstein distance must solve the quadratic assignment problem (QAP), and it is generally NPhard for arbitrary inputs [18, 17]. In this work, we estimate the SMI by simultaneously solving the alignment and fitting the distribution ratio by efficiently leveraging the Sinkhorn algorithm and properties of the squaredloss. Moreover, we show that our approach can be considered an example of the GromovWasserstein by properly setting the cost function. Recently, semisupervised GromovWassersteinbased Optimal transport has been proposed and applied to the heterogeneous domain adaptation problems [29]. Their approach can handle tasks similar to those mentioned in this paper. However, their method cannot be used to measure the independence.
The kernelized sorting [12, 19, 6] is highly related to the GromovWasserstein. Specifically, the kernelized sorting determines a set of paired samples by maximizing the HilbertSchmidt independence criterion (HSIC) between samples [9]. However, the kernelized sorting can only handle the same number of samples (i.e., and ).
5 Experiments
In this section, we evaluate the proposed algorithm using the synthetic data and benchmark datasets.
5.1 Setup
For all methods, we use the Gaussian kernels.
where and
denote the widths of the kernel that are set using the median heuristic
[22].We set the number of basis , , the maximum number of iterations , and the stopping parameter . The parameters and are chosen by crossvalidation.
5.2 Convergence and Runtime
We first demonstrate the convergence of the loss function and the estimated SMI value. Here, we generate synthetic data from and randomly choose as paired samples and as unpaired samples. The convergence curve is shown in Figure 1. The values of loss and SMI converge quickly (5 iterations). This is consistent with Proposition 1.
Then, we perform a comparison between the runtimes of the proposed LSMISinkhorn and GromovWasserstein for CPU and GPU implementation. The data are sampled using two 2D random measures, where is the unpaired data and is the paired data (only for LSMISinkhorn). For GromovWasserstein, we use the CPU implementation from Python Optimal Transport toolbox [7]
and the Pytorch GPU implementation from
[2]. We use the squared loss function and set the entropic regularization to 0.005 according to the original code. For LSMISinkhorn, we implement the CPU and GPU versions using numpy and Pytorch, respectively. For fair comparison, we use the logstabilized Sinkhorn algorithm and the same early stopping criteria and the same maximum iterations as in GromovWasserstein. As shown in Figure 2, in comparison to the GromovWasserstein, LSMISinkhorn is more than one order of magnitude faster for the CPU version and several times faster for the GPU version. This is consistent with our computational complexity analysis. Moreover, the GPU version of our algorithm costs only 3.47s to compute unpaired samples, indicating that it is suitable for largescale applications.5.3 SMI Estimation
For SMI estimation, we set up four baselines:

LSMI (full): paired samples are used for crossvalidation and SMI estimation. It is considered as the ground truth value.

LSMI: Only (usually small) paired samples are used for crossvalidation and SMI estimation.

LSMI (opt): paired samples are used for SMI estimation. However, we use the optimal parameters from LSMI (full) here. This can be seen as the upper bound of SMI estimation with limited number of paired data because the optimal parameters are usually unavailable.

GromovSMI: The GromovWasserstein distance is applied on unpaired samples to find potential matching (). Then, the matched pairs and existing paired samples are combined to perform crossvalidation and SMI estimation.
Synthetic Data: In this experiment, we manually generated four types of paired samples: random normal, (Linear), (Nonlinear), and . We changed the number of paired samples while fixing and for GromovSMI and the proposed LSMISinkhorn, respectively. The model parameters and are selected by crossvalidation using the paired examples with and . The results are shown in Figure 3. In the random case, the data are nearly independent and our algorithm achieves a small SMI value. In other cases, LSMISinkhorn yields a better estimation of the SMI value and it lies near the ground truth when increases. In contrast, GromovSMI has a small estimation value, which may be due to the incorrect potential matching when .
UCI Datasets: We selected four benchmark datasets from the UCI machine learning repository. For each dataset, we split the features into two sets as paired samples. To ensure high dependence between these two subsets of features, we utilized the same splitting strategy as [19] according to the correlation matrix. The experimental setting was same as the synthetic data, except . We show the SMI estimation results in Figure 4. Similarly, LSMISinkhorn obtains better estimation values in all four datasets. In most cases, GromovSMI tends to overestimate the value by a large margin, while other baselines underestimate the value.
5.4 Deep Image Matching
Next, we consider an image matching task with deep convolution features. We use two commonlyused image classification benchmarks: CIFAR10 [13] and STL10 [3].We extracted 64 dim features from the last layer of ResNet20 [10] pretrained on the training set of CIFAR10. The features are divided into two 32dim parts denoted by and . We shuffle the samples of and attempt to match and with limited pair samples and unpaired samples . Other settings are the same as above experiments.
To evaluate the matching performance, we used top1 accuracy, top2 accuracy (correct matching is achieved in the top2 highest scores), and class accuracy (matched samples are in the same class). As shown in Figure 5, LSMISinkhorn obtains high accuracy with only a few tens of supervised pairs. Additionally, the high class matching performance implies that our algorithm can be applied to further applications such as semisupervised image classification.
We then investigate the impact of Sinkhorn regularization . With fixed to be 50, we show the matching accuracy of CIFAR10 and STL10 w.r.t changing in Figure 6. Matching accuracy gradually dropped when the value of increased. This is due to the intrinsic property of Sinkhorn regularization: with larger , the assignment matrix becomes smoother, thereby the matching accuracy drops.
5.5 Photo Album Summarization
Finally, we apply the proposed LSMISinkhorn to the photo album summarization problem, where images are matched to a predefined structure according to the Cartesian coordinate system.
Color Feature: We first used 320 images collected from Flickr [19] and extracted the original RGB pixels as color feature. Figure 6(a) and 6(b) depict the semisupervised summarization to the triangle and grids with the corners of the grids fixed to green, orange, black (triangle), and blue (rectangle) images. Similarly, we show the summarization results on an “AAAI 2020” grid with the center of each character fixed. It can be seen that these layouts show good color topology according to the fixed color images.
Semantic Feature: We then used CIFAR10 with the ResNet20 feature to illustrate the semantic album summarization. Figure 8 shows the layout of 1000 images into the same triangle, , and “AAAI 2020” grids. For Figure 7(a) and 7(b), we fixed corners of the grid to automobile, airplane, horse (triangle) and dog (rectangle) images. For Figure 7(c), we fixed the corresponding character centers. It can be seen that similar objects are aligned together by their semantic meanings rather than colors with respect to the fixed images.
In comparison to previous summarization algorithms, LSMISinkhorn has two advantages. First, the semisupervised property enables interactive album summarization, while kernelized sorting [19, 6] and object matching [28] cannot. Second, we obtained a solution for general rectangular matching (), e.g., 320 images to a triangle grid, 1000 images to a grid, while most previous methods [19, 28] relied on the Hungarian algorithm [14] to obtain square matching, which is not as flexible as the proposed method.
6 Conclusion
In this paper, we proposed the leastsquare mutual information Sinkhorn (LSMISinkhorn) algorithm to estimate the SMI from a limited number of paired samples. To the best of our knowledge, this is the first semisupervised SMI estimation algorithm. Through experiments on synthetic and realworld examples, we showed that the proposed algorithm can successfully estimate SMI with a small number of paired samples. Moreover, we demonstrated that the proposed algorithm can be used for image matching and photo album summarization.
References
 Belghazi et al. [2018] Belghazi, M. I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Hjelm, D.; and Courville, A. 2018. Mutual information neural estimation. In ICML.
 Bunne et al. [2019] Bunne, C.; AlvarezMelis, D.; Krause, A.; and Jegelka, S. 2019. Learning generative models across incomparable spaces. In ICML.
 Coates, Ng, and Lee [2011] Coates, A.; Ng, A.; and Lee, H. 2011. An analysis of singlelayer networks in unsupervised feature learning. In AISTATS.
 Cover and Thomas [2006] Cover, T. M., and Thomas, J. A. 2006. Elements of Information Theory. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2nd edition.
 Cuturi [2013] Cuturi, M. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS.
 Djuric, Grbovic, and Vucetic [2012] Djuric, N.; Grbovic, M.; and Vucetic, S. 2012. Convex kernelized sorting. In AAAI.
 Flamary and Courty [2017] Flamary, R., and Courty, N. 2017. Pot python optimal transport library.
 Goodfellow et al. [2014] Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS.
 Gretton et al. [2005] Gretton, A.; Bousquet, O.; Smola, A.; and Schölkopf, B. 2005. Measuring statistical dependence with HilbertSchmidt norms. In ALT.
 He et al. [2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
 Hjelm et al. [2019] Hjelm, R. D.; Fedorov, A.; LavoieMarchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; and Bengio, Y. 2019. Learning deep representations by mutual information estimation and maximization. In ICLR.
 Jebara [2004] Jebara, T. 2004. Kernelized sorting, permutation, and alignment for minimum volume PCA. In COLT.
 Krizhevsky and others [2009] Krizhevsky, A., et al. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer.
 Kuhn [1955] Kuhn, H. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(12):83–97.
 Mémoli [2011] Mémoli, F. 2011. Gromov–wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics 11(4):417–487.
 Ozair et al. [2019] Ozair, S.; Lynch, C.; Bengio, Y.; Oord, A. v. d.; Levine, S.; and Sermanet, P. 2019. Wasserstein dependency measure for representation learning. NeurIPS.
 Peyré and Cuturi [2019] Peyré, G., and Cuturi, M. 2019. Computational optimal transport. Foundations and Trends® in Machine Learning 11(56):355–607.
 Peyré, Cuturi, and Solomon [2016] Peyré, G.; Cuturi, M.; and Solomon, J. 2016. Gromovwasserstein averaging of kernel and distance matrices. In ICML.
 Quadrianto et al. [2010] Quadrianto, N.; Smola, A.; Song, L.; and Tuytelaars, T. 2010. Kernelized sorting. IEEE Transactions on Pattern Analysis and Machine Intelligence 32:1809–1821.
 Schmitzer [2019] Schmitzer, B. 2019. Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM Journal on Scientific Computing 41(3):A1443–A1481.
 Sinkhorn [1974] Sinkhorn, R. 1974. Diagonal equivalence to matrices with prescribed row and column sums. Proceedings of the American Mathematical Society 45(2):195–198.

Sriperumbudur et al. [2009]
Sriperumbudur, B. K.; Fukumizu, K.; Gretton, A.; Lanckriet, G. R.; and
Schölkopf, B.
2009.
Kernel choice and classifiability for rkhs embeddings of probability distributions.
In NIPS.  Suzuki and Sugiyama [2010] Suzuki, T., and Sugiyama, M. 2010. Sufficient dimension reduction via squaredloss mutual information estimation. In AISTATS.
 Suzuki et al. [2009] Suzuki, T.; Sugiyama, M.; Kanamori, T.; and Sese, J. 2009. Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics 10(S52).
 Suzuki, Sugiyama, and Tanaka [2009] Suzuki, T.; Sugiyama, M.; and Tanaka, T. 2009. Mutual information approximation via maximum likelihood estimation of density ratio. In ISIT.
 Yamada and Sugiyama [2010] Yamada, M., and Sugiyama, M. 2010. Dependence minimizing regression with model selection for nonlinear causal inference under nongaussian noise. In AAAI.
 Yamada and Sugiyama [2011] Yamada, M., and Sugiyama, M. 2011. Crossdomain object matching with model selection. In AISTATS.
 Yamada et al. [2015] Yamada, M.; Sigal, L.; Raptis, M.; Toyoda, M.; Chang, Y.; and Sugiyama, M. 2015. Crossdomain matching with squaredloss mutual information. IEEE transactions on Pattern Analysis and Machine Intelligence 37(9):1764–1776.
 Yan et al. [2018] Yan, Y.; Li, W.; Wu, H.; Min, H.; Tan, M.; and Wu, Q. 2018. Semisupervised optimal transport for heterogeneous domain adaptation. In IJCAI.
 Zhao, Song, and Ermon [2017] Zhao, S.; Song, J.; and Ermon, S. 2017. Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262.
Comments
There are no comments yet.