1 Introduction
Cluster analysis aims to gather data instances into groups, called clusters, where instances within one group are similar among themselves while instances in different groups are as dissimilar as possible. Clustering methods have become more and more popular recently due to their ability to provide new insights into unlabeled data that may be difficult or even impossible to capture for a human being. Clustering methods, however, do not take into account the possible existing relationships between the features that describe the data instances. For example, one may consider a data matrix extracted from text corpus where each document is described by the terms appearing in it. In this case, clustering documents may benefit from the knowledge about the correlation that exists between different terms revealing their probability of appearing in the same documents. This idea is the cornerstone of coclustering (Hartigan, 1972; Mirkin, 1996) where the goal is to perform clustering of both data points and features simultaneously. The obtained latent structure of data is composed of blocks usually called coclusters. Applications of coclustering include but are not limited to recommendation systems (George & Merugu, 2005; Deodhar & Ghosh, 2010; Xu et al., 2012), gene expression analysis (Cheng et al., 2008; Hanisch et al., 2002) and text mining (Dhillon et al., 2003a; Wang et al., 2009). As a result, these methods are of an increasing interest to the data mining community.
Coclustering methods are often distinguished into probabilistic methods (e.g., (Dhillon et al., 2003b; Banerjee et al., 2007; Nadif & Govaert, 2008; Wang et al., 2009; Shan & Banerjee, 2010)) and metric based (e.g., (Rocci & Vichi, 2008; Ding et al., 2006)
) methods. Probabilistic methods usually make an assumption that data was generated as a mixture of probability density functions where each one of them corresponds to one cocluster. The goal then is to estimate the parameters of the underlying distributions and the posterior probabilities of each cocluster given the data. Metric based approaches proceed in a different way and rely on introducing and optimizing a criterion commonly taking into account intra and interblock variances. This criterion, in its turn, is defined using some proper metric function that describes the geometry of data in the most precise way possible. Both metric and probabilistic approaches are known to have their own advantages and limitations: despite being quite efficient in modeling the data distribution, probabilistic methods are computationally demanding and hardly scalable; metric methods are less computationally demanding but present the need to choose the “right” distance that uncovers the underlying latent coclusters’ structure based on available data. Furthermore, the vast majority of coclustering methods require the number of coclusters to be set in advance. This is usually done using the computationally expensive exhaustive search over a large number of possible pairs of row and column clusters as in
(Keribin et al., 2015; Wyse & Friel, 2012; Wyse et al., 2014).In this paper, we address the existing issues of coclustering methods described above by proposing a principally new approach that efficiently solves the coclustering problem from both qualitative and computational points of view and allows the automatic detection of the number of coclusters. We pose the coclustering problem as the task of transporting the empirical measure defined on the data instances to the empirical measure defined on the data features. The intuition behind this process is very natural to coclustering and consists in capturing the associations between instances and features of the data matrix. The solution of optimal transportation problem is given by a doublystochastic coupling matrix which can be considered as the approximated joint probability distribution of the original data. Furthermore, the coupling matrix can be factorized into three terms where one of them reflects the posterior distribution of data given coclusters while two others represent the approximated distributions of data instances and features. We use these approximated distributions to obtain the final partitions. We also derive a kernelized version of our method that contrary to the original case, is based on an optimal transportation metric defined on the space of dissimilarity functions.
The main novelty of our work is twofold. To the best of our knowledge, the proposed approach is a first attempt to apply entropy regularized optimal transport for coclustering and to give its solution a coclustering interpretation. While Wasserstein distance has already been adapted to design clustering algorithms (Cuturi & Doucet, 2014; Irpino et al., 2014), our idea is to concentrate our attention on the solution of the optimal transport given by the coupling matrix and not to minimize the quantization error with respect to (w.r.t.) Wasserstein distance. We also note that using entropy regularization leads to a very efficient algorithm that can be easily parallelized (Cuturi, 2013)
. Second, we show that under some plausible assumptions the density estimation procedure appearing from the use of the optimal transport results in the variational inference problem with the minimization of the reversed KullbackLeibler divergence. The important implications of this difference w.r.t. other existing methods are explained in Section 3.
The rest of this paper is organized as follows. In Section 2, we briefly present the discrete version of the optimal transportation problem and its entropy regularized version. Section 3 proceeds with the description of the proposed approach, its theoretical analysis and algorithmic implementation. In Section 4, we evaluate our approach on synthetic and realworld data sets and show that it is accurate and substantially more efficient than the other stateoftheart methods. Last section concludes the paper and gives a couple of hints for possible future research.
2 Background and notations
In this section, we present the formalization of the MongeKantorovich (Kantorovich, 1942) optimization problem and its entropy regularized version.
2.1 Optimal transport
Optimal transportation theory was first introduced in (Monge, 1781) to study the problem of resource allocation. Assuming that we have a set of factories and a set of mines, the goal of optimal transportation is to move the ore from mines to factories in an optimal way, i.e., by minimizing the overall transport cost.
More formally, given two empirical probability measures^{1}^{1}1Due space limitation, we present only the discrete version of optimal transport. For more details on the general continuous case and the convergence of empirical measures, we refer the interested reader to the excellent monograph by (Villani, 2009). and defined as uniformly weighted sums of Dirac with mass at locations supported on two point sets and , the MongeKantorovich problem consists in finding a probabilistic coupling defined as a joint probability measure over with marginals and that minimizes the cost of transport w.r.t. some metric :
where is the Frobenius dot product, is a set of doubly stochastic matrices and is a dissimilarity matrix, i.e., , defining the energy needed to move a probability mass from to . This problem admits a unique solution and defines a metric on the space of probability measures (called the Wasserstein distance) as follows:
The Wasserstein distance has been successfully used in various applications, for instance: computer vision
(Rubner et al., 2000), texture analysis (Rabin et al., 2011), tomographic reconstruction (I. Abraham & Carlier, 2016), domain adaptation (Courty et al., 2014), metric learning (Cuturi & Avis, 2014) and clustering (Cuturi & Doucet, 2014; Irpino et al., 2014). This latter application is of a particular interest as Wasserstein distance is known to be a very efficient metric due to its capability of taking into account the geometry of data through the pairwise distances between samples. The success of algorithms based on this distance is also due to (Cuturi, 2013) who introduced an entropy regularized version of optimal transport that can be optimized efficiently using matrix scaling algorithm. We present this regularization below.2.2 Entropic regularization
The idea of using entropic regularization dates back to (Schrödinger, 1931). In (Cuturi, 2013), it found its application to the optimal transportation problem through the following objective function:
Second term in this equation allows to obtain smoother and more numerically stable solutions compared to the original case and converges to it at the exponential rate (Benamou et al., 2015). Another advantage of entropic regularization is that it allows to solve optimal transportation problem efficiently using SinkhornKnopp matrix scaling algorithm (Sinkhorn & Knopp, 1967).
In the next section, we explain the main underlying idea of our approach that consists in associating data instances with features through regularized optimal transport.
3 Coclustering through optimal transport
In this section we show how the coclustering problem can be casted in a principally new way and then solved using the ideas from the optimal transportation theory.
3.1 Problem setup
Let us denote by and
two random variables taking values in the sets
and , respectively, where subscripts and correspond to rows (instances) and columns (features). Similar to (Dhillon et al., 2003b), we assume that the joint probability distribution between and denoted by is estimated from the data matrix . We further assume that and consist of instances that are distributed w.r.t. probability measures , supported on where and , respectively.The problem of coclustering consists in jointly grouping the set of features and the set of instances into homogeneous blocks by finding two assignment functions and that map as follows: , where and
denote the number of row and columns clusters, and discrete random variables
and represent the partitions induced by and , i.e., and . To use discrete optimal transport, we also define two empirical measures and based on and as follows: and . We are now ready to present our method.3.2 Proposed approach
The main underlying idea of our approach is to use the optimal transportation presented above to find a probabilistic coupling of the empirical measures defined based on rows and columns of a given data matrix. More formally, for some fixed we solve the coclustering problem through the following optimization procedure:
(1) 
where the matrix is computed using the Euclidean distance, i.e., . The elements of the resulting matrix provides us with the weights of associations between instances and features: similar instances and features correspond to higher values in . Our intuition is to use these weights to identify the most similar sets of rows and columns that should be grouped together to form coclusters.
Following (Benamou et al., 2015), this optimization problem can be equivalently rewritten in the following way:
where is the Gibbs kernel.
Finally, we can rewrite the last expression as follows:
where is the intersection of closed convex subsets given by and . The solution of the entropy regularized optimal transport can be obtained using SinkhornKnopp algorithm and has the following form (Benamou et al., 2015):
(2) 
where and are the scaling coefficients of the Gibbs kernel .
In what follows, we show that under some plausible assumptions, we can interpret these two vectors as approximated rows and columns probability density functions.
3.3 Connection to variational inference
In order to justify our approach from the theoretical point of view, we first explain how the obtained solution can be used for coclustering. As mentioned in (Dhillon et al., 2003b) and later in (Banerjee et al., 2007), the coclustering can be seen as a density estimation problem where the goal is to approximate the real density by a simpler one depending on the obtained coclustering in a way that it preserves the loss in the mutual information given by where is the mutual information. This quantity is further shown to be equal to the KullbackLeibler divergence between the original distribution and where the latter has the following form:
From this point, one may instantly see that the solution of the optimal transport problem has a very similar form as it also represents the joint probability distribution that approximates the original probability distribution given by the Gibbs measure and also factorizes into three terms. The most important difference, however, lies in the asymmetry of the KL divergence: while (Dhillon et al., 2003b) and (Banerjee et al., 2007) concentrate on minimizing , our idea is different and consists in minimizing . This approach is known in the literature as the variational inference (Bishop, 2006) and exhibits a totally different behaviour compared to the minimization of . As shown by (Bishop, 2006), in variational inference the estimated distribution concentrates on the modes of data and remains compact, while the minimizer of tends to cover the whole surface of the original density and to overestimate its support. As , and and represent the observed and unobserved variables, respectively, the natural goal is to try to estimate the distribution of the data given the obtained coclusters by the simpler variational distribution . However, as the maximisation of is computationally impossible, it is common to introduce a free distribution on the parameters and in order to obtain the following decomposition:
where the lower bound
is maximized when the KL divergence is minimized.
Now, if we assume that follows the Gibbs distribution, i.e. , we can consider the original formulation of the regularized optimal transport as the variational inference problem:
where the optimal coupling equals to the estimated joint probability .
At this point, we know that the coupling matrix can be seen as an approximation to the original unknown posterior density function but the question how one can use it to obtain the clustering of rows and columns has not been answered yet. In order to solve the variational inference problem, it is usually assumed that the variables are independent and thus the variational distribution factorizes as . This assumption, however, goes against the whole idea of coclustering that relies on the existence of a deep connection between these two variables.
To this end, we propose to consider the factorization of that has the following form
This particular form follows the idea of structured stochastic variational inference proposed in (Hoffman & Blei, 2015) where a term depicting the conditional distribution between hidden and observed variables is added to the fully factorized traditional setting presented above. As stated in (Hoffman & Blei, 2015), this term allows arbitrary dependencies between observed and hidden variables which can increase the fidelity of the approximation.
Following (Bishop, 2006), the optimal estimated densities and are controlled by the direction of the smallest variance of and respectively. Furthermore, and are proportional to the joint densities and , i.e., and . Bearing in mind the equivalence between and , this brings us to the following important conclusions: (1) the matrices and can be seen as the approximated densities and ; (2) vectors and represent the approximated densities and obtained by summing and out of and , respectively.
According to (Laird, 1978), the nonparametric estimate of the mixing distribution is a piecewise step function where the number of steps depend on the number of components in the mixture. In the cluster analysis, we can assume that and consist of and components, respectively. Then, our goal is to detect these steps based on the estimates given by and to obtain the desired partitions.
3.4 Kernelized version and GromovWasserstein distance
In this part, we introduce the kernelized version of our method and compare it to the original formulation of our algorithm. In order to proceed, we first define two similarity matrices and associated to empirical measures thus forming metricmeasure spaces as in (Mémoli, 2011). Matrices and are defined by calculating the pairwise distances or similarities between rows and columns, respectively, without the restriction of them being positive or calculated based on a proper distance function satisfying the triangle inequality. The entropic GromovWasserstein discrepancy in this case is defined as follows (Peyré et al., 2016):
where is a coupling matrix between two similarity matrices and is an arbitrary lostfunction, usually the quadraticloss or KullbackLeibler divergence.
Based on this definition, one may define the problem of the entropic GromovWasserstein barycenters for similarity or distance matrices and as follows:
(3) 
where is the computed barycenter and , are the coupling matrices that align it with and , respectively. are the weighting coefficients summing to one, i.e., that determine our interest in more accurate alignment between and or and .
The intuition behind this optimization procedure for coclustering with respect to original formulation given in (1) is the following: while in (1) we align rows with columns directly, in (3) our goal is to do it via an intermediate representation given by the barycenter that is optimally aligned with both and . In this case, we obtain the solutions and that, similar to (2), can be decomposed as follows:
where and
are Gibbs kernels calculated between the barycenter and row and column similarity matrices using any arbitrary lossfunction
as explained before. Finally, based on the analysis presented above, we further use vectors and to derive row and column partitions.3.5 Detecting the number of clusters
In order to detect the steps (or jumps) in the approximated marginals, we propose to adapt a procedure introduced in (Matei & Meignen, 2012) for multiscale denoising of piecewise smooth signals. This method is of particular interest for us as it determines the significant jumps in the vectors and without knowing their number and location, nor a specific threshold to decide the significance of a jump. As the proposed procedure deals with nondecreasing functions, we first sort the values of and in the ascending order. Since the procedure is identical for both vectors, we only describe it for the vector .
We consider that the elements of are the local averages of a piecewise continuous function on the intervals defined by the uniform subdivision of step of the interval . More precisely: . The detection strategy is based on the following cost function: defined for each interval. Therefore, we get the list of the interval suspicious to contain a jump for the subdivision of order as follows:
This detection should be refined in order to get only significant jumps in our vector To this end we use the multiscale representation of as in (Harten, 1989) and we perform this detection on each scale. On the first scale, we get a coarse version of by averaging:
Now, by considering the coarse version of , we obtain a second list of suspicious intervals as before. After that, these two lists merge in the list as follows: a jump will be considered in the interval or if the interval is also detected as suspicious at the coarse scale. This procedure is iterated times and a jump is observed if a chain of detection exists from fine to coarse scales. Finally, the number of clusters is obtained by .
3.6 Algorithmic implementation
We now briefly summarize the main steps of both CCOT and CCOTGW methods and discuss their peculiarities with respect to each other. The pseudocode of both approaches in Matlab are presented in Algorithm 1 and Algorithm 2, respectively.
Ccot
First step of our algorithm consists in calculating the cost matrix and using it to obtain the optimal coupling matrix by applying the regularized optimal transport. In order to calculate , row and column instances should both lie in a space of the same dimension. This condition, however, is verified only if the matrix is squared which occurs rarely in the realworld applications. To overcome this issue, we first subsample the original data set in a way that allows us to equalize the number of rows and columns and operate with two sets of the same dimension. If we assume that then this new reduced data set is denoted by . We repeat the sampling procedure until every individual is picked at least once.
The next step is to perform for each the jump detection on the sorted vectors and to obtain two lists of the jumps locations and and to define the number of row and column clusters and . By using them, we obtain the resulting row partition:
The partition for columns is obtained in the same way. Finally, we apply the majority vote over all samples partitions to obtain and . Regarding complexity, both SinkhornKnopp algorithm used to solve the regularized optimal transport (Knight, 2008) and the proposed jump detection techniques are known to converge at the linear rate multiplied by the number of samples, i.e., . On the other hand, the calculation of modes of the clustering obtained on the generated samples for both features and data instances has the complexity . In the end, the complexity of the whole algorithm is . We also note that in the realworld applications, we usually deal with scenarios where (“big data”) or (“small” data) thus reducing the overall complexity to and , respectively. This makes our approach even more computationally attractive.
CcotGw
As it can be seen from Algorithm 2, CCOTGW allows to overcome the important disadvantage of CCOT that consists in the need to perform sampling to cluster all data objects. On the other hand, the computational complexity of CCOT is only , while for CCOTGW it scales as . We also note that CCOTGW offers a great flexibility in terms of the possible data representation used at its input. One may easily consider using any arbitrary kernel function to calculate similarity matrices or even learn them beforehand using multiplekernel learning approaches.
Data set  Algorithms  

Kmeans  NMF  DKM  TriNMF  GLBM  ITCC  RBC  CCOT  CCOTGW  
D1  
D2  
D3  –  –  –  –  
D4  –  – 
4 Experimental evaluations
In this section, we provide empirical evaluation for the proposed algorithms.
4.1 Synthetic data
Simulation setting
We simulate data following the generative process of the Gaussian Latent Block Models (for details see (Govaert & Nadif, 2013)) and we consider four scenarios with different number of coclusters, degree of separation and size. Table 1 and Figure 4 present the characteristics of theta simulated data sets and their visualization showing the different coclustering structures.
Data set  Overlapping  Proportions  

D1  [+]  Equal  
D2  [+]  Unequal  
D3  [++]  Equal  
D4  [++]  Unequal 
We use several stateoftheart coclustering algorithms as baselines including ITCC (Dhillon et al., 2003b), Double KMeans (DKM) (Rocci & Vichi, 2008), Orthogonal Nonnegative Matrix TriFactorizations (ONTMF) (Ding et al., 2006), the Gaussian Latent Block Models (GLBM) (Nadif & Govaert, 2008; Govaert & Nadif, 2013) and Residual Bayesian CoClustering (RBC) (Shan & Banerjee, 2010). We also report the results of Kmeans and NMF, run on both modes of the data matrix, as clustering baseline. To assess the performance of all compared methods, we compute the coclustering error (CCE) (Patrikainen & Meila, 2006) defined as follows:
where and are the partitions of instances and variables estimated by the algorithm; and are the true partitions and (resp. ) denotes the error rate, i.e., the proportion of misclassified instances (resp. features).
For all configurations, we generate 100 data sets and compute the mean and standard deviation of the CCE over all sets. As all the approaches we compared with are very sensitive to the initialization, we run them times with random initializations and retain the best result according to the corresponding criterion. RBC is initialized with Kmeans. Regarding CCOT we set to for all configurations except D4 which has the same number of rows and columns, and therefore does not require any sampling. For CCOTGW, we use Gaussian kernels for both rows and columns with computed as the mean of all pairwise Euclidean distances between vectors (Kar & Jain, 2011). Finally, we let both CCOT and CCOTGW detect automatically the number of coclusters, while for all other algorithms we set the number of clusters to its true value.
Coclustering performance
We report the mean (and standard deviation) of coclustering errors obtained in Table 2. Based on these results, we observe that on D1, which has a clear block structure, all algorithms perform equally well, however CCOTGW gives the best results, closely followed by CCOT and Kmeans. Regarding D2, D3 and D4, which have more complicated structure than D1, both CCOT and CCOTGW significantly outperform all other algorithms and this difference is all the more important on D3 and D4 where some of the compared algorithms are unable to find a partition with the desired number of clusters.
Furthermore, we argued that one of the strengths of our method is its ability to detect automatically the number of coclusters by applying a jump detection algorithm on and . From Figure 8 one can observe that the plots of these vectors, obtained with CCOT, with their elements sorted in the ascending order reveal clear steps that correspond to the correct number of clusters and also illustrate their proportions and the degree of overlapping. The same observation is valid for CCOTGW. Both approaches correctly identified the number of clusters in most cases and CCOT is slightly more accurate than CCOTGW when the proportions of coclusters are unbalanced.

To summarize, CCOT and CCOTGW outperform all the other baselines for the considered data structures and present two important advantages: (1) they do not suffer from the initialization issues, (2) they are able to detect automatically the number coclusters.
4.2 MovieLens
Data and setting
MovieLens100K^{2}^{2}2https://grouplens.org/datasets/movielens/100k/ is a popular benchmark data set that consists of usermovie ratings, on a scale of one to five, collected from a movie recommendation service gathering 100,000 ratings from 943 users on 1682 movies. In the context of coclustering, our goal is to find homogeneous subgroups of users and films in order to further recommend previously unseen movies that were highly rated by the users from the same group.
We set the regularization parameters for CCOT and CCOTGW using the crossvalidation; the number of samplings for CCOT is set to (as the dimensions of the data set are quite balanced); the weights for the barycenter in CCOTGW are set to .
Results
In what follows we only present figures and results obtained by CCOTGW as both algorithms return the same number of blocks and the partitions are almost identical (with a normalized mutual information between partitions above ). CCOTGW automatically detects a structure consisting of blocks, that corresponds to 9 user clusters and 15 movie clusters. From Figure 11, one can observe that the users and the movies are almost equally distributed across clusters, except for two user and three movie clusters which have a larger size than others.
Figure 14 shows the original data set as well as a summarized version where each block is represented by its mean rating value (the lighter the block, the higher the ratings), revealing a structure into homogeneous groups. One can observe that the first movie cluster consists of films for which all users agree on giving high ratings (most popular movies) while the last movie cluster consists of the movies with very low ratings. We also report the 5 best rated movies in those two clusters in Table 3. One can easily see that popular movies, such that both Star Wars episodes are in M1 while M5 is composed of movies that were less critically acclaimed.
M1  M15 

Star Wars (1977)  Amytiville: A New Generation (1993) 
The Lion King (1994)  Amytiville: It’s About Time (1992) 
Return of the Jedi (1983)  Ninjas: High Noon at Mega Mountain (1998) 
Contact (1997)  Sudden Manhattan (1996) 
Raiders of the lost ark (1981)  Dream Man (1995) 
We can make similar observations for the interpretation of user clusters. For instance, the last two user clusters include users that tend to give less good ratings to movies than the average population. Also, we note that block corresponds to users who liked movies from M10 better than the rest of the users. These observations are also very similar to the results reported by (Banerjee et al., 2007), where the authors proposed a detailed study of a blocks structure for this data set. Additional results can be found in the Supplementary material.
5 Conclusions and future perspectives
In this paper we presented a novel approach for coclustering based on the entropy regularized optimal transport. Our method is principally different from other coclustering methods and consists in finding a probabilistic coupling of the empirical measures defined based on the data instances and features. We showed how this procedure can be seen as the variational inference problem and that the inferred distribution can be used to obtain the row and feature partitions. The resulting algorithm is not only more accurate than other stateoftheart methods but also fast and capable of automatically detecting the number of coclusters. We also presented an extended version of our algorithm that makes use of the optimal transportation distance defined on similarity matrices associated to the rows’ and columns’ empirical measures.
In the future, our work can be continued in multiple directions. First, we would like to extend our method in order to deal with the online setting where the goal is to classify a new previously unseen observation without the need to do the coclustering of the data set that includes it. This can be done using a recent approach proposed in
(Perrot et al., 2016) that allows to update the learned coupling matrix using the outofsample observations without recomputing it using all the data. We believe that this extension will make our algorithm attractive for the exploitation in realtime industrial recommendation systems due to its computational efficiency. We would also like to study the generalization properties of our algorithm in a spirit similar to the results obtained in (Maurer & Pontil, 2010). This latter work presents a rare case where the generalization bounds are derived for some famous unsupervised learning algorithms.Acknowledgements
This work has been supported by the ANR project COCLICO, ANR12MONU0001.
References

Banerjee et al. (2007)
Banerjee, Arindam, Dhillon, Inderjit, Ghosh, Joydeep, Merugu, Srujana, and
Modha, Dharmendra S.
A generalized maximum entropy approach to bregman coclustering and
matrix approximation.
Journal of Machine Learning Research
, 8:1919–1986, December 2007.  Benamou et al. (2015) Benamou, JeanDavid, Carlier, Guillaume, Cuturi, Marco, Nenna, Luca, and Peyré, Gabriel. Iterative Bregman Projections for Regularized Transportation Problems. SIAM Journal on Scientific Computing, 2(37):A1111–A1138, 2015.
 Bishop (2006) Bishop, Christopher M. Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag New York, Inc., 2006.
 Cheng et al. (2008) Cheng, KinOn, Law, NgaiFong, Siu, WanChi, and Liew, Alan W. Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization. BMC Bioinformatics, 9:210, 2008.
 Courty et al. (2014) Courty, Nicolas, Flamary, Rémi, and Tuia, Devis. Domain adaptation with regularized optimal transport. In Proceedings ECML/PKDD 2014, pp. 1–16, 2014.
 Cuturi (2013) Cuturi, Marco. Sinkhorn distances: Lightspeed computation of optimal transport. In Proceedings NIPS, pp. 2292–2300, 2013.
 Cuturi & Avis (2014) Cuturi, Marco and Avis, David. Ground metric learning. Journal of Machine Learning Research, 15(1):533–564, 2014.
 Cuturi & Doucet (2014) Cuturi, Marco and Doucet, Arnaud. Fast computation of wasserstein barycenters. In Proceedings ICML, pp. 685–693, 2014.
 Deodhar & Ghosh (2010) Deodhar, M. and Ghosh, J. Scoal: A framework for simultaneous coclustering and learning from complex data. ACM Transactions on Knowledge Discovery from Data, 4(3):1–31, 2010.
 Dhillon et al. (2003a) Dhillon, Inderjit S., Mallela, Subramanyam, and Kumar, Rahul. A divisive information theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research, 3:1265–1287, 2003a.
 Dhillon et al. (2003b) Dhillon, Inderjit S., Mallela, Subramanyam, and Modha, Dharmendra S. Informationtheoretic coclustering. In Proceedings ACM SIGKDD, pp. 89–98, 2003b.
 Ding et al. (2006) Ding, C., Li, T., Peng, W., and Park, H. Orthogonal nonnegative matrix trifactorizations for clustering. In Proceedings ACM SIGKDD, pp. 126–135, 2006.
 George & Merugu (2005) George, T. and Merugu, S. A scalable collaborative filtering framework based on coclustering. In Proceedings ICDM, pp. 625–628, 2005.
 Govaert & Nadif (2013) Govaert, G. and Nadif, M. Coclustering. John Wiley & Sons, 2013.
 Hanisch et al. (2002) Hanisch, Daniel, Zien, Alexander, Zimmer, Ralf, and Lengauer, Thomas. Coclustering of biological networks and gene expression data. BMC Bioinformatics, 18(suppl 1):145–154, 2002.
 Harten (1989) Harten, Amiram. Eno schemes with subcell resolution. Journal of Computational Physics, 83:148–184, 1989.
 Hartigan (1972) Hartigan, J. A. Direct Clustering of a Data Matrix. Journal of the American Statistical Association, 67(337):123–129, 1972.
 Hoffman & Blei (2015) Hoffman, Matthew D. and Blei, David M. Stochastic structured variational inference. In Proceedings AISTATS, volume 38, pp. 361–369, 2015.
 I. Abraham & Carlier (2016) I. Abraham, R. Abraham, M. Bergounioux and Carlier, G. Tomographic reconstruction from a few views: a multimarginal optimal transport approach. Applied Mathematics and Optimization, pp. 1–19, 2016.
 Irpino et al. (2014) Irpino, Antonio, Verde, Rosanna, and De Carvalho, Francisco de A.T. Dynamic clustering of histogram data based on adaptive squared wasserstein distances. Expert Systems with Applications, 41(7):3351–3366, 2014.
 Kantorovich (1942) Kantorovich, Leonid. On the translocation of masses. In C.R. (Doklady) Acad. Sci. URSS(N.S.), volume 37(10), pp. 199–201, 1942.
 Kar & Jain (2011) Kar, Purushottam and Jain, Prateek. Similaritybased learning via data driven embeddings. In NIPS, pp. 1998–2006, 2011.
 Keribin et al. (2015) Keribin, Christine, Brault, Vincent, Celeux, Gilles, and Govaert, Gérard. Estimation and selection for the latent block model on categorical data. Statistics and Computing, 25(6):1201–1216, 2015.
 Knight (2008) Knight, Philip A. The sinkhornknopp algorithm: Convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1):261–275, March 2008.
 Laird (1978) Laird, N. Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the American Statistical Association, 73:805–811, 1978.
 Matei & Meignen (2012) Matei, Basarab and Meignen, Sylvain. Nonlinear cellaverage multiscale signal representations: Application to signal denoising. Signal Processing, 92:2738–2746, 2012.
 Maurer & Pontil (2010) Maurer, Augusto and Pontil, Massimiliano. K dimensional coding schemes in hilbert spaces. IEEE Trans. Information Theory, 56(11):5839–5846, 2010.
 Mémoli (2011) Mémoli, Facundo. Gromovwasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics, 11(4):417–487, 2011.
 Mirkin (1996) Mirkin, Boris Grigorievitch. Mathematical classification and clustering. Nonconvex optimization and its applications. Kluwer academic publ, Dordrecht, Boston, London, 1996.
 Monge (1781) Monge, Gaspard. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences, pp. 666–704, 1781.
 Nadif & Govaert (2008) Nadif, M. and Govaert, G. Algorithms for modelbased block gaussian clustering. In DMIN’08, the 2008 International Conference on Data Mining, 2008.
 Patrikainen & Meila (2006) Patrikainen, A. and Meila, M. Comparing subspace clusterings. IEEE Transactions on Knowledge and Data Engineering, 18(7):902–916, July 2006.
 Perrot et al. (2016) Perrot, Michaël, Courty, Nicolas, Flamary, Rémi, and Habrard, Amaury. Mapping estimation for discrete optimal transport. In NIPS, pp. 4197–4205, 2016.
 Peyré et al. (2016) Peyré, Gabriel, Cuturi, Marco, and Solomon, Justin. Gromovwasserstein averaging of kernel and distance matrices. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, pp. 2664–2672, 2016.
 Rabin et al. (2011) Rabin, Julien, Peyré, Gabriel, Delon, Julie, and Bernot, Marc. Wasserstein barycenter and its application to texture mixing. In Proceedings SSVM, volume 6667, pp. 435–446, 2011.
 Rocci & Vichi (2008) Rocci, R. and Vichi, M. Twomode multipartitioning. Computational Statistics and Data Analysis, 52(4):1984–2003, 2008.

Rubner et al. (2000)
Rubner, Yossi, Tomasi, Carlo, and Guibas, Leonidas J.
The earth mover’s distance as a metric for image retrieval.
International Journal on Computer Vision, 40(2):99–121, 2000.  Schrödinger (1931) Schrödinger, E. Uber die umkehrung der naturgesetze. Sitzungsberichte Preuss. Akad. Wiss. Berlin. Phys. Math., 144:144–153, 1931.
 Shan & Banerjee (2010) Shan, Hanhuai and Banerjee, Arindam. Residual bayesian coclustering for matrix approximation. In Proceedings of the SIAM International Conference on Data Mining, SDM 2010, April 29  May 1, 2010, Columbus, Ohio, USA, pp. 223–234, 2010.
 Sinkhorn & Knopp (1967) Sinkhorn, Richard and Knopp, Paul. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21:343–348, 1967.
 Villani (2009) Villani, Cédric. Optimal transport : old and new. Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2009.
 Wang et al. (2009) Wang, Pu, Domeniconi, Carlotta, and Laskey, Kathryn. Latent dirichlet bayesian coclustering. Machine Learning and Knowledge Discovery in Databases, pp. 522–537, 2009.
 Wilcoxon (1945) Wilcoxon, F. Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1(6):80–83, December 1945.
 Wyse & Friel (2012) Wyse, Jason and Friel, Nial. Block clustering with collapsed latent block models. Statistics and Computing, 22(2):415–428, 2012.
 Wyse et al. (2014) Wyse, Jason, Friel, Nial, and Latouche, Pierre. Inferring structure in bipartite networks using the latent block model and exact ICL. ArXiv eprints, 2014.
 Xu et al. (2012) Xu, Bin, Bu, Jiajun, Chen, Chun, and Cai, Deng. An exploration of improving collaborative recommender systems via useritem subgroups. In Proceedings WWW, pp. 21–30, 2012.