1. Introduction
Dimension reduction plays a key role in exploratory data analysis, data visualization, clustering and classification. Techniques range from the classical PCA and nonlinear manifold learning to deep autoencoders
[1, 2, 3, 4, 5, 6, 7]. These techniques focus on only one mode of the data, often the observations (columns) which are are measurements in a highdimensional feature space (rows), and exploit correlations among the features to reduce the dimension of the features and obtain the underlying lowdimensional geometry of the observations. Yet for many data matrices, for example in gene expression studies, recommendation systems, and worddocument analysis, correlations exist among both observations and features. In these cases, we seek a method that can exploit the correlations among both the rows and columns of a data matrix to better learn lowerdimensional representations of both. Biclustering methods, which extract distinct biclusters along both rows and columns, give a partial solution to performing simultaneous dimension reduction on the rows and columns of a data matrix. Such methods, however, can break up a smooth geometry into artificial clusters. A more general viewpoint to consider is that data matrices possess geometric relationships between their rows (features) and columns (observations) such that both modes lie on lowdimensional manifolds. Furthermore, the relationships between the rows may be informed by the relationships between the columns, and vice versa. Several recent papers [8, 9, 10, 11, 12, 13] exploit this coupled relationship to coorganize matrices and infer underlying row and column embeddings.Further complicating the story is that such matrices may suffer from missing values, due to measurement corruptions and limitations. These missing values can sabotage efforts to learn the low dimensional manifold underlying the data. Specifically, kernelbased methods rely on calculating a similarity matrix between observations, whose eigendecomposition yields a new embedding of the data. As the number of missing entries grows, the distances between points are increasingly distorted, resulting in poor representation of the data in the lowdimensional space [14]. Matrix completion algorithms assume the data is lowrank and fill in the missing values by fitting a global linear subspace to the data. Yet, this may fail when the data lies on a nonlinear manifold.
Manifold learning in the missing data scenario has been addressed by a few recent papers. Nonlinear Principle Component Analysis (NLPCA) [15]
uses an autoencoder neural network, where the middle layer serves to learn a lowdimensional embedding of the data, and the trained autoencoder is used to fill in missing values. Missing Data Recovery through Unsupervised Regression
[16] first fills in the missing data with linear matrix completion methods, then calculates a nonlinear embedding of the data and incorporates this embedding in an optimization problem to fill in the missing values. Recently [14] proposed MRMISSING which first calculates an initial distance matrix using only nonmissing entries and then uses the increaseonlymetricrepair (IOMR) method to fix the distance matrix so that it is a metric from which they calculate an embedding. None of these methods consider the comanifold setting, where the coupled structure of the rows and the columns can be used to fill in the data, and to calculate an embedding.In this paper, we introduce a new method for performing joint dimension reduction on the rows and columns of a data matrix, which we term comanifold learning, in the missing data setting. We build on two recent lines of work on coorganizing the rows and columns of a data matrix [8, 10, 12] and convex optimization methods for performing coclustering [17, 18]. The former provide a flexible framework for jointly organizing rows and columns but lacks algorithmic convergence guarantees. The latter provides convergence guarantees but does not take full advantage of the multiple scales of the data revealed in the solution.
Instead of inferring biclusters at a single scale, we use a multiscale optimization framework to fill in the data, imposing smoothness on both the rows and the columns at fine to coarse scales. The scale of the solution is encoded in a pair of joint cost parameters along the rows and columns. The solutions to the optimization for each such pair yields a smooth estimate of the data along both the rows and columns, whose values are used to fill in the missing values of the given matrix. We define a new multiscale metric based on the filledin matrix across all scales, which we use to calculate nonlinear embeddings of the rows and columns. Thus our approach yields three results: a collection of smoothed estimates of the matrix, pairwise distances on the rows and columns that better estimate the geometry of the complete data matrix, and corresponding nonlinear embeddings (see Figure 1). We will demonstrate in experimental results that our method reveals meaningful representations in coupled datasets with missing entries, whereas other methods are capable of revealing a meaningful representation only along one of the modes.
2. Coclustering an Incomplete Data Matrix
We seek a collection of complete matrix approximations of a partially observed data matrix that have been smoothed along their row and columns to varying degrees. This collection will serve in computing row and column multiscale metrics to better estimate the row and column pairwise distances of the complete data matrix. Let denote the set of indices , and let be a subset of the indices that correspond to observed entries of , and let denote the projection operator of matrices onto an index set , i.e. is if and is 0 otherwise.
We seek a minimizer of the following function.
(1) 
The quadratic term quantifies how well approximates on the observed entries, while the two roughness penalties, and , incentivize smoothness across the rows and columns of . The nonnegative parameters and tune the tradeoff between how well agrees with over and how smooth is with respect to its rows and columns. By tuning and , we obtain estimates of at varying scales of row and column smoothness.
In this paper, we use roughness penalties of the following forms
where denotes the th row (column) of the matrix . The index sets and denote the edge sets of row and column graphs that encode a preliminary datadriven assessment of the similarities between rows and columns of the data matrix. The function , which maps into , will be explained shortly. The convergence properties of our coclustering procedure will rely on the following two assumptions.
Assumption 2.1.
The row and column graphs and are connected, i.e. the row graph is connected if for any pair of rows, indexed by and with , there exists a sequence of indices such that . A column graph is connected under analogous conditions.
Assumption 2.2.
The function is (i) concave and continuously differentiable on , (ii) vanishes at the origin, i.e. , (iii) is increasing on , and (iv) has finite directional derivative at the origin.
Variations on the optimization problem of minimizing (1) have been previously proposed in the literature. When there is no data missing, i.e. and is a linear mapping, minimizing the objective in (1) produces a convex biclustering problem [17]. Additionally, if either or is zero, then we obtain convex clustering [19, 20, 21, 22]. If we take to be a nonlinear concave function, problem (1) reduces to an instance of concave penalized regressionbased clustering [23, 24, 25].
Replacing and by quadratic row and column Laplacian penalties
gives a version of matrix completion on graphs [26, 27]. [11] also use row and column Laplacian penalties to perform joint linear dimension reduction on the rows and columns of the data matrix. Our work generalizes both [11] and [17] in that we seek the flexibility of performing nonlinear dimension reduction on the rows and columns of the data matrix and seek more general manifold organization than coclustered structure.
2.1. CoClustering Algorithm
We now introduce a majorizationminimization (MM) algorithm [28] for solving the minimization in (1). The basic strategy behind an MM algorithm is to convert a hard optimization problem into a sequence of simpler ones. The MM principle requires majorizing the objective function by a surrogate function anchored at . Majorization is a combination of the tangency condition and the domination condition for all . The associated MM algorithm is defined by the iterates . It is straightforward to verify that the MM iterates generate a descent algorithm driving the objective function downhill, i.e. that for all .
The following function
majorizes our objective function (1) at , where is a constant that does not depend on and and are weights that depend on , i.e.
(2) 
We give a detailed derivation of this majorization in Appendix A.
Minimizing is equivalent to minimizing the objective function of the convex biclustering problem for which efficient algorithms have been introduced [17]. Thus, in the th iteration, our MM algorithm solves a convex biclustering problem where the missing values in have been replaced with the values of and the weights and have been computed based on according to (2).
Algorithm 1 summarizes our MM algorithm, coclustermissing, which returns a smooth output matrix , a filledin matrix as well as and , which are respectively the number of distinct rows and distinct columns in . The coclustermissing algorithm has the following convergence guarantee whose proof is in Appendix B.
Proposition 1.
In the rest of this paper, we use the following function which satisfies Assumption 2.2
where is a small positive number, e.g. . We briefly explain the rationale in our choice. By the monotone convergence theorem, as tends to zero, converges to the mapping . Thus, approximates a snowflake metric . When the approximate snowflake metric is employed in the penalty term, small differences between rows and columns are penalized significantly more than larger differences resulting in more aggressive smoothing of small noisy variations and less smoothing of more significant systematic variations. Note that the weights are continuously updated throughout the optimization as opposed to the fixed weights in [17]. This introduces a notion of the scale of the solution into the weights.
2.2. Coclustering at multiple scales
Initializing Algorithm 1 is very important as the objective function in (1) is not convex. The matrix is initialized to be the mean of all nonmissing values. The connectivity graphs and are initialized at the beginning using nearestneighbor graphs, and remain fixed throughout all considered scales. If we observed the complete matrix, employing a sparse Gaussian kernel is a natural way to quantify the local similarity between pairs of rows and pairs of columns. The challenge is that we do not have the complete data matrix but only the partially observed one . Therefore, we rely only on the observed values to calculate the knearestneighbor graph, using the distance used by [29]
in an image inpainting problem.
To obtain a collection of estimates at multiple scales, we need to solve the optimization problem for pairs of . We start with small values of and , where . We calculate the coclustering (Algorithm 1) and obtain the smooth estimate , the filledin data matrix , and and which are the number of distinct row and column clusters, respectively, identified at that scale. Keeping fixed, we keep increasing by power of 2 and biclustering the data until the algorithm converges to one cluster along the columns. We then increase by power of 2 and reset . We repeat this procedure at increasing scales of , until , indicating we have converged to a single global bicluster. Thus we traverse a solution surface at logarithmic scale [30]. This yields a collection of filledin matrices at all scales .
3. Comanifold learning
Kernelbased manifold learning relies on constructing a “good” similarity measure between points, and a dimension reduction method based on this similarity. The eigenvectors of these kernels is typically used as the new lowdimensional coordinates for the data. Here we leverage having calculated an estimate of the filledin matrix at multiple scales
, to define a new metric between rows and columns. This metric will encompass all biscales as defined by joint pairs of optimization cost parameters . Given a new metric we employ diffusion maps to obtain a new embedding of the rows and columns. Note that other methods can be used for embedding based on our new metric. The full algorithm is given in Algorithm 2.3.1. Multiscale metric
We define a new metric to estimate the geometry both locally and globally of the complete data matrix. For a given pair , we calculate the Euclidean distance between rows for the filledin matrix at that joint scale, weighted by the cost parameters:
where . Having solved for multiple paris from the solution surface, we sum over all the distances to obtain a multiscale distance on the data rows:
An analogous multiscale distance is computed for pairs of columns.
This metric takes advantage of solving the optimization for multiple pairs of cost parameters and filling in the missing values with increasingly smooth estimates. Note that if there are no missing values, this metric is just the Euclidean pairwise distance scaled by a scalar, so that we recover the embedding of the complete matrix. In our simulations, we set to favor local over global structure. As opposed to the partitiontree based metric of [12], this metric takes into account all joint scales of the data as the matrix is smoothed across rows and columns simultaneously, thus fully taking advantage of the coupling between both modes.
3.2. Diffusion maps
Having calculated a multiscale metric on the rows and columns throughout the joint optimization procedure, we can now construct a pair of lowdimensional embeddings based on these distances. Specifically we use diffusion maps [4], but any dimension reduction technique relying on the construction of a distance kernel could be used instead. We briefly review the construction of the diffusion maps for the rows (features) of a matrix but the same can be applied to the columns (observations). Given a distance between two rows of the matrix , we construct an affinity kernel on the rows. We choose a Gaussian kernel, but other kernels can be considered depending on the application:
where is a scale parameter. This kernel enhances locality, as pairs of samples whose distance exceed have negligible affinity. One possible choice for is to be the median of pairwise distances within the data.
We derive a rowstochastic matrix by normalizing the rows of
:where is a diagonal matrix whose elements are given by . The eigendecomposition of
yields a sequence of decreasing eigenvalues:
, and right eigenvectors . Retaining only the first eigenvalues and eigenvectors, the mapping embeds the rows into the Euclidean space :The embedding integrates the local connections found in the data into a global representation, which enables visualization of the data, organizes the data into meaningful clusters, and identifies outliers and singular samples. This embedding is also equipped with a noiserobust distance, the diffusion distance. For more details on diffusion maps, see
[4].4. Numerical Experiments
We applied our approach to three datasets, and evaluated results both qualitatively and quantitatively:

linkage A synthetic dataset with a onedimensional manifold along the rows and a twodimensional manifold along the columns. Let be points along a helix and let be a two dimensional surface. We analyze the matrix of Euclidean distances between the two spatially distant sets of points to reveal the underlying geometry of both rows and columns,
Other functions of the distance can also be used such as the elastic or Coulomb potential operator [31]. Missing values correspond to having access to only some of the distances between pairs of points across the two sets. Note that this is unlike MDS as we do not have pairwise distances between all datapoints, but rather distances between two sets of points with different geometries.

linkage2 A synthetic dataset with a clustered structure along the rows and a twodimensional manifold along the columns. Let be composed of points in 3 Gaussian clouds in 3D and let be a two dimensional surface as before.

lung500 A realworld dataset composed of 56 lung cancer patients and their gene expression [32]
. We selected the 500 genes with the greatest variance from the original collection of 12,625 genes. Subjects belong to one of four subgroups; they are either normal subjects (Normal) or have been diagnosed with one of three types of cancers: pulmonary carcinoid tumors (Carcinoid), colon metastases (Colon), and small cell carcinoma (Small Cell).
The rows and columns of the data matrix are randomly permuted so their natural order does not play a role in inferring the geometry. In Figure 2, we compare our embeddings to both NLPCA with missing data completion [15] and Diffusion maps (DM) [4] on the missing data, where both methods are applied to each mode separately, while our comanifold approach takes into account the coupled geometry. Comparing to Diffusion maps demonstrates how missing values corrupt the embedding. In all examples 50% of the entries are missing. For each of the three methods we display the embedding for both the rows (top) and the columns (bottom),
Both NLPCA and DM reveal the underlying 2D surface structure on the rows in only one of the linkage datasets, and err greatly on the other. DM correctly infers a 1D path for the linkage dataset but it is increasingly noisy. For NLPCA the 1D embedding is not as smooth and clean as the embedding inferred by the comanifold approach. Our method reveals the 2D surface in both cases. For the lung500 data, NLPCA and DM embed the cancer samples such that the normal subjects (yellow) are close to the Colon type (cyan), whereas our method separates the normal subjects from the cancer types. This is due to taking into account the coupled structure of the genes and the samples. All three methods reveal a smooth manifold structure to the genes, which is different than the assumed clustered structure a biclustering method would infer. For plots presenting the datasets and filledin values at multiple scales see Appendix C.
Manifold learning is not used only for data visualization but also for calculating new data representations that can then be used for signal processing and machine learning tasks. The left panel of Figure
3 compares clustering the embedding of the cancer patients in lung500 by each method for increasing percentage of missing values in the data, where we averaged over 30 realizations of missing entries. We use the Adjusted Rand Index (ARI) [33], to measure the similarity between the kmeans clustering of the embedding and the groundtruth labels. Our embedding (blue plot) gives the best clustering result and its performance is only slightly degraded by increasing the percentage of missing values, as opposed to Diffusion maps (red plot). This demonstrates that the metric we calculate is a good estimate of the metric of the complete data matrix. NLPCA (yellow plot) performs worst.The right panel of Figure 3 compares clustering the embedding of the three Gaussian clusters in linkage2 for increasing percentage of missing values in the data, where we averaged over 30 realizations of missing entries. Our embedding (blue plot) gives the best clustering result and its performance is unaffected by increasing the percentage of missing values, as opposed to Diffusion maps (red plot) which is greatly degraded by the missing values. NLPCA (yellow plot) does not perform as well as our approach, with performance decreasing as the percentage of missing values increases.
5. Conclusions
In this paper we presented a new method for learning nonlinear manifold representations of both the rows and columns of a matrix with missing data. We proposed a new optimization problem to obtain a smooth estimate of the missing data matrix, and solved this problem for different values of the cost parameters, which encode the smoothness scale of the estimate along the rows and columns. We leverage calculating these multiscale estimates into a new metric that aims to capture the geometry of the complete data matrix. This metric is then used in a kernelbased manifold learning technique to obtain new representations of both the rows and the columns. In future work we will investigate additional metrics in a general comanifold setting and relate them to optimal transport problem and Earth Mover’s Distance [34].
References
 [1] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, Dec. 2000.
 [2] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, pp. 2323–2326, 2000.
 [3] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, no. 6, pp. 1373–1396, 2003.
 [4] R. R. Coifman and S. Lafon, “Diffusion Maps,” Applied and Computational Harmonic Analysis, vol. 21, no. 1, pp. 5–30, 2006.

[5]
P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in
Proceedings of the 25th International Conference on Machine learning (ICML08). ACM, 2008, pp. 1096–1103. 
[6]
S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive autoencoders: Explicit invariance during feature extraction,” in
Proceedings of the 28th International Conference on Machine Learning (ICML11), 2011, pp. 833–840.  [7] D. P. Kingma and M. Welling, “AutoEncoding Variational Bayes,” in Proceedings of the International Conference on Learning Representations (ICLR), 2014.
 [8] M. Gavish and R. R. Coifman, “Sampling, denoising and compression of matrices by coherent matrix organization,” Applied and Computational Harmonic Analysis, vol. 33, no. 3, pp. 354 – 369, 2012.
 [9] J. I. Ankenman, “Geometry and analysis of dual networks on questionnaires,” Ph.D. dissertation, Yale University, 2014.

[10]
G. Mishne, R. Talmon, R. Meir, J. Schiller, M. Lavzin, U. Dubin, and R. R. Coifman, “Hierarchical coupledgeometry analysis for neuronal structure and activity pattern discovery,”
IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 7, pp. 1238–1253, 2016.  [11] N. Shahid, N. Perraudin, V. Kalofolias, G. Puy, and P. Vandergheynst, “Fast robust PCA on graphs,” IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 4, pp. 740–756, 2016.
 [12] G. Mishne, R. Talmon, I. Cohen, R. R. Coifman, and Y. Kluger, “Datadriven tree transforms and metrics,” IEEE Transactions on Signal and Information Processing over Networks, 2017, in press.
 [13] O. Yair, R. Talmon, R. R. Coifman, and I. G. Kevrekidis, “Reconstruction of normal forms by learning informed observation geometries from data,” Proceedings of the National Academy of Sciences, vol. 114, no. 38, pp. E7865–E7874, 2017.
 [14] A. C. Gilbert and R. Sonthalia, “Unrolling swiss cheese: Metric repair on manifolds with holes,” arXiv preprint arXiv:1807.07610, 2018.
 [15] M. Scholz, F. Kaplan, C. L. Guy, J. Kopka, and J. Selbig, “Nonlinear PCA: a missing data approach,” Bioinformatics, vol. 21, no. 20, pp. 3887–3895, 2005.
 [16] M. A. CarreiraPerpin and Z. Lu, “Manifold learning and missing data recovery through unsupervised regression,” in Data Mining (ICDM), 2011 IEEE 11th International Conference on, 2011, pp. 1014–1019.
 [17] E. C. Chi, G. I. Allen, and R. G. Baraniuk, “Convex Biclustering,” Biometrics, vol. 73, no. 1, pp. 10–19, 2017.
 [18] E. C. Chi, B. R. Gaines, W. W. Sun, H. Zhou, and J. Yang, “Provable convex coclustering of tensors,” arXiv:1803.06518 [stat.ME], 2018.
 [19] K. Pelckmans, J. De Brabanter, J. Suykens, and B. De Moor, “Convex clustering shrinkage,” in PASCAL Workshop on Statistics and Optimization of Clustering Workshop, 2005.
 [20] T. Hocking, J.P. Vert, F. Bach, and A. Joulin, “Clusterpath: An algorithm for clustering using convex fusion penalties,” in Proceedings of the 28th International Conference on Machine Learning (ICML11), June 2011, pp. 745–752.
 [21] F. Lindsten, H. Ohlsson, and L. Ljung, “Just relax and come clustering! A convexification of means clustering,” Linköpings Universitet, Tech. Rep., 2011.
 [22] E. C. Chi and K. Lange, “Splitting methods for Convex Clustering,” Journal of Computational and Graphical Statistics, vol. 24, no. 4, pp. 994–1013, 2015.

[23]
W. Pan, X. Shen, and B. Liu, “Cluster analysis: Unsupervised learning via supervised learning with a nonconvex penalty,”
Journal of Machine Learning Research, vol. 14, pp. 1865–1889, 2013.  [24] Y. Marchetti and Q. Zhou, “Solution path clustering with adaptive concave penalty,” Electronic Journal of Statistics, vol. 8, no. 1, pp. 1569–1603, 2014.
 [25] C. Wu, S. Kwon, X. Shen, and W. Pan, “A new algorithm and theory for penalized regressionbased clustering,” Journal of Machine Learning Research, vol. 17, no. 188, pp. 1–25, 2016.
 [26] V. Kalofolias, X. Bresson, M. Bronstein, and P. Vandergheynst, “Matrix completion on graphs,” arXiv:1408.1717 [cs.LG], 2014.
 [27] N. Rao, H.F. Yu, P. K. Ravikumar, and I. S. Dhillon, “Collaborative filtering with graph information: Consistency and scalable methods,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2107–2115.
 [28] Y. Sun, P. Babu, and D. P. Palomar, “Majorizationminimization algorithms in signal processing, communications, and machine learning,” IEEE Transactions on Signal Processing, vol. 65, no. 3, pp. 794–816, 2017.
 [29] I. Ram, M. Elad, and I. Cohen, “Image processing using smooth ordering of its patches,” IEEE transactions on image processing, vol. 22, no. 7, pp. 2764–2774, 2013.
 [30] E. C. Chi and S. Steinerberger, “Recovering trees with convex clustering,” ArXiv eprints, 2018.
 [31] R. R. Coifman and M. Gavish, “Harmonic analysis of digital data bases,” in Wavelets and Multiscale analysis. Springer, 2011, pp. 161–197.

[32]
M. Lee, H. Shen, J. Z. Huang, and J. Marron, “Biclustering via sparse singular value decomposition,”
Biometrics, vol. 66, no. 4, pp. 1087–1095, 2010.  [33] L. Hubert and P. Arabie, “Comparing partitions,” Journal of Classification, vol. 2, no. 1, pp. 193–218, 1985.
 [34] R. R. Coifman and W. E. Leeb, “Earth mover’s distance and equivalent metrics for spaces with hierarchical partition trees,” Yale University, Tech. Rep., 2013, technical report YALEU/DCS/TR1482.
 [35] R. R. Meyer, “Sufficient conditions for the convergence of monotonic mathematical programming algorithms,” Journal of Computer and System Sciences, vol. 12, no. 1, pp. 108–121, 1976.
Appendix A Derivation of Majorization
We first construct a majorization of the datafidelity term. It is easy to verify that the following function of
(3) 
where , majorizes the datafidelity term at .
We next construct a majorization of the penalty term. Recall that the firstorder Taylor approximation of a differentiable concave function provides a tight bound on the function. Therefore, under Assumption 2.2, we have the following inequality
Thus, we can majorize the penalty term with the function
(4) 
where is a constant that does not depend on and and (2) are weights that depend on . The sum of functions (3) and (4)
majorizes our objective function (1) at .
Appendix B Convergence
The MM algorithm generates a sequence of iterates that has at least one limit point, and the limit points are stationary points of the objective function
(6) 
To reduce notational clutter, we suppress the dependency of on and since they are fixed during Algorithm 1. We prove Proposition 1 in three stages. First, we show that all limit points of the MM algorithm are fixed points of the MM algorithm map. Second, we show that fixed points of the MM algorithm are stationary points of in (6). Finally, we show that the MM algorithm has at least one limit point.
b.1. Limit points are fixed points
The convergence theory of MM algorithms relies on the properties of the algorithm map that returns the next iterate given the last iterate. For easy reference, we state a simple version of Meyer’s monotone convergence theorem [35], which is instrumental in proving convergence in our setting.
Theorem 1.
Let be a continuous function on a domain and be a continuous algorithm map from into satisfying for all with . Then all limit points of the iterate sequence are fixed points of .
In order to apply Theorem 1, we need to identify elements in the assumption with specific functions and sets corresponding to the problem of minimizing (6
). Throughout the following proof, it will sometimes be convenient to work with the column major vectorization of a matrix. The vector
is obtained by stacking the columns of on top of each other.The function : Take and to be the objective function in (6) and majorize with given in (A). The function is continuous. Let denote the algorithm map for the MM algorithm. Since is strongly convex in , it has a unique global minimizer. Consequently, for all .
Continuity of the algorithm map : Continuity of follows from the fact that the solution to the convex biclustering problem is jointly continuous in the weights and data matrix [17][Proposition 2]. The weight is a continuous function of , since is continuous according to Assumption 2.2. The weight is likewise continuous in . The data matrix passed into the convex biclustering algorithm is , which is a continuous function of since the projection mapping is continuous.
b.2. Fixed points are stationary points
Let and , where denotes the Kronecker product. Then
The directional derivative of in the direction at a point is given by
A point is a stationary point of , if for all direction vectors
where .
A point is a fixed point of , if is in the subdifferential of , i.e.
(7) 
where the set on the right is the subdifferential .
If , then . On the other hand, if , then .
Fix an arbitrary direction vector . The inner product of with an element in the set on right hand side of (7) is given by
(8) 
where and .
Then the supremum of the right hand side of (8) over all and is nonnegative, because . Consequently, all fixed points of are also stationary points of .
b.3. The MM iterate sequence has a limit point
To ensure the existence of a limit point, we show that the function is coercive, i.e. for any sequence . Recall that according to Assumption 2.1 we assume that the row and column edge sets and form connected graphs. Therefore, if and only if [17, Proposition 3]. The edgeincidence matrix of the column graph encodes its connectivity and is defined as
The row edgeincidence matrix is defined similarly. Assume that nonempty, i.e. at least one entry of the matrix has been observed. Finally, assume that is also coercive.
Note that any sequence where . Note that and . Let be a diverging sequence, i.e. . There are two cases to consider.
Case I: Suppose that . Let
and let denote the smallest singular value of . Note that the null space of is the span of . Therefore, since
(9) 
Also note that
Since the mapping is a norm, and all finite dimensional norms are equivalent, there exists some such that
(10) 
By the triangle inequality
(11) 
Let then
(12) 
Putting inequalities (9), (10), (11), and (12) together gives us
(13) 
Since is increasing according to Assumption 2.2, it follows that
(14) 
Inequality (14) implies that
Consequently, since is increasing and implies that .
Case II: Suppose for some . Then . Note that we have the following inequality
where
Comments
There are no comments yet.