## 1 Introduction

In the matrix completion problem, we seek to recover or estimate a matrix, when only a fraction of its entries are observed. While it is impossible to complete an arbitrary matrix using only partial observations of its entries, it may be possible to fully recover matrix entries when the matrix has an appropriate underlying structure. For example, most low-rank matrices can be completed accurately with high probability, by solving a convex optimization problem

(Candés and Recht, 2009). Consequently, algorithms for low-rank matrix completion have enjoyed widespread use across many disciplines, including collaborative filtering and recommender systems (Koren et al., 2009), multi-task learning and classification (Amit et al., 2007; Argyriou et al., 2007; Wu and Lange, 2015)(Chen and Suter, 2004), statistical genetics (Chi et al., 2013), as well as remote sensing (Malek-Mohammadi et al., 2014).In this paper, we consider matrix completion under a structural assumption that is closely related to the low-rank assumption; i.e., we assume that the matrix entries vary “smoothly” with respect to a graphical organization of the rows and columns. For example, in the context of a movie recommendation system, we seek to complete a user-by-movies ratings matrix. We may have additional information about users, such as if pairs of users are friends on a social media application, as well as additional information from a movie database, such as the co-occurrence of certain film principles. We expect the entries of a movie ratings matrix to vary “smoothly” over a neighborhood of users, defined by a friendship graph, and over a neighborhood of movies, defined by a shared movie principles graph. When such local similarity structure exists, and is available, it behooves us to leverage this information to predict missing entries in a matrix.

In general, we wish to recover a matrix from a noisy and partially observed matrix when there exist similarities between pairs of rows and pairs of columns. Let the parameters and for and denote non-negative weights that quantify the similarities between pairs of rows and pairs of columns. Let denote the set of observed indices. Finally, let denote the projection operator onto the set of indices where the th entry of is if and is zero otherwise. With this notation in hand, we can pose this version of the matrix completion task as the following optimization problem:

(1.1) |

where

In the equations above, ( denotes the th row (column) of the matrix and are nonnegative regularization parameters. The first term quantifies the misfit between and over the observed entries . The second term is a penalty that incentivizes smoothness with respect to the row and column similarities. The two nonnegative parameters control the relative importance of minimizing the discrepancy between and over , and enforcing smoothness of with respect to the given row and column similarities. We refer to the matrix completion problem given in (1.1) as the biclustered matrix completion (BMC) problem. Several variations on (1.1) have been proposed in the literature prior to this work (Ma et al., 2011; Cai et al., 2011; Kalofolias et al., 2014; Rao et al., 2015; Shahid et al., 2016), and smoothness penalties similar to have been applied in penalized regression (Li and Li, 2008; Rañola et al., 2014; Hu and Allen, 2015; Li et al., 2016)

and functional principal components analysis

(Huang et al., 2009; Tian et al., 2012; Allen et al., 2014).### 1.1 Contributions

Our major contributions in this paper are two-fold. First, we derive some new properties of BMC problem, concerning the existence and uniqueness of a solution and as well as the solution’s limiting behavior as the penalty parameters tend to infinity. Second, we provide a computational framework for model selection, namely choosing . We survey the contents of this paper, emphasizing the main results.

#### Properties of BMC

Despite the widespread use of the graph smoothing penalties like in matrix completion, we present new results on basic properties of the regularizer , the BMC optimization problem (1.1), and the BMC solution. Many of these results, while intuitive, have been taken for granted without careful justification. A key consequence of these results is that they highlight when BMC also recovers a low-rank matrix. This fact suggests that BMC may be more computationally advantageous over other variants proposed in the literature. Additionally, these results also suggest strategies to sparsify the row and column weights in order to speed up estimation while still ensuring that the BMC problem is well defined. Specifically, we show that the BMC problem always has a solution and give conditions on the missingness pattern and row and column weights that guarantee the solution’s uniqueness. Furthermore, we show that as the regularization parameters diverge to infinity, the BMC solution converges to a limiting smooth estimate of the data matrix and also derive what this limit is.

#### Computational framework for model selection

The optimization problem in (1.1) is convex and differentiable. The solution for a fixed set of regularization parameters requires solving a linear system. As we will see later, this system admits a unique solution under conditions that can be easily verified. We study, in detail, the problem of solving this linear system for a fixed set of parameters, as well as choosing optimal parameters , i.e., perform model selection. The prevalent approach to choosing these parameters is by searching for a minimizer of a surrogate measure of predictive performance over a two-way grid of candidate parameters. Common surrogate measures include prediction error on hold-out sets, as in cross-validation, and various information criteria. Cross-validation in particular is popular since it is easy to implement (Rao et al., 2015; Shahid et al., 2016). While grid-search may be computationally feasible for choosing a single parameter, it can be prohibitively expensive when selecting two parameters since each grid point requires fitting a model for those parameters, and in the case of BMC fitting a model requires inverting a potentially very large matrix. Moreover, it requires pre-specifying a grid of regularization parameters.

Our second contribution is a novel scalable strategy for model selection in BMC problems based on directly minimizing the Bayesian Information Criterion (BIC). The BIC for (1.1) is continuously differentiable and is amenable to minimization by Quasi-Newton methods. To further scale up our procedure, we introduce a refinement based on the Hutchinson estimator to approximate the BIC, and then minimize this approximation. Our resulting procedure, which we call Iterative Model Selection (IMS), leads to drastic reduction in the computational time to select and does not require pre-specifying a grid of regularization parameter pairs to explore. Figure (a)a shows an example of the search path taken by IMS exploring the BIC surface on one of the simulated problems described in Section 7.1. IMS took 12 iterations to converge to the minimum. Consider searching the BIC surface over a 50-by-50 grid of candidate parameters. Figure (b)b shows the set of 50-by-50 grid points at which the BIC would have to be evaluated. Each evaluation requires solving a large linear system. As we show later in Section 5, the dominant calculation at each IMS iteration is solving the same linear system. While similar smoothing parameters would be chosen by the two procedures, this simple example illustrates how the naive grid-search may blindly evaluate the BIC at many points far from a minimum and therefore may unnecessarily solve far more linear systems than the IMS. In this example, grid-search would solve 2,500 linear systems, while IMS would solve 12 to arrive at essentially the same model.

To summarize, the IMS path sports the follows advantages over the standard grid-search: (i) In practice, it often takes a more direct route to a model minimizing the BIC leading to potentially many fewer linear system solves, (ii) it does not not require pre-specifying the grid, and (iii) consequently, model selection is not restricted to a finite set of pre-specified grid points. In short, by enabling the model search to go off the tuning parameter grid, we can perform similar and sometimes superior model selection while also reaping significant savings in computation time.

The rest of the paper is organized as follows. In Section 2, we review the relationship of the BMC problem to the prior art in matrix completion. In Section 3, we present new results on properties of the BMC solution. In Section 4, we discuss the problem of solving (1.1) for a fixed pair of regularization parameters. In Section 5, we frame the model selection problem and discuss how to efficiently search the regularization parameter space with IMS to select a model with good prediction accuracy. In Section 6, we elaborate on how to further scale up IMS using stochastic approximation strategies. In Section 7, we present an empirical comparison of IMS and standard grid-based regularization parameter selection methods on both simulated as well as a real data example from radiogenomics. In Section 8, we close with a discussion.

## 2 Relationship to Prior Art

To put BMC into context and clarify its connections to prior art, we review the two primary formulations of matrix completion in the literature: low-rank matrix completion (LMRC) and matrix completion on graphs (MCG).

#### Low-Rank Matrix Completion (LRMC)

In the noisy LRMC problem, we seek to recover a denoised matrix from a noisy and incomplete matrix by solving the following constrained optimization problem:

(2.1) |

This formulation balances the tradeoff between how well matches over the observed entries and model complexity of as measured by its rank. As we relax the bound on the rank by making it larger, we can better fit the data at risk of overfitting it.

Due to the rank constraint, (2.1

) is a combinatorial optimization problem and quickly becomes impractical to solve as the problem size increases. Fortunately, we can solve the following computationally tractable convex problem instead:

minimize | (2.2) |

As before in (1.1), the first term quantifies how well approximates over the observed entries . The second term denotes the nuclear norm of

, which is the sum of its singular values, and the nonnegative regularization parameter

trades off the emphasis on these two terms. Problem (2.2) is related to problem (2.1) through the fact that the nuclear norm of a matrix is the tightest convex approximation to its rank (Fazel, 2002). Remarkably, under suitable conditions on the missingness patterns defined by , the solution to the convex problem in (2.2) also coincides with those of the combinatorial problem in (2.1) with high probability (Candés and Plan, 2010).#### Matrix Completion on Graphs (MCG)

Given how successful the low-rank paradigm is, a natural strategy for incorporating information on row and column similarities would be to augment (2.2) with the penalty and solve the following convex optimization problem:

(2.3) |

With respect to BMC, the only difference between (1.1) and (2.3) is the addition of a nuclear norm penalty in (2.3). While the problem defined in (2.3) is also convex, including the nuclear norm penalty drastically complicates the estimation procedure. Solving LRMC is tractable because there exist polynomial time iterative solvers. Nonetheless, iterative solvers for (2.2) and consequently (2.3

) typically require computing an expensive singular value decomposition (SVD) to account for the nuclear norm regularizer. Considerable attention has been given to either formulate alternative non-convex optimization problems that omit the nuclear norm penalty entirely

(Burer and Monteiro, 2003; Srebro et al., 2005; Rao et al., 2015), or performing judiciously chosen low-rank SVD calculations (Mazumder et al., 2010; Cai et al., 2010). Moreover, there are now three tuning parameters that tradeoff the emphasis on the data-fit and the structure imposed on . Given the costs of including the nuclear norm penalty, a natural question to ask is how much added benefit is gained by including it?The following illustrative example provides some evidence that the penalty is typically sufficient for completion tasks when the matrices exhibit strong row and column clustering structure. Such matrices exhibit a checkerboard or biclustered structure under row and column reordering.

Figure 2 compares the results from performing LRMC and BMC on a digital replica of the oil painting ‘Composition A’ (Figure (a)a) by the Dutch painter Piet Mondrian^{1}^{1}1A jpeg file was obtained from http://www.wikiart.org/en/piet-mondrian/composition-a-1923. after adding noise and removing half its entries (Figure (b)b). To the eye, both LRMC (Figure (c)c) and BMC (Figure (d)d) appear to give reasonably good reconstructions. Further inspection reveals that the BMC predictions have lower mean squared error over the unobserved entries than LRMC. Details on this experiment and the MSE calculations are in the Supplementary Materials.

The comparable performance of LRMC and BMC on this example suggests that the nuclear norm penalty in the MCG problem may be an unnecessary computational complication when there is an underlying biclustering structure. Indeed, we will see next that the penalty shrinks solutions towards a low-rank matrix defined by the connectivity structure of the underlying row and column graphs.

## 3 Properties of the BMC Solution

The BMC formulation is related to recent work by Shahid et al. (2016); however, they present results from a signal processing perspective. In contrast, our perspective is on shrinkage estimation. Furthermore, the results on matrix completion are new. All proofs are in Section B of the Appendix.

To better understand the action of , we need to review some basic facts from algebraic graph theory. Let denote an undirected graph with a vertex set and an edge set . A weighted undirected graph also includes a non-negative weight function that is symmetric in its arguments, namely . The set is a connected component of if (i) there is a sequence of edges forming a path between every pair of vertices in and (ii) none of its vertices are connected to any vertices in its complement . Let denote the indicator function on the set of vertices , namely if and if . Recall that the graph Laplacian of is a symmetric positive semidefinite matrix given by

Define a weighted undirected row graph with and weights , and denote its graph Laplacian by . We use analogous notation for a weighted undirected column graph .

It is straightforward to show that the regularizer can be expressed in terms of the two graph Laplacians, as

The expression above explicitly characterizes the shrinkage action of in terms of the connectivity properties of and . We present a result which gives conditions under which the penalty .

###### Proposition 3.1.

Suppose that there are row connected components in and column connected components in Then the penalty if and only if for some for and .

Property 3.1 suggests that the penalty incentivizes approximations of whose rows and columns are spanned by the indicator functions of the connected components of the row and column graphs and . In other words, shrinks estimates to matrices that are piecewise constant on submatrices defined by the functions . We refer to these submatrices as biclusters or checkerboard patches. Indeed, suppose that the data matrix is a linear combination of the outer products of the indicator functions of the row and column connected components, namely

(3.1) |

Given Property 3.1, we intuitively expect that the BMC estimate should be able to exactly recover missing entries in this scenario. This is indeed the case, provided the missingness pattern is reasonable. We make explicit what we mean by reasonable in the following assumption, which will be invoked throughout the rest of this paper.

###### Assumption 3.1.

If there are row connected components in and column connected components in , then for all and .

In words, Assumption 3.1 states that every checkerboard patch defined by a pair of row and column connected components must have at least one observation. Under this assumption and the ideal scenario presented in (3.1), the BMC estimate of the missing entries is exact.

###### Proposition 3.2.

There are two important observations about the form of in (3.1). First, that can be expressed as in (3.1) corresponds to the checkerboard pattern we seek to recover. Second, such are low-rank, when the number of row clusters and the number of column clusters and consequently in (3.1) has rank at most . The second observation motivates employing the simpler BMC over MCG when the underlying matrix has a biclustered structure.

The penalty is already shrinking solutions towards a low-rank solution, likely rendering the addition of a nuclear norm penalty a computationally expensive redundancy. Of course, this is an ideal case when the data matrix has the form in (3.1). We bring it up mainly to understand (i) what is shrinking estimates towards, (ii) when the nuclear norm may be unnecessary, and consequently (iii) for what kind of data matrices BMC is best equipped to recover. These results suggest that BMC should perform well when the true underlying matrix has an approximately checkerboard pattern and row and column weights that are consistent with that pattern can be supplied. Experiments in Section 7 will confirm this suspicion. For now though, we turn our attention to the properties of the BMC problem and solution for a general data matrix and general set of row and column weights.

Our first main result concerns the existence and uniqueness of the solution to the BMC problem (1.1).

###### Theorem 3.1.

The interpretation of this result is that there is a unique solution to the biclustered matrix completion problem if and only if no bicluster induced by the row and column graph Laplacians is completely missing. On the other hand, the prediction error for the reconstruction can be arbitrarily poor if Assumption 3.1 fails to hold.

In order for Theorem 3.1 to be practical, however, we need a way to verify Assumption 3.1. We provide an algorithm based on breadth-first-search that accomplishes this in time linear in the size of the data. Details are given in the Section A of the Appendix. The next two results characterize the limiting behavior of as a function of the tuning parameters .

Since is shrinking estimates towards the checkerboard pattern induced by the clustering pattern in the row and column graphs, we intuitively expect that the estimate tends toward the solution of the following constrained optimization problem:

(3.2) |

Moreover, we anticipate that this limiting solution should be the result of averaging the observed entries over each checkerboard patch. This is indeed the case.

###### Proposition 3.3.

The next result verifies our intuition that the estimate tends towards to in (3.3) as and tend towards infinity.

## 4 Estimation

In this section, we discuss how to solve the estimation problem for a fixed set of parameters

in order to quantify the amount of work a standard grid-search method would incur. It is easier to work with vectorized quantities. Let

, namely is the vector obtained by stacking the columns of on top of each other. Then the objective in (1.1) can be written as(4.1) |

where the binary operator denotes the Kronecker product and is a diagonal matrix with a 1 in the th diagonal entry if the th entry in the matrix (column major ordering) is observed and 0 otherwise. More explicitly, if , then where . We have rewritten the two penalty expressions in terms of by invoking the identity .

Since the objective function in (4.1) is differentiable and convex, we seek the vector at which the gradient of the objective vanishes. Thus, the estimate is the solution to the following linear system obtained by setting the gradient equal to zero,

where . Note that under Assumption 3.1, this linear system is invertible and therefore .

#### Sparse Weights

Recall that is -by-. If the majority of the row and column weights are positive, the resulting Laplacian matrices will be dense and solving this -by- linear system will take a demanding operations. On the other hand, if most of the row and column weights are zero, namely the weights are sparse, then we can solve the linear system in substantially less time as discussed below. Fortunately, it is possible to construct sparse approximations to the weights graphs that lead to Laplacian regularized solutions that are close to the solutions one would obtain using the original dense Laplacian regularizers (Sadhanala et al., 2016). Unless stated otherwise, for the rest of the paper we will assume that the weights are sparse. In particular, we assume the number of positive weights is linear in the size of the graph. This can be achieved using a

-nearest neighbors (knn) sparsification strategy described in Section

7.2.Finally, we emphasize that there is a tension between minimizing computational costs and ensuring reliable estimation. If the weights are too sparse, Assumption 3.1 can fail to hold and there may not be a unique solution to the BMC problem (1.1). In practice, the knn sparsification for strikes a reasonable balance between these two goals. Again, we emphasize that we can easily check Assumption 3.1 to expedite the identification of a good sparsity level.

#### Computational Complexity

Since is symmetric and sparse, the solution to can be computed using either a direct solver, such as the sparse Cholesky factorization of , or an iterative solver, such as the preconditioned conjugate gradient.

In the direct approach, a triangular factorization of is computed and then forward and backward substitution are performed on two triangular systems to obtain the solution. The exact computational complexity of a sparse direct solver depends on the underlying sparsity pattern; however, a (knn) sparsity pattern in the matrix

is similar to the discretization of elliptic partial differential equations in two spatial dimensions. For these problems, the computational complexity for solving a linear system involving

, which is of size -by-, requires flops (George, 1973).On the other hand, if the problem at hand is very large, factorization methods may not be feasible. We recommend that the linear system be solved using Preconditioned Conjugate Gradient method, with the Incomplete Cholesky Factorization method as a preconditioner (Golub and Van Loan, 2012, Section 10.3). In numerical experiments, we have successfully used this approach for large-scale problems, which is directly available via MATLAB.

## 5 Model Selection

We now address the issue of choosing the penalty parameters. We seek the parameters that result in the model with the best prediction error. There are two general approaches to estimating this prediction error: covariance penalties that are analytic and sampling methods that are non-parametric. The former includes methods such as Mallow’s , Akaike’s information criterion (AIC), the Bayesian information criterion (BIC) (Schwarz, 1978), generalized cross-validation (GCV) (Craven and Wahba, 1978; Golub et al., 1979), and Stein’s unbiased risk estimate (SURE). The latter includes methods such as cross-validation (CV) and the bootstrap. Interested readers may consult Efron (2004) for an in depth discussion on the relationship between the two approaches. In this article, we will use the BIC and compare it to CV since each are widely used in practice. All proofs are in Section B of the Appendix.

To derive the BIC for BMC, we first need to compute the degrees of freedom for BMC. In general, the degrees of freedom of a model quantifies its flexibility. To derive the degrees of freedom for a model, we need a probabilistic model for the data generating process. Suppose we observe noisy measurements of a parameter , namely where and

are uncorrelated random variables with zero mean and common variance

. Let denote an estimate of . Then the degrees of freedom of is given bydf | (5.1) |

In the case of BMC, this is a straightforward calculation. We assume the vectorization of our data, , is given by , where the elements of are uncorrelated errors with zero mean and common variance . In the BMC problem, the vector plays the role of in the formula (5.1). Note that the degrees of freedom calculation does not require that the mean vector to have a checkerboard structure.

###### Proposition 5.1.

The degrees of freedom of the BMC estimate is given by , where

The degrees of freedom for BMC has several intuitive properties. As diverges, the degrees of freedom decreases monotonically to the degrees of freedom of , defined in (3.3), namely where and are the number of connected components of and respectively.

###### Proposition 5.2.

If Assumption 3.1 holds, then the degrees of freedom possesses the following properties: (i) for all positive, (ii) whenever and , and (iii) for any sequence such that .

The BIC for BMC is given by the following expression.

We can re-express the BIC in terms of to make the dependence of the BIC on and more explicit.

A naive approach fits models over a grid of values and chooses the pair of regularization parameters that minimize the BIC. But this approach does not leverage the differentiability of the BIC. Since the BIC is differentiable, we compute the gradient with respect to and employ gradient descent. Let denote the solution of the system and define the residual . Then the gradient is given by the following equation.

(5.2) |

where and . We provide a derivation in the Supplementary Materials. With the gradient in hand, for little additional cost we may employ accelerated first order methods, for example SparSa (Wright et al., 2008) or FISTA (Beck and Teboulle, 2009), or a Quasi-Newton method (Nocedal and Wright, 2006, Ch 6). In this article, we apply the Quasi-Newton method. Note that with a trivial modification we can extend the IMS strategy to iteratively minimizing the AIC.

#### Computational Complexity

The key cost in evaluating the BIC is computing the trace of and for (5.2) the key cost is computing the trace of and . Here, one can no longer take advantage of the sparsity of the system, since the inverse of is dense and therefore the computational complexity of computing the inverse is now . The same is also true for computing and . Moreover, storing a dense matrix is infeasible when and are large.

Grid based model selection using a BIC or AIC will require work, as each grid point requires trace of the inverse of an -by- matrix which can be prohibitively expensive for large-scale problems of interest. We tackle this computational challenge with a two-pronged strategy: (i) we approximate the trace computation using a Monte Carlo method, to reduce the cost of each objective function evaluation, and (ii) we use an optimization approach for minimizing the BIC.

## 6 Scaling up Iterative Model Selection

In this section, we briefly review Monte Carlo estimators for computing the trace of a matrix and discuss the computational costs associated with it. Next, we express the BIC minimization problem as a stochastic programming problem; we then approximate the objective function using a Monte Carlo estimator. This is called the Sample Average Approximation (SAA) method.

### 6.1 Monte Carlo Trace Estimator

In the application at hand, is a large sparse matrix; however, is a large dense matrix. Consequently, forming explicitly is not advisable. We turn to a matrix-free approach for estimating the trace. Hutchinson (1989)

introduced an unbiased estimator for the trace of a positive semidefinite matrix

,(6.1) |

Here are i.i.d. samples from a Rademacher distribution, i.e., has entries with equal probability. Other choices for distributions have been proposed (Avron and Toledo, 2011); in general, must have zero mean and identity covariance. Examples of alternative distributions include Gaussian random vectors and random samples from columns of unitary matrices (Avron and Toledo, 2011; Roosta-Khorasani and Ascher, 2015).

The quality of the Monte Carlo estimator given in (6.1) can be judged by two different metrics – variance and the number of samples for an -estimator. Hutchinson (1989) showed that the variance of the estimator given in (6.1) is . Therefore, if the off-diagonal entries of the matrix are large compared to its diagonal entries, the variance of the Hutchinson estimator can be quite large. An estimator is called an -estimator (for and ) if the relative error of the trace is less than with probability at least . Roosta-Khorasani and Ascher (2015) showed that the minimum number of samples from a Rademacher distribution for an -estimator is . This result implies that for an accurate estimator (i.e., small , many samples are required (). However, as we will demonstrate in the numerical experiments, a modest number of samples are sufficient for our purpose.

Employing the Hutchinson estimator requires repeated evaluation of quadratic forms , in order to estimate the . The inverse need not be computed explicitly; instead, we first solve , and then compute the inner product . Methods to solve linear system are discussed in Section 4.

### 6.2 Sample Average Approximation (SAA) method

Following Anitescu et al. (2012), we substitute the exact trace operation in the objective function with its Hutchinson estimate, to obtain a stochastic programming problem. Define

Therefore, the minimizer of the BIC is equivalent to the solution of the optimization problem .

The SAA to the objective function gives an unbiased estimator of the BIC function

(6.2) |

where are i.i.d. random vectors in with mean zero and identity covariance. Following the previous subsection, we choose

as Rademacher random matrix. As before, we compute the gradient of (

6.2).A natural question, is what is the relationship between the regularization parameters obtained by minimizing the approximate to regularization parameters obtained by minimizing the true BIC. Let denote a stationary point of , namely and suppose further that the Hessian of the BIC at is nonsingular, namely is nonsingular. Let denote a stationary point to the unbiased estimate of the BIC, namely

where we have made explicit the dependency of on the outcome to emphasize that is a random variable. Then with the assumptions of Shapiro et al. (2014, Theorem 5.14) with probability 1, the stationary point of is locally unique in a neighborhood of for sufficiently large.

Furthermore, define

and the empirical covariance matrix . From (Shapiro et al., 2014, Theorem 5.14), it can be shown that

### 6.3 Summary of Computational Costs for Model Selection

We now compare and contrast the computational costs of the various strategies for performing model selection in the BMC problem. Direct grid search to obtain model selection parameters using -fold cross-validation is expensive both in terms of the per function evaluation and number of function evaluations which amounts to a total computational cost of . The cost for direct grid search using the BIC is even worse; evaluating the BIC function over grid points costs . Using the Hutchinson approximation, however, substantially lowers the cost of the BIC function evaluation, because now we need to solve a sparse system rather than explicitly invert the matrix. The number of function evaluations, however, remains the same and therefore, the computational cost is , where is the number of vectors used in the Monte Carlo approximation. In the IMS, a Quasi-Newton approach is used to optimize the BIC to obtain the model parameters, where now, the objective function and the gradient have been approximated using the Hutchinson trace approach. The objective function costs , and the gradient evaluation requires an additional cost of . We note that the gradient computation only requires solving only one additional linear system involving , since intermediate computations involving the objective function can be reused for evaluating the gradient, details are provided in the Supplementary Materials. It is clear that the IMS is computationally cheaper than CV if the number of IMS iterations satisfies . This is indeed what we will see in our numerical experiments in Section 7. A summary of the computational costs is provided in Table 1. In the above analysis, we have assumed a direct solver has been used to solve systems involving ; an iterative solver may be computationally beneficial for large-scale systems and the cost is similar.

CV | BIC | BIC + Hutchison | IMS | |
---|---|---|---|---|

Obj. func. | ||||

Gradient | - | - | - |

## 7 Numerical Experiments

We now discuss numerical experiments to evaluate the exact and approximate IMS methods on simulated and real data. We also compare the IMS methods to standard grid-search strategies. All experiments were conducted in Matlab. To compare timing results, we record the time in seconds required to complete model selection. We perform computations in serial on a multi-core computer with twenty four 3.3 GHz Intel Xeon processors and 189 GB of RAM.

### 7.1 Simulated Data

We first compare IMS via Quasi-Newton optimization, BIC grid-search, and cross-validated grid-search on simulated data. We consider two versions of the Quasi-Newton optimization: (i) exact computation and (ii) Hutchinson estimation. Identical experiments with the AIC used in place of the BIC lead to similar results and are summarized in the Supplementary Materials.

In all simulated data experiments, the matrix that we seek to recover consists of four biclusters.

where is the matrix of all ones. We observe the noisy matrix , where are i.i.d. . We use the following row and column weights

where and . Thus, the weights introduce some erroneous smoothing across distinct biclusters. While the noise level and weights choices are admittedly not particularly challenging from an inferential perspective, our main objective in these studies is to understand the computational impact of the choices we make in deciding upon a model selection procedure.

We perform three different experiments. The first experiment evaluates the run-time versus accuracy tradeoff of IMS via the Quasi-Newton method when using the exact computation versus the Hutchinson estimator. The point of this experiment is to assess how gracefully the quality of the stochastic approximation of the BIC degrades as a function of the number of samples used to compute the approximation. The second and third experiments compare the run-time versus accuracy tradeoff of the Quasi-Newton method against two standard grid-search model selection methods: cross-validation grid-search and grid-search over the BIC surface. For all experiments, we use three different missingness fractions (0.1, 0.3 and 0.5); metrics are averaged over 30 replications.

Figure 3 compares exact computation and Hutchinson estimation in terms of runtime, BIC, and mean squared error (MSE) over missing entries and observed entries. Unexpectedly, regardless of the method, the prediction accuracy increases as the fraction of missing entries decreases. Remarkably, however, the Quasi-Newton method with Hutchinson estimation can recover the matrix as well as the Quasi-Newton method with exact computation even when the sample size is 5. This is a nontrivial windfall, as using just 5 samples takes significantly less time than the exact computation. In light of this result, we use a sample size of 5 whenever we employ the Hutchinson estimator in subsequent experiments with simulated data.

Figure 4 compares the Quasi-Newton method (exact computation and Hutchinson estimation) against cross-validation grid. We use 5-fold cross-validation on MSE over missing entries, and we test on three different levels of grid coarseness and three missingness fractions. All grid-searches occur over the range . We denote by Grid () the set of evenly spaced points on the interval that are then exponentiated. Thus, larger corresponds to a finer grid. The upper left and right panels in Figure 4 show that Quasi-Newton with Hutchinson estimation takes the least time and has the best performance in objective value BIC. The lower left and right panels in Figure 4, show that the parameters chosen using BIC criteria lead to models with better performance in MSE than those chosen using cross-validation on MSE. The ability to go off the parameter grid is evident. Even the finest grid-search, Grid (37), in our study cannot reach the optimal MSE achieved obtained via the IMS via Quasi-Newton methods.

Figure 5 compares the Quasi-Newton method (exact computation and Hutchinson estimation) against BIC grid-search with different levels of coarseness. The upper left panel in Figure 5 again shows that the Quasi-Newton direct optimization takes less time than searching over a finer grid, and the Quasi-Newton with Hutchinson estimation takes even less time. The lower left and right panels in Figure 5 show that the Quasi-Newton with Hutchinson method achieves lower prediction error than grid-search when 10% of the entries missing. Even though the finest grid-search, Grid (37), has better performance on average, when greater fractions of data are missing, the Quasi-Newton methods are not far behind. Employing the Quasi-Newton method with Hutchinson estimation is clearly attractive given its accuracy and superior run time.

### 7.2 Real Data Example from Radiogenomics

The goal in radiogenomics is to create a rational set of rules for recommending a patient for genomic testing based on a collection of radiographic findings (Rutman and Kuo, 2009; Colen et al., 2014). The key task is to identify associations between radiographic features and genomic pathway activity. In the case of glioma radiology, the Visually Accessible Rembrandt Images (VASARI) is a standard way of reporting MRI finding for gliomas (Gutman et al., 2013). In addition, computational approaches to identifying image features based on tumor volume and variation in voxel intensities are also used (Yang et al., 2015).

To set some notation, suppose on patients we obtain two sets of measurements: a matrix where the th row is the vector of radiographic features for the th patient and a matrix of gene expressions where the th row is the vector of pathway activities. To relate these computationally-derived image features with gene expression data, we consider the cross-covariance matrix . The th entry of is the quantifies the association between the th imaging feature and the th pathway. The objective is to identify correlated features such as tumor size with gene mutations and ultimately derive more principled rules for ordering genetic testing.

There is a missing data challenge, however, as patients may be missing annotation on some radiographic features and gene pathway expression levels. Nonetheless, similarity weights can be inferred for the radiographic features as well as for the gene pathways using measurements from different modalities. The availability of similarity structure suggests that the problem of identifying missing associations in a radiogenomics study may be accomplished using BMC. We now consider how BMC performs on a radiogenomics example involving a subset of patients from The Cancer Genome Atlas (TCGA).

Figure 6 shows the cross covariance matrix between 48 segmentation-based fractal texture (SFTA) features extracted from T1-post contrast MRI scans and 533 genomic pathways^{2}^{2}2Pathway data is available at https://gdac.broadinstitute.org/.. The 48 SFTA features were obtained by using the method of Costa et al. (2012)

. Both imaging features and pathways were recorded on the same set of 77 TCGA patients. Rows and columns have been reordered using single-linkage hierarchical clustering on the rows and columns independently. Reordering the rows and columns reveals that the data has a checkerboard pattern.

We construct weights in two stages analogous to the construction of

-nearest-neighbor graphs in spectral clustering

(von Luxburg, 2007). We describe how row weights are constructed; columns weights are construction similarly. Initial row weights consist of the exponentiated Pearson correlation between the row. Thus, only positive correlations led to strong smoothing shrinkage. Although a more sophisticated weight choice may take advantage of negative correlations, these simple weights are effective for our purposes, as our focus is on evaluating the computational performance of IMS. We then make the row weights sparse, or mostly zero, using the following rule. Fixing , we find the 5 largest values of . If is not among these top values, we set . We then repeat this step with fixed. We do this for all and . Approximations to this procedure should be employed and can be accomplished with approximate -nearest-neighbors algorithms (Slaney and Casey, 2008), as searching over all pairs requires computation that grows quadratically in the number rows.Since the cross-covariance matrix is larger than the simulated matrices considered earlier, we performed the following illustrative experiment to compare the performance using Quasi-Newton with Hutchinson estimation and conjugate gradient (CG) with Hutchinson estimation. It is important to note here that the CG method is used to solve a linear system in order to compute the gradient of the BIC and not used as a method to minimize the BIC. Results are shown in Figure 7. The experiment were repeated for 30 times under 3 different missing fractions: 0.1, 0.3 and 0.5. All methods produce equally good predictions at all levels of missingness. We see that for the larger matrix, however, that the CG method can provide some additional speed ups for larger matrices such as the radiogenomics cross-covariance considered here.

## 8 Discussion

In this paper, we revisited the matrix completion problem when we have additional information on row and column similarities. We posed this problem as a penalized convex least squares problem and established several properties of this problem formulation, in particular when this problem admits a unique solution. We also introduced an efficient algorithm called IMS for tuning the amount of regularization in practice. We showed that when rows and columns exhibit a strong clustering pattern, a pair of differentiable Laplacian penalties can recover a low-rank structure that is consistent with the row and column similarities. This motivates solving a differentiable convex optimization problem, that has been previously proposed in the literature, with two penalty terms instead of a nondifferentiable convex optimization problem with three penalty terms one of which is a nuclear norm penalty. Dropping the nuclear norm penalty has three advantages: (i) an expensive SVD calculation is avoided, (ii) model selection is reduced to searching for two tuning parameters, instead of three, and (iii) model selection via the BIC can be achieved by minimizing a differentiable function of two variables. We emphasize that what makes advantage (iii) possible is that the degrees of freedom has a differentiable closed form expression. If we included the nuclear norm penalty, we could derive an unbiased estimate of the degrees of freedom following (Candés et al., 2013), but the resulting estimate would not be differentiable and consequently the BIC could not be minimized via the Quasi-Newton method.

Exhaustively searching for a minimizer of a surrogate measure of predictive performance over a regular two-way grid of tuning parameters is typically inefficient. Ideally one would place grid points more densely in the model space where the optimal model resides. Unfortunately, this information is not known in advance. With IMS, we do not need to construct a grid of candidate regularization parameters and can even identify better predictive models by going off the parameter grid and not limiting the search to models defined by the parameters on the grid. Since the work required to fit a model at a grid point is essentially the same as the work required for a single IMS step, searching the space with IMS can lead to drastically fewer linear system solves in practice.

We can further expedite model selection by reducing the amount of work expended at each gradient calculation. In this paper, we proposed using the Hutchinson estimator to approximate the trace term in the BIC. This stochastic approximation to the BIC is then minimized using the Quasi-Newton method. Remarkably, our numerical experiments demonstrated that even coarse approximations regularly lead to the selection of models with prediction errors that rivaled those obtained by minimizing the exact BIC. This is significant because computations with the approximations take an order of magnitude less time than their exact counterparts.

As mentioned earlier, alternatives to the Quasi-Newton method could be employed in IMS. We leave it as future work to investigate how alternative modified first order methods might fare against the Quasi-Newton approach proposed here. We also leave it as future work to investigate how other stochastic approaches, such as stochastic Quasi-Newton method (Byrd et al., 2016), might fare against the SAA method proposed here.

Exploiting the special structure in this version of the matrix completion problem can lead to surprisingly effective computational gains in model selection. We close by noting that this simple but effective strategy applies in contexts outside of matrix completion and should be considered as an alternative to grid-search methods for automating and streamlining model selection when two or more tuning parameters need to be identified.

## Appendix A Verifying Assumption 3.1 in work linear in the data

Recall that the connected components of a graph can be determined in work that is linear in the number of vertices via a breadth-first-search. The brief description of the algorithm, along with the computational costs, is provided in Algorithm 1

## Appendix B Proofs of Results in Section 3 and Section 5

In this section we give proofs of the propositions and theorems within the paper.

### b.1 Property 3.1

We first recall a key fact about the number of connected components of a graph and the spectrum of its graph Laplacian matrix.

###### Proposition B.1 (Proposition 2 in von Luxburg (2007)).

Let be an undirected graph with non-negative weights. Then the multiplicity

of the eigenvalue 0 of

, the graph Laplacian of , equals the number of connected components in the graph. The eigenspace of eigenvalue 0 is spanned by the indicator vectors

of those components.We are now ready to prove Property 3.1.

###### Proof.

First assume that for some for and . Property B.1 implies that for all . Now invoke the linearity and cyclic permutation properties of the trace function to simplify the expression .

Analogously, , and consequently .

Now assume that . Let denote the set of rank-1 matrices defined by the connected components of and , namely . The set is a basis for the dimensional subspace of matrices span. Let denote the orthogonal complement of and suppose that where and . Consequently, and . Since , we conclude that .

∎

### b.2 Property 3.2

### b.3 Theorem 3.1

It is easier to work with vectorized quantities. Let , namely is the vector obtained by stacking the columns of on top of each other. We observe that and where denotes the Kronecker product. Then the objective in (1.1) can be written as

(B.1) |

We first establish that (B.1) always has a solution. Recall the edge-incidence matrix of the row graph encodes its connectivity and is defined as

(B.2) |

The column edge-incidence matrix is defined similarly. Recall that the Laplacian matrix of a graph can be written in terms of the edge-incidence matrix of the graph. Thus, Laplacian row matrix can be expressed as , and the Laplacian column matrix can be expressed as . With these facts in hand, we can rewrite (B.1) as the following least squares problem.

(B.3) |

where

Recall that a least squares problem always has a solution since its solution is a Euclidean projection onto a closed convex set, namely the column space of the design matrix.

Having established that (B.1) always has a solution, we next characterize its solutions. A vector is a solution to (B.1) if and only if it satisfies the linear system in (4). where is defined in Property 5.1. The linear system in (4) may not be invertible. Since and are positive semidefinite, Laub (2005, Theorem 13.12) guarantees that and are also positive semidefinite; therefore, is the sum of three positive semidefinite matrices, each of which may be rank deficient.

We now state the conditions under which the linear system (4) has a unique solution.

###### Lemma 1.

Assume that . Then is positive definite if and only if