## 1 Introduction

The problem of recovering a high-dimensional matrix

from very few (noisy) observations of its entries is commonly known as matrix completion, whose applications include, for examples, collaborative filtering, computer visions and positioning. From a statistical viewpoint, it is a high-dimensional missing data problem where a high percentage of matrix entries are missing. As in many missing data problems, the underlying missing (sampling/observation) mechanism plays an important role. Most existing work

(e.g., Candès and Recht, 2009; Keshavan et al., 2009; Recht, 2011; Rohde and Tsybakov, 2011; Koltchinskii et al., 2011) adopt a uniform observation mechanism, where each entry has the same marginal probability of being observed. This leads to significant simplifications, and enables the domain to move forward rapidly with various theoretical breakthroughs in the last decade. However, the uniform mechanism is often unrealistic. Recent works (Negahban and Wainwright, 2012; Klopp, 2014; Cai and Zhou, 2016; Cai et al., 2016; Bi et al., 2017; Mao et al., 2018) have been devoted to relaxing such an restrictive assumption by adopting other missing structures. The usage of these settings hinges on strong prior knowledge of the underlying problems. At a high level, many of them utilize some special forms of low-rank structures, e.g., rank-1 structure. In this paper, we aim at recovering the target matrix under a flexible high-dimensional low-rank sampling structure. This is achieved by a weighted empirical risk minimization, with application of inverse probability weighting (IPW) (e.g., Schnabel et al., 2016; Mao et al., 2018) to adjust for the effect of non-uniform missingness.Data arising in many applications of matrix completion, such as recommender systems, usually possesses complex “sampling” structure and its distribution are largely unknown. For example of a movie recommender system, some believe that users tend to rate movies that they prefer or dislike most, while often remain “silent” to other movies. Another example of the complex sampling regime is in the online merchandising, some users may purchase certain items regularly without often rating them, but evaluate products that they rarely buy with a higher chance. Similar to the widely adopted model that ratings are generated from a small number of hidden factors, it is reasonable to believe that the missingness is also governed by a small and possibly different set of hidden factors, which leads to a low-rank modeling of the missing structure.

Inspired by generalized linear models (GLM), we model the probabilities of observation by a high-dimensional low-rank matrix through a known function . That means, on the entry-wise level, we have . In GLM, the linear predictor is further modeled as a linear function of observed covariates. However, to reflect difficulties to attain (appropriate and adequate) covariate information and the complexity in the modeling of in some situations of the matrix completion, the predictor matrix is assumed completely hidden in this study. Despite being hidden, as demonstrated in this work, the low-rankness of together with the high dimensionality allows both identification and consistent estimation of , which facilitates IPW-based matrix completion. Motivated by the nature of matrix completions, we propose a novel parametrization where satisfies . Our proposal extends the work of Davenport et al. (2014), which aims to solve a binary matrix completion problem and pursues a different goal. Compared with Davenport et al. (2014), the proposed method does not regularize the estimation of , but only regularize the nuclear norm of the estimation of . This modification requires different algorithmic treatment. The underlying reason for such modification is to avoid bias caused by the nuclear-norm penalty.

There are three fundamental challenges that set our work aside from the existing works of matrix completion and the IPW-based estimator: (i) the high-dimensional nature of the sampling mechanism; (ii) the diminishing lower bound of the observation probabilities (as go to infinity) in common settings of matrix completion, and added issue to the instability of IPW; (iii) the effect of estimation error in IPW to the matrix completion procedure. Challenges (i) and (ii) are unique to our problem, and not found in the literature of missing data. The work related to Challenge (iii) is sparse in the literature of matrix completion. One notable example is Mao et al. (2018)

, which focuses on a low-dimensional parametric modeling of IPW with observable covariates.

We develop non-asymptotic upper bounds of the mean squared errors (MSE) for the proposed estimators of the observation probabilities and the target matrix. The theoretical analysis shows that the IPW-based matrix completion with the underlying probability offers a better upper bound than matrix completion that ignore the missing mechanism all together as in Klopp (2014). But, the IPW-based completed matrix using the aforementioned low-rank estimation of endures a slower convergence rate due to the high-dimensionality of and low levels of observation probabilities. Indeed, in many matrix completion problems, often goes to zero when . Not only does this inflate the estimation error of but also leads to unstable inverse probability weights, which reduces the convergence rate. To circumvent this issue, we propose to re-estimate by constraining the magnitude of its entries to a smaller threshold. Our result shows that the proposed constrained IPW estimator achieve the optimal rate (up to a logarithmic factor). We also compare the IPW-based completion based on the proposed constrained estimation, with the completion based on direct weight trimming (or winsorization), a known practice in the conventional missing value literature (e.g., Rubin, 2001; Kang and Schafer, 2007; Schafer and Kang, 2008) and show that the constrained estimation has both theoretical and empirical advantages.

The rest of the paper is organized as follows. The proposed model is constructed in Section 2. The corresponding estimation and computational algorithm for both the observation probabilities and the target matrix are developed in Section 3 and 4 separately, while the non-asymptotic upper bounds are given in Section 5. Numerical performances of the proposed method are illustrated in a simulation study in Section 6, and an application to a Yahoo! music rating dataset in Section 7. Concluding remarks are given in Section 8, while some technical details are delegated to a supplementary material.

## 2 Model and Method

### 2.1 General Setup

Let be an unknown high-dimensional matrix of interest, and be a contaminated version of according to the following additive noise model:

(2.1) |

where

are independently distributed random errors with zero mean and finite variance. In the setting of matrix completion, only a portion of

is observed. For the -th entry, define the sampling indicator if is observed, and otherwise, and assume are independent of .As for the sampling mechanism, we adopt a Bernoulli model where

are independent Bernoulli random variables with observation probabilities

, collectively denoted by a matrix . Similar to generalized linear models (GLM), the observation probabilities can be expressed in terms of an unknown matrix and a pre-specified monotone and differentiable function , i.e., for all . The matrix plays the same role as a linear predictor in GLM, while the function is an inverse link function. Two popular choices ofare inverse logit function

(logistic model) and the standard normal cumulative distribution function (probit model).

### 2.2 Low-rank Modeling of and

The above setup is general. Without additional assumption, it is virtually impossible to recover the hidden feature matrix and also the target matrix . A common and powerful assumption is that is a low-rank matrix, i.e., . Take the Yahoo! Webscope data set (to be analyzed in Section 7) as an example. This data set contains a partially observed matrix of ratings from 15,400 users to 1000 songs, and the goal is to complete the rating matrix. The low-rank assumption reflects the belief that users’ ratings are generated by a small number of factors, representing several standard preference profiles for songs. This viewpoint has been proven useful in the modeling of recommender systems (e.g., Candès and Plan, 2010; Cai et al., 2010).

The same idea could be adapted to the missing pattern, despite that the factors that induce the missingness may be different from those that generate the ratings. To this end, we assume is also low-rank, and in addition, it can be decomposed as

(2.2) |

with being a

-vector of ones, and

. Here is the mean of , i.e, . Although this parametrization holds for any matrix, this allows different treatments in the estimations of and . See Section 3 for details. Further, the low-rank assumption of can be translated to the low-rank assumption of .We note that the rank of is not the same as the rank of due to the nonlinear transformation . Generally, the low-rank structure of implies a specific low-dimensional nonlinear structure of . For a common high missingness scenario, most entries of are significantly negative, at where many common choices of the inverse link function can be well-approximated by a linear function. So our modeling can be regarded as a low-rank modeling of in certain sense.

There are a few related but more specialized models. Srebro and Salakhutdinov (2010) and Negahban and Wainwright (2012) utilize an independent row and column sampling mechanism, leading to a rank-1 structure for . Cai et al. (2016) consider a block structure for and hence , which can be regarded as a special case of the low-rank modeling. Mao et al. (2018) considered the case when the missingness is dependent on observable covariates, and adopted a low-rank modeling with a known row space of . The proposal in this paper is for the situation when the missingness is dependent on some hidden factors, which reflects situations when obvious covariates are unknown or not available.

### 2.3 IPW-based Matrix Completion: Motivations and Challenges

Write the Hadamard product as and the Frobenius norm as . To recover the target matrix , many existing matrix completion techniques assume uniform missing structure and hence utilize an unweighted/uniform empirical risk function (e.g., Candès and Plan, 2010; Koltchinskii et al., 2011; Mazumder et al., 2010)

, which is an unbiased estimator of the risk

(up to a multiplicative constant) under uniform missingness. The work of Klopp (2014) is a notable exception that considers the use of under non-uniform missingness.For any matrix , we denote and . Under general missingness (uniform or non-uniform), one can show that, for any ,

Clearly, uniquely minimizes the risk . If were known, an unbiased estimator of would be

(2.3) |

which involves IPW, and motivates the use of IPW in matrix completion as in Mao et al. (2018). In addition, our theoretical analysis shows that the nuclear-norm-regularized empirical risk estimator (to be defined in details later) based on (assuming the use of true observation probabilities) improves upon existing error upper bound of corresponding estimator based on achieved by Klopp (2014). See Section 5.3 for details. However, the inverse probability weights are often unknown and have to be estimated in practice. The proposed estimation of will be addressed carefully. Now we dicuss some challenges with the estimation particularly related to the matrix completion settings.

Despite the popularity of IPW in missing data literature, it is known to often produce unstable estimation due to occurrences of small probabilities (e.g., Rubin, 2001; Kang and Schafer, 2007; Schafer and Kang, 2008). This problematic scenario is indeed common for matrix completion problems where one intends to recover a target matrix from very few observations. Theoretically, a reasonable setup should allow some to go to zero as , leading to diverging weights and a non-standard setup of IPW. Due to these observations, a careful construction of the estimation procedure is required.

For uniform sampling ( for some probability ), one only has to worry about a small common probability (or that diminishes in an asymptotic sense.) Although small increases the difficulty of estimation, changes only up to a multiplicative constant. However, for non-uniform setting, it is not as straightforward due to the heterogeneity among . To demonstrate the issue, we now briefly look at the Yahoo! Webscope dataset described in Section 7. One sign of the strong heterogeneity in is a large , where and . We found that the corresponding ratio of estimated probabilities based on the rank-1 structure of Negahban and Wainwright (2012) was , and that based on our proposed method (without re-estimation, to be described below) was . These huge ratios are signs of strong heterogeneity in the probability of the observation. From our analysis, strong heterogeneity could jeopardize the convergence rate of our estimator, and so will be addressed rigorously in our framework.

## 3 Estimation of

### 3.1 Regularized Maximum Likelihood Estimation

We develop the estimation of based upon the framework of regularized maximum likelihood. Given the inverse of link function , the log-likelihood function with respect to the indicator matrix is given by

for any , where is the indicator of an event . Due to the low-rank assumption of , one natural candidate of estimators is the maximizer of the regularized log-likelihood , where represents the nuclear norm and is a tuning parameter. It is also common to enforce an additional max-norm constraint for some in the maximization (e.g., Davenport et al., 2014). Note that the nuclear norm penalty flavors , corresponding to that for all . Nevertheless, this does not align well with common settings of matrix completion under which the average probability of observations is small, and hence results in a large bias. In view of this, we instead adopt the following estimator of :

(3.1) |

Note that the mean of the linear predictor is not penalized. With , the corresponding estimator of is defined as . The constraint ensures the identifiability of and . Apparently, the constraints in are analogous to , where , but on the parameters and respectively.

Davenport et al. (2014) considered a regularized maximum likelihood approach for a binary matrix completion problem. But their goal was different from ours, as they aimed to recover a binary rating matrix in lieu of the missing structure, they considered a regularization on (instead of ) via . In addition, this constraint required a prior knowledge of the true rank of , which is not required in our proposed method (3.1). As for the scaling parameter , Davenport et al. (2014) considered an independent of the dimensions and to restrict the “spikiness” of . As explained earlier, in our framework, should be allowed to go to zero as . To this end, we allow and to depend on the dimensions and . See more details in Section 5.

### 3.2 Computational algorithm and tuning parameter selection

To solve the optimization (3.1), we begin with the observation that is a smooth concave function, which allows the usage of an iterative algorithm called accelerated proximal gradient algorithm (Beck and Teboulle, 2009). Given a pair from a previous iteration, a quadratic approximation of the objective function is formed:

where is an algorithmic parameter determining the step size of the proximal gradient algorithm, and is chosen by a backtracking method (Beck and Teboulle, 2009). Here for any matrices and of same dimensions.

In this iterative algorithm, a successive update of can be obtained by

where the optimization with respect to and can be performed separately. For , one can derive a closed-form update

As for , we need to perform the minimization

which is equivalent to

(3.2) |

We apply a three-block extension of the alternative direction method of multipliers (ADMM) (Chen et al., 2016) to an equivalent form of (3.2):

(3.3) |

Write . The augmented Lagrangian for (3.3) is given by

where is an algorithmic parameter and, if the constraint holds and otherwise. The detailed algorithm to solve (3.3) is summarized in Algorithm 1. It is noted that, in general, the multi-block ADMM may fail to converge for some (Chen et al., 2016). In those cases, an appropriate selection of is crucial. However, we are able to show that the form of our ADMM algorithm belongs to a special class (Chen et al., 2016) in which convergence is guaranteed for any . Therefore, we simply set . We summarize the corresponding convergence result in the following theorem whose proof is provided in the supplementary material.

Notice that the ADMM algorithm is nested within the proximal gradient algorithm. But, from our practical experiences, both the number of inner iterations (ADMM) and outer iterations (proximal gradient) are small, usually less than twenty in our numerical experiments.

We now discuss the choice of tuning parameters. For and , they can be chosen according to prior knowledge of the problem setup, if available. In practice when prior knowledge is not available, one can choose large values for these parameters. Once these parameters are large enough, our method is not sensitive to their specific values. A more principled way to tune and is a challenging problem and beyond the scope of this work. As for

, we adopt Akaike information criterion (AIC) where the degree of freedom is approximated by

.is the singular value soft-thresholding operator defined as

, is the singular value decomposition (SVD) of a matrix

.### 3.3 Constrained estimation

To use of (2.3), a naive idea is to obtain , where is an operator defined by for any , and then replace by . However, this direct implementation is not robust to extremely small probabilities of observation, and our theoretical analysis shows that this could lead to slower convergence rate of the estimator of . In the literature of missing data, a simple solution to robustify is weight truncation (e.g., Potter, 1990; Scharfstein et al., 1999), i.e., winsorizing small probabilities.

In the estimation of defined in (3.1), assuming , a large has an adverse effect on the estimation. In the setting of diverging (due to diminishing ), the convergence rate of becomes slower and the estimator obtained after direct winsorization will also be affected. That is, even though the extreme probabilities could be controlled by winsorizing, the unchanged entries of (in the procedure of winsorizing) may already suffer from a slower rate of convergence. This results in a larger estimation error bound under certain settings of missingness, which will be discussed theoretically in Section 5.

A seemingly better strategy is to impose a tighter constraint directly in the minimization problem (3.1). That is to adopt the constraint where . Theoretically, one can better control the errors on those entries of magnitude smaller than . However, the mean-zero constraint of no longer makes sense as the constraint may have shifted the mean.

We propose a re-estimation of with a different constraint level :

(3.4) |

Note that we only re-compute but not , which allows us to drop the mean-zero constraint. Thus we have . The corresponding algorithm for optimization (3.4) can be derived similarly as in Davenport et al. (2014), and is provided in Section LABEL:sec:appest of the supplementary material. Similar to many IPW method, the tuning of is a challenging problem. In what follows, we write and .

## 4 Estimation of

Now, we come back to (2.3) and replace by to obtain a modified empirical risk:

(4.1) |

where . Since is a high-dimensional parameter, a direct minimization of would often result in over-fitting. To circumvent this issue, we consider a regularized version:

(4.2) |

where is a regularization parameter. Again, the nuclear norm regularization encourages low-rank solution. Based on (4.2), our estimator of is defined as

(4.3) |

where is an upper bound on . The above contains as special cases (i) the matrix completion , with unconstrained probability estimator , by setting and (ii) the estimator , with constrained probability estimator , when .

We use an accelerated proximal gradient algorithm (Beck and Teboulle, 2009) to solve (4.3). For the choice of tuning parameter in (4.3), we adopt a -fold cross-validation to select the remaining tuning parameters. Due to the non-uniform missing mechanism, we use a weighted version of validation errors. The specific details are shown in Algorithm 2.

## 5 Theoretical Properties

Let , and be the spectral norm, the maximum norm and -norm of a matrix respectively. We use the symbol to represent the asymptotic equivalence in order, i.e., is equivalent to and . We define the average squared distance between two matrices as . The average squared errors of and are then defined as and respectively. To measure the similarity between two probability matrices, we adopt the Hellinger distance defined as follows. For any two matrices , where for . In the literature of matrix completion, most discussions related to optimal convergence rate are only accurate up to certain polynomial orders of . For convenience, we use the notation to represent some polynomial of .

### 5.1 Probabilities of observation

In this subsection, we investigate the asymptotic properties of and defined in Section 3. To this end, we introduce the following conditions on the missing structure.

C1. The indicators are mutually independent, and independent of . For and ,

follows a Bernoulli distribution with probability of success

. Furthermore, is monotonic increasing and differentiable.C2. The hidden feature matrix where , and . Here and are allowed to depend on the dimensions and . This also implies that there exists a lower bound (allowed to depend on ) such that .

For clear presentation, we assume and choose the logit function as the inverse link function in the rest of Section 5, while corresponding results under general settings of , and are delegated to Section S1.2 in the supplementary material. We first establish the convergence results for , and , respectively. To simplify notations, let and .

###### Theorem 2.

Suppose Conditions C1-C2 hold, and . Consider where is the solution to (3.1). There exist positive constants such that for , we have with probability at least ,

(5.1) |

where

The three upper bounds in (5.1) all consist of a trivial bound and a more dedicated bound . The trivial upper bounds , and can be easily derived from the constraint set . For extreme settings of increasing , the more dedicated bound is diverging and the trivial bounds may provide better control. For example, under the extreme scenario where the target matrix is still recoverable (Candès and Plan, 2010; Koltchinskii et al., 2011; Mao et al., 2018), we have . Then and for which implies the trivial bounds are of the smallest order compared with . Thus it is necessary that we keep these trivial upper bounds in the right hand sides of (5.1). The term can be controlled by either the nuclear norm and the rank of . For a range of non-extreme scenarios, i.e., or , the second term in achieves the smallest order once .

We now consider the constrained estimation for , and . For any matrix , define the winsorizing operator by where

(5.2) |

Write and , and and respectively. It is noted that serves as a “bridge” between the underlying and the empirical . Write

Comments

There are no comments yet.