1 Introduction
Matrix completion (MC) has recently gained a substantial amount of popularity among researchers and practitioners due to its wide applications; as well as various related theoretical advances Candès and Recht (2009); Candès and Plan (2010); Koltchinskii et al. (2011); Klopp (2014). Perhaps the most wellknown example of a MC problem is the Netflix prize problem (Bennett and Lanning, 2007), of which the goal is to predict missing entries of a partially observed matrix of movie ratings. Two commonly shared challenges among MC problems are high dimensionality of the matrix and a huge proportion of missing entries. For instance, Netflix data has less than 1% of observed entries of a matrix with around rows and customers. With technological advances in data collection, we are confronting increasingly large matrices nowadays.
Without any structural assumption on the target matrix, it is wellknown that MC is an illposed problem. A popular and often welljustified assumption is low rankness, which however leads to a challenging and nonconvex rank minimization problem Srebro et al. (2005). The seminal works of Candès and Recht (2009); Candès and Tao (2010); Gross (2011)
showed that, when the entries are observed without noise, a perfect recovery of a lowrank matrix can be achieved by a convex optimization via near minimal order of sample size, with high probability. As for the noisy setting, some earlier work
(Candès and Plan, 2010; Keshavan et al., 2010; Chen et al., 2019b) focused on arbitrary, not necessarily random, noise. In general, the arbitrariness may prevent asymptotic recovery even in a probability sense.Recently, a significant number of works (e.g. Bach, 2008; Koltchinskii et al., 2011; Negahban and Wainwright, 2011; Rohde and Tsybakov, 2011; Negahban and Wainwright, 2012; Klopp, 2014; Cai and Zhou, 2016; Fan et al., 2019; Xia and Yuan, 2019) targeted at more amenable random error models, under which (near)optimal estimators had been proposed. Among these work, trace regression model is one of the most popular models due to its regression formulation. Assume independent pairs , for , are observed, where ’s are random design matrices of dimension and
’s are response variables in
. The trace regression model assumes(1.1) 
where denotes the trace of a matrix , and is an unknown target matrix. Moreover, the elements of are i.i.d. random noise variables independent of the design matrices. In MC setup, the design matrices ’s are assumed to lie in the set of canonical bases
(1.2) 
where is the
th unit vector in
, and is the th unit vector in. Most methods then apply a regularized empirical risk minimization (ERM) framework with a quadratic loss. It is wellknown that the quadratic loss is most suitable for lighttailed (subGaussian) error, and leads to nonrobust estimations. In the era of big data, a thorough and accurate data cleaning step, as part of data preprocessing, becomes virtually impossible. In this regard, one could argue that robust estimations are more desirable, due to their reliable performances even in the presence of outliers and violations of model assumptions. While robust statistics is a wellstudied area with a rich history
(Davies, 1993; Huber, 2011), many robust methods were developed for small data by today’s standards, and are deemed too computationally intensive for big data or complex models. This work can be treated as part of the general effort to broaden the applicability of robust methods to modern data problems.1.1 Related Work
Many existing robust MC methods adopt regularized ERM and assume observations are obtained from a lowrankplussparse structure , where the lowrank matrix is the target uncontaminated component; the sparse matrix models the gross corruptions (outliers) locating at a small proportion of entries; and is an optional (dense) noise component. As gross corruptions are already taken into account, many methods with lowrankplussparse structure are based on quadratic loss. Chandrasekaran et al. (2011); Candès et al. (2011); Chen et al. (2013); Li (2013) considered the noiseless setting (i.e., no ) with an elementwisely sparse . Chen et al. (2011) studied the noiseless model with columnwisely sparse . Under the model with elementwisely sparse , Wong and Lee (2017) looked into the setting of arbitrary (not necessarily random) noise , while Klopp et al. (2017) and Chen et al. (2020) studied random (subGaussian) noise model for . In particular, it was shown in Proposition 3 of Wong and Lee (2017) that in the regularized ERM framework, a quadratic loss with elementwise penalty on the sparse component is equivalent to a direct application of a Huber loss without the sparse component. Roughly speaking, this class of robust methods, based on the lowrankplussparse structure, can be understood as regularized ERMs with Huber loss.
Another class of robust MC methods is based on the absolute deviation loss, formally defined in (2.1). The minimizer of the corresponding risk has an interpretation of median (see Section 2.1), and so the regularized ERM framework that applies absolute deviation loss is coined as median matrix completion (Elsener and van de Geer, 2018; Alquier et al., 2019). In the trace regression model, if the medians of the noise variables are zero, the median MC estimator can be treated as a robust estimation of . Although median is one of the most commonly used robust statistics, the median MC methods have not been studied until recently. Elsener and van de Geer (2018) derived the asymptotic behavior of the tracenorm regularized estimators under both the absolute deviation loss and the Huber loss. Their convergence rates match with the rate obtained in Koltchinskii et al. (2011) under certain conditions. More complete asymptotic results have been developed in Alquier et al. (2019)
, which derives the minimax rates of convergence with any Lipschitz loss functions including absolute deviation loss.
To the best of our knowledge, the only existing computational algorithm of median MC in the literature is proposed by Alquier et al. (2019)
, which is an alternating direction method of multiplier (ADMM) algorithm developed for the quantile MC with median MC being a special case. However, this algorithm is slow and not scalable to large matrices due to the nonsmooth nature of both the absolute deviation loss and the regularization term.
Despite the computational challenges, the absolute deviation loss has a few appealing properties as compared to the Huber loss. First, absolute deviation loss is tuningfree while Huber loss has a tuning parameter, which is equivalent to the tuning parameter in the entrywise penalty in the lowrankplussparse model. Second, absolute deviation loss is generally more robust than Huber loss. Third, the minimizer of expected absolute deviation loss is naturally tied to median, and is generally more interpretable than the minimizer of expected Huber loss (which may vary with its tuning parameter).
1.2 Our Goal and Contributions
Our goal is to develop a robust and scalable estimator for median MC, in largescale problems. The proposed estimator approximately solves the regularized ERM with the nondifferentiable absolute deviation loss. It is obtained through two major stages — (1) a fast and simple initial estimation via embarrassingly parallel computing and (2) a refinement stage based on pseudo data. As pointed out earlier (with more details in Section 2.2), existing computational strategy (Alquier et al., 2019) does not scale well with the dimensions of the matrix. Inspired by Mackey et al. (2015), a simple strategy is to divide the target matrix into small submatrices and perform median MC on every submatrices in an embarrassingly parallel fashion, and then naively concatenate all estimates of these submatrices to form an initial estimate of the target matrix. Therefore, most computations are done on much smaller submatrices, and hence this computational strategy is much more scalable. However, since lowrankness is generally a global (wholematrix) structure, the lack of communications between the computations of different submatrices lead to suboptimal estimation (Mackey et al., 2015). The key innovation of this paper is a fast refinement stage, which transforms the regularized ERM with absolute deviation loss into a regularized ERM with quadratic loss, for which many fast algorithms exist, via the idea of pseudo data. Motivated by Chen et al. (2019a), we develop the pseudo data based on a NewtonRaphson iteration in expectation. The construction of the pseudo data requires only a rough initial estimate (see Condition (C6) in Section 3), which is obtained in the first stage. As compared to Huberlossbased methods (sparsepluslowrank model), the underlying absolute deviation loss is nondifferentiable, leading to computational difficulty for largescale problems. The proposed strategy involves a novel refinement stage to efficiently combine and improve the embarrassingly parallel submatrix estimations.
We are able to theoretically show that this refinement stage can improve the convergence rate of the suboptimal initial estimator to nearoptimal order, as good as the computationally expensive median MC estimator of Alquier et al. (2019). To the best of our knowledge, this theoretical guarantee for distributed computing is the first of its kind in the literature of matrix completion.
2 Model and Algorithms
2.1 Regularized Least Absolute Deviation Estimator
Let be an unknown highdimensional matrix. Assume the pairs of observations satisfy the trace regression model (1.1) with noise . The design matrices are assumed to be i.i.d. random matrices that take values in (1.2). Let be the probability of observing (a noisy realization of) the th entry of and denote . Instead of the uniform sampling where (Koltchinskii et al., 2011; Rohde and Tsybakov, 2011; Elsener and van de Geer, 2018), out setup allows sampling probabilities to be different across entries, such as in Klopp (2014); Lafond (2015); Cai and Zhou (2016); Alquier et al. (2019). See Condition (C1) for more details. Overall,
are i.i.d. tuples of random variables. For notation’s simplicity, we let
be a generic independent tuple of random variables that have the same distribution as . Without additional specification, the noise variable is not identifiable. For example, one can subtract a constant from all entries of and add this constant to the noise. To identify the noise, we assume , which naturally leads to an interpretation of as median, i.e., is the median of . If the noise distribution is symmetric and lighttailed (so that the expectation exists), then , and is the also the mean matrix (), which aligns with the target of common MC techniques (Elsener and van de Geer, 2018). Letbe the probability density function of the noise. For the proposed method, the required condition of
is specified in Condition (C3) of Section 3, which is fairly mild and is satisfied by many heavytailed distributions whose expectation may not exist.Define a hypothesis class where such that . In this paper, we use the absolute deviation loss instead of the common quadratic loss (e.g., Candès and Plan, 2010; Koltchinskii et al., 2011; Klopp, 2014). According to Section 4 of the Supplementary Material (Elsener and van de Geer, 2018), is also characterized as the minimizer of the population risk:
(2.1) 
To encourage a lowrank solution, one natural candidate is the following regularized empirical risk estimator (Elsener and van de Geer, 2018; Alquier et al., 2019):
(2.2) 
where denotes the nuclear norm and is a tuning parameter. The nuclear norm is a convex relaxation of the rank which flavors the optimization and analysis of the statistical property (Candès and Recht, 2009).
Due to nondifferentiability of the absolute deviation loss, the objective function in (2.1) is the sum of two nondifferentiable terms, rendering common computational strategies based on proximal gradient method (e.g., Mazumder et al., 2010; Wong and Lee, 2017) inapplicable. To the best of our knowledge, there is only one existing computational algorithm for (2.1), which is based on a direct application of alternating direction method of multiplier (ADMM) (Alquier et al., 2019). However, this algorithm is slow and not scalable in practice, when the sample size and the matrix dimensions are large, possibly due to the nondifferentiable nature of the loss.
We aim to derive a computationally efficient method for estimating the median matrix in largescale MC problems. More specifically, the proposed method consists of two stages: (1) an initial estimation via distributed computing (Section 2.2) and (2) a refinement stage to achieve nearoptimal estimation (Section 2.3).
2.2 Distributed Initial Estimator
Similar to many largescale problems, it is common to harness distributed computing to overcome computational barriers. Motivated by Mackey et al. (2015), we divide the underlying matrix into several submatrices, estimate each submatrix separately in an embarrassingly parallel fashion and then combine them to form a computationally efficient (initial) estimator of .
For the convenience of notations, suppose there exist integers , , and such that and . (Otherwise, the following description can be easily extended with and which leads to slightly different sizes in several submatrices.) We divide the row indices into subsets evenly where each subset contains index and similarly divide the column indices into subsets evenly. Then we obtain submatrices, denoted by for . See Figure 1 for a pictorial illustration. Let be the index set of the observed entries within the th submatrix , and be the corresponding number of observed entries. Next, we apply the ADMM algorithm of Alquier et al. (2019) to each submatrix and obtain corresponding median estimator:
(2.3) 
where is a corresponding subdesign matrix of dimensions and is a tuning parameter. Note that the most computationally intensive subroutine in the ADMM algorithm of Alquier et al. (2019) is (repeated applications of) SVD. For submatrices of dimension , the computational complexity of a single SVD reduced from to .
After we have all the for , we can put these estimators of the submatrices back together according to their original positions in the target matrix (see Figure 1), and form an initial estimator .
This computational strategy is conceptually simple and easily implementable. However, despite the lowrank estimations for each submatrix, combining them directly cannot guarantee lowrankness of . Also, the convergence rate of is not guaranteed to be (near)optimal, as long as are of smaller order than respectively. See Theorem 1(i) in Section 3. However, for computational benefits, it is desirable to choose small . In the next section, we leverage this initial estimator and formulate a refinement stage.
2.3 The Idea of Refinement
The proposed refinement stage is based on a form of pseudo data, which leverages the idea from the NewtonRaphson iteration. To describe this idea, we start from the stochastic optimization problem (2.1). Write the loss function as . To solve this stochastic optimization problem, the population version of the NewtonRaphson iteration takes the following form
(2.4) 
where is defined in Section 2.1 to be independent of the data; is the vectorization of the matrix ; is an initial estimator (to be specified below); and is the subgradient of with respect to . One can show that the population Hessian matrix takes the form , where we recall that is the vector of observation probabilities; and transforms a vector into a diagonal matrix whose diagonal is the vector. Also, it can be shown that . Recall that is the density function of the noise .
By using in (2.4), we obtain the following approximation. When the initial estimator is close to the minimizer ,
(2.5)  
where we define the theoretical pseudo data
Here denotes the vector of dimension with all elements equal to 1. Clearly, (2.3) is the vectorization of the solution to , where
. From this heuristic argument, we can approximate the population Newton update by a least square solution based on the pseudo data
, when we start from an close enough to . Without the knowledge of , the pseudo data cannot be used. In the above,can be easily estimated by the kernel density estimator:
where is a kernel function which satisfies Condition (C4) and is the bandwidth. For each , we define the actual pseudo data used in our proposed procedure to be
and . For finite sample, regularization is imposed to estimate the highdimensional parameter . By using , one natural candidate for the estimator of is given by
(2.6) 
where is the nuclear norm and is the tuning parameter. If is replaced by , the optimization (2.3) is a common nuclearnorm regularized empirical risk estimator with quadratic loss which has been well studied in the literature (Candès and Recht, 2009; Candès and Plan, 2010; Koltchinskii et al., 2011; Klopp, 2014). Therefore, with the knowledge of , corresponding computational algorithms can be adopted to solve (2.3). Note that the pseudo data are based on an initial estimator . In Section 3, we show that any initial estimator that fulfills Condition (C5) can be improved by (2.3), which is therefore called a refinement step. It is easy to verify that the initial estimator in Section 2.2 fulfills such condition. Note that the initial estimator, like , introduces complicated dependencies among the entries of , which brings new challenges in analyzing (2.3), as opposed to the common estimator based on with independent entries.
From our theory (Section 3), the refined estimator (2.3) improves upon the initial estimator. Depending on how bad the initial estimator is, a single refinement step may not be good enough to achieve a (near)optimal estimator. But this can remedied by reapplying the refinement step again and again. In Section 3, we show that a finite number of application of the refinement step is enough. In our numerical experiments, 4–5 applications would usually produce enough improvement. Write given in (2.3) as the estimator from the first iteration and we can construct an iterative procedure to estimate . In particular, let be the estimator in the th iteration. Define
where is the same smoothing function used to estimate in the first step and is the bandwidth for the th iteration. Similarly, for each , define
(2.7) 
We propose the following estimator
(2.8) 
where is the tunning parameter in the th iteration. To summarize, we list the full algorithm in Algorithm 1.
3 Theoretical Guarantee
To begin with, we introduce several notations. Let , and . Similarly, write , and . For a given matrix , denote be the
th largest singular value of matrix
. Let , and be the spectral norm (operator norm), the infinity norm, the Frobenius norm and the trace norm of a matrix respectively. Define a class of matrices . Denote the rank of matrix by for simplicity. With these notations, we describe the following conditions which are useful in our theoretical analysis.(C1) For each , the design matrix takes value in the canonical basis as defined in (1.2). There exist positive constants and such that for any , .
(C2) The local dimensions on each block satisfies and for some . The number of observations in each block are comparable for all , i.e, .
(C3) The density function is Lipschitz continuous (i.e., for any and some constant ). Moreover, there exists a constant such that for any . Also, for each .
Theorem 1 (Alquier et al. (2019), Theorem 4.6, Initial estimator).
Suppose that Conditions (C1)–(C3) hold and . For each , assume that there exists a matrix with rank at most in where with the universal constant .
(i) Then there exist universal constants and such that with , the estimator in (2.2) satisfies
(3.1) 
with probability at least .
(ii) Moreover, by putting these estimators together, for the same constant in (i), we have the initial estimator satisfies
with probability at least .
From Theorem 1, we can guarantee the convergence of the submatrix estimator when . For the initial estimator , under Condition (C3) and that all the submatrices are lowrank ( for all ), we require the number of observation for some constant to ensure the convergence. As for the rate of convergence, is slower than the classical optimal rate when are of smaller than respectively.
(C4) Assume the kernel functions is integrable with . Moreover, assume that satisfies if . Further, assume that is differentiable and its derivative is bounded.
(C5) The initial estimator satisfies , where the initial rate .
For the notation consistency, denote the initial rate and define that
(3.2) 
Theorem 2 (Repeated refinement).
Suppose that Conditions (C1)–(C5) hold and . By choosing the bandwidth where is defined as in (3.2) and taking
where is a sufficient large constant, we have
(3.3) 
When the iteration number , it means onestep refinement from the initial estimator . For the right hand side of (3.3), it is noted that both the first term and the second term are seen in the error bound of existing works (Elsener and van de Geer, 2018; Alquier et al., 2019). The bound has an extra third term due to the initial estimator. After one round of refinement, one can see that the third term in (3.3) is faster than , the convergence rate of the initial estimator (see Condition (C5)), because .
With the increasing of the iteration number , Theorem 2 shows that the estimator can be refined again and again, until nearoptimal rate of convergence is achieved. It can be shown that when the iteration number exceeds certain number, i.e,
for some , the second term in the term associated with is dominated by the first term and the convergence rate of becomes which is the nearoptimal rate (optimal up to a logarithmic factor). Note that the number of iteration is usually small due to the logarithmic transformation.
3.1 Main Lemma and Proof Outline
For the th refinements, let be the residual of the pseudo data. Also, define the stochastic terms . To provide an upper bound of in Theorem 2, we follow the standard arguments, as used in corresponding key theorems in, e.g., Koltchinskii et al. (2011); Klopp (2014). The key is to control the spectral norm of the stochastic term . A specific challenge of our setup is the dependency among the residuals . We tackle this by the following lemma:
Lemma 1.
Suppose that Conditions (C1)–(C5) hold and . For any iteration , we choose the bandwidth where is defined as in (3.2). Then we have
We now give a proof outline of Lemma 1 for . The same argument can be applied iteratively to achieve the repeated refinement results as shown in Lemma 1.
In our proof, we decompose the stochastic term into three components ,
and
where
and with
Then we control their spectral norms separately.
For , we first bound for fixed and where , by separating the random variables and from , and then applying the exponential inequality in Lemma 1 of Tony and Liu (2011). To control the spectral norm, we take supremum over and , and the corresponding uniform bound can be derived using an net argument. The same technique can be used to handle the term . Therefore, we can bound the spectral norm of and for any initial estimator that satisfies Condition (C5).
As for the term , we first note that it is not difficult to control a simplified version: , with instead of . To control our target term, we provide Proposition 1 in the supplementary materials which shows that .
4 Experiments
4.1 Synthetic Data
We conducted a simulation study, under which we fixed the dimensions to . In each simulated data, the target matrix was generated as , where the entries of and
were all drawn from the standard normal distributions
independently. Here was set to . Thus was a lowrank matrix. The missing rate was , which corresponds to We adopted the uniform missing mechanism where all entries had the same chance of being observed. We considered the following four noise distributions:
Normal: .

Cauchy: .

Exponential: .
We note that Cauchy distribution is a very heavytailed distribution and its first moment (expectation) does not exist. For each of these four settings, we repeated the simulation for 500 times.
Denote the proposed median MC procedure given in Algorithm 1 by DLADMC (Distributed Least Absolute Deviations Matrix Completion). Due to Theorem 1(ii), , we fixed
where the constant . From our experiences, smaller leads to similar results. As , the bandwidth was simply set to , and similarly, where was defined by (3.2) with . In addition, all the tuning parameters in Algorithm 1 were chosen by validation. Namely, we minimized the absolute deviation loss evaluated on an independently generated validation sets with the same dimensions . For the choice of the kernel functions , we adopt the commonly used biweight kernel function,
Comments
There are no comments yet.