I Introduction
The problem of recovering a low rank matrix from an incomplete or noisy sampling of its entries arises in a variety of applications, including collaborative filtering [1] and sensor network localization [2, 3]. In many applications, the observations are not only missing, but are also highly discretized, e.g. binaryvalued (1bit) [4, 5], or multiplevalued [6]. For example, in the Netflix problem where a subset of the users’ ratings is observed, the ratings take integer values between 1 and 5. Although one can apply existing matrix completion techniques to discretevalued observations by treating them as continuousvalued, performance can be improved by treating the values as discrete [4].
In this paper we consider the problem of completing a matrix from a subset of its entries, where instead of observing continuousvalued entries, we observe a subset of 1bit measurements. Given , a subset of indices , and a twice differentiable function
, we observe (“w.p.” stands for “with probability”)
(1) 
One important application is the binary quantization of , where is a noise matrix with i.i.d entries. If we take
to be the cumulative distribution function of
, then the model in (1) is equivalent to observing(2) 
Recent work in the 1bit matrix completion literature has followed the probabilistic model in (1)(2) for the observed matrix and has estimated via solving a constrained maximum likelihood (ML) optimization problem. Under the assumption that is lowrank, these works have used convex relaxations for the rank via the trace norm [4] or maxnorm [5]
. An upper bound on the matrix estimation error is given under the assumptions that the entries are sampled according to a uniform distribution
[4], or in [5], following a nonuniform distribution.In this paper, we follow [4, 5] in seeking an ML estimate of but use an exact rank constraint on rather than a convex relaxation for the rank. We follow the sampling model of [7] for which includes the uniform sampling of [4] as well as nonuniform sampling. We provide an upperbound on the Frobenius norm of matrix estimation error, and show that our bound yields faster convergence rate with matrix dimensions than the existing results of [4, 5] when the fraction of revealed 1bit observations is fixed independent of the matrix dimensions. Lastly, we present an iterative algorithm for solving our nonconvex optimization problem with a certificate of global optimality under mild conditions. Our algorithm outperforms [4, 5] in the presented simulation example.
Notation: For matrix with th entry , we use the notation for the entrywise infinitynorm, for the Frobenius norm and for its operator norm. We use to denote the th row and to denote the th column. Taking to be a set, we use to denote the cardinality of . The notation represents the set of integers . We denote by
the vector of all ones, by
the unit vector , and by the indicator function, i.e. when is true, else .Ii Model Assumptions
We wish to estimate unknown using a constrained ML approach. We use to denote the optimization variable. Then the negative loglikelihood function for the given problem is
(3) 
Note that (3) is a convex function of when the function is logconcave. Two common choices for which the function is logconcave are: Logit model with logistic function and parameter , or equivalently in (2) is logistic with scale parameter ; Probit model with where and is the cumulative distribution function of . We assume that is a lowrank matrix with rank bounded by , and that the true matrix satisfies , which helps make the recovery of wellposed by preventing excessive “spikiness” of the matrix. We refer the reader to [4, 5] for further details.
The constrained ML estimate of interest is the solution to the optimization problem (s.t.: subject to):
(4) 
In many applications, such as sensor network localization, collaborative filtering, or DNA haplotype assembly, the rank is known or can be reliably estimated [8].
We now discuss our assumptions on the set . Consider a bipartite graph , where the edge set is related to the index set of revealed entries as iff . Abusing the notation, we use for both the graph and its biadjacency matrix where if , if . We denote the association of to by . Without loss of generality we take . We assume that each row of has nonzero entries (thus ) with the following properties on its SVD:

The left and right top singular vectors of are and , respectively. This implies that , where
denotes the largest singular value of
, and that each column of has nonzero entries. 
We have , where denotes the second largest singular value of and is some universal constant.
Thus we require to have a large enough spectral gap. As discussed in [7], an ErdösRenyi random graph with average degree satisfies this spectral gap property with high probability, and so do stochastic block models for certain choices of inter and intracluster edge connection probabilities. Thus, this sampling scheme is more general than a uniform sampling assumption, used in [4], and it also includes the stochastic block model [7] resulting in nonuniform sampling.
Iii Performance Upperbound
We now present a performance bound for the solution to (4). With , define
(5) 
(6) 
where is the bound on the entrywise infinitynorm of (see (4)). For the logit model, we have , and . For the probit model we obtain , . For further reference, define the constraint set
(7) 
Theorem III.1
Proof of this theorem is given in Sec. VI. Of particular interest is the case where is fixed and we let and become large, with fixed. In this case we have the following Corollary.
Corollary III.2
Iiia Comparison with previous work
Consider , with fraction of its entries sampled, such that (also assumed in [4, 5]) and . Then , and . The bounds proposed in [4] (and [5] in case of uniform sampling) yields
(11) 
whereas, applying our result (Corollary 3.2), we obtain
(12) 
Comparing (11) and (12), we see our method has faster convergence rate in for fixed rank and fraction of revealed entries . Notice that if the number of missing entries scales with according to , [4] yields bounded error while our bound grows with ; in our case we need to be of order at least . We believe this to be an artifact of our proof, as our numerical results (Fig. 1) show our method outperforms [4], especially for low values of and higher values of rank .
Iv Optimization
We will solve the optimization problem (4) using a logbarrier penalty function approach [9, Sec. 11.2]. The constraint translates to the logbarrier penalty function . This leads to the regularized objective function
(13) 
and the optimization problem
(14) 
We can account for the rank constraint in (14) via the factorization technique of [10, 11, 12] where instead of optimizing with respect to in (4), is factorized into two matrices and such that . One then chooses and optimizes with respect to the factors . The reformulated objective function is then given by
(15) 
where denotes the th row of , and the th row of . The parameter sets the accuracy of approximation of via the logbarrier function. We solve this factored version using a gradient descent method with backtracking line search, in a sequence of central path following solutions [9, Sec. 11.2], where one gradually reduces toward 0. Initial values of are randomly picked and scaled to satisfy . Starting with a large , we solve for via central path following and use 5fold cross validation error over as the stopping criterion in selecting .
Remark IV.1
The hard rank constraint results in a nonconvex constraint set. Thus, (4) and (14) are nonconvex optimization problems; similarly for minimization of (15) for which the rank constraint is implicit in the factorization of . However, the following result is shown in [10, Proposition 4], based on [11], for nonconvex problems of this form. If is a local minimum of the factorized problem, then is the global minimum of problem (14), so long as and are rankdeficient. (Rank deficiency is a sufficient condition, not necessary.) This result is utilized in [12] and [5] for problems of this form.
V Numerical Experiments
Va Synthetic Data
In this section, we test our method on synthetic data and compare it with the methods of [4, 5]. We set and construct as where and are matrices with i.i.d. entries drawn from a uniform distribution on (as in [5, 4]). Then we scale to achieve . We pick , vary matrix sizes or 400. We generate the set of revealed indices via the Bernoulli sampling model of [4] with fraction of revealed entries. We consider the probit (, as in [5, 4]) model. For Fig. 1, we take and vary . The resulting relative meansquare error (MSE) averaged over 20 Monte Carlo runs is shown in Fig. 1. As expected, the performance improves with increasing . For comparison, we have also implemented the methods of [4, 5], labeled “trace norm”, and “maxnorm” respectively. As we can see our proposed approach significantly outperforms [4, 5], especially for low values of and high values of .
In Fig. 2 we show the relative MSE for , for the probit model using our approach. We also plot the line in Fig. 2 to show the scale of the upper bound established in Section III. As we can see, the empirical estimation errors follow approximately the same scaling, suggesting that our analysis is tight, up to a constant.
We additionally plot the MSE for and in Fig. 3, with varying and keeping , under the probit model. This enables us to study the performance of the model under nonuniform sampling. Note that when , the spectral gap is largest and MSE is the smallest, and as gets larger, the spectral gap decreases, leading to larger MSE.
% Accuracy (Logit Model)  

% Training  Proposed  Tracenorm  Maxnorm 
95  72.30.7  72.40.6  71.50.7 
10  60.40.6  58.50.5  58.40.6 
5  53.70.8  49.20.7  50.30.2 
VB MovieLens (100k) Dataset
As in [4], we consider the MovieLens (100k) dataset (http://www.grouplens.org/node/73). This dataset consists of 100,000 movie ratings from 943 users on 1682 movies, with ratings on a scale from 1 to 5. Following [4], these ratings were converted to binary observations by comparing each rating to the average rating for the entire dataset. We used three splits of the data into training/test subsets and used 20 random realizations of these splits. The performance is evaluated by checking to see if the estimate of accurately predicts the sign of the test set ratings (whether the observed ratings were above or below the average rating). As in [4], we determine the needed parameter values by performing a grid search and selecting the values that lead to the best performance; we fixed , and varied (i.e. central path following), and rank . Our performance results are shown in Table I using a logistic model for three approaches: proposed, [4, 5]. These results support our findings on synthetic data that our method is preferable over [4, 5] for sparser data.
Vi Proof of Theorem iii.1
Our proof is based on a secondorder Taylor series expansion and a matrix concentration inequality.
Let and . The objective function is continuous in and the set is compact, therefore, achieves a minimum in . If minimizes subject to the constraints, then where . By the secondorder Taylor’s theorem, expanding around we have
(16) 
where for some , with corresponding matrix . We need several auxiliary results before we can prove Theorem III.1.
Let . Note that by our notation,
We then have
(20) 
where . Let . Therefore,
(21) 
Lemma VI.1
[13, Theorem 8.4] Take any two numbers and such that . Suppose that
is a matrix whose entries are independent random variables that satisfy, for some
,Suppose that for some . Then
where is a constant that depends only on and is a positive universal constant. The same result is true when and
is symmetric or skewsymmetric, with independent entries on and above the diagonal, all other assumptions remaining the same. Lastly, all results remain true if the assumption
is changed to .Lemma VI.2
Let , and . Then with probability at least , we have
where , is a constant that depends only on and is a positive universal constant.
Using (20), we have
(22) 
Let . Then we have , and . Applying Lemma VI.1 to with , we obtain with probability at least for some positive constants and . W note that for any matrix of rank , with denoting the nuclear norm. Hence , yielding the desired result.
Lemma VI.3
Let and . Then for any and any , we have
We need a result similar to [7, Theorem 4.1] regarding closeness of a fixed matrix to its sampled version, which is proved therein for square matrices under an incoherence assumption on . In Lemma VI.4 we prove a similar result for rectangular with bounded . Define
where for a matrix , denotes the largest norm of the rows in , i.e, .
For , , and define the operator as
Lemma VI.4
Let satisfy assumptions (A1) and (A2) in Section IV. Let with rank. Then we have
(24)  
(25) 
By definition of , there exist and for some such that , and . Since , we have , but this fact is not needed in our proof. By the variational definition of operator norm,
We also have where denotes the Hadamard (elementwise) product. Letting and respectively denote the th column of and , we write
We therefore have
Normalize to unit norm as , and similarly for . Let where is a unit norm vector orthogonal to . Then . Hence
(26) 
where we used the facts that and . Since is the top left singular vector of , we have
Using the above inequality in (26) we obtain
(27) 
We have . Hence, . Therefore,
(28) 
where we used . Similarly, we have
(29) 
It then follows from (27)(29) that
Lemma VI.5
Let . Then we have
Comments
There are no comments yet.