Matrix approximation is a common task in machine learning. Given a few observed matrix entries , matrix approximation constructs a matrix that approximates at its unobserved entries. In general, the problem of completing a matrix based on a few observed entries is ill-posed, as there are an infinite number of matrices that perfectly agree with the observed entries of . Thus, we need additional assumptions such that is a low-rank matrix. More formally, we approximate a matrix by a rank- matrix , where , , and . In this note, we assume that behaves as a low-rank matrix in the vicinity of certain row-column combinations, instead of assuming that the entire is low-rank. We therefore construct several low-rank approximations of , each being accurate in a particular region of the matrix. Smoothing the local low-rank approximations, we express as a linear combination of low-rank matrices that approximate the unobserved matrix . This mirrors the theory of non-parametric kernel smoothing, which is primarily developed for continuous spaces, and generalizes well-known compressed sensing results to our setting.
2 Global and Local Low-Rank Matrix Approximation
We describe in this section two standard approaches for low-rank matrix approximation (LRMA). The original (partially observed) matrix is denoted by , and its low-rank approximation by , where , , .
Incomplete SVD is a popular approach for constructing a low-rank approximation by minimizing the Frobenius norm over the set of observed entries of :
Another popular approach is minimizing the nuclear norm of a matrix (defined as the sum of singular values of the matrix) satisfying constraints constructed from the training set:
where is the projection defined by if and 0 otherwise, and is the Frobenius norm.
In order to facilitate a local low-rank matrix approximation, we need to pose an assumption that there exists a metric structure over , where denotes the set of integers . Formally, reflects the similarity between the rows and and columns and . In the global matrix factorization setting above, we assume that the matrix has a low-rank structure. In the local setting, however, we assume that the model is characterized by multiple low-rank matrices. Specifically, we assume a mapping that associates with each row-column combination a low rank matrix that describes the entries of in its neighborhood (in particular this applies to the observed entries ): where
. Note that in contrast to the global estimate in Global LRMA, our model now consists of multiple low-rank matrices, each describing the original matrixin a particular neighborhood. Figure 1 illustrates this model.
Without additional assumptions, it is impossible to estimate the mapping from a set of observations. Our additional assumption is that the mapping is slowly varying. Since the domain of is discrete, we assume that is Hölder continuous. Following common approaches in non-parametric statistics, we define a smoothing kernel , where , as a non-negative symmetric unimodal function that is parameterized by a bandwidth parameter . A large value of implies that has a wide spread, while a small corresponds to narrow spread of . We use, for example, the Epanechnikov kernel, defined as . We denote by the matrix whose -entry is .
where denotes a component-wise product of two matrices, .
The two optimization problems above describe how to estimate for a particular choice of . Conceptually, this technique can be applied for each test entry , resulting in the matrix approximation , where . However, this requires solving a non-linear optimization problem for each test index and is thus computationally prohibitive. Instead, we use Nadaraya-Watson local regression with a set of local estimates , in order to obtain a computationally efficient estimate for all :
Equation (5) is simply a weighted average of , where the weights ensure that values of at indices close to contribute more than indices further away from .
Note that the local version can be faster than global SVD since (a) each low-rank approximation is independent of each other, so can be computed in parallel, and (b) the rank used in the local SVD model can be significantly lower than the rank used in a global one. If the kernel has limited support ( is often zero), the regularized SVD problems would be sparser than the global SVD problem, resulting in additional speedup.
We compare local-LRMA to global-LRMA and other state-of-the-art techniques on popular recommendation systems datasets: MovieLens 10M and Netflix. We split the data into 9:1 ratio of train and test set. A default prediction value of 3.0 was used whenever we encounter a test user or item without training observations. We use the Epanechnikov kernel with , assuming a product form . For distance function , we use arccos distance, defined as . Anchor points were chosen randomly among observed training entries. regularization is used for local low-rank approximation.
Figure 2 graphs the RMSE of Local-LRMA and global-LRMA as well as the recently proposed method called DFC (Divide-and-Conquer Matrix Factorization) as a function of the number of anchor points. Both local-LRMA and global-LRMA improve as increases, but local-LRMA with rank outperforms global-LRMA with any rank. Moreover, local-LRMA outperforms global-LRMA in average with even a few anchor points (though the performance of local-LRMA improves further as the number of anchor points increases).