Deep Matrix Factorization with Spectral Geometric Regularization

11/17/2019
by   Amit Boyarski, et al.
0

Deep Matrix Factorization (DMF) is an emerging approach to the problem of reconstructing a matrix from a subset of its entries. Recent works have established that gradient descent applied to a DMF model induces an implicit regularization on the rank of the recovered matrix. Despite these promising theoretical results, empirical evaluation of vanilla DMF on real benchmarks exhibits poor reconstructions which we attribute to the extremely low number of samples available. We propose an explicit spectral regularization scheme that is able to make DMF models competitive on real benchmarks, while still maintaining the implicit regularization induced by gradient descent, thus enjoying the best of both worlds.

READ FULL TEXT VIEW PDF
05/25/2017

Implicit Regularization in Matrix Factorization

We study implicit regularization when optimizing an underdetermined quad...
05/31/2019

Implicit Regularization in Deep Matrix Factorization

Efforts to understand the generalization mystery in deep learning have l...
11/27/2020

Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit Bias towards Low Rank

We provide an explicit analysis of the dynamics of vanilla gradient desc...
01/13/2020

On implicit regularization: Morse functions and applications to matrix factorization

In this paper, we revisit implicit regularization from the ground up usi...
05/04/2017

Matrix Factorization with Side and Higher Order Information

The problem of predicting unobserved entries of a partially observed mat...
11/22/2021

Depth Without the Magic: Inductive Bias of Natural Gradient Descent

In gradient descent, changing how we parametrize the model can lead to d...
03/06/2022

Algorithmic Regularization in Model-free Overparametrized Asymmetric Matrix Factorization

We study the asymmetric matrix factorization problem under a natural non...

1 Introduction

Matrix completion deals with the recovery of missing values of a matrix from a subset of its entries,

(1)

Here stands for the unknown matrix, for the ground truth matrix, is a binary mask representing the input support, and denotes the Hadamard product. Since problem (1) is ill-posed, it is common to assume that belongs to some low dimensional subspace. Under this assumption, the matrix completion problem can be cast via the least-squares variant,

(2)

Relaxing the intractable rank penalty to its convex envelope, namely the nuclear norm, leads to a convex problem whose solution coincides with (2) under some technical conditions (Candès and Recht, 2009). Another way to enforce low rank is by explicitly parametrizing in factorized form, . The rank is upper-bounded by the minimal dimension of . Further developing this idea, can be parametrized as a product of several matrices , a model we denote as deep matrix factorization (DMF). Gunasekar et al. (2017); Arora et al. (2019) investigated the minimization of overparametrized DMF models using gradient descent, and came to the following conclusion (which we will formally state in section 2): whereas in some restrictive settings minimizing DMF using gradient descent is equivalent to nuclear norm minimization (i.e., convex relaxation of (2)), in general these two models produce different results, with the former enforcing a stronger regularization on the rank of . This regularization gets stronger as (the depth) increases. In light of these results, we shall henceforth refer by "DMF" to the aforementioned model coupled with the specific algorithm used for its minimization, namely, gradient descent.

Oftentimes, additional information is available in the form of a graph that neatly encodes structural (geometric) information about . For example, we can constrain

to belong to a subspace of the eigenvectors of some graph Laplacian, i.e., to be band-limited on the graph. Such information is generally overlooked by purely algebraic entities (e.g., rank), and becomes invaluable in the data poor regime, where the theorems governing reconstruction guarantees (i.e.,

(Candès and Recht, 2009)) do not hold. Our work leverages the recent advances in DMF theory to marry the two concepts: a framework for matrix completion that is explicitly motivated by geometric considerations, while implicitly promoting low-rank via its DMF structure.

Contributions.

Our contributions are as follows:

  • We propose task-specific DMF models that follow from geometric considerations, and study their dynamics.

  • We show that with our proposed models it is possible to obtain state-of-the-art results on various recommendation systems datasets, making it one of the first successful applications of deep linear networks on real problems.

  • Our findings challenge the quality of the side information available in various recommendation systems datasets, and the ability of contemporary methods to utilize it in a meaningful and efficient way.

2 Preliminaries

Spectral graph theory.

Let be a (weighted) graph specified by its vertex set and edge set , with its adjacency matrix denoted by . Given a function on the vertices, we define the following quadratic form (also known as Dirichlet energy) measuring the variability of the function on the graph,

(3)

The matrix is called the (combinatorial) graph Laplacian, and is given by , where is the degree matrix. is symmetric and positive semi-definite and therefore admits a spectral decomposition . Since ,

is always an eigenvalue of

. The graph Laplacian is a discrete generalization of the continuous Laplace-Beltrami operator, and therefore has similar properties. One can think of the eigenpairs as the graph analogues of "harmonic" and "frequency".

A function on the vertices of the graph whose coefficients are small for large , demonstrates a "smooth" behaviour on the graph in the sense that the function values on nearby nodes will be similar. A standard approach to promoting such smooth functions on graphs is by using the Dirichlet energy (3) to regularize some loss term. For example, this approach gives rise to the popular bilateral and non-local means filters (Gadde et al., 2013). Structural information about the graph is encoded in the spectrum of the Laplacian. For example, the number of connected components in the graph is given by the multiplicity of the zero eigenvalue, and the second eigenvalue (counting multiple eigenvalues separately) is a measure for the connectivity of the graph (Spielman, 2009).

Product graphs and functional maps.

Let , be two graphs, with , being their corresponding graph Laplacians. The bases can be used to represent functions on these graphs. We define the Cartesian product of and , denoted by , as the graph with vertex set , on which two nodes are adjacent if either and or and . The Laplacian of

is given by the tensor sum of

and ,

(4)

and its eigenvalues are given by the Cartesian sum of the eigenvalues of , i.e., all combinations where is an eigenvalue of and is an eigenvalue of . Let be a function defined on . Then it can be represented using the bases of the individual Laplacians, . In the shape processing community, such is called a functional map, as it it used to map between the functional spaces of and . For example, given two functions, on and on , one can use to map between their representations and , i.e., . We shall henceforth interchangeably switch between the terms "signal on the product graph" and "functional map".

We will call a functional map smooth if it maps close points on one graph to close points on the other. A simple way to construct a smooth map is via a linear combination of eigenvectors of

corresponding to small eigenvalues ("low frequencies"). Notice that while the singular vectors of

are outer products of the columns of and , their ordering with respect to the eigenvalues of might be different than their lexicographic order.

Implicit regularization of DMF.

Let be a matrix parametrized as a product of matrices (which can be interpreted as

linear layers of a neural network), and let

be an analytic loss function. Without loss of generality, we will assume that

. Arora et al. (2018, 2019)

analyzed the evolution of the singular values and singular vectors of

throughout the gradient flow , i.e., gradient descent with an infinitesimal step size, with balanced initialization,

(5)

As a first step, we state that 

admits an analytic singular value decomposition.

Lemma 1.

(Lemma 1 in Arora et al. (2019)) The product matrix  can be expressed as:

(6)

where: , and are analytic functions of ; and for every , the matrices  and  have orthonormal columns, while is diagonal (elements on its diagonal may be negative and may appear in any order).

The diagonal elements of , which we denote by , are signed singular values of ; the columns of and , denoted and , are the corresponding left and right singular vectors (respectively). Using the above lemma, Arora et al. (2019) characterized the evolution of singular values as follows:

Theorem 1.

(Theorem 3 in (Arora et al., 2019)) The signed singular values of the product matrix  evolve by:

(7)

If the matrix factorization is non-degenerate, i.e., has depth , the singular values need not be signed (we may assume for all ).

The above theorem implies that the evolution rates of the singular values are dependent on their size exponentiated by . As increases, the gap between their convergence rates grows, thereby inducing an implicit regularization on the effective rank of .

3 DMF with spectral geometric regularization

We assume that we are given a set of samples from the unknown matrix , encoded by a binary mask , and two graphs , encoding relations between the rows and the columns, respectively. Denote the Laplacians of these graphs and their spectral decompositions by , . We denote the Cartesian product between and by , and will henceforth refer to it as our reference graph. Our approach relies on a minimization problem of the form

(8)

with denoting a data term of the form

(9)

and is the Dirichlet energy of on , given by (see (4))111Note that it is possible to weigh the two terms differently, as we do in some of our experiments.

(10)

To that end, we parametrize via a matrix product and eliminate the rank constraint,

(11)

Since (11) is now a DMF model, this parametrization renders the rank constraint redundant, as according to Theorem 1 it will be captured by the implicit regularization induced by gradient descent.

To interpret this matrix factorization geometrically, we interpret as a signal living on a latent product graph

. Via the linear transformation

this signal is transported onto the reference graph , where it is assumed to be both low-rank and smooth (see Figure 1). Notice that the latent graph is used only for the purpose of illustrating the geometric interpretation, and there is no need to find it explicitly. Nevertheless, it is possible to promote particular properties of it via spectral constraints that can sometime improve the performance. We demonstrate these extensions in the sequel.

To give a concrete example, suppose is a permuted version of some low-rank matrix , i.e., , and is the 2D Euclidean Laplacian. Then, via an appropriate ordering of the rows and columns of , it is possible to obtain a signal which is both smooth on and low-rank222On a side-note, that is exactly the goal of the well known and closely related seriation problem (Recanati, 2018)..

For later reference let us rewrite (11) in the spectral domain. We will denote the Laplacians of the latent graph factors comprising by and their eigenbases by . Using those eigenbases and the eigenbases of the reference Laplacians , we can write,

(12)
(13)
(14)

Under this reparametrization we get

(15)

With some abuse of notation, (11) becomes

(16)

with

(17)

and

(18)
Figure 1: An illustration of the geometric interpretation of (11). A low-rank signal that lives on a latent product graph is transported onto the reference product graph . The transported signal will be smooth on the target graph due to the Dirichlet energy.

3.1 Extensions

Additional regularization via spectral filtering.

We propose a stronger explicit regularization by demanding that both and be smooth on their respective graphs. Since we do not know the Laplacian of , we smooth via spectral filtering, i.e., through direct manipulation of its spectral representation . To that end, we pass through a bank of pre-chosen spectral filters , i.e., diagonal positive semi-definite matrices, and transport the filtered signals to according to

(19)

In particular, we use the following filters,

(20)

where denotes a vector with ones followed by zeros. For these manipulations to take effect, we replace in (16) with the following loss function,

(21)

Despite the fact that we used separable filters in (19), these filters are coupled through the loss (21). This results in an overall inseparable spectral filter that still retains a DMF structure, since (19) is a -layer DMF with two fixed layers. While the theory developed by Arora et al. (2019) does not cover the case of a multi-layer DMF where only a subset of the layers are trainable, our empirical evaluations encourage us to conjecture that the implicit rank regularization is still in place. This additional regularization allows us to get decent reconstruction errors even when the number of measurements is extremely small, as we show in Figure 5.

Regularization of the individual layers.

Another extension we explore is imposing further regularization on the individual layers. For example, one could ask and to be jointly diagonalized by . Using (12)-(13) we get,

(22)

Thus, we can approximately enforce this constraint with the following penalty term,

(23)

where denotes the off-diagonal elements. A similar treatment to the columns graph gives,

(24)

We again emphasize that while these penalty terms are not a function of the product matrix, we are encouraged by our experimental results and by the results of Arora et al. (2018), to think that Theorem 1 can be extended to account for these terms as well. We leave these extensions to future work.

4 Experimental study on synthetic data

Figure 2: In these experiments we generated band-limited (low-rank and smooth) matrices using the synthetic Netflix graphs (see Fig. 10) to test the dependence of SGMC and DMF on the rank of the underlying matrix and on the number of training samples. Left: reconstruction error (on the test set) vs. the rank of the ground-truth matrix. As the rank increases, the reconstruction error increases, but it increases slower for SGMC than for DMF. For the training set we used of the points chosen at random (same training set for all experiments). was set to . Middle: reconstruction error (on the test set) vs. density of the sampling set in of the number of matrix elements, for a random rank matrix of size . As we increase the number of samples, the gap between DMF and SGMC reduces. Still, even when using of the samples, SGMC performs better for the same number of iterations. For all the experiments we set , , . Right: effective rank (Roy and Vetterli, 2007) vs. training set density, for a random rank matrix. Even for extremely data-poor regimes, SGMC was able to recover the effective rank of the ground-truth matrix, whereas is underestimating it.

The goal of this section is to compare between our approach and vanilla DMF on a simple example of a community structured graph. We exhaustively compare between the following distinct methods:

  • Deep matrix factorization (DMF):

    (25)
  • Spectral geometric matrix completion (SGMC): The proposed approach defined by the optimization problem (16).

  • Functional Maps (FM, SGMC1): This method is like SGMC with a single layer, i.e., we optimize only for , while and are set to identity.

We use the graphs taken from the synthetic Netflix dataset. Synthetic Netflix is a small synthetic dataset constructed by (Kalofolias et al., 2014) and (Monti et al., 2017), in which the user and item graphs have strong communities structure. See Figure 10 in Appendix A for a visualization of the user/item graphs. It is useful in conducting controlled experiments to understand the behavior of geometry-exploiting algorithms. In all our tests we use a randomly generated band-limited matrix on the product graph . For the complete details please refer to the captions of the relevant figures.

Performance evaluation.

To evaluate the performance of the algorithms in this section, we report the root mean squared error,

(26)

computed on the complement of the training set. Here is the recovered matrix and is the binary mask representing the support of the set on which the RMSE is computed.

We explore the following aspects:

Figure 3:

In this experiment, we study the robustness of SGMC in the presence of noisy graphs. We perturbed the edges of the graphs by adding random Gaussian noise with zero mean and tunable standard deviation to the adjacency matrix. We discarded the edges that became negative as a result of the noise, and symmetrized the adjacency matrix. SGMC1/SGMC2/SGMC3 stand for SGMC with 1 layer (training only

), 2 layers (training ) and 3 layers (). Left: With clean graphs all SGMC methods perform well. As the noise increases, the regularization induced by the depth kicks in and there is a clear advantage for SGMC3. For large noise, SGMC3 and DMF achieve practically the same performance. Middle & Right: eigenvalues of for different noise levels. Notice the steps in the spectra reflecting the community structure of the graphs. Even for moderately large amounts of noise, the structure of the lower part of the spectrum is preserved, and the effect on the low-frequency (smooth) signal remains small.

DMF

SGMC

Figure 4: In these experiments, we plot the dynamics of the singular values of the product matrix during the gradient descent iterations. We show singular value convergence at difference sampling densities (left-to-right: , , and ) for SGMC and DMF. We use the synthetic Netflix graphs on which we generate a random rank matrix of size . In accordance with Figure 2, we see that SGMC is able to recover the rank even for a very data poor regime, whereas DMF demands significantly higher sample complexity.

Sampling density.

We investigate the effect of the number of samples on the reconstruction error and the effective rank of the recovered matrix (Roy and Vetterli, 2007). We demonstrate that in the data-poor regime, the implicit regularization of DMF is too strong resulting in poor recovery, compared to a superior performance achieved by incorporating geometric regularization through SGMC. These experiments are summarized in Figure 2.

Initialization.

In all of our experiments we initialize with balanced initialization (5), with scaled identity matrices . We explore the effect of initialization in Figure 11 (in Appendix A).

Rank of the underlying matrix.

We explore the effect of the rank of the underlying matrix, showing that as the rank increases it becomes harder for both SGMC and DMF to recover the matrix. A remarkable property of SGMC is that it is able to get a decent approximation of the effective rank of the matrix even with extremely low number of samples. These experiments are summarized in Figure 2.

Noisy graphs.

We study the effect of noisy graphs on the preformance of SGMC. Figure 3 demonstrates that SGMC is able to utilize graphs with substantial amounts of noise before its performance drops to the level of vanilla DMF (which does not rely on any knowledge of the row/column graphs).

Dynamics.

Figure 4 shows the dynamics of the singular values of during the optimization. We visually verify that they behave according to (7) and that the convergence rate for the relevant singular values is much higher in SGMC than in DMF.

5 Results on recommender systems datasets

We demonstrate the effectiveness of our approach on the following datasets: Synthetic Netflix, Flixster, Douban, Movielens (ML-100K) and Movielens-1M (ML-1M) as referenced in Table 1. The datasets include user ratings for items (such as movies) and additional features. For all the datasets we use the users and items graphs taken from Monti et al. (2017). The ML-1M dataset was taken from Berg et al. (2017), for which we constructed 10 nearest neighbor graphs for users/items from the features, and used a Gaussian kernel with for edge weights. See Table 4 in Appendix A for a summary of the dataset statistics. For all the datasets, we report the results for the same test splits as that of (Monti et al., 2017) and (Berg et al., 2017). The compared methods are referenced in Table 1.

Proposed baselines.

We report the results obtained using the methods discussed above, with the addition of the following method:

  • SGMC-Z: a variant of SGMC that uses (21) as a data term. For this method we chose a maximal value of (which can be larger than ) and a skip determining the spectral resolution, denoted by . We use .

In addition, we add the diagonalization terms (23),(24) weighted by , respectively, to the SGMC/SGMC-Z methods. The optimization is carried out using gradient descent with fixed step size (i.e., fixed learning rate), which is provided for each experiment alongside all the other hyper-parameters in Table 5.

Initialization.

All our methods are deterministic and did not require multiple runs to account for initialization. We always initialize the matrices with . In Figure 11 we reported results on synthetic Netflix and ML-100K datasets for different values of . We noticed that for SGMC and SGMC-Z it is best to use . According to (Gunasekar et al., 2017; Li et al., 2017), DMF requires a large to decrease the generalization error. We used DMF with for Synthetic Netflix and for the real world datasets, in accordance with Figure 11 and our experimentation. In the cases where only one of the bases was available, such as in Douban and Flixster-user only benchmarks, we set the basis corresponding to the absent graph to identity.

Stopping condition.

Our stopping condition for the gradient descent iterations is based on a validation set. We use of the available entries for training (i.e., to construct the mask ) and the rest for validation. The split was chosen at random. We stop the iterations when the RMSE (26), evaluated on the validation set, does not change by more than between two consecutive iterations, . Since we did not apply any optimization into the choice of the validation set, we also report the best RMSE achieved on the test set via early stopping. In this regard, the number of iterations is yet another hyper parameter that has to be tuned for best performance.

Figure 5: Comparison of test RMSE in the presence of cold start users on the ML-100K dataset. The x-axis corresponds to the number of the cold start users . Red, blue and green correspond to DMF, SGMC and SGMC-Z methods respectively as also shown in the legend. Different shapes of the markers indicate different number of maximum ratings () available per cold-start user.

5.1 Cold start analysis

A particularly interesting scenario in the context of recommender systems is the presence of cold-start users, referring to the users who have not rated enough movies yet. We perform an analysis of the performance of our method in the presence of such cold start users on the ML-100K dataset. In order to generate a dataset consisting of cold start users, we sort the users according to the number of ratings provided by each user, and retain at most ratings (chosen randomly) of the bottom users (i.e., the users who provided the least ratings). We choose the values and

, and run our algorithms: DMF, SGMC and SGMC-Z, with the same hyperparameter settings used for obtaining Table

1. We use the official ML-100K test set for evaluation. Similar to before, we use of the training samples as a validation set used for determining the stopping condition. The results presented in Figure 5 suggest that the SGMC and SGMC-Z outperform DMF significantly, indicating the importance of the geometry as data becomes scarcer. As expected, we can see that the performance drops as the number of ratings per user decreases. Furthermore, we can observe that SGMC-Z consistently outperforms SGMC by a small margin. We note that SGMC-Z, even in the presence of cold start users with ratings, is still able to outperform the full data performance of (Monti et al., 2017), demonstrating the strength of geometry and implicit low-rank induced by SGMC-Z.

[!t] Model Flixster Douban ML-100K MC (Candès and Recht, 2009) GMC (Kalofolias et al., 2014) GRALS (Rao et al., 2015) RGCNN (Monti et al., 2017) a GC-MC (Berg et al., 2017) b FM (ours) DMF (Arora et al., 2019), (ours) d c / SGMC (ours) / SGMC-Z (ours) / c /

  • This number corresponds to the inseparable version of MGCNN.

  • This number corresponds to GC-MC.

  • Early stopping.

  • Initialization with .

Table 1: RMSE test set scores for runs on Synthetic Netflix (Monti et al., 2017), Flixster (Jamali and Ester, 2010), Douban (Ma et al., 2011), and Movielens-100K (Harper and Konstan, 2016). For Flixster, we show results for both user/item graphs (right number) and user graph only (left number). Baseline numbers are taken from (Monti et al., 2017; Berg et al., 2017).

Scalability.

All the experiments presented in the paper were conducted on a machine consisting of 64GB CPU memory, on an NVIDIA GTX 2080Ti GPU. Most of our large-scale experiments take upto 10-30 minutes of time until convergence, therefore, are rather quick. In this work we focused on the conceptual idea of solving matrix completion via the framework of deep matrix factorization by incorporating geometric regularization, paying little attention to the issue of scalability. The dependence of our method on eigenvalue decomposition might hinder its scalability prospects. While this did not pose a problem for the small data sets we used in this report, it is nonetheless an issue that we intend to address in our future work. We believe that our approach can be carefully transformed into the spatial domain where the sparse structure of the Laplacian matrix can be exploited to tackle scalability issues.

5.2 Discussion

A few remarkable observations can be extracted from Table 1: First, on the Douban and ML-100K dataesets, vanilla DMF shows competitive performance with all the other methods. This suggests that the geometric information is not very useful for these datasets. Second, the proposed SGMC algorithms outperform the other methods, despite their simple and fully linear architecture. This suggests that the other geometric methods do not exploit the geometry properly, and this fact is obscured by their cumbersome architecture. Third, while some of the experiments reported in Table 1 showed only slight margins in favor of SGMC/SGMC-Z compared to DMF, the results in the Synthetic Netflix column, the ones reported on Synthetic Movielens-100K (Table 3 in Appendix A) and the ones reported in Figure 2, suggest that when the geometric model is accurate our methods demonstrate superior results. Table 2 in Appendix A presents the results of Movielens-1M. First, we can deduce that vanilla DMF model is able to match the performance of complex alternatives. Furthermore, using graphs produces slight improvements over the DMF baseline and overall provides competitive performance compared to heavily engineered methods. On Synthetic Netflix, we notice that by using SGMC, we outperform Monti et al. (2017) by a significant margin, reducing the test RMSE by half. Furthermore, it can be observed that DMF performs poorly on both synthetic datasets compared to SGMC/SGMC-Z, raising a question as to the quality of the graphs provided with those datasets on which DMF performed comparably.

A compelling argument for this behaviour is given by Table 4 in Appendix A. We can see that in the real datasets we tested on, the number of available samples is way below the density required by DMF to achieve good performance, in accordance with our findings in section 4. With high quality graphs, we should have expected SGMC to outperform DMF by a large margin. Our conclusion is that while geometric matrix completion algorithms may seem like gold, in the absence of enough data and good geometric priors, they are just fool’s gold.

6 Related work

Geometric matrix completion.

There is a vast literature on classical approaches for matrix completion, and covering it is beyond the scope of this paper. In recent years, the advent of deep learning platforms equipped with efficient automatic differentiation tools allows the exploration of sophisticated models that incorporate intricate regularizations. Some of these contemporary approaches to matrix completion fall under the umbrella term of

geometric deep learning, which generalizes standard (Euclidean) deep learning to domains such as graphs and manifolds. For example,

graph convolutional neural networks

(GCNNs) follow the architecture of standard CNNs, but replace the Euclidean convolution operator with linear filters constructed using the graph Laplacian. We distinguish between graph based approaches which make use of the bi-partite graph structure of the rating matrix (e.g., Berg et al. (2017)), and geometric matrix completion techniques which make use of side information in the form of graphs encoding relations between rows/columns (Kovnatsky et al., 2014; Kalofolias et al., 2014; Monti et al., 2017).

More recently, it has been demonstrated that some graph CNN architectures can be greatly simplified, and still perform competitively on several graph analysis tasks (Wu et al., 2019). Such simple techniques have the advantage of being easier to analyze and reproduce. One of the simplest notable approachs is deep linear networks, networks comprising of only linear layers. While these network are still mostly used for theoretical investigations, we note the recent results shown in Bell-Kligler et al. (2019) that successfully employed such a network for the task of blind image deblurring.

Product manifold filter & Zoomout.

The inspiration for our paper stems from techniques for finding shape correspondence. In particular, the functional maps framework and its variants (Ovsjanikov et al., 2012, 2016). Most notably the work of (Litany et al., 2017) who combined functional maps with joint diagonalization to solve partial shape matching problems, and the product manifold filter (PMF) (Vestner et al., 2017a, b) and zoomout (Melzi et al., 2019) – two greedy algorithms for correspondence refinement by gradual introduction of high frequencies.

7 Conclusion

In this work we have proposed a simple spectral technique for matrix completion, building upon recent practical and theoretical results in geometry processing and deep linear networks. We have shown, through extensive experimentation on real and synthetic datasets, that combining the implicit regularization of DMF with explicit, and possibly noisy, geometric priors can be extremely useful in data-poor regimes. Our work is a step towards building interpretable models that are grounded in theory, and proves that such simple models need not only be considered for theoretical study. With the proper glasses, they can be made useful.

References

  • S. Arora, N. Cohen, and E. Hazan (2018) On the optimization of deep networks: implicit acceleration by overparameterization. External Links: 1802.06509 Cited by: §2, §3.1.
  • S. Arora, N. Cohen, W. Hu, and Y. Luo (2019) Implicit regularization in deep matrix factorization. arXiv preprint arXiv:1905.13655. Cited by: Table 2, §1, §2, §2, §3.1, Table 1, Lemma 1, Theorem 1.
  • S. Bell-Kligler, A. Shocher, and M. Irani (2019)

    Blind super-resolution kernel estimation using an internal-gan

    .
    In Advances in Neural Information Processing Systems 32, pp. 284–293. Cited by: §6.
  • R. v. d. Berg, T. N. Kipf, and M. Welling (2017) Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263. Cited by: Table 2, Table 1, §5, §6.
  • E. J. Candès and B. Recht (2009) Exact matrix completion via convex optimization. Foundations of Computational mathematics 9 (6), pp. 717. Cited by: §1, §1, Table 1.
  • G. K. Dziugaite and D. M. Roy (2015) Neural network matrix factorization. CoRR abs/1511.06443. External Links: Link, 1511.06443 Cited by: Table 2.
  • A. Gadde, S. K. Narang, and A. Ortega (2013) Bilateral filter: graph spectral interpretation and extensions. In 2013 IEEE International Conference on Image Processing, pp. 1222–1226. Cited by: §2.
  • S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro (2017) Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pp. 6151–6159. Cited by: §1, §5.
  • F. M. Harper and J. A. Konstan (2016) The movielens datasets: history and context. Acm transactions on interactive intelligent systems (tiis) 5 (4), pp. 19. External Links: Link Cited by: Table 1.
  • M. Jamali and M. Ester (2010) A matrix factorization technique with trust propagation for recommendation in social networks. In Proceedings of the fourth ACM conference on Recommender systems, pp. 135–142. Cited by: Table 1.
  • V. Kalofolias, X. Bresson, M. Bronstein, and P. Vandergheynst (2014) Matrix completion on graphs. arXiv preprint arXiv:1408.1717. Cited by: §4, Table 1, §6.
  • Y. Koren, R. Bell, and C. Volinsky (2009) Matrix factorization techniques for recommender systems. Computer 42 (8), pp. 30–37. External Links: ISSN 0018-9162 Cited by: Table 2.
  • A. Kovnatsky, M. M. Bronstein, X. Bresson, and P. Vandergheynst (2014) Functional correspondence by matrix completion. External Links: 1412.8070 Cited by: §6.
  • J. Lee, S. Kim, G. Lebanon, Y. Singer, and S. Bengio (2016) LLORMA: local low-rank matrix approximation.

    Journal of Machine Learning Research

    17 (15), pp. 1–24.
    Cited by: Table 2.
  • Y. Li, T. Ma, and H. Zhang (2017) Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. arXiv preprint arXiv:1712.09203. Cited by: §5.
  • O. Litany, E. Rodolà, A. M. Bronstein, and M. M. Bronstein (2017) Fully spectral partial shape matching. In Computer Graphics Forum, Vol. 36, pp. 247–258. Cited by: §6.
  • H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King (2011) Recommender systems with social regularization. In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 287–296. Cited by: Table 1.
  • S. Melzi, J. Ren, E. Rodola, M. Ovsjanikov, and P. Wonka (2019) ZoomOut: spectral upsampling for efficient shape correspondence. arXiv preprint arXiv:1904.07865. Cited by: §6.
  • F. Monti, M. Bronstein, and X. Bresson (2017) Geometric matrix completion with recurrent multi-graph neural networks. In Advances in Neural Information Processing Systems, pp. 3697–3707. Cited by: Figure 10, §4, §5.1, §5.2, Table 1, §5, §6.
  • M. Ovsjanikov, M. Ben-Chen, J. Solomon, A. Butscher, and L. Guibas (2012) Functional maps: a flexible representation of maps between shapes. ACM Transactions on Graphics (TOG) 31 (4), pp. 30. Cited by: §6.
  • M. Ovsjanikov, E. Corman, M. Bronstein, E. Rodolà, M. Ben-Chen, L. Guibas, F. Chazal, and A. Bronstein (2016) Computing and processing correspondences with functional maps. In SIGGRAPH ASIA 2016 Courses, pp. 9. Cited by: §6.
  • N. Rao, H. Yu, P. K. Ravikumar, and I. S. Dhillon (2015) Collaborative filtering with graph information: consistency and scalable methods. In Advances in neural information processing systems, pp. 2107–2115. Cited by: Table 1.
  • A. Recanati (2018) Relaxations of the seriation problem and applications to de novo genome assembly. Ph.D. Thesis. Cited by: footnote 2.
  • O. Roy and M. Vetterli (2007) The effective rank: a measure of effective dimensionality. In 2007 15th European Signal Processing Conference, pp. 606–610. Cited by: Figure 2, §4.
  • R. Salakhutdinov, A. Mnih, and G. Hinton (2007) Restricted boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, New York, NY, USA, pp. 791–798. External Links: ISBN 978-1-59593-793-3 Cited by: Table 2.
  • R. Salakhutdinov and A. Mnih (2007) Probabilistic matrix factorization. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, USA, pp. 1257–1264. External Links: ISBN 978-1-60560-352-0 Cited by: Table 2.
  • S. Sedhain, A. K. Menon, S. Sanner, and L. Xie (2015)

    AutoRec: autoencoders meet collaborative filtering

    .
    In Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, New York, NY, USA, pp. 111–112. External Links: ISBN 978-1-4503-3473-0 Cited by: Table 2.
  • D. Spielman (2009) Spectral graph theory. Lecture Notes, Yale University, pp. 740–0776. Cited by: §2.
  • M. Vestner, Z. Lähner, A. Boyarski, O. Litany, R. Slossberg, T. Remez, E. Rodola, A. Bronstein, M. Bronstein, R. Kimmel, et al. (2017a) Efficient deformable shape correspondence via kernel matching. In 2017 International Conference on 3D Vision (3DV), pp. 517–526. Cited by: §6.
  • M. Vestner, R. Litman, E. Rodolà, A. Bronstein, and D. Cremers (2017b)

    Product manifold filter: non-rigid shape correspondence via kernel density estimation in the product space

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3327–3336. Cited by: §6.
  • F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger (2019) Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153. Cited by: §6.
  • Y. Zheng, B. Tang, W. Ding, and H. Zhou (2016) A neural autoregressive approach to collaborative filtering. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 764–773. Cited by: Table 2.

Appendix A Appendix

Ablation study.

We study the effects of different hyper-parameters of the algorithms on the final reconstruction of the matrix. We perform an ablation study on the effects of on DMF, SGMC and SGMC-Z. The results are summarized in Figures 6, 7, 8. It is interesting to note that in the case of DMF and SGMC, overparametrizing consistently improves the performance (see Figure 8), but it only holds up to a certain point, beyond which the overparametrization does not seem to effect the reconstruction error. Notice that in the Table 5, control the Dirichlet energy of rows and columns; while govern the weights of row/column diagonalization energy.

Synthetic MovieLens-100K.

While the experiments reported in Table 1 showed slight margins in favor of methods using geometry, we further experimented with a synthetic model generated from the ML-100K dataset. The purpose of this experiment is to investigate whether the results are due to the DMF model or due to the geometry as incorporated by SGMC/SGMC-Z. The synthetic model was generated by projecting on the first 50 eigenvectors of , and then matching the ratings histogram with that of the original ML-100K dataset. This nonlinear operation increased the rank of the matrix from to about . See Figure 9 in the Appendix for a visualization of the full matrix, singular value distribution and the users/items graphs. The test set and training set were generated randomly and are the same size as those of the original dataset. The results reported in Table 3 and those on the Synthetic Netflix column in Table 1 clearly indicate that SGMC/SGMC-Z outperforms DMF, suggesting that when the geometric model is accurate it is possible to use it to improve the results.

Model ML-1M
PMF (Salakhutdinov and Mnih, 2007)
I-RBM (Salakhutdinov et al., 2007)
BiasMF (Koren et al., 2009)
NNMF (Dziugaite and Roy, 2015)
LLORMA-Local (Lee et al., 2016)
I-AUTOREC (Sedhain et al., 2015)
CF-NADE (Zheng et al., 2016)
GC-MC (Berg et al., 2017)
DMF (Arora et al., 2019), (ours)
SGMC (ours)
Table 2: Comparison of test RMSE scores on Movielens-1M dataset. Baseline scores are taken from (Zheng et al., 2016; Berg et al., 2017)
Model Synthetic ML-100K
DMF
SGMC
SGMC-Z
Table 3: Comparison of average RMSE of DMF, SGMC and SGMC-Z baselines calculated on 5 randomly generated Synthetic Movielens-100K datasets.

[!tbh] Dataset Users Items Features Ratings Density Rating levels Flixster Users/Items Douban Users MovieLens-100K Users/Items MovieLens-1M Users/Items Synthetic Netflix Users/Items a Synthetic ML-100K Users/Items

Table 4: Number of users, items and ratings for Flixster, Douban, Movielens-100K, Movielens-1M, Synthetic Netflix and Synthetic Movielens-100K datasets used in our experiments and their respective rating density and rating levels.
  • The ratings are not integer-valued.

Figure 6: Ablating and of SGMC on the ML-100K dataset. The rest of the parameters were set to the ones reported in Table 5. Green X denotes the baseline from Table 1.
Figure 7: Ablating and of SGMC-Z on the ML-100K dataset. The rest of the parameters were set to the ones reported in Table 5.
Figure 8: Effect of overparametrization: SGMC (left) and DMF (right). x-axis indicates the values of , and y-axis presents the RMSE. Green X denotes the baseline from Table 1.
Figure 9: Synthetic Movielens-100k. Top-left: Full matrix. Top-right: singular values of the full matrix. Bottom left & right: items & users graph. Both graphs are constructed using 10 nearest neighbors.
Figure 10: Synthetic Netflix. Top-left: Full matrix. Top-right: singular values of the full matrix. Bottom left & right: items & users graph. Taken from (Monti et al., 2017).
Dataset Method
DMF
FM
SGMC
SGMC-Z
DMF
Flixster SGMC
SGMC-Z
Flixster SGMC
(users only) SGMC-Z
DMF
Douban SGMC
SGMC-Z
DMF
ML-100K SGMC
SGMC-Z
DMF
ML-1M SGMC
DMF
SGMC
SGMC-Z
Table 5: Hyper-parameter settings for the algorithms: DMF, SGMC and SGMC-Z, reported in Tables 1, 2, 3.
Figure 11: Reconstruction error (on the test set) vs. scale of initialization. For each method we initialized with . SGMC consistently outperforms DMF for any initialization.