Recovery Guarantees for Quadratic Tensors with Limited Observations

10/31/2018 ∙ by Hongyang Zhang, et al. ∙ University of Wisconsin-Madison Stanford University 0

We consider the tensor completion problem of predicting the missing entries of a tensor. The commonly used CP model has a triple product form, but an alternate family of quadratic models which are the sum of pairwise products instead of a triple product have emerged from applications such as recommendation systems. Non-convex methods are the method of choice for learning quadratic models, and this work examines their sample complexity and error guarantee. Our main result is that with the number of samples being only linear in the dimension, all local minima of the mean squared error objective are global minima and recover the original tensor accurately. The techniques lead to simple proofs showing that convex relaxation can recover quadratic tensors provided with linear number of samples. We substantiate our theoretical results with experiments on synthetic and real-world data, showing that quadratic models have better performance than CP models in scenarios where there are limited amount of observations available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tensors provide a natural way to model higher order data [1, 2, 3, 4, 5]. They have applications in recommendation systems [6, 7, 8], knowledge base completion [9, 10, 11], predicting geo-location trajectories [12] and so on. Most tensor datasets encountered in the above settings are not fully observed. This leads to tensor completion, the problem of predicting the missing entries, given a small number of observations from the tensor [2, 13]. In order to recover the missing entries, it is important to take into account the data efficiency of the tensor completion model.

One of the most well known tensor models is the CANDECOMP/PARAFAC or CP decomposition [2]

. For a third order tensor, the CP model will express the tensor as the sum of rank 1 tensors, i.e. tensor product of three vectors. The tensor completion problem of learning a CP decomposition has received a lot of attention recently

[14, 15, 16]. It is commonly believed that reconstructing a third-order dimensional tensor in polynomial time requires samples [14, 17]. This is necessary even for low rank tensors, where samples are information theoretically sufficient for recovery. The sample requirement of CP decomposition limits its representational power for sparsely observed tensors in practice [10, 12, 18]. While regularization may be helpful when there are limited observations, adding strong regularization will also hurt the optimization performance.

On the other hand, an alternative family of quadratic tensor models have emerged from applications in recommendation systems [6] and knowledge base completion [11]. The pairwise interaction model has demonstrated strong performance for the personalized tag recommendation problem [6, 7, 8]. In this model, the entry of a tensor is viewed as the sum of pairwise inner products: , where correspond to the embedding of each coordinate. As another example, the translating embedding model [9] for knowledge base completion can be (implicitly) viewed as solving tensor completion with a quadratic model. Suppose that are the embedding of two entities and is the embedding of a relation. Then the smaller is, the more likely are related by . To be concrete, we formalize the notion of quadratic tensors as:

where correspond to the embedding vectors, and denotes a quadratic function. Both the pairwise interaction model and the translational embedding model correspond to specific choices of .

It is known that for the special case of pairwise interaction tensors, linear (in dimension) number of samples are enough to recover the tensor via convex relaxation [19]. However, in practice non-convex methods are the predominant method of choice for training quadratic models. This is because non-convex methods, such as alternating minimization and gradient descent, are more scalable to handle very large datasets. Despite the practical success, it has been a major challenge to theoretically analyze the performance of non-convex methods. In this work, we present the first recovery guarantee of non-convex methods for learning quadratic tensors. Besides the motivation of quadratic tensors, our work joins a line of recent work to further understand the optimization landscape of non-convex low rank problems [20, 21, 22, 23]. Our results show that quadratic tensor completion enjoys the property that all local minima are global minima in its natural non-convex formulation under certain rank condition.

Main Results.

Assume that we observe entries of uniformly at random. Denote the set of observed entries as . Consider the natural least squares minimization problem.

where includes weight decay and other regularizers. (See Section 3 for the precise definition). Note that is in general non-convex since it generalizes the matrix completion setting when . We show that as long as , all local minimum can reconstruct the ground truth accurately.

[informal] Assume that for all , . Let be the desired accuracy and . For the regularized objective , as long as , then all local minimum of can be used to reconstruct such that

In the incoherent setting when is a small constant, the tensor entries are on the order of . Our results imply that the average recovery error is on the order of . Hence we recover most tensor entries up to less than relative error. Our result applies to any quadratic tensor, whereas the previous result on convex relaxations only applies to pairwise interaction tensors [19]. An additional advantage is that our approach does not require the low rank assumption for recovery.

Our technique is based on over-parameterizing the search space to dimension . We show that for the training objective, there is no bad local minimum after over-parameterization. Hence any local minima can achieve small training error. The regularizer is then used to ensure that the generalization error to the entire tensor is small, provided with just a linear number of samples from . Since the result applies to any local minimum, it has implications for any non-convex method conceptually, such as alternating least squares and gradient descent. Meanwhile, our techniques lead to simple proofs showing that convex relaxation methods are able to recover quadratic tensors accurately given samples. See Section 3 for details.

Experiments.

We substantiate our theoretical results with experiments on synthetic and real-world tensors. Our synthetic experiments validate our theory that non-convex methods and convex relaxation can recover quadratic tensors with linear number of samples. We show that non-convex methods such as alternating least squares scales linearly as the dimension of the tensor grows, by observing that the number of iterations needed is small.

We then compare the CP model and the quadratic model solved using regularized alternating least squares on three real world datasets. The first dataset consists of million movie ratings over time (Movielens-10M). The task is to predict movie ratings by completing the missing entries of the tensor. We found that the quadratic model outperforms CP-decomposition by .

The second dataset consists of a word tri-occurrence tensor comprising the most frequent 2000 English words. We learn word embeddings from the tensor using both the quadratic model and the CP model, and evaluate the embeddings on standard NLP tasks. The quadratic model is more accurate than the CP model.

Lastly, we consider recovering a hyperspectral image given incomplete pixel observations. In this experiment, we vary the fraction of observations available and compare the generalization performance of the quadratic model and the CP model, trained with regularized alternating least squares. We observe that when the fraction of observations is small (less than 1% of samples), the quadratic model achieves lower test error than CP model. On the other hand when the fraction of observations becomes order of magnitude larger, the CP model can recover the hyperspectral image more accurately.

The experimental results indicate that the quadratic model is better suited to sparse, high-dimensional datasets than the CP model. The image experiment makes it explicit that the improvement stems from its better generalization performance compared to the CP model.


In conclusion, we show that provided with just linear number of samples from a quadratic tensor, any local minimum of the natural non-convex formulation can recover the tensor accurately. Empirically, the quadratic models enjoy superior performance when solved with regularized alternating least squares on the non-convex formulation, compared to the CP model. Together, they indicate that the quadratic model may be a viable alternative to use in practical settings where very limited number of observations are available.

Organization.

The rest of the paper is organized as follows. In Section 2 we define the quadratic model more formally and review related work. In Section 3 we present our main theoretical result. In Section 4 we evaluate the non-convex formulation for solving quadratic models on both synthetic and real world datasets. We conclude in Section 5 and describe a few questions for future work. The Appendix contains missing proofs from Section 3.

2 Preliminaries

2.1 Notations

Given a positive integer , let denote the set of integers from to . For a matrix , let denote the -th row vector of , for any . We use to denote that is positive semi-definite. Denote by as the set of symmetric matrices of size by . Denote by as the set of by positive semidefinite matrices. Let denote the Euclidean norm of a vector and spectral norm of a matrix. Let denote the Euclidean norm (square root of sum of every entry squared) of a matrix or a tensor. Let denote the norm of a matrix or tensor, i.e. sum of absolute value of every entry. For two matrices we define the inner product . For three matrices , denote by as the three matrices stacked vertically.

Given an objective function , we use to denote the gradient of , and to denote the Hessian matrix of , which is of size by .

We denote if there exists an absolute constant such that .

2.2 Quadratic Models

We now define the quadratic model more formally with examples. Recall that is a third order tensor, composed by a quadratic function over three factor matrices .111We assume that the three dimensions all have size in order to simplify the notations. It is not hard to extend our results to the more general case when different dimensions have different sizes. Also, we will focus on third order tensor for the ease of presentation – it is straightforward to extend the quadratic model to higher orders. In the introduction we defined as a function on real values, we now overload the notation and define to work over vectors as well. More specifically,

Recall that is a matrix with the rows of stacked vertically. Here the kernel matrix encodes the similarity/dissimilarity represented by between the input vectors. Different choices of represent different quadratic models, for example when , . We assume that is a symmetric matrix without loss of generality, since we can always symmetrize without changing . The coefficients of are assumed to be fixed constants which do not increase with dimension. We now describe two quadratic models which are known from the literature.

The Pairwise Interaction Tensor Model [6] is proposed in the context of tag recommendation, e.g. suggesting a set of tags that a user is likely to use for an item. The Pairwise Model scores the triple with the following measure:

For this model, the kernel matrix has on all off-diagonal entries, and 0 on the diagonal entries. In the tag recommendation setting, and correspond to embeddings for the th user, th item, and th tag respectively. The pairwise interaction model models two-way interactions between users, items and tags to predict if user is likely to use tag for item .

The Translational Embedding Model (a.k.a TransE) [9] is well studied in the knowledge base completion problem, e.g. inferring relations between entities. The TransE model scores a triple with

Intuitively, the smaller is, the more likely that entities and will be related by relation .

The idea here is that if adding the embedding for Italy to the embedding for the capital of relationship results in a vector close to the embedding for Rome, then Rome and Italy are likely to be linked by the capital of relation.

2.3 Related Work

We first review existing approaches for analyzing non-convex low rank problems. One line of work focuses on the geometry of the non-convex problem, and show that as long as the current solution is not optimal, then a direction of improvement can be found [20, 21, 22, 23, 24]. There are a few technical difficulties in applying this line of approach to our setting. One difficulty is asymmetry — our setting requires recovering three set of different parameters. Existing analysis of alternating least squares does not seem to apply because of the asymmetry as well [25]. The second difficulty is that there exists multiple factor matrices which correspond to the same quadratic tensor in our setting. Hence it is not clear which factor matrices the gradient descent algorithms converges to. A second line of work builds on an interesting connection between SDPs and their Burer-Monteiro low-rank formulations [21]

. However, their results do not directly apply to our setting because the non-convex formulation is unconstrained. Recent work has applied this connection to analyzing over-parameterization in one hidden layer neural networks with quadratic activations

[26]. Our techniques are inspired by this work, however our setting is fundamentally different from their setting. This is because we need to take into account the incoherence of the factor matrices. Hence we need to add the incoherence regularizer to our setting [20, 21]. We refer the reader to Section 3 for more technical details.

Next we briefly review related works for tensor completion. One approach is to flatten the tensor into a matrix, or treat each slice of the tensor as a low rank matrix individually, and then apply matrix completion methods [27, 28, 29]. There are other models such as RESCAL [3], Tucker-based methods [2] etc that have been considered for tensor completion tasks. We refer the interested reader to a recent survey for more information [13].

Latent variable models:

We mention a few related work from matrix factorization that inspired this work. C15, SLLS16 propose a latent variable model approach for collaborative filtering. Every user and item are represented by a latent representation, and there is a kernel function that determine a user’s preference over an item, given their latent representations. [SLLS16] characterize general conditions under which it is possible to learn the latent representation. [GL16] consider computing graph embeddings for the link prediction problem. They consider different operators to learn edge features, and found the inner product to perform the best in their experiments.

Algorithms for Computing CP Decompositions:

We focus on CP decomposition here and refer the reader to Kolda and Bader [2] for other tensor decomposition models such as Tucker decomposition. There has been an line of recent research on provably recovering recovering random and smoothed tensors using algorithms such as gradient descent and alternating minimization [anandkumar2014guaranteed, GM17, 18], which are by far the most popular algorithms for CP decomposition [2]. The sum-of-squares framework has also emerged as a powerful theoretical tool to understand tensor decompositions [MSS16, tang2015guaranteed, hopkins2016fast, SS17]. A long line of work goes back to using simultaneous diagonalization and higher-order SVD based approaches [de2006link], and recent work has attempted to make these algorithms more noise robust and scalable [BCMV14, SRT15, KCL15, colombo2016tensor]. Recently, there has been significant interest to develop more computationally efficient and scalable algorithms for CP decomposition [PFS12, wang2015fast, SWZ16].

3 Recovery Guarantees

In this section, we consider the recovery of quadratic tensors under partial observations. Recall that we observe entries uniformly at random from an unknown tensor . Let denote the indices of the observed entries. Given , our goal is to recover accurately.

First, we review the definition of local optimality conditions.

(Local minimum) Suppose that is a local minimum of , then we have that and .

We focus on the following non-convex least squares formulation with variables , which model the true parameters . In this setting, we assume that is already known. This is without loss of generality, since our approach also applies to the case when is unknown using the same proof technique.

Let us unpack the above function. The first term corresponds to the natural MSE over . Next we have . The role of is to penalize any row of whose norm is larger than , the desired amount from our assumption. It is not hard to verify that is twice differentiable. Last, is a random PSD matrix with spectral norm at most . One can view as a small perturbation on the loss surface. This perturbation will be important to smooth out unlikely cases in our analysis, as we will see later. Our main result is described below.

Let be a quadratic tensor defined by factors and a quadratic function . Assume that

We are given a uniformly random subset of entries from . Let and . Under appropriate choices of and , for any local minimum of

, with high probability over the randomness of

and , for , we have:

Note that Theorem 1 simply follows from Theorem 3 by setting as well as the corresponding value of and .

For a concrete example of the recovery guarantee, suppose that are all sampled independently from Gaussian . In this case, it is easy to verify that . Hence when , we have that the average recovery error of is at most . Note that every entry of is on the order of as defined by the quadratic model. Hence our theorem shows that most tensor entries are accurately recovered up to a relative error of fraction.

Next we give an overview of the technical insight. The first technical complication of analyzing such a is that the three factors are asymmetric. Therefore to simplify the analysis, we first reduce the problem to a symmetric problem, by viewing the search space as instead. We then show that all local minima of are global minima. Here is where we crucially use the random perturbation matrix – this is necessary to avoid a zero probability space which may contain non global minima. While this idea of adding a random perturbation is inspired by the work of Du and Lee [26], the adaptation to our setting is novel and requires careful analysis. In the last part, we use the regularizer of to argue that all local minima are incoherent, and their Frobenius norms are small. Based on these two facts, we use Rademacher complexity to bound the generalization error. We now go into the details of the proof.

Local optimality.

Before proceeding, we introduce several notations. Denote by

as the three factors stacked vertically. Let . For each triple , denote by as a sensing matrix such that . Specifically, we have that restricted to the row and column indices is equal to (the kernel matrix of ), and otherwise. We can rewrite more concisely as follows,

where . We will use the following Proposition in the proof. [Proposition 4 in Bach et al. [30]] Let be a twice differentiable convex function over . If the function defined over has a local minimum at a rank deficient matrix , then is a global minimum of .

Now we are ready to show that there is no bad local minima in the landscape of . In the setting of Theorem 3, with high probability any local minimum of is a global minimum.

Proof.

We will show that , hence by Proposition 3, is a global minimum of . Assume that . By local optimality, , we obtain that:

(1)

Denote by

In the above definition, is a symmetric matrix in the null space of — recall that is symmetric, for any . The set is a manifold and clearly by the gradient condition. Because the dimension of a finite union of manifolds is equal to the maximal dimension of the set of manifolds, we have that the dimension of is equal to the dimension of .

In this case, the rank of is equal to , and the dimension of is . Together with the dimension of and , we have that the dimension of is:

We have assumed that . Hence the dimension of is strictly less than . And we obtain that the dimension of is strictly less than . However, the probability that a random PSD matrix falls in such a set only happens with probability zero. Hence with high probability, the rank of is less than . The proof is complete. ∎

Rademacher complexity.

Next we bound the generalization error using Rademacher complexity. We first introduce some notations. For any , , denote by

And let denote the set of matrices as follows.

Denote by the set of quadratic tensors constructed from matrices in .

We bound the Rademacher complexity of in the following Lemma. In the setting of Theorem 3, we have that

We leave the proof to Appendix A. Based on the above Lemma, we are ready to prove our main result.

Proof of Theorem 3.

By Lemma 3, we have that as long as is a local minimum of , then it is a global minimum. In particular, this implies that

since . Recall that is the reconstructed tensor. By setting to be , we get that

because .

Next, it is not hard to see that by setting . Hence is at most and . This implies that . By Lemma 3, the Rademacher complexity of all quadratic tensors in is bounded by , recalling that . To summarize, we have that the MSE of on is less than and the Rademacher complexity is at most . Hence the MSE of on can be bounded by:

with probability at most , over the randomness of (See e.g. Bartlett and Mendelson [31] for more details). We can obtain the desired conclusion by setting a small value of (e.g. suffices). ∎

Polynomial time algorithms.

Next we discuss about algorithms for minimizing the function . Apart from the gradient descent algorithm, minimizing can also be solved via alternating least squares (ALS), because on fixing and , is an regularized least-squares problem over ; similarly for and . Hence ALS alternatively solves regularized least-squares problems and terminates after a predefined maximum number of iterations, or if the error does not decrease in an iteration. Each iteration involves at most computations, but can be substantially faster if the original tensor is sparse, in which case the computational complexity essentially only depends on the sparsity of the original tensor. We will validate the performance of gradient descent and ALS for synthetic data in Section 4.1. It is an interesting open question to analyze the convergence of gradient descent or ALS for quadratic tensors.

Lastly, minimizing can be solved via convex relaxation methods as follows.

s.t.

where we recall that corresponds to the -th entry of the quadratic tensor defined by . Note that the objective function is convex and the feasible region is convex and bounded from above, hence the problem can be solved in polynomial time (see e.g. Bubeck [32]). Combining Lemma 3 and the proof of Theorem 3, we obtain the following recovery guarantee for the above convex relaxation method.

Let be a quadratic tensor defined by factors and a quadratic function . Assume that

Let be a set of entries sampled uniformly at random from and be the entries of corresponding to the indices of . When , then solving using convex optimization methods (such as cutting plane method) can return a solution in time . And can be used to reconstruct for all satisfying:

Discussions.

One interesting question is for Theorem 3, whether the amount of parameters can be reduced from to , which does not grow polynomially with dimension. Here we describe an interesting connection between the above question and the notion of matrix rigidity [33]. Concretely, we ask the following question.

Let

be a random matrix where every entry is sampled independently at randomly from a fixed distribution (e.g. standard Gaussian). Denote by

. Suppose that we are allowed to arbitrarily change entries of , and obtain . In other words, and differ by at most entries. What is the minimum possible rank of ?

It is trivial to see that one can obtain by removing rows from . Hence the minimum rank would be at most . If the answer to the above question is , then Theorem 3 would be true for . To see this, in the proof of Lemma 3, we can use as the random perturbation, and scale down the perturbation matrix so that its spectral norm is under the desired threshold. Then, since there are at most six non-zero entries in , for any . Hence overall the following equation changes the perturbation matrix in at most entries (c.f. Equation (1)). If indeed the rank of is at least , then we can set to obtain the desired result in Lemma 3. For accurate recovery we need , hence can be reduced to .

Question 3 is equivalent to asking what is the rigidity of a random PSD matrix. It turns out that understanding the rigidity of random matrices is technically challenging and there is an ongoing line work to further improve our understanding in this area. We refer the interested reader to the work of Goldreich and Tal [33] for detailed information.

4 Experiments

In this section, we describe our experiments on both synthetic data and real world data. For synthetic data, we validate our theoretical results and show that the number of samples needed to recover the tensor only grows linearly in the dimension using two non-convex methods – gradient descent and alternating least squares (ALS). We then evaluate the quadratic model solved using alternating least squares on real-world tasks in three diverse domains: a) predicting movie ratings in the Movielens-10M dataset. b) learning word embeddings using a tensor of word tri-occurrences. c) recovering the hyperspectral image “Riberia” given incomplete pixel values. In the Movielens-10M dataset the quadratic model outperforms CP decomposition by more than . In the word embedding experiment the quadratic model outperforms CP decomposition by more than 20% across standard NLP benchmarks for evaluating word embeddings. In the hyperspectral image experiment, we explicitly vary the fraction of sampled pixels from the image. We observe that the quadratic model outperforms the CP model when there are limited observations, whereas the CP model excels when more observations are available.

4.1 Synthetic Data

Both gradient descent and ALS are common paradigms for solving non-convex problems, and hence our goal in this section is to evaluate their performances on synthetic data. The ALS approach minimizes the mean squared error objective by iteratively fixing two sets of factors, and then solving the regularized least squares problem on the third factor. In addition, we also evaluate a semidefinite programming based approach which solves a trace minimization problem, similar to the approach in Chen et al. [19].

We now describe our setup. Let

, where every entry is sampled independently from standard normal distribution. We sample a uniformly random subset of

entries from the quadratic tensor . Let the set of observed entries be , and the goal is to recover given . We measure test error of the reconstructed tensor as follows:

(2)

Accuracy.

We first examine how many samples ALS and the SDP require in order to recover accurately. Let , here will be the number of samples. We fix . For each value of and , we repeat the experiment thrice, and report the median value with error bars. Because ALS is more scalable, we are able to test on much larger dimensions . Figure 1 shows that the sample complexity of both the SDP and ALS is between to . When , both the SDP and ALS fail to recover ; but given samples, they can recover very accurately.

(a) Test error vs. Dimension for ALS
(b) Test error vs. Dimension for SDP
Figure 1: ALS and SDP require to samples to recover a quadratic tensor with random factors, providing evidence that the sample complexity for them is . Here rank and number of samples is .

ALS also converges given a small number of iterations — we observe that ALS can achieve low test error within 30 iterations, where each iteration requires solving a sparse by least squares problems. Figure 2 shows how the error decays with the iteration. This makes ALS highly scalable for solving the problem on large tensors.

Figure 2: Training and test error for ALS vs the number of iterations. ALS achieves low test error within 30 iterations.

We also repeat the same experiment for gradient descent. We run gradient descent with rank for 20000 iterations. Figure 3 shows that the sample complexity of gradient descent is between and samples. Our experiments suggest that the constants for the sample complexity are slightly better for ALS as compared to gradient descent, and ALS also seems to converge faster to a solution with small error.

Figure 3: Gradient descent requires about samples to recover a quadratic tensor with random factors, providing evidence that the sample complexity for gradient descent is . Here rank and number of samples is .

Robustness

In this experiment, we examine the robustness of the SDP relaxation and ALS to noise in the input tensor. We fix and

. For each training data point, we add a Gaussian perturbation with zero mean and variance equal to

. Figure 4 shows the test error of SDP and ALS when each training data is corrupted with an amount of noise. Even with , both methods are still able to learn the underlying tensor with error less than . While the performance of ALS is consistently better than SDP relaxations, we suspect that this is because we formulated the SDP by fitting the training examples exactly. It is possible that using a different formulation, one can obtain similar results to ALS. For example, one can consider minimizing the mean squared error, while constraining the trace to be less than a desired value (see e.g. [YUTC17]). Interestingly, even when , ALS can still recover with test error around , showing that it is extremely noise robust.

Figure 4: Test error of SDP and ALS when each training data is corrupted independently by a Gaussian noise with zero mean and variance . Both algorithms achieve a high a high degree of noise robustness, and get low test error even when .

Evidence for Conjecture LABEL:conj_sample

In this experiment, we verify Conjecture LABEL:conj_sample for synthetic data. Table 1 contains the incoherence of the optimum SDP solution. We observe that the incoherence does not increase as

increases; moreover we also observe that the sum of singular values other than the top

singular values is close zero for all the experiments, providing evidence for Conjecture LABEL:conj_sample. If we further orthogonalize the underlying random factors, then we observe that the SDP can recover the original factors up to rotations. For random factors, we do not always recover the original factors in the experiment. For example, this happens if the factors in group have bigger norms compared to the factors in group and .

dimension incoherence
Table 1: Evidence for Conjecture LABEL:conj_sample on random data: incoherence does not increase as increases.

4.2 Movie Ratings Prediction on Movielens-10M Dataset

The Movielens-10M dataset222https://grouplens.org/datasets/movielens/10m/ contains about 10 million ratings (each between 0-5) given by users to movies, along with time stamps for each rating. We test both CP decomposition and the quadratic model on a tensor completion task of predict missing ratings given a subset of the ratings. We also compare with a matrix factorization based method which ignores the temporal information to evaluate if the temporal information in the time stamps is useful.

Algorithm Sampling rate Sampling rate
Rank =10 Rank =20 Rank =10 Rank =20
Matrix model
CP model
Quadratic model
Table 2: Results for the Movielens-10M dataset for varying sampling rates corresponding to different training and test splits, and different ranks of the factorization. The quadratic model yields the best results across all settings, with the gap being larger at lower sampling rates.

Methodology.

We split the ratings into a training and test set. We perform this split with two different sampling rates: and corresponding to and of the entries being in the training set respectively. The smaller sampling rate is to evaluate the performance of the algorithm given very little data. To construct the tensor of ratings we bin the time window into 20 week long intervals, which gives a tensor of size , where the third mode is the temporal mode. We then use CP decomposition and the quadratic model, both with regularization to predict the missing ratings. For the matrix method we run matrix factorization with regularization on the

dimensional matrix of ratings. We use alternating minimization with a random initialization and tune the regularization parameter for all algorithms. We repeat the experiment for 3 random splits of the dataset corresponding to each sampling rate. The evaluation metric is the mean squared error (MSE) on the test entries.

Results.

The means and standard deviations of the MSE are reported in Table

2. There are two key observations to be made. Firstly, we can see that the quadratic model consistently yields superior performance than the CP model for the choices of rank333We found that going to higher rank did not improve the performance of either model. and sampling rate we explored. The difference between the performances is also larger for the regime with the lower sampling rate, and we hypothesize that this is due to superior generalization ability of the quadratic model compared with the CP model. The quadratic model also gets a improvement over the baseline which ignores the temporal information in the ratings and uses matrix factorization. This is expected—as a users like or dislike for a genre of movies may change over time, or movie’s rating might change from the time of its release.

4.3 Learning Word Embeddings

Word embeddings are vectors representations of words, where the vectors and their geometry encodes both syntactic and semantic information about the words. In this section, we construct word embeddings using the factors obtained by doing tensor factorization on a suitably normalized tensor of word tri-occurrences, and compare the quality of word embeddings learned by the quadratic model and CP decomposition. This experiment tests if the quadratic model returns meaningful factors, in addition to accurately predicting the missing entries.

Methodology.

We construct a dimensional cubic tensor of word tri-occurrences of the 2000 most frequent words in English by using sliding window of length 3 on a 1.5 billion word Wikipedia corpus, hence the entry of the tensor is the number of times word , and occur in a window of length 3. As in previous work [34, 18], we construct a normalized tensor by applying an element-wise nonlinearity of for each entry of . We then find the factors for a rank 100 factorization of for the quadratic model and CP decomposition using ALS. The embedding for the th word is then obtained by concatenating the th rows of , and , and then normalizing each row to have unit norm.

Evaluation.

In addition to reporting the MSE, we evaluate the learned embeddings on standard word analogy and word similarity tasks. The word analogy tasks [37, 38] consist of analogy questions of the form “cat is to kitten as dog is to  ?”, and can be answered by doing simple vector arithmetic on the word vectors. For example, to answer this particular analogy we take the vector for cat, subtract the vector for kitten, add the vector for dog, and then find the word with the closest vector to the resulting vector. There are two standard datasets for analogy questions, one of which has more syntactic analogies [38] and the other has more semantic analogies [37]. The metric here is the percentage of analogy questions which the algorithm gets correct. The other task we test is a word similarity task [39, 40]

where the goal is to evaluate how semantically similar two words are, and this is done by taking the cosine similarity of the word vectors. The evaluation metric is the correlation between the similarity scores assigned by the algorithm and the similarity scores assigned by humans.

Results.

The results are shown in Table 3. The quadratic model significantly outperforms the CP model on both the MSE metric, and on the NLP tasks which directly evaluate the embeddings. In order to have a closer look at the analogy questions at which the quadratic model is better at answer than the CP model, we also evaluated the word embeddings on subsets of analogy questions which test specific analogies. The results are reported in Table LABEL:specific_tasks. The promising performance on the Country/State capitals task indicates the quadratic model may be better at capturing relations for rarer words for which less data is observed due to its superior generalization capabilities, though a proper evaluation of this would require testing on a larger dictionary than 2000 words. Our results indicate significant improvement over the CP model, and that a promising direction of future work is to use the quadratic model in conjunction with other algorithms and models for learning word embeddings using tensors [liu2015learning, 18, bailey2017word] to yield higher quality word embeddings for downstream NLP tasks.

Metric CP model Quadratic model
MSE 0.5893 0.4253
Syntactic analogy 30.61% 46.14%
Semantic analogy 42.37% 54.76%
Word similarity 0.51 0.60
Table 3: Results for word embedding experiments. The quadratic model significantly outperforms the CP model across all tasks.

4.4 Recovering Hyperspectral Images

Since the quadratic model is a special case of the CP model, in principle it is not able to represent any tensor. 444Given three factors and , the pairwise model defines the following tensor: where denotes the all one vector. Hence any tensor inside the span of can be factorized into at most rank one tensors. When there are enough observations, the quadratic model may not perform as well as the CP model, due to limited representational power. On the other hand when the amount of observations is limited, the quadratic model still outperforms the CP model. We describe such an example for the task of completing a hyperspectral image.

We consider recovering a hyperspectral image “Riberia” [41] which has previously been considered in the context of tensor factorization [42, 43]. The image is a tensor , where each slice of the image corresponds to the same scene being imaged at a different wavelength.

As has been done in previous works [42, 43], we resize the image to

by downsampling. We obtain a fraction of sampled entries of the tensor, and the task is to estimate the remaining entries. We fix the rank of the CP model and the quadratic model to be

, measured in terms of the normalized Frobenius error of the recovered tensor on the missing entries (c.f. Equation (2)). We observe no improvement by using even higher ranks for both models in our experiments. We vary the fraction of samples and compare the performance of the CP model and the quadratic model. We fine tune the regularization parameter to achieve the best performance for both models. The results are reported in Table 4.

We see that the performance of the CP model and the quadratic model vary depending on the fraction of samples available. While the CP model achieves the best results with 10% samples, the quadratic model outperforms the CP model when the amount of samples are less than 1%. For the most parsimonious setting with only samples, the quadratic model incurs less than half the RMSE compared to the CP model.

Percentage of samples CP model Quadratic model
1.064 0.488
0.495 0.424
0.358 0.353
0.116 0.216
Table 4: Results for completing the hyperspectral image Riberia. We report the test error by taking the median result over three runs in each experiment.

5 Conclusions and Future Work

In this work, we showed that for a natural non-convex formulation, all local minima are global minima and can be used to recover quadratic tensors using a linear number of samples. The techniques are also used to show that convex relaxation methods recover quadratic tensors provided with linear samples. We experimented with a diverse set of real world datasets, showing that the quadratic model outperforms the CP model when the number of observations is limited.

There are several immediate open questions. Firstly, is it possible to show a convergence guarantee with a small number of iterations for gradient descent or alternating least squares? Secondly, is it possible to achieve similar results to Theorem 3 with rank as opposed to ? We believe that solving this may require novel techniques.

References

  • Anandkumar et al. [2014] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models.

    Journal of Machine Learning Research

    , 15(1):2773–2832, 2014.
  • Kolda and Bader [2009] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
  • Nickel et al. [2011a] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In ICML, volume 11, pages 809–816, 2011a.
  • Sun et al. [2006] Jimeng Sun, Dacheng Tao, and Christos Faloutsos. Beyond streams and graphs: dynamic tensor analysis. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 374–383. ACM, 2006.
  • Sutskever et al. [2009] Ilya Sutskever, Joshua B Tenenbaum, and Ruslan R Salakhutdinov. Modelling relational data using bayesian clustered tensor factorization. In Advances in neural information processing systems, pages 1821–1828, 2009.
  • Rendle and Schmidt-Thieme [2010] Steffen Rendle and Lars Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In Proceedings of the third ACM international conference on Web search and data mining, pages 81–90. ACM, 2010.
  • Rendle and Freudenthaler [2014] Steffen Rendle and Christoph Freudenthaler. Improving pairwise learning for item recommendation from implicit feedback. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 273–282. ACM, 2014.
  • Shi et al. [2014] Yue Shi, Martha Larson, and Alan Hanjalic. Collaborative filtering beyond the user-item matrix: A survey of the state of the art and future challenges. ACM Computing Surveys (CSUR), 47(1):3, 2014.
  • Bordes et al. [2013] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pages 2787–2795, 2013.
  • García-Durán et al. [2016] Alberto García-Durán, Antoine Bordes, Nicolas Usunier, and Yves Grandvalet. Combining two and three-way embedding models for link prediction in knowledge bases.

    Journal of Artificial Intelligence Research

    , 55:715–742, 2016.
  • Nguyen [2017] Dat Quoc Nguyen. An overview of embedding models of entities and relationships for knowledge base completion. arXiv preprint arXiv:1703.08098, 2017.
  • Wang et al. [2014] Yilun Wang, Yu Zheng, and Yexiang Xue. Travel time estimation of a path using sparse trajectories. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 25–34. ACM, 2014.
  • Song et al. [2017] Qingquan Song, Hancheng Ge, James Caverlee, and Xia Hu. Tensor completion algorithms in big data analytics. arXiv preprint arXiv:1711.10105, 2017.
  • Jain and Oh [2014] Prateek Jain and Sewoong Oh. Provable tensor factorization with missing data. In NIPS, pages 1431–1439, 2014.
  • Montanari and Sun [2016] Andrea Montanari and Nike Sun. Spectral algorithms for tensor completion. Communications on Pure and Applied Mathematics, 2016.
  • Potechin and Steurer [2017] Aaron Potechin and David Steurer. Exact tensor completion with sum-of-squares. In COLT, pages 1619–1673, 2017.
  • Barak and Moitra [2016] Boaz Barak and Ankur Moitra. Tensor prediction, rademacher complexity and random 3-XOR. In COLT, 2016.
  • Sharan and Valiant [2017] Vatsal Sharan and Gregory Valiant. Orthogonalized als: A theoretically principled tensor decomposition algorithm for practical use. In ICML, pages 3095–3104, 2017.
  • Chen et al. [2013] Shouyuan Chen, Michael R Lyu, Irwin King, and Zenglin Xu. Exact and stable recovery of pairwise interaction tensors. In Advances in Neural Information Processing Systems, pages 1691–1699, 2013.
  • Ge et al. [2016] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
  • Boumal et al. [2016] Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex burer-monteiro approach works on smooth semidefinite programs. In Advances in Neural Information Processing Systems, pages 2757–2765, 2016.
  • Bhojanapalli et al. [2016] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881, 2016.
  • Ge et al. [2017] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. arXiv preprint arXiv:1704.00708, 2017.
  • Zhang et al. [2018] Xiao Zhang, Lingxiao Wang, Yaodong Yu, and Quanquan Gu. A primal-dual analysis of global optimality in nonconvex low-rank matrix recovery. In International conference on machine learning, pages 5857–5866, 2018.
  • Li et al. [2016] Yuanzhi Li, Yingyu Liang, and Andrej Risteski. Recovery guarantee of non-negative matrix factorization via alternating updates. In Advances in neural information processing systems, pages 4987–4995, 2016.
  • Du and Lee [2018] Simon S Du and Jason D Lee. On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206, 2018.
  • Gandy et al. [2011] Silvia Gandy, Benjamin Recht, and Isao Yamada. Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Problems, 27(2):025010, 2011.
  • Nickel et al. [2011b] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In ICML, volume 11, pages 809–816, 2011b.
  • Farias and Li [2017] Vivek F Farias and Andrew A Li. Learning preferences with side information. Technical report, 2017.
  • Bach et al. [2008] Francis Bach, Julien Mairal, and Jean Ponce. Convex sparse matrix factorizations. arXiv preprint arXiv:0812.1869, 2008.
  • Bartlett and Mendelson [2002] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
  • Bubeck et al. [2015] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
  • Goldreich and Tal [2016] Oded Goldreich and Avishay Tal. Matrix rigidity of random toeplitz matrices. In

    Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

    , pages 91–104. ACM, 2016.
  • Pennington et al. [2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    , pages 1532–1543, 2014.
  • Ledoux and Talagrand [2013] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.
  • Davenport et al. [2014] Mark A Davenport, Yaniv Plan, Ewout Van Den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3(3):189–223, 2014.
  • Mikolov et al. [2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
  • Mikolov et al. [2013b] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751, 2013b.
  • Bruni et al. [2012] Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 136–145. Association for Computational Linguistics, 2012.
  • Finkelstein et al. [2001] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414. ACM, 2001.
  • Foster et al. [2006] David H Foster, Kinjiro Amano, Sérgio MC Nascimento, and Michael J Foster. Frequency of metamerism in natural scenes. Josa a, 23(10):2359–2372, 2006.
  • Signoretto et al. [2011] Marco Signoretto, Raf Van de Plas, Bart De Moor, and Johan AK Suykens. Tensor versus matrix completion: a comparison with application to spectral data. IEEE Signal Processing Letters, 18(7):403–406, 2011.
  • Kasai and Mishra [2015] Hiroyuki Kasai and Bamdev Mishra. Riemannian preconditioning for tensor completion. arXiv preprint arXiv:1506.02159, 2015.

Appendix A Proof of Lemma 3

In this section, we fill in the missing proofs for Theorem 3. We present the proof of Lemma 3, which bounds the Rademacher complexity of , the set of quadratic tensors.

Proof of Lemma 3.

Let denote a set of independent samples from . Clearly, we have . Hence,

(3)

by concavity of the supreme operation and the square function. Let denote

i.i.d. Rademacher random variables. Denote by

and . By the symmetry of and , Equation (3) is equal to:

(4)

where . By our assumption on , we have that the . Since we also know that , for any . Therefore, we have that , for any , and the function is -Lipschitz, when . By the contraction principle (Theorem 4.12 in Ledoux and Talagrand [35]), Equation (4) is at most:

(because and )

To handle the above expecation, we will use the following fact (c.f. Lemma 1 in Davenport et al. [36] and the proof therein). Let be a set of uniformly random samples from a by matrix. Let be the indicator matrix for , in other words, the -th entry of is 1, and 0 otherwise. Let denote Rademacher random variables. We have that

To see how to use the above fact in our setting, observe that contains nine nonzero entries, for every . If we divide into the by submatrices, then there is exactly one nonzero entry in each submatrix with a fixed value. Hence we can use Fact A to bound the contribution of each by submatrix. 555For diagonal blocks, similar results to Fact A can be obtained based on the proof in Lemma 1 of Davenport et al. [36] (details omitted). Overall, we obtain:

Combined with Equation 3 and 4, we obtain the desired conclusion. Hence the proof is complete. ∎