. In relational learning, matrix factorization (MF) is a popular approach for dimension reduction by representing the rows (entities of one type) and columns (entities of another type) as two low-rank matrices. The optimization of dimension reduction is usually achieved by minimizing the reconstruction error between the low-rank model and the original data. Since the low-rank model can encode certain patterns latent in the original data, MF has been recognized as an effective pattern recognition technique and been widely used in a variety of tasks, such as relation prediction[4, 5], data compression , clustering , feature learning , topic modeling , etc.
When the relation of interest is sparse, it is always desired to predict whether a relation exists between two entities, known as relation prediction. The relation prediction task plays a vital role in many real-world learning systems, such as recommender systems [4, 10], language understanding [11, 12], social network mining [13, 14], and so on. For relation prediction, which can be seen as a classification task, it is crucial to account for the missing entries, since they provide valuable signal about negative instances . For example, in recommendation, early collaborative filtering approaches predict ratings by modeling the observed data only ; later on, researchers find that this way of ignoring missing data leads to poor performance in the real top-N recommendation system . Another example is that in the learning of word embeddings , negative sampling is performed on missing data to add the negative signal about word-context co-occurrence, which is a crucial setup to ensure the semantics of learned word embeddings.
Nevertheless, it is non-trivial to leverage the missing data, since its scale can be several orders of magnitude larger than the observed entries . For example, in video recommendation data, users may only watch hundreds of videos on average among millions of videos, making the scale of missing data three orders of magnitude larger than the observed data in the user-video matrix . This large-scale missing data poses efficiency challenges for learning the MF model. Towards this end, existing works have resorted to either sampling partial missing data as negative signal (aka., negative sampling [20, 21]) or modeling all missing data in a simplified way – by assigning them a uniform weight to be negative [22, 23]. Both solutions have pros and cons: negative sampling has controllable efficiency, but its effectiveness may suffer from the low quality of negative examples and slow convergence [24, 25]; while modeling all missing data is costly, it can be more effective [1, 18]. To pursue high effectiveness, we focus on learning from all missing data in this work.
The Singular Value Decomposition (SVD) is a representative method for whole data-based MF. It assigns the same weight to all entries in the data matrix, regardless of whether they are observed or not (by default, the unobserved entries are assigned with a value of zero). This assumption makes the optimization problem have a nice structure and yields an analytical closed-form solution . However, considering that the number of missing data can be much larger than the observed data in real applications, it is more desirable to assign the missing data a lower weight to address the class imbalance issue. To this end, the Weighted Alternating Least Square (WALS)  assigns a lower weight to all missing data, which is more flexible than the default setting of 1. However, we argue that WALS implicitly admits all missing data have the same likelihood to be negative, which may not be true in real applications. For example, in recommendation, we know that popular items are more likely to be known by users, and thus a missing on popular items is more likely to be a true negative. Lastly, it is worth mentioning that the uniform weight design in WALS is mainly due to the efficiency concern, since it allows for a clever speedup on ALS learning, which can avoid the high complexity brought by modeling all missing data. If we were to use non-uniform weights on missing data, the speedup trick of WALS is not applicable anymore, and the optimization complexity becomes unaffordable.
In this work, we enhance 1) the flexibility of MF by allowing the use of non-uniform weights on missing data, and 2) the practicability of weighted MF by developing an efficient optimization algorithm. In short, we allow each missing entry to be assigned with an individualized weight, which encodes its prior to be a negative instance; the learning task takes the whole data matrix into account, but its time complexity is dependent on the number of observed entries only, rather than the matrix size (which is row#
column#). The two significant enhancements of our method make it easy to address large-scale relation prediction issue with a more expressive modeling on missing data, which has not been possible by the traditional MF methods like SVD and ALS. Our solution is achieved in three steps: 1) we perform truncated SVD on the weight matrix of missing data, using a more compact low-rank model to represent (or approximate) the weights of missing entries; 2) we perform ALS optimization on each element of user and item latent vectors, rather than the traditional vector-wise manner[27, 22]; 3) we leverage the low-rank weights to design memoization strategies to reduce the time complexity significantly. Through comprehensive experiments on two real-world recommendation benchmarks, we verify the correctness and efficiency of our fast eALS method, and the effectiveness of using non-uniform weights on missing data for the recommendation task.
A preliminary version of this work has been published as a conference paper in SIGIR 2016 . This paper is significantly different from its preliminary version in the methodology. Specifically, this work approaches a generic problem setting where any weighting strategy can be applied on missing data, but our previous work  can only deal with a simpler case where the missing entries of a column have the same weight; moreover, the recent work  can also be seen as a simpler case of this work where the missing entries of a row have the same weight. As such, the Preliminaries (Section II), Proposed Methods (Section III), and Experiments (Section IV) have been re-written to support our solution to the new generic problem. The key contributions of this paper are summarized as follows:
We highlight the problem of optimizing MF with non-uniform weights on missing data and present an element-wise ALS algorithm to solve it.
We propose a fast eALS algorithm that solves the weighted MF problem with low-rank weights on missing data. The algorithm has a low time complexity in proportion to the number of observed entries and is independent of the number of missing entries.
We perform extensive experiments on two real-world datasets and demonstrate its correctness, efficiency, and effectiveness. The codes of our experiments can be found in: https://github.com/duxy-me/ext-als.
This section provides some preliminaries about MF and formalizes the problem to solve in this paper. Moreover, we discuss the efficiency challenge in solving the problem. Note that part A of this section has been presented in the preliminary version  (cf. Section 3.1) and other parts are new. Before starting the section, we first introduce some notations.
We denote the original data matrix as , where and denote the number of rows and columns in the data matrix, respectively. We use the set to denote the set of observed entries in R, i.e., for which the values are non-zero. Matrices and denote the latent factor matrix for rows and columns respectively; that is, they are the results or model parameters of MF. We use the vector to denote the -th row of matrix P, and we use the set to denote the column indices with a nonzero value on row , i.e., . We use the symbols and to denote the similar meanings for the column side. Throughout the paper, we use the uppercase bold font to denote a matrix, lowercase bold font to denote a vector, and lowercase italic font to denote a scalar; for example, P denotes a matrix, denotes the -th row vector in P, and denotes the -th entry in P.
Ii-a The MF model
MF maps both rows and columns into a low-dimension latent space (the dimension is ) such that their interactions are modeled as an inner product in that space . Mathematically, each element of R
is estimated as:
where and are model parameters, which can be understood as the latent feature vector for row and column , respectively. The model estimation can be seen as reconstruction for an observed entry, or prediction for an unobserved one. For example, in recommendation, denotes a user’s rating on an item (the larger the better), and ranking all items by can be used to select top-N recommendations for . In matrix-wise representation, the model can be expressed as , which implies the low-rank assumption of the data matrix .
Ii-B Problem Formulation
Since MF performs dimension reduction on the original data matrix ( is typically set to be much smaller than and ), the objective function for model learning is usually formed as an error-based regression loss [16, 27]. In this work, we learn MF parameters by solving the minimization problem on the objective function as follows:
where denotes the weight of the training instance , is the matrix form for all weights , and is a hyper-parameter to control the regularization strength to prevent overfitting. In this problem formulation, we consider all data entries in R and assign each data entry with an individualized weight , which is a generic setting that gives practitioners the flexibility to design the weighting strategy. Many previous efforts on MF do not deal with this generic problem setting, but instead use a specific weighting strategy. Here we discuss three most common strategies:
Strategy 1. Zero weight on missing entries. This strategy applies a zero weight on missing entries, i.e., . Since only observed entries are used as training instances, the learning time complexity is low, which depends on the number of observed entries. This is a typical setting for the rating prediction task [16, 30]
, which aims to predict the values of missing entries in user-item rating matrix. When the data follows the missing at random (MAR) assumption, such a setting can provide unbiased estimation. However, the MAR assumption does not hold in many real-world applications, for example, a user is are more likely to rate movies of her interest, rather than a random set of movies. In this case, the missing entries contain valuable signal about negative instances, and thus ignoring them will lead to suboptimal performance, especially for predicting whether a relation exists between two entities .
Strategy 2. Uniform weight on all entries. This strategy applies a uniform weight of 1 on all data entries, i.e., . The SVD method  can be directly applied to find the optimal solution for this problem. When the number of missing entries are of the same scale as the number of observed entries, such a setting may yield good performance. However, many real-world applications need to deal with sparse matrix that is highly imbalanced, for example, the observed ratings only take of the rating matrix in the Netflix challenge data111https://en.wikipedia.org/wiki/Netflix_Prize. For such highly imbalanced learning scenarios, a uniform weighting strategy will make the parameter estimation process dominated by the missing entries, resulting in suboptimal performance.
Strategy 3. Uniform weight on missing entries. This strategy assigns all missing entries with the same weight , which can be different as the weight for observed entries :
where denotes the weight of observed entry , and denotes the uniform weight for all missing entries. When dealing with sparse data, can be set as a smaller number than to alleviate the imbalanced learning issue. Hu et al.  demonstrated that this strategy yields better performance than a uniform weight on all entries in recommendation task. However, the deficiency is that it assumes all missing entries provide the same level of negative signal, which severely limits the fidelity for modeling real-world scenarios. For example, in recommendation systems, we know that the exposed but unclicked items (e.g., display ads) are more likely to be true negatives , which should be assigned with a higher weight than others. Another reasonable intuition is that the missing entries of active users (who have consumed many items) are more likely to be true negatives .
In this work, we do not assume any weighting strategy on the data entries, and provide a solution for solving the generic problem of Eq. (2). In other words, our solution subsumes the above-mentioned works that define various weighting strategies.
Ii-C Efficiency Discussion
One key reason that the previous work assumes a specific weighting strategy for missing data is due to efficiency concern. Here we analyze the learning time complexity for the three strategies:
- For Strategy 1, the training set contains only
observed entries, thus standard optimization method like stochastic gradient descent (SGD) can be applied, which has the time complexity of. This level of complexity is rather low, only requiring a traversal on all training instances and updating latent vectors and at each visit of instance .
- For Strategy 2, since SVD can be directly applied to find the global optimum solution, the learning complexity depends on the solver for SVD. The commonly used solver Lanczos Bidiagonalization (LBD) method  has a complexity linear with respect to the number of observed entries. As such, its actual running time is in the same magnitude as the SGD solver for Strategy 1, and we can denote the analytical time complexity of SVD as .
- For Strategy 3, the training set contains all data entries, for which standard optimization methods like SGD have the time complexity of . This level of complexity is rather high, being unaffordable for real-world large-scale applications that may have over millions of rows and columns (e.g., the user-item matrix in recommendation). Fortunately, the uniform weight constraint brings opportunities for speedup by memorizing some intermediate variables. Hu et al.  leveraged on memoization tricks and proposed an ALS-based algorithm (named as WALS), reducing the time complexity to . Note that the term is brought by the matrix inversion operation, which is inevitable when optimizing a latent vector (i.e., or ) as a whole in ALS . Even so, the part is still more costly than SGD, which only requires time. As a result, even with speedup, WALS may still be prohibitive for running on large data, where large is crucial as it can lead to better representation ability and better performance. Lastly, it is worth mentioning that the speedup design in WALS is only applicable when the missing entries have the same weight. When such a nice structure is broken, WALS degrades to ALS with a complexity of .
In this paper, we propose a new solution to efficiently solve the weighted MF problem. Distinct from the above-mentioned efforts that performed vector-wise (i.e., ) or matrix-wise (i.e., P) optimization, we perform optimization on the element level (i.e., ), named as element-wise ALS (or eALS for short). Furthermore, we apply a low-rank model to represent the weights of missing entries. Unifying the two designs, our solution achieves a time complexity of the level, but is more flexible on the weights of missing entries.
Iii Proposed Methods
We first present a vanilla element-wise ALS learner, which differs from the conventional vector-wise ALS [27, 22]. By performing optimization on each element of the parameter matrix P and Q, not only we can avoid the expensive matrix inversion operation in optimization, but also allow for more flexible design of memoization strategies for further speedup. Next, we propose to represent the weights for missing entries with a low-rank model, which not only reduces the space to store the weights for missing data, but also opens the door for speeding up the eALS learner. Lastly, based on the low-rank weights, we elaborate the fast eALS algorithm and discuss its several properties. Note that part A of this section has presented in the preliminary version  (cf. Section 3.3) and other parts are different.
Iii-a Vanilla Element-wise ALS Learner
One bottleneck of the previous WALS solution lies in the matrix inversion operation, which is due to the updating of the latent vector for a row (column) as a whole (more explanations see Section 3.2 of ). As such, it is a natural thought to avoid this operation by optimizing parameters at the element level. Specifically, we follow the coordinate descent setting [33, 34] that optimizes each element of the latent vector while leaving the others fixed.
First, we differentiate the objective function Eq. (2) with respect to :
where , which can be understood as the model prediction in the absence of latent factor . Given other variables fixed, the optimal solution of can be obtained at the point of . Solving this equation, we can have:
Following the similar way, we can get the solver for item latent factor :
With the above solution that solves on variable with others fixed, we can get a learning algorithm by iteratively executing on all parameters until convergence. It is worth noting that since the objective function is non-convex in terms of all parameters together. As such, this element-wise ALS solver can only find local minima (where the critical points where gradients vanish). This is the same for the conventional ALS [22, 35] and other gradient descent methods in optimizing the objective function. As a consequence, the initialization of model parameters will affect the results. According to our experiment experience, eALS’s performance is relatively stable with a Gaussian random initialization.
Time Complexity. We can see that by updating each element and at a time, we can avoid the expensive matrix inversion operation, which is compulsory in traditional ALS. Through this way, we can eliminate the term in the time complexity. Furthermore, we can follow the caching strategy as stated in , further reducing the time complexity from (i.e., the time complexity of directly implementing the update rules) to . This time complexity is in the same level as evaluating all points in the data matrix R. Algorithm 1 shows the vanilla eALS algorithm with the cache on , which has a time complexity of .
Iii-B Low-Rank Representation on Weights of Missing Data
The original problem of weighted MF assigns an individualized weight for each data point, which requires space to store all weights (denoted as the weight matrix W). This is very costly and unrealistic for large-scale applications. To be more specific, let us consider an intuitive example in recommendation that needs to deal with 1 million users and 1 million items. Assuming we use the 4-Byte float type to express a weight, then the space to store all weights is . Such a large consumption of space will pose a great challenge for the infrastructure, not to mention that the space cost will increase quadratically with respect to the number of users and items.
Now that storing an individualized weight for each data point is practically infeasible, we consider using a more compact way to represent W. Specifically, we perform truncated SVD on W, obtaining two low-rank matrices which can reconstruct W without any error:
where , and denotes the rank size of weight matrix W. If is much smaller than and , using A and B to reconstruct W takes fewer space than directly storing W. If is large such that the space complexity of is still unaffordable, one can use truncated SVD with a smaller number of predefined rank size, but at the cost of approximating W with some errors — the smaller the number, the larger the error of the approximation. This is a tradeoff between the space cost and the precision of low-rank representation. To be precise, let be the predefined rank size of truncated SVD, then the space cost to store the weights is . Since the common solver of SVD like the Lanczos Bidiagonalization (LBD) method  has a complexity linear with respect to the number of observed entries, the analytical time complexity of truncated SVD is . Given is usually a small number (e.g., 1 for column-oriented  or row-oriented  weighting schemes), this time complexity is much lower than matrix factorization algorithms (which are shown in Table 1). As such, using truncated SVD on the weights will not significantly increase the actual runtime of our method.
In a sparse matrix, the number of missing entries is usually several magnitudes than the number of observed entries. As such, we apply truncated SVD on the weights of missing entries only, and use the original weights of observed entries as they are. Such a weighting strategy can be expressed as follows:
where denotes the weight of observed entry (u,i), and denote the -th row vector of A and -th row vector of B, respectively. In the next subsection, we present a fast algorithm to accelerate eALS by leveraging the low-rank structure of the weights of missing data.
Iii-C Fast eALS Algorithm
The fast algorithm to be presented in this part reduces the time complexity from -related to -related, successfully avoiding the heavy burden brought by optimizing the missing data.
First, we reformulate the objective function by separating the terms on observed data and missing data:
As we can see, the first term focuses on the observed data only and leads to a low complexity in optimization. The major cost comes from the second term that operates on all missing data. Next, we elaborate the derivation process of optimizing user latent factor , and its counterpart of optimizing item latent factor can be achieved similarly.
First, we compute the derivative of with respect to and set the derivative to zero, we can obtain the update rule of :
where , i.e., the prediction without the component of latent factor . With careful inspection, we can find that the major cost of executing the update rule comes from the two sum operations on missing data, i.e., (in the numerator) and (in the denominator). We term the evaluation of the two costly terms as the problem and the problem, respectively, and show how to solve the two problems in an efficient way.
1. Solving the problem. First, we expand the term using element-wise operations and obtain:
Then, we re-arrange the sum operations and obtain its equivalent form:
As we can see, the main computational bottleneck is in the term , which needs to scan over all items. Nevertheless, a nice property is that this term is independent of — which means if we sequentially update all elements in P, and then Q, this term can be pre-computed and used for the updating of all elements in P
without computing it on-the-fly (the reverse way also applies). To achieve this, we define a 3-dimensional tensor, in which each element is defined as ; the tensor is computed after updating Q and is cached in the updates of P. With this cache, the problem can be approached as:
which can be computed in time, rather than the raw time complexity of .
2. Solving the problem. We can apply the similar cache strategy to address the costly problem:
where denotes the -th element of the cache.
We can apply the similar derivation process on the item latent factor to obtain its update rule:
To speed up the computation of and , we similarly define the 3-dimensional cache, in which each element is . With this cache, the two costly terms can be efficiently computed as:
Algorithm 2 summarizes the accelerated algorithm for our eALS method. Since each parameter update of eALS finds the optimal value for the parameter given the current status of other parameters, the training objective function is guaranteed to decrease with the training222We omit rigorous proof here since it is obvious.. As such, for the stopping criteria, one can either check the objective function value, or rely on a hold-out validation data to investigate the metrics of interest.
In this subsection, we discuss several properties of our fast eALS algorithm, including analyzing its time complexity, the fast computation of objective function, and how to do parallel learning.
Iii-D1 Time Complexity Analysis
The time complexities of key steps have been annotated in Algorithm 2. Summarily, the complexity of one eALS iteration is , which includes the complexity of updating a row latent factor and the complexity of updating a column latent factor . As we can see, even eALS models all missing data, the essential time complexity is controlled by the number of observed data , rather than the matrix size . Thus the overall time complexity is in proportion to and Z, which makes eALS extremely efficient for large-scale applications.
|WALS (Hu et al.)|
|IALS1 (Pilászy et al.)|
|ii-SVD (Volkovs et al.)|
|RCD (Devooght et al.)|
|eALS (Algorithm 2)|
denotes the number of non-zeros in the data matrix R. and denote the number of rows and columns of data matrix R, respectively. denotes the latent dimension of MF. denotes the rank size of the weights of missing data.
There are some other MF methods that model all the missing data. Their time complexities (of one iteration) are shown in Table I. Note that besides our proposed eALS, other MF methods shown in the table only support uniform weights on missing entries. As such, there is an additional term in our method, which denotes the rank size of the weights of missing data. For a fair comparison with other methods in time complexity, we assume to be 1 and use eALS to optimize the same objective function. First, our model is times faster than the vector-wise ALS [22, 35], and it has the same time complexity with RCD . Moreover, it is faster than ii-SVD , another recent solution for item recommendation with implicit feedback. It is remarkable that RCD  leverages the gradient descent on a randomly chosen latent vector to learn a whole-data based MF. To find out a good learning rate for faster convergence adaptively, RCD runs a line search in each gradient step. Therefore, the major advantages of eALS over RCD are the high efficiency and simplicity.
Iii-D2 Fast Computation of the Objective Function
The value of objective function is an important indicator on the training process. A direct calculation requires evaluating every entry in the R matrix, which takes time and is very time-consuming. To address this problem, we leverage the low-rank weighting scheme and intermediate variables cached in Algorithm 2, devising a set of similar element-wise computations on R for acceleration. Here, we reformulate the major cost of objective function (i.e. the loss of the missing data):
It is obvious that the major computation comes from the first term, due to the iterations over all rows and columns. Thus we accelerate it with the transformation in Eq. (19):
As can be seen, with the help of the cache, we can reduce the time complexity to , which is several orders of magnitude smaller than the direct computation of .
Iii-D3 Parallel Learning
The key operations of eALS are easily parallelizable. First, the computations of the two caches and are based on the standard matrix multiplication operations (line 8 and 20), which are straightforward to be parallelized. Second, in updating the latent vectors P (line 10-17), the cache is temporarily fixed, and the shared parameters are independent with each other. Therefore, it is practicable to update rows in parallel due to the nice independent property. Specifically, eALS can leverage multiple workers to update the model parameters for disjoint sets of rows concurrently. Similarly, this parallel strategy can also be applied in updating latent vectors Q (line 22-29).
It is notable that since the operations in SGD are strictly ordered and they are hard to be separated, controlling the possible losses is significant and difficult to leverage sophisticated strategies in parallel . Meanwhile, the SGD loss always constrains the paralleling magnitude. Thus the ease of parallelization is an important advantage of our proposed eALS over the commonly used SGD learner. With coordinate descent, by parallelizing the key operations, our proposed eALS is embarrassingly parallel without any approximation loss.
In this section, we perform experiments to verify the correctness, efficiency, and effectiveness of our fast eALS algorithm. All experiments are conducted on two real-world rating datasets, which are commonly used in recommendation systems. We first introduce the experimental settings, followed by the verification of correctness and efficiency, and the effectiveness of eALS in recommendation by modeling missing data with non-uniform weights. Note that part A, D and Table IV of this section have been presented in the preliminary version , and other parts are new.
Iv-a Experimental Settings
Two publicly accessible rating datasets are selected to evaluate the methods: Yelp333We used the Yelp Challenge dataset downloaded on October 2015 that contained 1.6 million reviews. and Amazon Movies444http://snap.stanford.edu/data/web-Amazon-links.html. The Yelp dataset is about users’ ratings on businesses (most of which are restaurants) and the Amazon dataset is about users’ ratings on movies, where each rating is in the range of 1 to 5. We construct the data matrix by defining each row as a user, each column as an item, and each entry as the user’s rating score on the item; if a user did not rate an item before, the corresponding entry in R will be defined as missing data. Table II summarizes the statistics of our experimented datasets555The experimented datasets can be downloaded from: https://github.com/hexiangnan/sigir16-eals/tree/master/data. We can see that both datasets are extremely sparse with the sparsity ratio over . This provides empirical evidence on the necessity of using low-rank weights rather than storing the whole weight matrix as it is. Take the Amazon dataset as an example, storing the whole dataset matrix takes the space of (), which is very space-consuming.
Iv-A2 Evaluation Protocols
We split the data into training set and testing set by using the leave-one-out protocol, a widely used method in recommendation papers [20, 39]. Specifically, the last rating of each user is hold out for testing, and the remaining data are used for training.
Besides validating the model training with the loss value, we also evaluate the performance of Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG), which judge the ranking quality of top-N recommendation; here we set the N to 100 without special mention. More specifically, for a user, we rank all unrated items by its prediction scores, and take out the top-100 items as the recommended list. If the testing item appears in the recommended list, it is treated as a hit, and the HR is set as 1. We can see that HR does not account for the position of a hit — as long as the testing item appears in the recommended list, it is treated as a success. NDCG addresses this deficiency by assigning higher rewords to hits at top positions and scoring successively lower-position hits with marginal fractional utility. More details about the two metrics can be found in .
Since the evaluation on each user produces a ranking list, we calculate HR and NDCG for each user and report the average score of all users. Clearly, a higher score denotes a better performance, and both measures are in the range of 0 to 1.
Iv-B Correctness Verification
Here we verify the correctness of our proposed eALS algorithm. Since SVD is known to optimize the unweighted squared loss and can find the global minimum, we setup eALS to optimize the same objective function as SVD to verify its correctness. Specifically, we set for all observed entries as 1, the regularization parameter to 0, and A and B to a vector in which all entries are 1; the number of latent factors is set to 64 for both methods. For SVD, we use the Python toolkit sparsesvd666https://pypi.python.org/pypi/sparsesvd/, which solves the SVD problem with the LBD method. For eALS, we run it for 100 iterations.
shows the loss achieved by the two methods on the training set and their HR and NDCG scores on the testing set. We can see that eALS achieves almost the identical performance as SVD. On Yelp, the training loss of eALS is slightly higher than that of SVD, which is caused by the insufficient training of eALS, since eALS iteratively updates model parameters while SVD finds the global optima with a closed form solution; and the t-tests show that the two methods are in the same significance level. On Amazon, eALS sufficiently converges in 100 iterations, and both training loss and testing scores show that eALS achieves the same performance as SVD. To verify that eALS finds the exactly same solution as SVD, we further employ the point-wise measuremean absolute error (MAE) to evaluate the difference of the two methods’ prediction on observed entries: on Yelp, the MAE is ; and on Amazon, the MAE is . Such tiny error rate is acceptable, considering that eALS and SVD are implemented with different programming languages with different float precision settings. Overall, these results verify the correctness of our fast eALS algorithm.
Furthermore, we empirically test whether our fast eALS method (Algorithm 2) speeds up the vanilla eALS (Algorithm 1) without any sacrifice on the accuracy. To avoid the possible randomness that affects the results, we apply the same initialization on their model parameters . Figure 1 shows the training process of the two algorithms with the number of latent factors
setting to 1 and 5 on Yelp. We can see that the fast eALS algorithm obtains exactly the same result as the vanilla eALS, including the training loss of each epoch. This is as expected, since we derive the fast eALS based on rigorous mathematical operations without any approximation. This further justifies the correctness of the fast eALS algorithm, which is actually non-trivial to implement since it has several caches for speed-up purpose that need to be carefully updated. Interested readers can check out our implementation at:https://github.com/duxy-me/ext-als.
Iv-C Efficiency Study
We investigate the actual speedup brought by our design of the fast eALS algorithm. All experiments in this subsection are run on the same machine (Intel Xeon 2.67GHz CPU and 24GB RAM) for fair comparison on the efficiency. Figure 2 shows the training time of the vanilla eALS and our fast eALS with different settings of and . In the figure, the x-axis denotes the setting of , the y-axis denotes the training time per iteration, and different lines indicate different settings of . We have the following key observations:
The fast eALS is several magnitudes faster than the vanilla eALS algorithm. For example, in Yelp, the vanilla eALS takes 98 seconds to train a small model of , while the fast eALS takes only 0.8 seconds to train the same model even with a large of 64. The speedup is more significant for the larger Amazon data, where the vanilla eALS takes over 42,000 seconds (i.e., half day) to train a model of , while the fast eALS takes only 300 seconds to train the same model with a large of 64. This acceleration is over 100 times, which is highly valuable in practice and is difficult to achieve with simply engineering efforts. Intuitively, one needs to have over 100 machines and implements an effective distributed system with a negligible network cost and linear scale-up on the number of machines, which is very difficult to achieve in practice .
The running time of fast eALS exhibits a linear relationship with respect to , which can be seen clearly from the inside box of the figure. For example in Yelp, for , the y-axis of is twice of that of , which is the same for the Amazon dataset. Moreover, the running time exhibits a quadratic relationship with respect to . These results are as expected, verifying the analytical time complexity of the fast eALS algorithm — .
Furthermore, we compare the efficiency of the fast eALS algorithm with two whole data-based MF methods — the Randomized block Coordinate Descent (RCD)  method and the WALS method . Since both methods only support uniform weighting on missing data, we set the of eALS to 1 for a fair comparison. Table IV shows the average training time per iteration of the three methods.
, , and denote seconds, minutes and hours, respectively.
Analytically, WALS has the time complexity of , while eALS and RCD have the same time complexity which is times smaller than that of WALS. As can be seen from the table, with the increase of , WALS takes much longer time than eALS and RCD. Specifically, when is 512, WALS requires 11.6 hours for one iteration on Amazon, while eALS only takes 12 minutes. Although eALS does not empirically shown to be times faster than ALS due to the more efficient matrix inversion implementation (we used the fastest known algorithm  with time complexity around ), the speed-up is already very significant. Moreover, as RCD and eALS have the same analytical time complexity, their actual running time are in the same magnitude; the minor differences can be caused by some implementation details, such as the data structures used.
Iv-D Effectiveness in Item Recommendation
In this subsection, we explore the effectiveness eALS in the real-world task of item recommendation. As mentioned in Section IV-A, this is a personalized ranking task, and we employ the leave-out-one protocol to evaluate the performance with NDCG and HR. We aim to answer the following research questions:
Is the non-uniform weighting strategy on missing data effective to offer better performance?
How does eALS perform as compared with existing whole data-based MF methods for recommendation?
Next, we describe experimental results to answer the two questions. Note that our findings are consistent across the number of latent factors , thus we show the results of only, a relatively large number that retains good model capability.
RQ1: Non-Uniform Weights on Missing Data
. Existing MF methods for recommendation assign a uniform weight on missing data for the ease of efficient optimization. This implies that all unrated items for a user have an equal probability to be negative, which may not be true. Since the visual interfaces of many Web systems tend to showcase popular items, when all other factors are equal, popular items are more likely to be known by users in general. As such, it is reasonable to think that a miss on a popular item is more probable to be truly irrelevant (as opposed to unknown) to the user. To account for this effect, we design the weights for missing entries based on item popularity:
where denotes the frequency of item , in the training set: . The weight for each observed entry (i.e., in Eq. (9)) is set as 1, and is a hyper-parameter to determine the overall weight of missing data. The exponent controls the significance level of popular items over unpopular ones — when the weights of popular items are promoted to strengthen the difference against unpopular ones; while setting within the lower range of suppresses the weight of popular items and has a smoothing effect. This weighting strategy basically assumes that the missing entries of popular items carry more negative signal. It is obvious that the rank size of such a weight matrix (on missing entries only) is 1, and with truncated SVD, we can get its low-rank representation as:
Our fast eALS implementation is initialized with this setting on the weights of missing entries. It is worth noting that other non-uniform weighting strategies can also be applied here, such as the user-oriented scheme proposed in , or we can combine user-oriented and item-oriented schemes in Eq. (21). Since the aim of this experiment is to show the effectiveness of customizing weights in eALS, rather than demonstrating state-of-the-art recommendation performance, we leave this further exploration on the weighting scheme as future work. Note that this initialization leads to a recommendation method same as our preliminary work . As such, the results presented below are also the same.
, where the weights of missing data follow a uniform distribution (controlled by); we vary to study how does the overall weight of missing data affect the performance. The optimal on Yelp (Figure (a)a) is around 512, and that on Amazon (Figure (c)c) is around 64. Correspondingly, the weights of each zero entry are and respectively (). However, both datasets exhibit similar patterns: when is smaller than the optimal value, the performance drops significantly. In other words, when the weights of zero entry are close to 0, the performance degrades. That reflects the importance of weighting the missing data. Moreover, too large also leads to bad performance. That is why the traditional SVD technique , which assigns the same weight to all the entries, is suboptimal here.
Then, we vary with the optimal (in the case of ) to check the performance change. As demonstrated in Figure (b)b and (d)d, the optimal is around 0.4 on both datasets. Below 0.4, with the increase of , the performance of eALS is gradually improved. But when increases above 0.5, the performance of eALS becomes worse. That reveals that weighting missing data according item popularity is important for recommendation. We further verify the improvement with the one-sample paired -test. The results (-value ) for both metrics on the two datasets indicates the effectiveness of our method.
In the following experiments, we fix and according to the best performance evaluated by HR, i.e., for Yelp and for Amazon.
RQ2: Performance Comparison. We compare eALS with two whole data-based MF methods which are originally designed for the item recommendation task:
- WALS . This is the weighted ALS method that optimizes the whole-data based MF. It assigns the same weight to all missing data. We tuned the carefully and reported the best performance.
- RCD . This is the state-of-the-art implicit MF method that optimizes the same objective function as ALS but with a faster coordinate descent learner. We similarly tuned the ; for the line search related parameters, we use the suggested values in the authors’ implementation777https://github.com/rdevooght/MF-with-prior-and-updates.
The recommendation accuracy of each training iteration is shown in Figure 4. All improvements are statistically significant as evidenced by the one-sample paired -test (). First of all, the performance of eALS is best upon convergence. We believe that the improvements are mainly from the weighting strategy on the missing data. Our proposed eALS assigns adaptive weights to the missing data while both WALS and RCD apply uniform weights.
Second, RCD converges slower than eALS and WALS. This is caused by the difference between global optimization and local optimization. In each iteration, RCD updates to a suboptimal point, which may be a wrong direction to the global optimal point. The effect of local optimization is also verified in Figure (d)d. RCD attains high NDCG in a short time, while its low HR and the later oscillation demonstrate that the high NDCG of RCD is unstable. Another reason for this situation may come from the RCD’s adaptive strategy. In the early iterations, the optimizer tends to make rapid learning by using a large learning rate, that may lead to the unexpected (i.e. suboptimal) results. Nevertheless, WALS outperforms RCD in most situations, that demonstrates ALS is better than gradient descent learner in these tasks.
V Related work
Matrix Factorization is a representative method that represent a data matrix as two low-dimension matrices. The decomposition process can distill co-occurrence patterns in data . Moreover, the reconstructed low-rank model can be used to recover missing information, such as its application in recommendation that predicts users’ ratings on unknown items [4, 44].
However, in many real-world applications, the data matrices are can be highly sparse. For example, in recommendation, handling missing data is particularly important for learning from implicit data, since they provide valuable negative signal. Along this line, we can categorize previous works into two types: sample-based learning and whole-data based learning:
- The first type samples negative instances from missing data [20, 10, 45, 46]. For example, the BPR method proposed by Rendle et al.  randomly samples negative instances from missing entries, maximizing the margin between the model prediction of observed entries and that of sampled negatives. Recently, He et al.  develops adversarial training methods for BPR to increase the robustness of the learned model. By negative sampling, the number of negative instances is greatly reduced, therefore the overall time complexity is controllable . However, the downside is that they usually have a slower convergence rate and the performance is highly dependent of the design of the sampler [18, 25, 24].
- The second type treats all missing entries as negative instances [22, 34, 15, 47]. For example, the WALS method proposed by Hu et al.  models all missing entries as negative instances with a label of 0, assigning them with a lower weight in point-wise regression learning. Recently, Ding et al.  develops a pairwise learning framework to model the margin between observed entries (based on view histories) and all missing entries. These methods model negative instances with a higher coverage, but the downside is that the learning algorithm could be much slower.
To pursue model effectiveness, we focus on whole-data based learning in this work, aiming to develop an efficient solution to address the inefficiency issue. For this line of research, several previous efforts have been made, such as [34, 23, 22, 36, 34, 37]. We find that these methods have a common limitation — weighting missing entries with a same weight. This design is mainly for efficiency concern, since fast learning algorithms can be obtained with this constraint. However, it decreases the modeling flexibility and may result in suboptimal performance. The works that are closest to ours are [35, 15, 28], which consider applying non-uniform weights on missing entries. However, these methods only supports simple weighting scheme, either row-based or column-based, and cannot be extended to other more complex schemes. This work addresses the research gap by developing efficient learning algorithms for any weighting scheme on missing data. Lastly, it is worth noting that the algorithm proposed in the recent work  is a special case of our fast eALS method, since it can be exactly recovered by setting the weights on missing entries to be user-oriented.
In this paper, we studied the problem of learning MF with non-uniform weights on missing data. Targeting at the square loss, we first proposed to apply ALS optimization at each element level, namely, eALS. To address the efficiency challenge in solving the weighted MF problem, we then proposed a low-rank weighting strategy on missing data, which not only saves the space in storing weights but also allows us to further speedup the eALS method. To this end, we developed a fast eALS algorithm by a clever use of memoization caches, for which the time complexity is determined by the number of observed entries only rather than the whole data matrix. We conducted extensive experiments on two public rating datasets, verifying the correctness, efficiency, and effectiveness of our proposed fast eALS method.
We believe that optimizing MF with missing data is a fundamental problem in learning on sparse matrices. While most existing works assign a uniform weight on missing data, this work opens the door for designing complex weighting schemes for missing data. This will benefit a wide variety of tasks that can be solved with MF. In future, we plan to extend eALS to MF with side information, such as spatial contexts , user reviews , visual content 
, and knowledge graphs. Moreover, we will consider applying non-uniform weights for missing data on the more generic embedding models, such as collective factorization  and neural factorization machines . Lastly, we are interested in applying our method on other tasks, such as the knowledge graph completion and word representation learning.
The authors thank the anonymous reviewers for their reviewing efforts. This research is supported by the National Natural Science Foundation of China (Grant No. 61772275, 61732007, 61321491, 61202320, 61501063), the Outstanding Youth Science Foundation (No. 61722204), the Scientific Research Foundation of Science and Technology Department of Sichuan Province(Grant No.2016JY0240), and the Collaborative Innovation Center of Novel Software Technology and Industrialization. This research is also part of NExT++, supported by the National Research Foundation, Prime Ministers Office, Singapore under its IRC@Singapore Funding Initiative. This work is a significant extension of , which appeared in the Proceedings of SIGIR 2016.
-  X. He, H. Zhang, M.-Y. Kan, and T.-S. Chua, “Fast matrix factorization for online recommendation with implicit feedback,” in SIGIR 2016, pp. 549–558.
-  Z. Ma, A. E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo, “Variational bayesian matrix factorization for bounded support data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 4, pp. 876–889, 2015.
-  C. Li, J. Xing, A. Sun, and Z. Ma, “Effective document labeling with very few seed words: A topic model approach,” in CIKM, 2016, pp. 85–94.
X. Luo, M. Zhou, S. Li, Z. You, Y. Xia, and Q. Zhu, “A nonnegative latent
factor model for large-scale sparse matrices in recommender systems via
alternating direction method,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 3, pp. 579–592, 2016.
-  J. Tang, X. Shu, G. Qi, Z. Li, M. Wang, S. Yan, and R. Jain, “Tri-clustered tensor completion for social-aware image tag refinement,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 8, pp. 1662–1674, 2017.
-  J. C. S. de Souza, T. M. L. Assis, and B. C. Pal, “Data compression in smart distribution systems via singular value decomposition,” IEEE Transactions on Smart Grid, vol. 8, no. 1, pp. 275–284, 2017.
-  X. He, M.-Y. Kan, P. Xie, and X. Chen, “Comment-based multi-view clustering of web 2.0 items,” in Proc. of WWW ’14, 2014, pp. 771–782.
-  Z. Ma, Y. Lai, W. B. Kleijn, Y.-Z. Song, L. Wang, and J. Guo, “Variational bayesian learning for dirichlet process mixture of inverted dirichlet distributions in non-gaussian image feature modeling,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2018.
-  C. Li, Y. Duan, H. Wang, Z. Zhang, A. Sun, and Z. Ma, “Enhancing topic modeling for short texts with auxiliary word embeddings,” ACM Transactions on Information Systems, vol. 36, no. 2, p. 11, 2017.
-  X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural collaborative filtering,” in WWW 2017, pp. 173–182.
-  W. Lei, X. Wang, M. Liu, I. Ilievski, X. He, and M. Kan, “SWIM: A simple word interaction model for implicit discourse relation recognition,” in IJCAI, 2017, pp. 4026–4032.
-  J. Tang, X. Shu, Z. Li, G.-J. Qi, and J. Wang, “Generalized deep transfer networks for knowledge propagation in heterogeneous domains,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 12, no. 4s, p. 68, 2016.
-  Z. Zhao, H. Lu, D. Cai, X. He, and Y. Zhuang, “User preference learning for online social recommendation,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 9, pp. 2522–2534, 2016.
-  L. Liao, X. He, H. Zhang, and T.-S. Chua, “Attributed social network embedding,” IEEE Transactions on Knowledge and Data Engineering, 2018.
-  F. Yuan, X. Xin, X. He, G. Guo, W. Zhang, C. Tat-Seng, and J. M. Jose, “fBGD: Learning embeddings from positive unlabeled data with BGD,” in UAI, 2018.
-  Y. Koren, “Factorization meets the neighborhood: A multifaceted collaborative filtering model,” in KDD 2008, pp. 426–434.
-  P. Cremonesi, Y. Koren, and R. Turrin, “Performance of recommender algorithms on top-n recommendation tasks,” in RecSys 2010, pp. 39–46.
-  X. Xin, F. Yuan, X. He, and J. Jose, “Allvec: Learning word representations without negative sampling,” in ACL 2018, pp. 1853–1862.
-  J. Chen, H. Zhang, X. He, L. Nie, W. Liu, and T.-S. Chua, “Attentive collaborative filtering: Multimedia recommendation with item- and component-level attention,” in SIGIR 2017, 2017, pp. 335–344.
-  S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “Bpr: Bayesian personalized ranking from implicit feedback,” in UAI 2009, pp. 452–461.
-  C. Yang, C. Zhang, X. Chen, J. Ye, and J. Han, “Did you enjoy the ride: Understanding passenger experience via heterogeneous network embedding,” in ICDE, 2018.
-  Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit feedback datasets,” in ICDM 2008, pp. 263–272.
-  R. Devooght, N. Kourtellis, and A. Mantrach, “Dynamic matrix factorization with priors on unknown values,” in KDD 2015, pp. 189–198.
-  S. Rendle and C. Freudenthaler, “Improving pairwise learning for item recommendation from implicit feedback,” in WSDM 2014, pp. 273–282.
-  J. Ding, F. Feng, X. He, G. Yu, Y. Li, and D. Jin, “An improved sampler for bayesian personalized ranking by leveraging view data,” in WWW 2018, pp. 13–14.
-  V. Klema and A. Laub, “The singular value decomposition: Its computation and some applications,” IEEE Transactions on automatic control, vol. 25, no. 2, pp. 164–176, 1980.
-  N. Srebro and T. Jaakkola, “Weighted low-rank approximations,” in ICML 2003, pp. 720–727.
-  H. Li, X. Diao, J. Cao, and Q. Zheng, “Collaborative filtering recommendation based on all-weighted matrix factorization and fast optimization,” IEEE Access, vol. 6, pp. 25 248–25 260, 2018.
Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, and S. Ma, “Explicit factor models for explainable recommendation based on phrase-level sentiment analysis,” inSIGIR, 2014, pp. 83–92.
-  S. Wang, J. Tang, Y. Wang, and H. Liu, “Exploring hierarchical structures for recommender systems,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 6, pp. 1022–1035, 2018.
-  B. M. Marlin, R. S. Zemel, S. Roweis, and M. Slaney, “Collaborative filtering and the missing at random assumption,” in UAI 2007, pp. 267–276.
-  L. Komzsik, The Lanczos method: evolution and application. SIAM, 2003.
-  S. Rendle, Z. Gantner, C. Freudenthaler, and L. Schmidt-Thieme, “Fast context-aware recommendations with factorization machines,” in SIGIR 2011, pp. 635–644.
-  I. Bayer, X. He, B. Kanagal, and S. Rendle, “A generic coordinate descent framework for learning from implicit feedback,” in WWW 2017, pp. 1341–1350.
-  R. Pan, Y. Zhou, B. Cao, N. Liu, R. Lukose, M. Scholz, and Q. Yang, “One-class collaborative filtering,” in ICDM 2008, pp. 502–511.
-  I. Pilászy, D. Zibriczky, and D. Tikk, “Fast als-based matrix factorization for explicit and implicit feedback datasets,” in RecSys 2010, pp. 71–78.
-  M. Volkovs and G. W. Yu, “Effective latent models for binary feedback in recommender systems,” in SIGIR 2015, pp. 313–322.
-  R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Large-scale matrix factorization with distributed stochastic gradient descent,” in KDD 2011, pp. 69–77.
-  X. He, Z. He, X. Du, and T. Chua, “Adversarial personalized ranking for recommendation,” in SIGIR, 2018, pp. 355–364.
S. Rendle, D. Fetterly, E. J. Shekita, and B.-y. Su, “Robust large-scale machine learning in the cloud,” inKDD 2016, pp. 1125–1134.
-  D. Coppersmith and S. Winograd, “Matrix multiplication via arithmetic progressions,” in STOC 1987, pp. 1–6.
-  X. He, M. Gao, M.-Y. Kan, Y. Liu, and K. Sugiyama, “Predicting the popularity of web 2.0 items based on user comments,” in SIGIR 2014, pp. 233–242.
-  Z. Ma, J. Xue, A. Leijon, Z. Tan, Z. Yang, and J. Guo, “Decorrelation of neutral vector variables: Theory and applications,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 1, pp. 129–143, 2018.
-  H. Zhang, F. Shen, W. Liu, X. He, H. Luan, and T.-S. Chua, “Discrete collaborative filtering,” in SIGIR 2016, pp. 325–334.
-  X. He, X. Du, X. Wang, F. Tian, J. Tang, and T. Chua, “Outer product-based neural collaborative filtering,” in IJCAI, 2018, pp. 2227–2233.
-  Y. Zhang, Q. Ai, X. Chen, and W. B. Croft, “Joint representation learning for top-n recommendation with heterogeneous information sources,” in CIKM, 2017, pp. 1449–1458.
-  J. Ding, G. Yu, X. He, Y. Quan, Y. Li, T. Chua, D. Jin, and J. Yu, “Improving implicit recommender systems with view data,” in IJCAI, 2018, pp. 3343–3349.
C. Yang, L. Bai, C. Zhang, Q. Yuan, and J. Han, “Bridging collaborative filtering and semi-supervised learning: a neural approach for poi recommendation,” inKDD, 2017, pp. 1245–1254.
-  X. He, T. Chen, M.-Y. Kan, and X. Chen, “Trirank: Review-aware explainable recommendation by modeling aspects,” in CIKM 2015, pp. 1661–1670.
-  S. Wang, Y. Wang, J. Tang, K. Shu, S. Ranganath, and H. Liu, “What your images reveal: Exploiting visual contents for point-of-interest recommendation,” in WWW, 2017, pp. 391–400.
-  Q. Ai, V. Azizi, X. Chen, and Y. Zhang, “Learning heterogeneous knowledge base embeddings for explainable recommendation,” Algorithms, no. 137, 2018.
-  X. He and T. Chua, “Neural factorization machines for sparse predictive analytics,” in SIGIR, 2017, pp. 355–364.