Many tasks in data mining and related fields can be formalized as matching between objects in two heterogeneous domains, including collaborative filtering, link prediction, image tagging, and web search. Machine learning techniques, referred to as learning-to-match in this paper, have been successfully applied to the problems. Among them, a class of state-of-the-art methods, named feature-based matrix factorization, formalize the task as an extension to matrix factorization by incorporating auxiliary features into the model. Unfortunately, making those algorithms scale to real world problems is challenging, and simple parallelization strategies fail due to the complex cross talking patterns between sub-tasks. In this paper, we tackle this challenge with a novel parallel and efficient algorithm for feature-based matrix factorization. Our algorithm, based on coordinate descent, can easily handle hundreds of millions of instances and features on a single machine. The key recipe of this algorithm is an iterative relaxation of the objective to facilitate parallel updates of parameters, with guaranteed convergence on minimizing the original objective function. Experimental results demonstrate that the proposed method is effective on a wide range of matching problems, with efficiency significantly improved upon the baselines while accuracy retained unchanged.READ FULL TEXT VIEW PDF
In this paper, we propose an online algorithm to compute matrix
Matrix factorization is a popular approach to solving matrix estimation
Interaction function (IFC), which captures interactions among items and
Cross-Domain Collaborative Filtering (CDCF) provides a way to alleviate ...
Automated per-instance algorithm selection and configuration have shown
Matrix factorization (MF) discovers latent features from observations, w...
The refugee crisis is perhaps the single most challenging problem for Eu...
Many application tasks can be formalized as matching between objects in two heterogeneous domains, in which the association between some objects and information on those objects are given. We refer to the objects from one domain as queries and those from the other as targets, with the distinction usually clear from the context. For example, in collaborative filtering, given some items, one manages to find the users who have best match to the items, by using the preference of some users on some items as well as the features of users and items. Another example is image tagging, in which one wants to associate tags (keywords) with images based on some tagged images as well as the features of tags and images. Recent years have observed a great success of employing machine learning techniques, referred to as learning-to-match in this paper, to solve the matching problems.
Among existing approaches, a family of factorization models that make use of feature spaces to encode additional information, stand out as state-of-the-art in matching tasks. Examples include factorization machines [17, 18], feature-based latent factor models for link prediction [3, 14], and regression-based latent factor models . We refer to this class of methods as feature-based matrix factorization (FMF) in this paper. The basic idea of FMF is to formalize the task as extension to plain matrix factorization for incorporating the features of objects into the model. In this way, one can make a full use of available information in the task to improve the accuracies. In fact, FMF is the best performer on many real world matching tasks. In collaborative filtering, FMF models using user feedback [9, 10], attribute [1, 23], and content [3, 24] have outperformed other models including plain matrix factorization. In web search, FMF models for calculating matching scores (relevance) between queries and documents have significantly enhanced relevance ranking [25, 26]. FMF models have also been successfully employed in link prediction , and have been adopted by the champion teams in KDD Cup 2012 [5, 18].
The learning of the FMF model can be conducted with a coordinate descent algorithm or a stochastic gradient descent algorithm. Since a matching problem is usually of a very large scale, with hundreds of millions of objects and features or more, it can easily become hard for FMF to manage. It is therefore necessary to develop a parallel and efficient algorithm for FMF. This is exactly the problem we attempt to address in this paper.
Making FMF scalable and efficient is much more difficult than it appears, due to the following two challenges. First, training requires simultaneous access to all the features, and thus the existing techniques for parallelization of matrix factorization [8, 27, 29] are not directly applicable. Second, the computation complexity of the coordinate descent algorithm is still too high, and it can easily fail to run on a single machine when the scale of problem becomes large, calling for techniques to significantly accelerate the computation. By making use of repeating patterns, the least-squares and probit losses can be scaled up for coordinate descent 
, but it does not provide guarantee for any general convex loss functions. Existed parallel coordinate descent algorithms, such as and , due to the complex feature dependencies, cannot be directly applied here. The Hogwild!  algorithm for parallel stochastic gradient descent can be applied here, but it is a generic algorithm and thus is still inefficient for FMF.
In this paper, we try to tackle the two challenges by developing a parallel and efficient algorithm tailored for learning-to-match. The algorithm, referred to as parallel and efficient algorithm for learning-to-match (PL2M), parallelizes and accelerates the coordinate descent algorithm through (1) iteratively relaxing the objective to facilitate parallel updates of parameters, and (2) avoiding repeated calculations caused by features. The main contributions of this paper are as follows.
We propose the parallel and efficient algorithm for feature-based matrix factorization, which iteratively relaxes the objective for parallel updates of parameters, and neatly avoids repeated calculations caused by features, for any general convex loss functions.
We theoretically prove the convergence of the proposed algorithms on minimizing the original objective function, which is further verified by our extensive experiments. The parallel algorithm can automatically adjust the rate of parallel updates according to the conditions in learning.
We empirically demonstrate the effectiveness and efficiency of the proposed algorithm on four benchmark datasets. The parallel algorithm achieves nearly linear speedup and the proposed acceleration helps the parallel algorithm run about times faster than the Hogwild!  algorithm on average, using threads.
Given the importance of the FMF models and difficulty of their parallelization, the work in this paper represents a significant contribution to the study of learning to match. To our best knowledge, this is the first effort on the scalability of the general FMF models.
The rest of the paper is organized as follows. Section 2 gives a formal description of the generalized matrix factorization and Section 3 explains the efficient coordinate descent algorithm. Section 4 describes parallelization of the coordinate descent algorithm. Related work is introduced in Section 5. Experimental results are provided in Section 6. Finally, the paper is concluded in Section 7.
In this section, we give a formal definition of learning-to-match and a formulation of a feature-based matrix factorization. We also present our motivation of parallelizing this learning task.
Learning-to-match can be formally defined as follows. Let be the instances in the query domain and be the instances in the target domain, where and
are query and target instances (feature vectors) respectively. For some query-target pairs, the corresponding matching scoresare given as training data, where is the set of indices for all observed query-target pairs. Our problem is to learn to predict the matching score between any pair of query and target .
The setting is rather general and it subsumes many application problems. For example, in collaborative filtering, a user’s preference over an item can be interpreted as the matching score between the user and the item. In social link prediction, the likelihood of link between nodes on the network can be as regarded as the matching score between the nodes. Web search, in general document retrieval, can also be formalized as a problem of first matching between a given query and documents and then ranking of documents based on the matching scores.
The goal of learning-to-match is to make accurate prediction by effectively using the information on the given relations between instances (e.g., similar users may prefer similar items), as well as the information on the features of instances (e.g., users may prefer items with similar properties).
The query and target instances (feature vectors) are in two heterogeneous feature spaces, and a direct match between them is generally impossible. Instead, we map the feature vectors in the two domains into a latent space and perform matching on the images of the feature vectors in the latent space. We calculate the matching score of a query-target pair as
where and are transformation matrices that map feature vectors from the feature spaces into the latent space. and are latent factors of query instance and target instance . In this paper, we use to denote the th column of matrix and to denote the th column of matrix . We refer to the model in Equation (1) as the model of feature-based matrix factorization. The model can be also interpreted as linear matching function of latent factors, in which the latent factor of each instance is also linearly constructed from feature vectors. The query latent factor can be expressed as . The target latent factor can be expressed similarly.
The model, as shown in Figure 1, contains many existing models of feature-based matrix factorization as special cases [17, 18, 3, 14, 1]. When no “informative" features are available for objects of both domains, the feature matrices contain only the indices of the objects. Clearly, in such cases and become identity matrices of sizes and , the feature-based matrix factorization model naturally degenerates to the plain matrix factorization .
The objective of the learning task then becomes 111 and are not model parameters; they are only auxiliary variables determined by and .:
Here is a strongly convex loss function that measures the difference between the prediction and the ground truth . The loss function can be square loss for regression
or logistic loss for classification
where . is a regularization term based on elastic net  which includes (ridge) and (lasso) regularization as its special cases.
The use of features in learning-to-match is crucial for the accuracy of the task. Usually an FMF model, when properly optimized, can produce a higher accuracy in prediction than a model of plan matrix factorization (MF). This is because the FMF model can leverage more information for the prediction, particularly the feature information, while the MF model can only rely on relations between instances which are usually very sparse. For example, in the task of recommendation , only about entries are observed. In Tencent Weibo dataset at KDD Cup 2012, about of users in the test set have no following records in the training set . As a result, MF cannot achieve satisfactory results in the tasks, while FMF models give the best results.
In fact, it has been observed the FMF models (with different types of features used) achieve state-of-the-art results on many different tasks, outperforming the models of MF with big margin. For example, in collaborative filtering, user feedback (SVD++) , user attribute , and product attribute  are incorporated into models to further improve the accuracies in prediction. In web search [25, 26], term vectors of queries and documents are used as features to significantly improve relevance ranking. FMF models also give the best results in link prediction in KDD Cup 2012 [14, 3, 4].
The success of the FMF models strongly indicates the necessity of scaling up the corresponding learning algorithms, given that the existing algorithms still cannot easily handle large datasets. By making use of repeating patterns, the least-squares and probit losses can be scaled up for coordinate descent , but it does not guarantee for any general convex loss functions. Other algorithms of parallel coordinate descent, such as  and  cannot be directly applied to FMF, because it is difficult for them to handle complex feature dependencies in FMF. The Hogwild!  algorithm for parallel stochastic gradient descent can be applied here, but it is a generic algorithm and thus is still inefficient for FMF. To our best knowledge, our work in this paper is the first effort on scalability of learning-to-match, i.e., feature-based matrix factorization.
In this section, we propose an acceleration of the coordinate descent algorithm for solving the feature-based matrix factorization problem. We prove the convergence of the accelerated algorithm, and we also give its time complexity at Section 4.3.
Let be the gradient of each instance over prediction and be constant of , such that,
That is, . We can exploit the standard technique to learn the model using coordinate descent (CD), as shown in Algorithm 1222In this paper, all matrix operations in algorithms have taken the advantages of sparsity, e.g. summations are all over nonzero entries implicitly. So do the time complex analysis and implementations of all algorithms.. Here, is defined using the following thresholding function to handle optimization of norm
The regularization term only affects the result through function , and thus makes most part of the algorithm independent of regularization. Note that we implicitly assume is buffered and kept up to date, when is needed in the algorithm.
The time complexity of one update in Algorithm 1 is , where and denote numbers of nonzero entries in feature matrices and , and denote numbers of query and target instances, and denotes average number of nonzero features for each pair . We note that the time complexity is the same as the time complexity of stochastic gradient optimization for Equation (2). From the analysis, we can see that the time complexity of algorithm is increased by order of when average number of nonzero features increases. This can greatly hamper the learning of matching model, when a large number of features are used.
We give an efficient algorithm for learning-to-match by avoiding the repeated calculations caused by features. There is only a little works focused on the acceleration of the FMF models. The most relevant one is that scaling up some specific coordinate descent by making use of repeating patterns of features . However, it is specialized for the least-squares and probit losses. Although the idea is similar to the avoiding repeated calculations in our work, we extend the idea to any general convex loss functions.
One can see that there exist repeated calculations of summations for the same query or target when calculating and in Algorithm 1, which gives us a chance to speed up the algorithm. We introduce two auxiliary variables and calculated by
where is the set of observed target instances associated with query instance . The key idea of efficient CD is to make use of and to save duplicated summations in Algorithm 1. Since the gradient value is changed after each update, it is not trivial to let unchanged. Our algorithm keeps making updated to ensure the convergence of the algorithm. The efficient algorithm for learning-to-match is shown in Algorithm 2.
Next, we prove the convergence of Algorithm 2, which greatly reduces the time complexity of learning. Suppose that the th row of is changed by . After the change, the loss function can be bounded by
Intuitively, updating corresponds to minimizing the quadratic upper bound
of the original convex loss which is re-estimated each round. Formally, all the values ofare zero in the beginning. We need to sequentially update for different ’s to minimize . Assuming that we have already updated and need to decide , we can calculate the upper bound as follows
The first order term of this equation is exactly the update rule in Algorithm 2. Using to denote the change on the th row after carrying out the update, we arrive at the following inequality
Note that we start from , and we have after the update. The inequality in Equation (11) shows that the original loss function decreases after each round of update, and hence this proves the convergence of Algorithm 2 for any differentiable convex loss function.
We propose a parallel and efficient learning-to-match algorithm (PL2M) to further improve the scalability and efficiency by deriving an adaptive estimation of the conflicts caused by parallel updates. Specifically, we consider parallelizing and accelerating Algorithm 2. The statistics calculation and preprocessing steps in Algorithm 2 can be naturally separated into several independent tasks and thus fully parallelized. However, there is strong dependency within update steps, making the parallelization of it a difficult task. We will discuss how we solve the problem next.
Let be a set of feature indices to be updated in parallel. Assume that the statistics of is up to date as in Algorithm 2 and we want to change for in parallel. For simplicity of notation, we use to represent the change in . The value of after this change will be
In a specific case in which for , the third line in Equation (12) becomes zero. This means that the features in the selected set do not appear in the same instance. In such case, the loss can be separated into independent parts and the original update rule can be applied in parallel. Not surprisingly, such condition does not hold in many real world scenarios. We need to remove these troublesome cross terms in the second line, by deriving an adaptive estimation of the conflicts caused by parallel updates, more specifically, by the inequality:
With the inequality, we can bound as follows
Obviously this new upper bound can be separated into independent parts and optimized in parallel. Moreover, the sum is common for all features in set and is only needed to be calculated once. With this result, we give a parallel and efficient algorithm for learning-to-match, shown in Algorithm 3.
The relaxation of into is performed iteratively in the optimization, and it still attempts to optimize the original objective as in Equation (2
), which is a case much analogous to Expectation-Maximization algorithm in finding a maximum-likelihood solution. Letbe the change in after each parallel update. Since each parallel update optimizes , we have the following inequality
It indicates that decreases after each parallel update. It then follows that the parallel procedure for optimizing the original loss function in Algorithm 3 always converges.
The update rule depends on the statistics . With the following notation
It can be shown that the parallel update of is shrunken by compared to sequential update. Intuitively depends on the co-occurrence between features . When features in rarely co-occur, will be close to one, which means that we can update “aggressively”. When features in co-occur frequently, will get small and we need to update more “conservatively”. In an extreme case in which no feature co-occurs with each other, and we get perfect parallelization without any loss of update efficiency. In another extreme case in which we have duplicated features ( ), , which is extremely conservative given the size of . The advantage of our algorithm is that it automatically adjusts its “level of conservativeness” by the condition in learning, and thus it always ensures the convergence of the algorithm regardless of the number of threads and the nature of dataset.
The changes in loss function can be analyzed accordingly. Let us consider the simple case in which and only regularization is involved. The change of loss after parallel update can be bounded by
As this inequality indicates, compared to the ideal case in which features do not co-occur, each parallel update’s contribution to the loss change is scaled by . The above analysis also intuitively justifies that controls the efficiency of the update.
The time complexity of the efficient algorithm (Algorithm 2) is only of . It is linear to numbers of nonzero entries of feature matrices and number of observed entries of . Recall the time complexity of the coordinate descent algorithm (Algorithm 1), which is .
The speedup on updates in Algorithm 2 is as follows:
This corresponds to average number of observed target instances per query instance. Similarly, on the updates, the speedup is about times. Therefore, the overall speedup of Algorithm 2 over Algorithm 1 is at least,
In application tasks, this can be at level of to such as collaborative filtering and link prediction. When is close to (or smaller than) (datasets like Yahoo! Music, Tencent Weibo and Movielens-10M), our algorithm runs as fast as the algorithm of plain matrix factorization even though it uses extra features.
For the complexity of the parallel and efficient learning-to-match algorithm (PL2M) described in Algorithm 3, using threads to run the algorithm, the computation cost for one round update is . It is due to the fact that all parts of the algorithm are parallelized. This analysis does not consider the synchronization cost. In real world settings, we need to take synchronization cost into consideration, the corresponding time complexity becomes , where
denotes variance of computation costs by parallel tasks. Assume that we havetasks and the time costs of the tasks are . We define , since the training is delayed by the slowest task. To achieve maximum speedup, we need to schedule the tasks well such that the load of each task is average, which is always feasible when , , and are large. Therefore, our algorithm can gain almost times speedup.
In real world applications, there is a trade-off between the size of parallel coordinate set and the parameter , especially when different features have different levels of sparsity in the dataset. When we increase the size of parallel coordinate set , we can divide the task into threads in a more balanced way. On the other hand, will decrease as we increase , making the update more conservative. Thus a parallel coordinate set needs to be chosen to balance convergence and acceleration. In fact, we need to empirically choose such that each instance is covered by only a few nonzero features and the task size is large enough to run in a fairly balanced way.
In this paper, we fix and randomly partition elements from the feature indices to generate a set of disjoint subsets in each round. We note that there can be more sophisticated scheduling strategies to select , which is beyond the scope of this paper and can be an interesting topic for future research.
|Yahoo! Music||Collaborative Filtering||User Feedback , Taxonomy|
|Tencent Weibo||Social Link Prediction||Social Network, User Profile,|
|Flickr||Image Tagging||MAP, Sift Descriptors of Image|
|Movielens-10M||Collaborative Filtering||User Feedback |
MF models  are arguably the most successful approach to learning-to-match. They have been applied to a wide range of real world problems, especially FMF models, which achieve state-of-the-art results, outperforming the models of MF in many different tasks, with different types of features used. In collaborative filtering, user feedback information (SVD++) , user attribute information , and product attribute information  are incorporated into models to further enhance the accuracies in prediction. In web search [25, 26], term vectors of queries and documents are utilized as features to significantly improve relevance ranking. FMF models also give the best results in link prediction in KDD Cup 2012 [14, 3, 4]. These works demonstrate the effectiveness of the learning-to-match models, but also create necessity for parallelization of the learning algorithms.
There has been much effort on parallelizing the process of plain matrix factorization. For example, Gemulla et al.  propose a method of distributed stochastic gradient descent for MF. Yu et al.  introduce a parallel coordinate descent algorithm for MF. An alternating least square method is proposed for MF as well . Recently, Zhuang et al.  improve the efficiency of parallel stochastic gradient descent for MF by making a better scheduling of updates. Liu et al.  propose a distributed algorithm for nonnegative matrix factorization for web dyadic data analysis. The method of Probabilistic Latent Semantic Indexing is parallelized for Google news recommendation . However, all the models on parallelizing plain matrix factorization replies on the fact that the rows and columns can be naturally separated and the parameters can be independently updated, and therefore cannot work on FMF due to the complex feature dependencies in the updating steps.
There is only a little work focusing on acceleration of coordinate descent for FMF. The most recent one scales up coordinate descent by making use of repeating patterns of features . However, it is specialized for the least-squares loss and probit loss. Although the idea of avoiding repeated calculations is similar, our algorithm takes a completely different approach and can handle any general convex loss functions. Other algorithms of parallel coordinate descent, such as  and  cannot be directly applied to FMF, because it is difficult for them to handle complex feature dependencies in FMF.
As general parallelization technique, the Hogwild! algorithm  can be applied to our problem. However, its time complexity is , the same as Algorithm 1, due to the repeated calculations. Using the same number of threads, as analyzed in time complexity sections, it is theoretically
times slower than our parallel algorithm. In experiments, our parallel algorithm runs averagely about 5 times faster than Hogwild!. Another thread of related work is parallelization of coordinate descent algorithms. There have been studies on parallelizing coordinate descent for linear regression[2, 20, 21], other than matrix factorization . The convergence of these algorithms depends on the spectrum of covariance matrix, which changes in each round in our learning setting (due to the changes in and ), and thus the algorithms cannot be directly applied to our problem. Our algorithm makes use of parallel update to minimize an upper bound re-estimated each round to ensure convergence, which can also be viewed as a kind of minorization-maximization algorithm .
In this section, we introduce our experimental results on several matching tasks using benchmark datasets. We first conduct comparison on accuracies between feature-based matrix factorization and plain matrix factorization. We then make comparisons on accuracies and efficiencies between our method of parallel learning-to-match and the baselines, including Hogwild! . Finally, we conduct analysis on the efficiency of our parallel learning algorithm.
Four datasets representing different types of learning-to-match tasks are chosen. Details of the datasets are summarized in Table 1.
The first dataset is Yahoo! Music Track1 333http://kddcup.yahoo.com/datasets.php from the Yahoo! Music website. The dataset is among the largest public datasets for collaborative filtering. We use the official split of the dataset for experiments. As features, we use the implicit feedback of users  as well as the taxonomical information between the tracks, albums and artists, in addition to the indicators of users and tracks. Because it is an item rating dataset, we choose square loss as the loss function and use Root Mean Square Error (RMSE) as evaluation measure.
The second dataset is Tencent Weibo (microblog)444http://kddcup2012.org/c/kddcup2012-track1/data
, for social link prediction. The task is to predict a potential list of celebrities that a user will follow. The dataset is split into training and test data by time, with the test data further split into public and private sets for independent evaluations. We use the training set for learning and the public test set for evaluation. We use logistic loss as the loss function and MAP@K as evaluation metric, which is officially adopted in the KDD Cup competition555http://kddcup2012.org/c/kddcup2012-track1/details/Evaluation. The matrix data is extremely sparse, with only on average two positive links per user. Furthermore, about of users in the test set have no following records in the training set. However, there are lots of additional information available, including social network and interaction (i.e., retweeting and commenting) records, profiles of users, categories of celebrities, and tags/keywords of users. The information is used as features of the task.
The third dataset is for automatic annotating images crawled from Flickr666http://www.flickr.com. The dataset contains million images and each image is associated with on average four tags. We select the most frequently occurring tags as the tag set. We randomly select images as test set and use the rest of images as training set. We use the bag-of-words vector of SIFT descriptors as features for images, and indicator vectors as features for tags. Logistic loss is chosen as the loss function. In testing, we generate a rank list of tags and use P@K(Precision at K) and MAP as evaluation metrics.
The fourth dataset is also for collaborative filtering, provided by Movielens777http://www.movielens.org/. We use the official split of dataset for experiments. This dataset is added because Hogwild! cannot run on Yahoo! Music dataset due to its high time complexity. In addition to the indicators of users and movies, the implicit feedbacks of users  are used as features. Similar to Yahoo! Music dataset, we choose square loss as the loss function and RMSE as evaluation metric.
We have implemented our parallel and efficient algorithm for learning-to-match (PL2M) using OpenMP888http://www.openmp.org. The experiments are conducted on a machine with an Intel Xeon CPU E5-2680 (8 cores, supporting 16 threads at 2.70GHz, 128GB memory). We utilize up to 15 working threads and reserve one thread for scheduling.
We compare the performance of PL2M with those of serial algorithm for learning-to-match algorithm (denoted as L2M) and the Hogwild! algorithm . To simplify the notations, here we use PL2M- to refer to the parallel algorithm for learning-to-match with parallel set (e.g PL2M-5K means the parallel algorithm with ). Hogwild! is the only one that can be directly applied to our problem as mentioned in Section 5. We have also implemented Hogwild! using OpenMP. All matrix operations mentioned in the algorithms take the advantages of data sparsity. PL2M, L2M , and Hogwild! share the same codes of elementary operations.
We empirically set and for L2M and PL2M through all our experiments. To make fair comparison, the parameters of Hogwild! including learning rate, and are tuned with cross validation on training set.
We make comparison between FMF and MF to investigate the effectiveness of the features. We first compare FMF (by the algorithm of L2M) and MF in terms of test RMSE on the Yahoo! Music dataset in Figure 3(a). From the result, we can find that the FMF model converges faster and achieves better results than the MF model. This result is consistent with the result reported in [9, 18] and confirms the importance of using features in this problem. The test error first decreases but increases again in different rounds of training, indicating that training can stop at about 5 rounds.
The results of Tencent Weibo dataset are shown in Table 2. Since MF gives similar performance as the popularity based algorithm that only considers the popularity of each target node, and thus the result is not reported. Here the suffix ALL stands for the FMF model using all the available features shown in Table 1. We also evaluate the performance of the FMF model with only social network information, with suffix SNS. From Table 2, we can see that this dataset is extremely biased toward popular nodes. However, it is still possible to improve the results using social network information, and the auxiliary features can help to achieve the best performance. Note that PL2M-500-ALL has achieved the best result in Tencent Weibo dataset (actually our method is the same as the champion system on this dataset ).
The performance on Flickr test set is shown in Table 3. Because training and test images do not overlap, we cannot use MF to make prediction, and thus we adopt the use of popularity scores as a baseline. From the result, we can find that FMF can improve upon the popularity method, and assign relevant tags using image content features.
The test RMSE curves of different algorithms on Movielens-10M dataset are shown in Figure 3(b). From the result, we can find that the FMF model converges faster and achieves better results than the MF model. This result demonstrates the importance of using features in this problem. The test error first decreases but increases again in different rounds of training, which indicates that training can be stopped at about 7 rounds.
We make comparison between PL2M and L2M in terms of accuracy and efficiency.
Figure 2 gives the training loss curves of PL2M and L2M. From the figure, we can observe that PL2M always converges following the lower bound given by L2M at the beginning. This is consistent with our theoretical result on convergence in Section 4. That is, if PL2M and L2M start with the same initial values, PL2M can perform at most as well as L2M.
The things get changed, however, as the training goes on. On Tencent Weibo dataset, PL2M-500 converges slightly better than L2M after rounds. On Movielens-10M dataset, although the training loss curves of L2M, PL2M-500, and PL2M-50 are almost the same, the training loss of PL2M is lower in the end. This may due to the fact that the loss function is non-convex for and . After several updates, for example, 20 rounds, the values of and are quite different so that the two methods finally converge to different local minimums.
From these figures, we can also observe that PL2M-50K converges slower than PL2M-5K in Yahoo! Music dataset, PL2M-5K converges slower than PL2M-500 on Tencent Weibo dataset, PL2M-500 converges slower than PL2M-50 on Flickr dateset. PL2M-500 converges a little slower than PL2M-50 on Movielens-10M dateset. These are consistent with our previous theoretical result that smaller leads to faster convergence.
We make comparison between PL2M and Hogwild!. Since Hogwild! is based on stochastic gradient descent, some parameters such as learning rate need to be tuned. After fine tuning the parameters using cross validation on the training set for Hogwild!, including learning rate, coefficient , and coefficient , we have obtained its performance in Table 4, including test error, running time of one round of training, and number of rounds needed to get the best test error. Both Hogwild! and PL2M use threads.
We can see that the running time for each round of PL2M is much shorter than that of Hogwild!, while their test errors are similar. The difference of running time on Tencent Weibo is not as much as Movielens-10M and Flickr datasets because they have less features. This is consistent with our theoretical result about the time complexity in Section 3 and Section 4.
Furthermore, Hogwild! needs more training than PL2M to achieve its best test errors. For example, Hogwild! needs rounds but PL2M needs only rounds to achieve their best test errors in Movielens-10M. Therefore, the total running time for PL2M to get the best performance is much smaller than that of Hogwild!.
Finally, we evaluate the scalability of the parallel learning-to-match algorithm (PL2M). We test the average running time of PL2M-5K on Yahoo! Music dataset, PL2M-500 on Tencent Weibo dataset, PL2M-50 on Flickr dataset, PL2M-50 on Movielens-10M with varying numbers of threads and evaluate the improvement in efficiency.
As shown in Figure 4, the speedup curves are similar on Yahoo! Music, Tencent Weibo, and Flickr datasets, but the curve converges earlier on the Movielens-10M dataset. This is because that Movielens-10M is relatively smaller than the others and PL2M-50 runs really fast on Movielens-10M, which only needs about 18 seconds when 8 threads are used. Although the speedup gained by parallelization is not as much as that on other datasets, the parallel algorithm can also provide accelerations.
On the first 3 datasets, PL2M can achieve almost linear speedup with less than threads, but the speedup gain slows down with more threads. We observe that the working threads are still fully occupied with more than 8. We conjecture that this turning point is due to the fact that the number of physical cores of the machine is only 8. From the results, we can find that PL2M is able to gain about times speedup using threads, confirming the scalability of the parallel algorithm.
In summary, the speedup gained by the parallel algorithm is significant, and thus it can easily handle hundreds millions of instances and features on a single machine.
We have proposed a parallel and efficient algorithm for learning-to-match, more specifically feature-based matrix factorization, a general and state-of-the-art approach. Our algorithm employs (1) iterative relaxations to solve the conflicts caused by parallel updates, with provable convergence guarantee on minimizing the original objective function, and (2) accelerate the computation by avoiding the repeated calculations caused by features, for any general convex loss functions. As a result, our algorithm can easily handle data with hundreds of millions of objects and features on a single machine. Extensive experimental results show that our algorithm is both effective and efficient when compared to the baselines.
As future work, we plan to (1) extend the algorithm to a distributed setting instead of the current multi-threading, (2) find better scheduling strategies for making parallel updates with a guaranteed bound of speedup, and (3) apply the technique developed in this paper to the parallelization of other learning methods, such as Markov Chain Monte Carlo (MCMC) learning methods for learning-to-match problem.