It is well-known that the performance of recommender systems is unstable under the cold-start and data sparsity settings [30, 35]. The exploitation of side information provides an effective way to tackle these problems and many approaches have been developed over the years . Among these approaches, Factorization Machines (FMs) [28, 29] have gained more and more attention recently [10, 38, 22], which can easily leverage any side information (including user- and item-related) for recommendation.
FMs first map each attribute into a latent space and then concatenate the embedded vectors of attributes to a high-dimensional sparse vector111An Attribute can be a user ID, item ID, or other contextual attributes, e.g., user age. In this paper, an attribute is represented by a -dimensional feature vector, indicating that features are used to describe an attribute in the embedded space.. Particularly, FMs predict the target mainly by modeling all the second-order interactions among attribute using a factorized parametrization. Despite their promising performance, traditional FM methods suffer from two limitations: 1) the attribute interactions are modeled in a linear way (i.e., the predicted target is linear w.r.t. each model parameter), which is insufficient for capturing the non-linear and complex inherent structure of real-world data . And 2) FMs model the second-order interactions between attributes via the inner product of their factorized vectors. The problem is that the inner product does not satisfy the triangle inequality222It is defined as: “The distance between two points cannot be larger than the sum of their distances from a third point.” . Specifically, for real valued vectors , and , triangle inequality requires meeting the condition that . in the vector space, which is crucial to model fine-grained relationships between attributes, resulting in sub-optimal performance [36, 13].
To overcome the first problem, researchers have explored deep neural networks for non-linear transformations. For example, He et al.
developed a Bi-Interaction operation upon the factorized features, followed by several multi-layer perceptrons (MLPs) to capture the non-linear attribute interactions. However, this method is still unable to address the second problem, leading to inferior performance as observed in our experiments (see Section5.1). To tackle the second problem, Pasricha et al. 
developed a new FM method, which replaces the inner product with a distance function (i.e., squared Euclidean distance) in FMs for recommendation. Specifically, they employed the squared Euclidean distance to estimate the similarity between each pair of feature vectors. It is worth noting that the Euclidean distance function computes the distance between each pair of the factorized feature vectors independently and then sums them up. It thus ignores the possible inherent correlations between different features of an attribute. Taking two typical examples as shown in Figure1, we map the attributes into a 2-D feature space. There are 1) a linear correlation in Figure 1(a), indicating a positive correlation between two features, such as the product brand price and product brand quality for the attribute of product brand, and 2) a non-linear correlation in Figure 1(b), e.g., the complex correlation between music rhythm and music melody for the attribute music elements. When using the Euclidean distance function to compute the similarity between two features with a certain correlation, as the ones shown in Figure 1, it often fails to capture such relationships between features. As a result, it is incapable of modeling the fine-grained feature interactions of attributes.
Motivated by the above observations, in this paper, we devise a novel FM framework equipped with generalized metric learning techniques (dubbed as GML-FM). Based on this framework, we study two different distance methods: the Mahalanobis distance and DNN-based distance methods. The Mahalanobis distance based method adopts a positive semi-definite matrix to project the features into a new space such that the features obey certain linear constraints. In this way, the linear correlations between features such as the ones in Figure 1(a) can be captured by this matrix. To model the more complex correlations such as the one shown in Figure 1 (b), the DNN-based distance function is designed to capture the non-linear feature correlations, which can benefit from both the metric learning approaches and the strong representation capability of neural networks. Notice that the values of the traditional inner product can cover the whole real number space, while the values from distance functions are all non-negative, which will limit the representation capability of FMs. To tackle this problem, we introduce a learnable weight to the interactions of each attribute pair. This strategy can greatly enhance the performance of distance functions (see Section 5.2).
Extensive experiments have been conducted on four public benchmark datasets, including three widely-used Amazon datasets  and the MovieLens one . Comparisons with several state-of-the-art methods demonstrate the effectiveness of our method. To further explore the superiority of our method over a variety of baselines under sparse settings, we collected a new large-scale dataset on second-hand trading from Mercari333https://mercari.com/., which is of high sparsity (i.e., most items are only purchased once), and contains rich side information (e.g., item condition, shipping duration). Experiments on this dataset also validate the effectiveness of our method.
In summary, our main contributions are four-fold:
We propose a novel FM framework equipped with generalized metric learning techniques to effectively model the fine-grained feature interactions inside attributes. This framework can generalize the traditional inner product based and the recently proposed FMs with the Euclidean distance.
Based on the proposed method, we further design an effective solution to simplify the model equations and verify that our proposed method can be implemented in an efficient way.
We collect a new large-scale second-hand trading dataset to facilitate the study of the cold-start and sparsity problems in recommendation. To the best of our knowledge, this Mercari dataset is the largest second-hand trading dataset for recommendation in literature.
We conduct extensive experiments on three Amazon datasets, MovieLens dataset and two Mercari datasets to validate the effectiveness of the proposed method. Moreover, we have released the code to facilitate future research in this direction444https://github.com/guoyang9/GML-FM..
The rest of the paper is structured as follows. We define some preliminaries in Section 2. We then detail our framework and its simplification form in Section 3. Experimental setup and result analysis are presented in Section 4 and 5, respectively. In Section 6, we briefly review the related work. We finally conclude our work and discuss the future directions in Section 7.
We first shortly recapitulate the key definition in literature, and then introduce the mainstream involvement of Factorization Machines. Both of these two are the building blocks for our proposed method.
2.1 Metric Learning
Given a data collection , where each data sample is over the input space , the metric learning is to learn an appropriate distance metric between all data pairs for satisfying some distance constraints, such as the pair-wise distance ones. In general, given two sets of data pairs, the first one is the known similar pairs,
and the other one is the known dissimilar pairs,
Specifically, traditional approaches attempt to learn a Mahalanobis distance metric to make the distance in space smaller for similar pairs and larger for dissimilar pairs. The distance function is defined as,
where is a positive semi-definite matrix. A typical method is to globally solve the following convex optimization problem,
where denotes matrix is a positive semi-definite matrix.
2.2 Factorization Machines
Factorization Machines can work with any real valued feature vectors for prediction. Given an input feature vector , FMs estimate the target by,
where is the global bias, models the strength of the -th attribute , and models the second-order interactions between the -th and -th attributes. In the original FMs , denotes the factorized feature vector for attribute , and represents the inner product of and . To model more complex interactions between attributes, NFM 
introduces deep learning into FMs and employs the fully connected layers to learn non-linear feature interactions,
where MLP() and denote several fully connected deep layers and element-wise product, respectively. This approach is equal to inner product with non-linear transformations. However, the inner product violates the triangle inequality and thus cannot well capture fine-grained attribute interactions . With this observation, Pasricha et al.  recently proposed a TransFM method which employs the squared Euclidean distance function to replace the inner product for sequential recommendation,
where and are the embedding and translation feature vectors for attribute , respectively.
3 Proposed Method
Traditional FMs model the interactions between two attributes by the inner product, which does not satisfy the triangle inequality property of feature vectors. Recently, TransFM  has been proposed to use the squared Euclidean distance function to replace the inner product to model the interactions and achieved better performance. The reason is that the Euclidean distance obeys the triangle inequality property and thus can better capture the fine-grained relationships between feature vectors. However, the existing FMs only consider the interactions between attributes, while ignoring the interactions between the features of each attribute (Remind that the attributes are represented as latent feature vectors in FMs to estimate their interactions). We take a toy example for explanation. Suppose that the attribute of brand is represented by a 2-dimensional vector in FMs when predicting whether a user will purchase a product or not. The two dimensions of the brand attribute are brand price and brand quality, respectively. In existing FMs, they all model the interactions between the brand and other attributes, while ignoring the correlations between the features of brand itself, which are also important for making decisions. In fact, the inherent structure of real-world data is much more complex. To learn from such data effectively, it is thus crucial to consider a deep-level interaction, i.e., the interactions between features of an attribute.
In this section, we first introduce the overall framework of the proposed method, and then present two exemplar methods based on this framework to solve the above problems (e.g., linear and non-linear correlations between features). We then provide an optimization approach for the proposed method and the learning strategy adopted in this work. Finally, we theoretically prove that our method can be generalized to the vanilla FMs. The main notations involved in this paper are summarized in Table I.
|Notations||Definition and Description|
|Length of the concatenated attribute vector|
|Dimension of the embedded features for an attribute|
|Learning rate for gradient descent|
|Global bias in FMs|
|Weight of the -th attribute ()|
|Weight of the interaction of the -th and -th attribute|
|Vector of the concatenated attributes|
|Vector of the -th attribute|
|Vector for computing the transformation weight|
|Vector of the -th layer’s learnable bias|
|Vector of learnable parameters|
|Matrix to parameterize the linear transformation|
|Matrix for constraining the Mahalanobis distance|
|Matrix of the -th layer’s learnable weights|
|Set of all the training instances|
|Set of similar data pairs|
|Set of dissimilar data pairs|
3.1 Model Formulation
In order to model the feature correlations, we propose to use a generalized metric learning based approach to replacing the Euclidean distance function in FMs and dub this method as GML-FM. Similar to FMs, our proposed method could also take any real-value feature vectors as inputs. Formally, given an input vector , the target is estimated by,
where and are the factorized feature embeddings and , , and is the generalized distance function.
However, due to the inherent characteristic of distance functions, they are limited to non-negative values compared to the inner product, limiting the expressiveness of FMs. To solve this problem, we introduce a transformation weight - , to convert the values of the second-order interactions to the real number values. Concretely, to avoid introducing more parameters and over-fitting, we leverage the existing embedded features and , by combining them via the element-wise product, which is then converted to the transformation weight by a trainable vector ,
The transformation weight is to increase the representation ability of FM methods which is limited by the distance functions (i.e., distances are all non-negative real values). To this end, the second-order interactions of distance functions in FMs can achieve the same functionality of the ones based on the inner product. Therefore, the previous target function can be rewritten as,
Note that we do not explicitly define the similar and dissimilar attribute sets, where we leave the framework itself to automatically learn the correlation between attributes. In the next subsection, for the generalized distance function , we present two instances which can generalize the inner product as well as the Euclidean distance function.
3.2 Generalized Metric Learning based FMs
The Euclidean distance function is limited by its deficiency in modeling feature correlations. As these correlations play an important role in the target prediction, we thus devote our efforts to generalizing the Euclidean distance function to a generalized metric learning based one. In the following, we introduce two generalized metric learning based methods, which correspond to the linear and non-linear correlations between features, respectively.
3.2.1 Mahalanobis Distance Function
To effectively model the linear correlations between features, we adopt the mahalanobis function  and form the distance function by,
where is a transformation matrix. In particular, if is a diagonal matrix, namely, the coordinate axis is orthogonal, the correlations between different features are independent. Nevertheless, different features may have positive or other linear correlations (recall the example that the positive correlation between product brand price and brand quality for the attribute of product brand), that is, the coordinate axis is not orthogonal. Therefore, it is sub-optimal to set to be a diagonal matrix. Besides, the distance function should be non-negative which cannot be guaranteed by randomly initiating . Due to the aforementioned two considerations, is set to be a positive semi-definite matrix , which can be auto-learned from the training data. In the following, we show how to get the positive semi-definite matrix .
For those features with linear correlations, in order to correctly model the interactions, it is common to perform a linear transformation before the Euclidean distance,
where and parameterizes the transformation. Furthermore, the linear transformation can be expressed by,
In this way, any matrix from a real-valued matrix is guaranteed to be positive semi-definite (i.e., to have no negative eigenvalues). This can be verified through the following proof.
is guaranteed to be positive semi-definite (i.e., to have no negative eigenvalues). This can be verified through the following proof.
For any real valued vector , the condition that can always be satisfied. Therefore, matrix is positive semi-definite. ∎
3.2.2 DNN-based Distance Function
The aforementioned method models the feature interactions in a linear way, which is insufficient for capturing the non-linear or other more complex correlations between features. For example, if we would like to recommend a piece of music to a user based on its elements attribute, which can be represented by the rhythm and melody, it may not be optimal to combine these two features independently or linearly, since different types of rhythms can co-exist with the same melody and vice versa. To capture such complex correlations, it is better to model the feature interactions in a non-linear way. For modeling the interactions and obtaining better fusion features, we refer to DNNs, which apply multi-layer of non-linear interactions to model feature interactions and have been proven to be very effective . Specifically, original is transformed by a deep neural network,
where and denote the weight matrix and bias vector for the is the activation function, which could be sigmoid, hyperbolic tangent (tanh), rectified linear unit (ReLU), etc. In this work, we take the tanh as the activation function for all layers, which can map the input in the range of -1 to 1. With this deep non-linear neural network, it is expected that more complex correlations between features inside attributes can be captured.
denote the weight matrix and bias vector for the-th fully connected layer, respectively. And is the number of deep layers used in the network. A dropout layer  is deployed between each contiguous fully connected layers to prevent overfitting. For simplicity, all the learnable weight and bias are set to be the same size in this work, namely, and . And
is the activation function, which could be sigmoid, hyperbolic tangent (tanh), rectified linear unit (ReLU), etc. In this work, we take the tanh as the activation function for all layers, which can map the input in the range of -1 to 1. With this deep non-linear neural network, it is expected that more complex correlations between features inside attributes can be captured.
After this procedure, the distance function becomes,
where both and are the learned features via non-linear transformations from and , respectively.
In particular, the Euclidean distance can be easily recovered by setting all the weight matrices in Equation 7 to be identity matrix, the bias vectors to be zeros and the activation functions to be the identity function.
A visual comparison of our method and three state-of-the-art FM models is shown in Figure 2. As can be observed, all the three previous researches (i.e., FM, NFM and TransFM) focus on the interactions on the attribute level, i.e., inter-attribute interaction. In contrast, our proposed GML-FM method takes the feature-level interactions inside attributes into consideration, i.e., intra-attribute interaction. We empirically demonstrate that the feature-level interactions are important for better modeling the complex and rich interactions of real-world data in Section 5.
3.3 Proposed Efficient Solution
Formally, if the proposed method is computed in a straight way, the time cost will be in linear time , where is the embedding size and is the length of the concatenated attribute vector, which is too expensive. In the following, we will provide an effective approach to efficiently simplifying the proposed method equations. We theoretically analyze that our proposed solution can greatly reduce the time complexity of the proposed method.
In the next, we first present a simplified general form of the second-order interaction and then provide the proposed approaches for both the Mahalanobis and DNN distance functions, which can significantly simplify the computation and reduce the time complexity. The general form of the second-order interaction of our proposed method is given in Equation 9,
where the index of and starts from 1 instead of the nested summation, which is critical to simplify the original model equation. Note that represents the distance function, which will be zero if the two inputs are same (i.e., ). In the following, we illustrate the two proposed generalized metric learning based FMs and the corresponding simplified form.
3.3.1 Mahalanobis Distance Function
We show the derivation of the Mahalanobis distance function with the transformation weight in Equation 10. With the simplification of Equation 9, the second-order interaction of the original Mahalanobis distance function based FMs can be rewritten as,
where is an operation to convert the vector to a diagonal matrix. After this simplification, the summation of and can be computed independently or sequentially, which can greatly reduce the complexity of computing the summation in a nested structure of the original model equations.
Notice that the time complexity of the two elements on the right hand side (RHS) of Equation 10 is equal. Therefore, we only analyze the time complexity of the first one on the RHS as an example in the next. This element is a computation of two sums and the main cost is from the second one (i.e., ). And the computation of the second one is composed of 3 vector element-wise product or inner product and one vector-matrix multiplication. The time complexity is therefore . Recall that the original time complexity of the GML-FM is . Since is usually much smaller than that of ( is usually of tens or hundreds. In contrast, is often tens of thousands, or even tens of millions), we argue that the proposed solution can significantly reduce the original time complexity of the proposed method.
3.3.2 Generalized Metric Distance with Deep Neural Networks
It is well-known that training DNNs is very expensive compared to traditional shallow models. We thus propose to simplify the original DNN-based model equations. Similar as the previous one, the detailed derivation under this setting is,
where and are produced by deep neural networks,
Within the deep neural networks, the matrix-vector multiplication of weight matrices and input features is the main operation which can be computed in (we set all the weight matrices to be in ). In short, the same as above, the overall time complexity for evaluating the GML-FM method is .
3.4 Learning Strategy
GML-FM can be applied to a variety of prediction tasks, including classification, regression and ranking. In this work, we adopt a commonly used regression objective function (i.e., the squared loss),
where represents all the training instances and denotes the corresponding target of . To optimize the objective, we employ the stochastic gradient descent (SGD), a universal solver for optimizing machine learning models. The SGD updates the model parameters towards the direction of the negative gradients. Normally, a mini-batch of training instances is selected for training and optimizing model parameters,
. To optimize the objective, we employ the stochastic gradient descent (SGD), a universal solver for optimizing machine learning models. The SGD updates the model parameters towards the direction of the negative gradients. Normally, a mini-batch of training instances is selected for training and optimizing model parameters,
where denotes the learnable parameters (e.g., in deep neural networks), and is the learning rate deciding the step size of gradient descent.
3.5 Relation with FMs
In this subsection, we prove that the vanilla FMs are a special case of our GML-FM method by setting = 0 and to be the Euclidean distance. By expanding the squared function and constraining all the to be a constant, e.g., 1 (which gives more geometry meaning since the vector is an orthogonal basis), we can derive that,
where and are constants. It is worth pointing out that, to the best of our knowledge, this is the first work proving that metric learning based FMs can generalize the vanilla FMs.
4 Experimental Setup
We evaluate the effectiveness of our method on two common recommendation tasks: rating prediction and top-n recommendation. The former attempts to predict the rating that a user would give to an unseen item, and the latter one is to rank unseen items to a target user based on his/her preference. For the latter, our method rank the items according to the predicted ratings.
The experiments are conducted on three datasets, including two publicly available datasets (i.e., the Amazon Product dataset [9, 20] and the MovieLens dataset ) and a newly collected dataset which is used to further explore the effectiveness of the proposed method under a very sparse setting. Details of these three datasets are as follows:
Amazon. The Amazon dataset555http://jmcauley.ucsd.edu/data/amazon/. contains product reviews, ratings and metadata, with the user interactions spanning from May 1996 to July 2014. In our experiments, we adopted the 5-core version datasets. And three categories are used in our experiments: Auto, Office and Clothing. For attribute extraction, we leveraged the user ID, item ID and item sub-category as the experimented attributes.
MovieLens. The MovieLens dataset666https://grouplens.org/datasets/movielens/. has been long recognized as a protocol dataset for evaluating recommendation algorithms. We used one of the most stable dataset - MovieLens 1M, which contains rich side information, including user gender, user age, user occupation, and item genres.
Mercari. This dataset is collected from a second-hand product trading platform - Mercari777https://mercari.com/., including 265.4 million items, 10.7 million buyers, 8.4 million sellers, and user behaviors spanning from Nov. 2016 to Oct. 2018. This dataset includes user information (ID, status), product metadata (product ID, seller ID, price, brand, category, condition, size, description, status), product shipping information (methods, origin, duration, payer), and user behaviors (liking, listing, purchasing, and click). Two categories are adopted: Ticket and Books. And we utilized the purchase interactions and kept the users with at least five items in their purchase history. For side information, we selected a set of item features, including category, conditions, shipping method, shipping origin and shipping duration.
The basic statistics are summarized in Table II.
4.2 Compared Baselines
We compared the proposed method with two groups of recommendation methods: MF-based and FM-based. The former one contains MF, PMF, NCF and BPR-MF, which considers the user-item interactions only, without any side information. And the latter one includes the state-of-the-art FM methods - FM, NFM, AFM and TransFM, which leverages the rich side information for recommendation. Specifically, the MF and PMF are adopted to evaluate the rating prediction. The NCF and BPR are designed for evaluating the top-n recommendation, and the other four baselines are exploited for both tasks.
Unique Baselines for Rating Prediction:
MF  factorizes the rating matrix into latent vectors of users and items, and uses the inner product between users and items to estimate the interactions.
Unique Baselines for Top-n Recommendation:
NCF  extends MF methods with deep neural networks, where the inner product between users and items is replaced with non-linear interactions. NCF is devised in a point-wise learning to rank fashion.
BPR-MF  leverages the Bayesian Personalized Ranking (BPR) framework with MF as the underlying model. It adopts the pair-wise learning to rank strategy.
Common Baselines for Both Tasks:
FM. This is the standard FM method, which is originally proposed for recommendation . In experiments, we used the official implementation LibFM888http://libfm.org/. and adopted the SGD optimizer in accordance with other methods.
NFM  extends the inner product interaction in FMs with non-linear multi-layer perceptron. This work brings together the effectiveness of linear factorization machines with the strong representation ability of non-linear neural networks.
AFM  automatically learns the importance of each attribute interaction via deep neural networks (i.e., attention networks).
TransFM  is a recently proposed metric learning based FM. It replaces the inner product in FM with the Euclidean distance function. We adapted it from the sequential recommendation to the general recommendation setting, where we removed the constriction that two items have to be sequentially adjacent.
4.3 Evaluation Protocols
As the objective of rating prediction and top-n recommendation is different, we therefore used separate settings for these two tasks.
4.3.1 Rating Prediction
For rating prediction task, as the dataset contains positive instances only, we thus randomly sampled two negative instances for each positive instance (the user-item pairs which are not interacted by the current user) to ensure the generalization of the predictive model. We set the positive instance score with 1 and negative with -1 for implicit feedback setting . Moreover, the dataset is randomly split into 70% training, 20% validation and 10% testing, where the validation set is used for tuning the hyper-parameters and the final results are reported on the testing set.
The evaluation metric we adopted is the commonly used Root Mean Square Error (
The evaluation metric we adopted is the commonly used Root Mean Square Error (RMSE), where a lower score denotes the better model performance.
4.3.2 Top-n Recommendation
To evaluate the model performance, we followed the widely used leave-one-out evaluation [24, 11], where the latest interaction data of each user is used for testing and all the previous interactions are used for training. As all the datasets contain positive interactions only, we randomly sampled two negative instances to pair with one positive instance in the training set to ensure the generalization of the models [10, 38]. For fair comparison, we leveraged the same positive and negative instances for all the models.
We applied two standard metrics in evaluation: Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG) , where the first indicates the percentage of items which are recommended correctly, while the latter considers the position of positive items in the ranking list. Since it is too time-consuming to rank all the items for each user during testing, we followed the common strategy [11, 16] to randomly select 99 items that are not purchased by the candidate user, and truncated the top 10 ranked items for both metrics. We calculated the two metrics for each user and reported the average scores.
4.4 Parameter Settings
. The learning rate is tuned in the range of [0.0001, 0.001, 0.01, 0.1], and dropout is [0, 1.0] with a step size of 0.1. The batch size is fixed to be 256. We carefully tuned the number of deep layers from 0 to 3 and the embedding size from 4 to 512. The code has been released for the reproductivity of this work.
5 Experimental Results
denotes the statistical significance with two-sides t-test of, , respectively, compared to the best baseline. The best performance is highlighted in boldface.
In this section, we report and analyze the experimental results. Particularly, we focus on the following research questions:
RQ1: Can our model outperform the state-of-the-art recommendation baselines on both tasks?
RQ2: Are the Mahalanobis distance and DNN-based distance (in different layers) helpful for the final results compared to the Euclidean one?
RQ3: How does the embedding size affect both the proposed model and baselines’ performance?
RQ4: How does the proposed method perform with different attributes?
RQ5: What do the proposed metric learning based and the previous FM methods learn in the embedded latent space?
5.1 Performance Comparison (RQ1)
Table III and Table IV show the performance of our methods and the baselines across all the datasets on the rating prediction and top-n recommendation tasks, respectively. Note that the GML-FM and GML-FM represent our method with Mahalanobis distance and DNN-based distance funtions, respectively. In addition, we also conducted pairwise significance test between our method and the baseline with the best performance. Note that a smaller RMSE in Table III and a larger HR or NDCG in Table IV denote the better performance. The key observations are as follows.
Firstly, our proposed methods, especially GML-FM, outperform all the baselines across six datasets consistently and significantly (slightly worse on the MovieLens dataset on the rating prediction task). In particular, the more sparse the dataset is, the larger improvement our method can achieve. Specifically, the sparsity of the three datasets (i.e., MovieLens, Amazon-Office, Mercari-Ticket) is 95.53%, 99.55% and 99.97%, respectively, and the corresponding improvements (absolute) of HR on the top-n recommendation task are 0.08%, 6.32% and 16.13% compared to the best baseline, respectively. This demonstrates the advantage of our method on sparse datasets.
Secondly, our proposed GML-FM method can also outperform all the baselines on most occasions. However, the performance of GML-FM is inferior than that of GML-FM, which proves that the feature correlations in real-world data are often non-linear and quite complex.
In addition, across all baselines equipped with FMs, the neural network based method NFM and the metric learning method TransFM achieve the best or competitive performance in most cases. Specifically, NFM achieves better on rating prediction task while TransFM achieves the best on top-n recommendation task. This validates the effectiveness of deep neural networks and the metric learning over the inner product in FMs, respectively.
Finally, the FM-based baselines surpass the MF-based ones (i.e., MF and PMF for rating prediction, NCF and BPR for top-n recommendation) on four sparser datasets (i.e., Amazon-Clothing, Amazon-Auto, Mercari-Ticket and Mercari-Books). It is expected because FM-based methods exploit more side information, which can improve the recommendation accuracy, especially for sparse datasets. This point has been widely proved in studies [10, 38]. It is also worth noting that although the HR of BPR-MF is worse than other methods on top-n recommendation task, its NDCG is very competitive except for the two most sparse datasets. It is because NDCG takes the position of positive items into consideration, and pairwise learning methods are more suitable for the ranking task.
5.2 Ablation Study (RQ2)
To validate the effectiveness of the Mahalanobis distance and different layers of deep neural networks, we justified the variants of our proposed method and reported the final performance on two datasets (i.e., MovieLens and Mercari-Ticket) of the two tasks in Table V. The main observations are three-fold:
|Rating Prediction||Top-n Recommendation|
|w/o. weight & M||0.6861||1.0693||0.6435||0.3702||0.1699||0.0743|
|w/. M only||0.6815||0.9627||0.6091||0.3446||0.0423||0.0181|
|w/. weight & M||0.6469||0.7736||0.6608||0.3742||0.5349||0.2478|
The introduced transformation weight is very critical for enhancing the model’s representation capability. In particular, on the Mercari-Ticket dataset, the absolute improvement of HR is 35.46% and 49.26% for Euclidean and Mahalanobis distance functions on the top-n recommendation task, respectively. Notice that the values in the row of ‘#layers 0’ are the results of the Euclidean distance with the transformation weight.
The performance of the Mahalanobis distance function is inferior to that of the Euclidean one when the transformation weight is removed. However, with the introduction of the transformation weight, the Mahalanobis method can consistently outperform the Euclidean one. The reason behind this is that, with the transformation weight, the Mahalanobis distance method is more suitable for capturing the inherent correlations between features than the simple Euclidean distance does.
When the number of layers is increasing from 0 to 2, the performance is consistently improved. Nevertheless, when the number of layers is set to 3, a large deterioration can be observed. This is mainly because that more parameters lead to the over-fitting problem. From this table, two layers of deep neural networks is a reasonable choice on most occasions.
5.3 Effects of Embedding Size (RQ3)
To analyze the effect of different embedding sizes of our proposed method and the other six baselines, we present the results of these methods with increasing embedding size on four datasets: Aamzon-Clothing, Amazon-Auto,Amazon-Office and MovieLens. Note that we used the GML-FM with deep neural networks in this experiment. Figure 3 and Figure 4 show the performance changes with the increasing embedding size on the rating prediction and top-n recommendation tasks, respectively. It can be observed that, for almost all the methods, with the increasing embedding size, the performance improves firstly, and then starts to converge or deteriorate. In general, models with smaller embedding sizes lack the representation ability, while too large embedding sizes could lead to over-fitting.
From these two figures, we can observe that our method (i.e., the red one) can surpass all the baselines under most embedding sizes with a large margin, except for the NFM (i.e., the green one) on the MovieLens dataset. It demonstrates that the feature-level interactions inside attributes are important for improving recommendation performance under sparse settings (as the MovieLens dataset is the most dense dataset among all the experimented datasets and NFM does not take feature-level interactions into consideration). This is more practical since the cold-start and data sparsity problems are the main obstacle of nowadays recommender systems.
Another observation is that the performance of GML-FM is more stable with the increase of embedding size compared to other methods. Besides, GML-FM is less prone to over-fit when the embedding size grows larger on all datasets. This implicitly shows the robustness of GML-FM over other baselines.
5.4 Attribute Effect Exploration (RQ4)
As the newly collected Mercari dataset contains rich types of side information, we conducted this study to explore the effect of the side information on two sub-datasets on the top-n recommendation task. The results of our proposed method with different attributes are shown in Table VI. The detailed attributes are: ‘base’ refers to user and item, ‘cty’ refers to item category, ‘cdn’ refers to item condition (e.g., 70% new), and ‘shp’ refers to shipping information (e.g., shipping duration: 2 days, shipping method: air flight). Note that the item condition and shipping information is unique in our collected dataset.
From Table VI, we can observe that the method without any side information performs unsatisfactorily (38.29% and 29.52% absolute degradation of HR on the Mercari-Ticket and Mercari-Books compared to the model with all attributes, respectively). After considering the item category attribute in our method, a large improvement can be observed. Moreover, with the additional information of the item condition, the performance degrades slightly. In contrast, with item shipping information, the performance is improved. This indicates that the information of product condition does not provide discriminative features for purchasing prediction. In contrast, the shipping method is strongly related to the shipping duration and costs, which are more important for users. Finally, it is interesting to find that with additional attributes of both the condition and shipping method (i.e., all the contextual information), the performance can be further improved. It also indicates that with more side information, the performance can be improved in general. The results well demonstrate the complex interactions between all features.
5.5 Case Study (RQ5)
To intuitively understand what the proposed method and other FM-based baselines learn in the latent space, we visualize the item IDs embeddings of two users and illustrate them in Figure 5 and Figure 6, respectively. In particular, we used two groups of item IDs embeddings and reduced the dimension to 2 for visualization. The two groups of items are: 1) positive samples: the items that user has interacted with in the training set, which are expressed with brown color; and 2) negative samples: the randomly sampled items that user has not interacted in the training set (equal quantity with the first group), which are expressed with blue color. As shown in Figure 5 and Figure 6, the metric learning based methods (i.e., TransFM and GML-FM) demonstrate strong superiority over inner product ones (i.e., FM and NFM), where the positive samples under metric learning based methods are grouped into clusters, while the ones under inner product do not show obvious patterns. Compared to TransFM, the proposed GML-FM can significantly cluster the positive samples to one side. The reason why there is no specific borders between positive and negative samples is that we used the positive samples in the training set, while some negative ones may have interactions with the user in the testing set. Concretely, for user ID 709 in Figure 5, the GML-FM clustered positive items are mainly on the right side, and for user ID 1050 in Figure 6, the GML-FM clustered positive items are on the top-left areas. This demonstrates that the feature interactions inside attributes are definitely important for learning better attribute representations and the GML-FM can effectively capture these interactions and surpass that of TransFM.
6 Related Work
6.1 Factorization Machines
Factorization Machines (FMs) are a general model class working with any real valued feature vectors. Different from SVMs, FMs model all attribute interactions using factorized parameters, which enables them to estimate interactions even in tasks with huge sparsity where SVMs fail. FMs can be generalized to a variety of famous models, including SVMs , MFs , PITF  and FPMC .
FMs can be naturally applied into context-aware recommender systems, where rich context information (e.g., user mood when watching movies) is well exploited in  . However, traditional FMs usually take all attribute interactions equally, which is not reasonable since interactions are not equally important and some interactions may even introduce noises to the score prediction. Several studies have devoted to solving this problem by leveraging gradient boosting
. However, traditional FMs usually take all attribute interactions equally, which is not reasonable since interactions are not equally important and some interactions may even introduce noises to the score prediction. Several studies have devoted to solving this problem by leveraging gradient boosting, attention mechanisms  and Bayesian personalized feature interaction . For example, Xiao et al.  presented an attention-based FM to automatically learn the importance of different attribute interactions for the final score prediction. The learned attention weights guide FMs to treat different interactions discriminately. Besides, in order to model the non-linear and complex second-order attribute interactions, He et al.  introduced deep neural networks into FMs, where a Bi-Interaction is firstly operated on attribute interactions, followed by several fully connected deep layers.
In addition, FMs have been developed and extended to many different tasks [4, 19, 14]. For instance, Lu et al.  proposed an efficient multi-linear FM model to address the multi-task multi-view problem. Petroni t al.  developed a novel MF model based upon FMs for open relation extraction, which can effectively integrate various side information such as metadata information. Effective yet efficient higher-order interactions are also studied in [4, 3].
To the best of our knowledge, previous FM-based methods focus on the attribute-level interactions (i.e., inter-attribute interactions), while the feature-level interactions inside attributes (i.e., intra-attribute interactions) are leaving untapped. However, only modeling the attribute interactions is insufficient to capture finer grained feature interactions, resulting in sub-optimal performance. In this work, we attempt to bridge this gap by proposing a generalized metric learning based FMs method to model the feature interactions of an attribute.
6.2 Metric Learning
Metric learning has attracted more and more attention over the past few years. Various applications utilize it to learn an appropriate distance metric for specific decision making. For example,  employed semi-definite programming to learn a Mahalanobis distance metric for clustering. Different from clustering where all the objects with same labels are grouped together, [36, 7] designed an efficient distance metric learning algorithm to make the object and its k-nearest neighbors close in the mapped feature space. One typical way to implement this is to automatically learn a distance metric that pulls similar pairs together and pushes dissimilar pairs apart.
Recently, several efforts have been dedicated to combining metric learning with recommendation such as collaborative filtering . For instance, Bachrach et al.  present an order preserving transformation, mapping the maximal inner product search problem into an Euclidean nearest neighbor search one. Hsieh et al.  proposed a collaborative metric learning method to connect the metric learning and collaborative filtering. This method encodes user-item relations and user-user/item-item similarity into a joint metric space, and provides a suite of feature fusion techniques to make it feasible. This idea is further extended by LRML , modeling the latent relations that describe each user-item interaction, and MAML , considering user’s varying preferences on items.
As metric learning shows dominant advantage over inner product in recommendation, Pasricha et al.  recently adapted it to model side information with FMs. Specifically, the Euclidean distance function is employed to estimate the distance between the two attribute embedding vectors, where one of them is the addition of the embedding vector and a translation vector of the current attribute. Different from the work introduced in , we exploited metric learning techniques in a novel way, which then is leveraged to model the feature interactions inside attributes.
7 Conclusion and future work
In this work, we present a novel FM framework equipped with generalized metric learning techniques, namely GML-FM, to model the finer grained feature correlations in FMs. More concretely, we present two methods under the framework, GML-FM and GML-FM. The former one adopts the Mahalanobis distance function which contains a learnable semi-positive matrix. It is able to capture the linear correlations between features. The latter one utilizes the deep neural networks to capture the non-linear feature correlations. Furthermore, we designea transformation weight, which can extend the values of metric learning based FMs to cover the whole real number space and thereby increase the representation capability. In addition, we further propose an efficient approach to reducing the computation complexity. Another contribution is that we collect a new large-scale second-hand trading dataset to facilitate the study of cold-start and data sparsity problems in recommender systems. Extensive experiments on several benchmark datasets and the newly developed dataset validate the effectiveness of the proposed method.
In the future, we will explore pair-wise learning technique for GML-FM by enhancing GML-FM with the Bayesian Personalized Ranking (BPR) approach. Furthermore, as the Mercari dataset provides rich user-submitted queries and the corresponding clicking information, we will adapt and apply the GML-FM method to product search tasks to explore the effectiveness of the proposed method in other domains.
This work is supported by the National Natural Science Foundation of China, No.:61902223, No.:61772310, No.:U1936203; the Shandong Provincial Natural Science and Foundation, No.: ZR2019JQ23, the Innovation Teams in Colleges and Universities in Jinan, No.:2018GXRC014.
-  (2018) HiTR: hierarchical topic model re-estimation for measuring topical diversity of documents. TKDE. Cited by: §4.2.
-  (2014) Speeding up the xbox recommender system using a euclidean transformation for inner-product spaces. In RecSys, pp. 257–264. Cited by: §6.2.
-  (2016) Higher-order factorization machines. In NIPS, pp. 3351–3359. Cited by: §6.1.
-  (2016) Polynomial networks and factorization machines: new insights and efficient training algorithms. In ICML, pp. 850–858. Cited by: §6.1.
-  (2019) Bayesian personalized feature interaction selection for factorization machines. In SIGIR, Cited by: §6.1.
-  (2014) Gradient boosting factorization machines. In RecSys, pp. 265–272. Cited by: §6.1.
-  (2005) Neighbourhood components analysis. In NIPS, pp. 513–520. Cited by: §6.2.
-  (2016) The movielens datasets: history and context. TIIS 5 (4), pp. 19. Cited by: §1, §4.1.
-  (2016) Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In WWW, pp. 507–517. Cited by: §4.1.
-  (2017) Neural factorization machines for sparse predictive analytics. In SIGIR, pp. 355–364. Cited by: §1, §1, §1, §2.2, §4.2, §4.3.2, §5.1, §6.1.
-  (2017) Neural collaborative filtering. In WWW, pp. 173–182. Cited by: §3.2.2, §4.2, §4.3.2, §4.3.2.
-  (2018) Correlated matrix factorization for recommendation with implicit feedback. TKDE 31 (3), pp. 451–464. Cited by: §4.3.1.
-  (2017) Collaborative metric learning. In WWW, pp. 193–201. Cited by: §1, §6.2.
-  (2016) Field-aware factorization machines for ctr prediction. In RecSys, pp. 43–50. Cited by: §6.1.
-  (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.4.
-  (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD, pp. 426–434. Cited by: §4.3.2.
-  (2018) Scalable content-aware collaborative filtering for location recommendation. TKDE 30 (6), pp. 1122–1135. Cited by: §6.2.
-  (2019) User diverse preference modeling by multimodal attentive metric learning. In MM, pp. 1526–1534. Cited by: §6.2.
-  (2017) Multilinear factorization machines for multi-task multi-view learning. In WSDM, pp. 701–709. Cited by: §6.1.
-  (2015) Image-based recommendations on styles and substitutes. In SIGIR, pp. 43–52. Cited by: §1, §1, §4.1.
-  (2008) Probabilistic matrix factorization. In NIPS, pp. 1257–1264. Cited by: §4.2.
-  (2018) Translation-based factorization machines for sequential recommendation. In RecSys, pp. 63–71. Cited by: §1, §1, §2.2, §3, §4.2, §6.2.
-  (2015) CORE: context-aware open relation extraction with factorization machines. In EMNLP, pp. 1763–1773. Cited by: §6.1.
-  (2009) BPR: bayesian personalized ranking from implicit feedback. In UAI, pp. 452–461. Cited by: §4.2, §4.3.2.
Factorizing personalized markov chains for next-basket recommendation. In WWW, pp. 811–820. Cited by: §6.1.
-  (2011) Fast context-aware recommendations with factorization machines. In SIGIR, pp. 635–644. Cited by: §6.1.
Pairwise interaction tensor factorization for personalized tag recommendation. In WSDM, pp. 81–90. Cited by: §6.1.
-  (2010) Factorization machines. In ICDM, pp. 995–1000. Cited by: §1, §2.2, §4.2.
-  (2012) Factorization machines with libfm. TIST 3 (3), pp. 57. Cited by: §1.
-  (2002) Methods and metrics for cold-start recommendations. In SIGIR, pp. 253–260. Cited by: §1.
-  (2005) Maximum-margin matrix factorization. In NIPS, pp. 1329–1336. Cited by: §4.2, §6.1.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR 15 (1), pp. 1929–1958. Cited by: §3.2.2.
-  (2018) Latent relational metric learning via memory-based attention for collaborative ranking. In WWW, pp. 729–739. Cited by: §6.2.
-  (1982) Similarity, separability, and the triangle inequality.. Psychological review 89 (2), pp. 123. Cited by: footnote 2.
-  (2018) Addressing interpretability and cold-start in matrix factorization for recommender systems. TKDE 31 (7), pp. 1253–1266. Cited by: §1.
-  (2009) Distance metric learning for large margin nearest neighbor classification. JMLR 10 (Feb), pp. 207–244. Cited by: §1, §2.2, §3.2.1, §6.2.
-  (2018) Efficient multi-class probabilistic svms on gpus. TKDE. Cited by: §6.1.
-  (2017) Attentional factorization machines: learning the weight of feature interactions via attention networks. In IJCAI, pp. 3119–3125. Cited by: §1, §4.2, §4.3.2, §5.1, §6.1.
-  (2003) Distance metric learning with application to clustering with side-information. In NIPS, pp. 521–528. Cited by: §6.2.