1. Introduction
In the big data era, people are overwhelmed by the huge amount of information on the Internet, making recommender systems (RSs) an indispensable tool for getting interesting information. Collaborative filtering (CF) has been the most popular recommendation method in the last decade (Herlocker et al., 1999; Koren, 2008), which tries to predict a user’s preferences based on users who are similar to him/her. In recent years, researchers try to incorporate auxiliary information, or side information, to enhance CF. For example, social connections among users (Ma et al., 2011; Zhao et al., 2017a), reviews of items (McAuley and Leskovec, 2013; Ling et al., 2014), metadata attached to commodities (Wang et al., 2018), or locations of users and items (Ye et al., 2011; Zheng et al., 2012), have been shown to be effective for improving recommendation performance. However, a major limitation of most of the existing methods is that various types of side information are processed independently (Haghighat et al., 2016), leading to information loss across different types of side information. This limitation becomes more and more severe, because modern websites record rich heterogeneous side information about their users and contents (Pan, 2016) and it would be a huge loss to their business if the side information cannot be fully utilized. For example, on Yelp ^{1}^{1}1https://www.yelp.com/, a website recommending business to users, users can follow other users to form a social network, businesses have categories and locations, and users can write reviews on businesses. If each type of side information is processed in isolation, information that exists across different types of side information will be neglected. Therefore, a unifying framework is needed to fuse all side information for producing effective recommendations.
Heterogeneous information networks (HINs) (Sun et al., 2011; Shi et al., 2017) have been proposed as a general data representation tool for different types of information, such as scholar network (Sun et al., 2011)
and knowledge graph
(Wang et al., 2015a). Thus, it can also be used to model rich side information for RSs (Yu et al., 2014; Shi et al., 2015). Figure 1 shows an example HIN on Yelp, and Figure 2 shows a network schema defined over the entity types User, Review, Aspect, Business, etc. Based on the network schema, we can design metapaths (Sun et al., 2011; Shi et al., 2017), which are sequences of node types, to compute the similarities between users and businesses for generating recommendations. For example, we can define complicated metapaths such as , to measure similarities between user and business based on similar reviews written by users about the same aspect. In summary, we can unify rich side information with HIN and design metapaths to compute useritem similarities induced from different semantics for making effective recommendation.There are two major issues facing existing HINbased RSs. The first issue is that metapaths are not enough for representing the rich semantics for HINbased RSs. We refer to it as semantic limitation. Figure 1 shows a concrete example, where the metapath is used to capture users’ similarity since both users write reviews and mention the same aspect (seafood) in the review texts. However, if we want to capture the similarity induced by the two users’ reviews mentioning the same aspect (such as seafood) and at the same time rate the same business (such as Royal House), then metapath is not able to capture this semantic. Thus, we need a better way to capture such complicated semantics. Recently, Huang et al. (Huang et al., 2016) and Fang et al. (Fang et al., 2016) proposed to use metagraph (or metastructure) for computing similarity between homogeneous types of entities (e.g., using Person to search Person) over HINs, which is more powerful than metapath in capturing complex semantics. However, they did not explore metagraphs for entities of heterogeneous types, which are essential for RSs. In this paper, we extend metagraph to capture similarities of complex semantics between users and items (businesses) in recommendation scenarios.
The second issue is about similarity fusion, i.e., how to fuse the similarities produced by different semantics between users and items for HINbased RSs. Currently, there are two principled ways to do so. Our goal is to achieve accurate predictions of the users’ ratings on items, which can be formulated as a matrix completion problem of the useritem rating matrix. One way to predict missing ratings in HIN is to use metapaths to generate many adhoc alternative similarities among users and items, and then learn a weighing mechanism to explicitly combine the similarities from different metapaths to approximate the useritem rating matrix (Shi et al., 2015). This approach does not utilize latent features derivable from a metapath. Thus, the similarity matrix could be too sparse to contribute to the final ensemble. The other way is to first factorize each useritem similarity matrix to obtain user and item latent features, and then use all latent features to recover the useritem rating matrix (Yu et al., 2014). This method solves the sparsity problem associated with each similarity matrix. However, it does not fully utilize the latent features because it ignores the interactions among the latent features from different metapaths and captures only linear interactions among the latent features. Therefore, existing HINbased recommendation methods (Yu et al., 2014; Shi et al., 2015) suffer from information loss in various ways.
To address the above challenges, we propose a new systematic way to fuse various side information in HIN for recommendation. First, instead of using metapaths for recommendation (Yu et al., 2014; Shi et al., 2015), we introduce the concept of metagraph to the recommendation problem, which allows us to incorporate more complex semantics into HINbased RSs. Second, instead of computing the recovered matrices directly, we utilize the latent features from all metagraphs. Based on matrix factorization (MF) (Mnih and Salakhutdinov, 2007; Koren, 2008) and factorization machine (FM) (Rendle, 2012)
, we propose a “MF + FM” framework for our metagraph based RS in HIN. We first compute the useritem similarity matrix from each metagraph and then factorize the similarity matrix to obtain a set of user and item vectors representing the latent features of users and items. Finally, after obtaining sets of user and item latent features from the metagraphs, we use FM to assemble them to predict the missing ratings that users give to the items. This method enables us to capture nonlinear interactions among all of the latent features, which has been demonstrated to be effective in FMbased RS
(Rendle, 2012). To further improvement the performance of the “MF+FM” framework, we propose to use group lasso (Jacob et al., 2009) with FM (denote as FMG) to learn the parameters for selecting metagraphs that contribute most to recommendation effectiveness. Besides, we also adopt a nonconvex variant of group lasso regularization (Candès et al., 2008). This leads to a nonconvex and nonsmooth optimization problem, which is difficult to solve. We propose two algorithms to efficiently solve the optimization problem: one is based on the proximal gradient algorithm (Parikh and Boyd, 2014)and the other on stochastic variance reduced gradient
(Xiao and Zhang, 2014). As a result, we can automatically determine for different applications which metagraphs are most effective, and for each group of user and item features from a metagraph, how the features should be weighed.Experimental results on two large realworld datasets, Amazon and Yelp, show that our framework significantly outperforms recommendation methods that are solely based on MF, FM, or metapath in HIN. Preliminary results of this paper have been reported in (Zhao et al., 2017b). In this full version, in additional to matrix factorization (MF), we explore nuclear norm regularization (NNR) to obtain latent features in Section 3.3.2. Furthermore, we adopt nonconvex regularization to boost metagraph selection performance (Section 4.2.2) and design a new optimization algorithm, which is more efficient than the one used in (Zhao et al., 2017b) (Section 4.3.2). Finally, additional experiments are performed to support the above research in Sections 5.4, 5.6 and 5.8. Moreover, we also give practical suggestions to apply our framework to other RS scenarios and other HINbased prediction problems in Section 7. Our code is available at https://github.com/HKUSTKnowComp/FMG.
Notation
We denote vectors and matrices by lowercase and uppercase boldface letters, respectively. In this paper, a vector always denote row vector. For a vector , is its norm. For a matrix , its nuclear norm is , where
’s are the singular values of
; is its Frobenius norm and is its norm. For two matrices and , , and denotes the elementwise multiplication. For a smooth function , is its gradient at .2. “MF + FM” Framework
The proposed framework is illustrated in Figure 3. The input to the MF part is a HIN, e.g., the one in Figure 1. To solve the semantic limitation issue, we design metagraphs instead of metapaths to capture complex semantics that exists between users and items in a HIN, e.g., those in Figure 4 and 5. Let there be metagraphs. The MF part, introduced in Section 3, computes from the metagraphs useritem similarity matrices, denoted by . Since these similarity matrices tend to be very sparse, we apply lowrank matrix approximation to factorize each similarity matrix into two lowdimension matrices, representing the latent features of users and items. The output of the MF part is the groups of latent features for users and items. Since existing methods only compute metapath based similarities, we design a new algorithm to compute the useritem similarities from metagraphs.
The objective of the FM part is to utilize the latent features to learn a recommendation model that is more effective than previous HINbased RSs. This addresses the similarity fusion issue. FMG (see Section 4) has two advantages over previous methods: 1) FM can capture nonlinear interactions among features (Rendle, 2012), which is more effective than linear ensemble model adopted in previous HINbased RS (Yu et al., 2014), 2) by introducing group lasso regularization, we can automatically select the useful features and in turn the useful metagraphs for a recommendation application, avoiding laborious feature and metagraph engineering when a new HIN is encountered. Specifically, for a useritem pair, user and item , we first concatenate the latent features and from all of the metagraphs to create a feature vector, using rating as label. We then train our FMG model with group lasso regularization method to select the useful features in the groups, where each group corresponds to one metagraph. The selected features are in grey in Figure 3. Finally, to efficiently train FMG, we propose two algorithms, one is based on the proximal gradient algorithm (Parikh and Boyd, 2014) and the other on the stochastic variance reduced gradient algorithm (Xiao and Zhang, 2014) (see Section 4.3).
Remark 2.1 ().
Random walk based graph embedding (Perozzi et al., 2014), which is a more powerful method than matrix factorization, has been recently used in HIN to learn the embeddings of users and item (Dong et al., 2017). However, it is designed for the metapath, which cannot be directly used for metagraph here. Therefore, in this extended version, we still adopt MF to learn latent features from metagraph based similarity matrices.
Remark 2.2 ().
The main contribution of this paper is to solve the information fusion problem in HIN by the proposed “MF + FM” framework. More importantly, the designed pipeline and methods can not only be applied for RSs, but also be applicable for other HINbased problems, like malware detection in software systems (Hou et al., 2017; Fan et al., 2018a), or opioid user detection (Fan et al., 2018b). Through this paper, we also give practical suggestions about how to apply our framework to existing RS or other HINbased problems.
3. Matrix Factorization for Metagraph based Feature Extraction
In this section, we elaborate on the MF part for metagraph based feature extraction. First, we introduce the construction of metagraphs for RS in HIN in Section
3.1. Then we show how to compute the useritem similarity matrices in Section 3.2. Finally, in Section 3.3, we obtain latent features from these matrices using MFbased approaches. The main novelty of our approach is the design of the MF part, which extracts and combines the latent features from each metagraph before they are fed to the FM part. Besides, as existing methods are only for computation similarity matrices of metapaths, we show how the computation can be extended to metagraphs.3.1. Construction of Metagraphs
We first give the definitions of HIN, Network Schema for HIN, and Metagraph (Sun et al., 2011; Huang et al., 2016; Fang et al., 2016). Then we introduce how to compute metagraph based similarities between users and items in a HIN.
Definition 0 (Heterogeneous Information Network).
A heterogeneous information network (HIN) is a graph with an entity type mapping : and a relation type mapping : , where denotes the entity set, denotes the link set, denotes the entity type set, and denotes the relation type set, and the number of entity types or the number of relation types .
Definition 0 (Network Schema).
Given a HIN with the entity type mapping : and the relation type mapping : , the network schema for network , denoted by , is a graph, in which nodes are entity types from and edges are relation types from .
In Figures 1 and 2, we show, respectively, an example of HIN and its network schema based on the Yelp dataset. We can see that we have different types of nodes, e.g., User, Review, Restaurant, and different types of relations, e.g., Write, CheckIn. The network schema defines the relations between node types, e.g., User CheckIn Restaurant, Restaurant LocateIn City. Thus, we can see that HIN is a flexible way for representing various information in an unified manner. The definition of metagraph is given below.
Definition 0 (Metagraph).
A metagraph is a directed acyclic graph (DAG) with a single source node (i.e., with indegree 0) and a single sink (target) node (i.e., with outdegree 0), defined on a HIN . Formally, , where and are constrained by and , respectively.
As introduced above (Fang et al., 2016; Huang et al., 2016), compared to metapath, metagraph can capture more complex semantics underlying the similarities between users and items. In fact, metapath is a special case of metagraph. Thus, in this paper, we introduce the concept of metagraph for HINbased RS. In Figures 4 and 5, we show the metagraphs on Yelp and Amazon datasets, respectively, used in our experiments. In these figures, represents the reverse relation of . For example, for in Figure 4, means checks in a business . From Figure 4 and 5, we can see that each metagraph has only one source () and one target () node, representing a user and an item in the recommendation scenario.
Since there could be many metagraphs in a HIN and not they are not equally effective, we give three guidelines for the selection of metagraphs: 1) All metagraphs designed are from the network schema. 2) Domain knowledge is helpful in the selection good metagraphs because some metagraphs correspond to tradition recommenation strategies that have been proven to be effective (Yu et al., 2014; Shi et al., 2015). For example, and in Figure 4, respectively, represent social recommendation and the wellknown userbased CF. In practice, an understanding of existing recommendation strategies and application semantics is essential to the design of good metagraphs; 3) It is better to construct shorter metagraphs. In (Sun et al., 2011), the authors have shown that longer metapaths tend to decrease the performance because of noises.
3.2. Computation of Similarity Matrices
We use and on Yelp to illustrate the computation of metagraph based similarities. In previous work, commuting matrices (Sun et al., 2011) have been employed to compute the countbased similarity matrix of a metapath. Suppose we have a metapath , where ’s are node types in and denote the adjacency matrix between type and type by . Then the commuting matrix for is defined by the multiplication of a sequence of adjacency matrices:
where , the entry in the th row and th column, represents the number of path instances between object and object under . For example, for in Figure 4, , where is the adjacency matrix between type and type , and represents the number of instances of between user and item . In this paper, for a metagraph , the similarity between a source object and a target object is defined as the number of instances of connecting the source and target objects. In the remainder of this paper, we adopt the term similarity matrix instead of commuting matrix for clarity.
From the above introduction, we can see that the metapath based similarity matrix is easy to compute. However, for metagraphs, the problem is more complicated. For example, consider in Figure 4, there are two ways to pass through the metagraph, which are and . Note that represents the entity type Review in HIN. In the path , means if two reviews mention the same (Aspect), then they have some similarity. Similarly, in , means if two reviews rate the same (Business), they have some similarity too. We should decide how similarity should be defined when there are multiple ways to pass through the metagraph from the source node to the target node. We can require a flow to pass through either path or both paths in order to be considered in similarity computation. The former strategy is similar to simply split a metagraph into multiple metapaths, thus suffering from information loss. Thus, we adopt the latter, but it requires one more matrix operation in addition to simple multiplication, i.e, elementwise product. Algorithm 1 depicts the algorithm for computing the countbased similarity based on in Figure 4. After obtaining , we can get the whole similarity matrix by multiplying the sequence of matrices along . In practice, not limited to in Figure 4, the metagraph defined in this paper can be computed by two operations (Hadamard product and multiplication) on the corresponding matrices.
By computing the similarities between all users and items for the th metagraph , we can obtain a useritem similarity matrix , where represents the similarity between user and item along the metagraph, and and are the number of users and items, respectively. Note that ^{2}^{2}2To maintain consistency with the remaining sections, we change the notation C into . if and 0 otherwise. By designing metagraphs, we can get different useritem similarity matrices, denoted by .
3.3. Latent Feature Generation
In this part, we elaborate on how to generate latent features for users and items from the useritem similarity matrices. Since the similarity matrices are usually very sparse, using the matrices directly as features will lead to the highdimensional learning problem, resulting in overfitting. Motivated by the recent success of lowrank matrix completion for RSs (Mnih and Salakhutdinov, 2007; Koren, 2008; Candès and Recht, 2009), we propose to generate latent features using matrix completion methods.
Specifically, the nonzero elements in a similarity matrix are treated as observations and the others are taken as missing values. Then we find a lowrank approximation to this matrix. Matrix factorization (MF) (Koren, 2008; Mnih and Salakhutdinov, 2007) and nuclear norm regularization (NNR) (Candès and Recht, 2009) are two popular approaches for matrix completion. Generally, MF leads to nonconvex optimization problems, while NNR leads to convex optimization problems. NNR is easier to optimize and has better theoretical guarantee on the recovery performance than MF. Empirically, NNR usually has better performance and the recovered rank is often much higher than that of MF (Yao and Kwok, 2015). In this paper, we generate metagraph based latent features with both methods and conduct experiments to compare their performance (shown in Section 5.6). The technical details of these two methods are introduced in the remaining part of this section.
3.3.1. Matrix Factorization
Consider a useritem similarity matrix , let the observed positions be indicated by ’s in , i.e., if and 0 otherwise. is factorized as a product of and by solving the following optimization problem:
(1) 
where is the desired rank of , and is the hyperparameter controlling regularization.
3.3.2. Nuclear Norm Regularization
Although MF can be simple, (1) is not a convex optimization problem, so there is no rigorous guarantee on the recovery performance. This motivates our adoption of nuclear norm, which is defined as the sum of all singular values of a matrix. It is also the tightest convex envelope to the rank function. This leads to the following nuclear norm regularization (NNR) problem:
(2) 
where is the lowrank matrix to be recovered. Nice theoretical guarantee has been developed for (2), which shows that can be exactly recovered given sufficient observations (Candès and Recht, 2009). These advantages make NNR popular for lowrank matrix approximation (Candès and Recht, 2009). Ths, we adopt (2
) to generate latent features, using the stateoftheart AISImpute algorithm
(Yao and Kwok, 2015) in optimizing (2). It has fast convergence rate, where is the number of iterations, with low periteration time complexity. In the iterations, a SVD decomposition of is maintained ( only contains the nonzero singular values). When the algorithm stops, we take and as user and item latent features, respectively.3.4. Complexity Analysis.
We analyze the time complexity of the MF part, which includes similarity matrix computation and latent feature generation. For similarity matrix computation, the core part is matrix multiplication. Due to the fact that the adjacency matrices tend to be very sparse, they can be implemented very efficiently as sparse matrices. Moreover, for MF and NNR, according to (Yao and Kwok, 2015; Yao et al., 2018), the computation costs in each iteration are and , respectively, where is the number of nonzero elements in the similarity matrix, and are the dimensions of the similarity matrix, and is the rank used to factorize the similarity matrix.
4. Factorization Machine for Fusing Metagraph based Features
In this section, we describe the FM part for fusing multiple groups of metagraph based latent features. We first show in Section 4.1 how FM predicts based on the metagraph based latent features. Then we introduce two regularization terms in Section 4.2, which can achieve automatic metagraph selection. In Section 4.3, we depict the objective function and propose two optimization methods for it. Note that existing HINbased RS methods (Yu et al., 2014; Shi et al., 2015) only use linear combination of different metapath based features and hence ignore the interactions among features. To resolve this limitation, we apply FM to capture the interactions among metagraph based latent features and nonlinear interactions among features (i.e., secondorder interactions) when fusing various side information in HIN.
4.1. Combining Latent Features with FM
In this section, we introduce our FMbased algorithm for fusing different groups of latent features. As described in Section 3.3, we obtain groups of latent features of users and items, denoted by , , , , , from metagraph based useritem similarity matrices. For a sample in the observed ratings, i.e., a pair of user and item, denoted by and , respectively, we concatenate all of the corresponding user and item features from the metagraphs:
(3) 
where , and is the rank of the factorization of the similarity matrix for the th metagraph obtained with (1) or (2). and , respectively, represent user and item latent features generated from the th metagraph, and is a dimension vector representing the feature vector of the th sample after concatenation.
Given all of the features in (3), the predicted rating for the sample based on FM (Rendle, 2012) is computed as follows:
(4) 
where is the global bias, and represents the firstorder weights of the features. represents the secondorder weights for modeling the interactions among the features, and is the th row of the matrix , which describes the th variable with factors. is the th feature in . The parameters can be learned by minimizing the mean square loss:
(5) 
where is an observed rating for the th sample, and is the number of all observed ratings.
4.2. Metagraph Selection with Group Lasso
There are two problems when FM is applied to metagraph based latent features. The first problem is that noise may arise when there are too many metagraphs and thus impair the predicting capability of FM. Moreover, some metagraphs can be useless in practice since the strategies they represent may be ineffective. The second problem is computational cost. All of the features are generated by MF, which means that the design matrix (i.e., features fed to FM) is dense. It increases the computational cost for learning the parameters of the model and that of online recommendation. To alleviate these two problems, we propose two novel regularizations to automatically select useful metagraphs from data. They can be categorized into convex and nonconvex regularizations, and either of them enables our model to automatically select useful metagraphs during the training process.
4.2.1. Convex Regularization
The convex regularizer is the norm regularization, i.e., group lasso regularization (Jacob et al., 2009)
, which is a feature selection method on a group of variables. Given the predefined nonoverlapping
groups on the parameter , the regularization is defined as follows.(6) 
where is the norm. In our model, the groups correspond to the metagraph based features. For example, and are the user and item latent features generated by the th metagraph. For a pair of user and item , the latent features are and . There are two corresponding groups of variables in and according to (4). Thus, with metagraphs, w and V each has groups of variables.
For the firstorder parameters in (4), which is a vector, group lasso is applied to the subset of variables in . Then we have:
(7) 
where , which models the weights for a group of user or item features from one metagraph. For the secondorder parameters in (4), we have the regularizer as follows:
(8) 
where , the th block of V corresponds to the th metagraph based features in a sample, and is the Frobenius norm.
4.2.2. Nonconvex Regularization
While convex regularizers usually make optimization easy, they often lead to biased estimation. For example, in sparse coding, the solution obtained by the
regularizer is often not as sparse and accurate compared to capped penalty (Zhang, 2010). Besides, in lowrank matrix learning, the estimated rank obtained with the nuclear norm regularizer is often very high (Yao et al., 2018). To alleviate these problems, a number of nonconvex regularizers, which are variants of the convex norm, have been recently proposed (Yao and Kwok, 2016; Yao et al., 2018). Empirically, these nonconvex regularizers usually outperform the convex ones. Motivated by the above observations, we propose to use nonconvex variant of (7) and (8) as follows:(9) 
where is a nonconvex penalty function. We choose as the logsumpenalty (LSP) (Candès et al., 2008), as it has been shown to give the best empirical performance on learning sparse vectors (Yao and Kwok, 2016) and lowrank matrices (Yao et al., 2018).
4.2.3. Comparison with existing methods
Yu et.al. studied recommendation techniques based on HINs (Yu et al., 2014) and applied matrix factorization to generate latent features from metapaths and predict the rating with a weighted ensemble of the dot products of user and item latent features from every single metapath: , where is the predicted rating for user and item , and and are the latent features for and item from the th metapath, respectively. is the number of metapaths used, and is the weight for the th metapath latent features. However, the predicting method is not adequate, as it fails to capture the interactions between features across different metapaths, and between features within the same metapath, resulting in decrease of the prediction performance for all of the features. In addition, previous works on FM (Rendle, 2012; Hong et al., 2013; Yan et al., 2014) only focused on the selection of one row or column of the secondorder weight matrix, while here selects a block of rows or columns (defined by metagraphs). Moreover, we are the first to adopt nonconvex regularization, i.e., , for weights selection in FM.
4.3. Model Optimization
Combining (5) and (9), we define our FM with Group lasso (FMG) model with the following objective function:
(10) 
Note that when in (9), we get back (7) and (8). Thus, we directly use the nonconvex regularization in (10).
We can see that is nonsmooth due to the use of and , and nonconvex due to the nonconvexity of loss on and . To alleviate the difficulty on optimization, inspired by (Yao and Kwok, 2016), we propose to reformulate (10) as follows:
(11) 
where , and
Note that is equivalent to based on Proposition 2.1 in (Yao and Kwok, 2016). A very important property for the augmented loss is that it is still smooth. As a result, while we are still optimizing a nonconvex regularized problem, we only need to deal with convex regularizers.
In Section 4.3.1, we show how the reformulated problem can be solved by the stateoftheart proximal gradient algorithm (Li and Lin, 2015); moreover, such transformation enables us to design a more efficient optimization algorithm with convergence guarantee based on variance reduced methods (Xiao and Zhang, 2014). Finally, the time complexity of the proposed algorithms is analyzed in Section 4.3.3.
Remark 4.1 ().
4.3.1. Nonmonotonous Accelerated Proximal Gradient (nmAPG) Algorithm
To tackle the nonconvex nonsmooth objective function (11), we propose to adopt the PG algorithm (Parikh and Boyd, 2014) and, specifically, the stateoftheart nonmonotonous accelerated proximal gradient (nmAPG) algorithm (Li and Lin, 2015). It targets at optimization problems of the form:
(12) 
where
is a smooth (possibly nonconvex) loss function and
is a regularizer (can be nonsmooth and nonconvex). To guarantee the convergence of nmAPG, we also need , , and there exists at least one solution to the proximal step, i.e., , where is a scalar (Li and Lin, 2015).The motivation of nmAPG is two fold. First, nonsmoothness comes from the proposed regularizers, which can be efficiently handled if the corresponding proximal steps have cheap closedform solution. Second, the acceleration technique is useful for significantly speeding up first order optimization algorithms (Yao and Kwok, 2016; Li and Lin, 2015; Yao et al., 2017), and nmAPG is the stateoftheart algorithm which can deal with general nonconvex problems with sound convergence guarantee. The whole procedure is given in Algorithm 2. Note that while both and are nonsmooth in (11), they are imposed on and separately. Thus, for any , we can also compute proximal operators independently for these two regularizers following (Parikh and Boyd, 2014):
(13) 
These are performed in steps 5 and 10 in Algorithm 2. The closedform solution of the proximal operators can be obtained easily from Lemma 1 below. Thus, each proximal operator can be solved in one pass of all groups.
4.3.2. Stochastic Variance Reduced Gradient (SVRG) Algorithm
While nmAPG can be an efficient algorithm for (11
), it is still a batchgradient based method, which may not be efficient when the sample size is large. In this case, the stochastic gradient descent (SGD)
(Bertsekas, 1999) algorithm is preferred as it can incrementally update the learning parameters. However, the gradient in SGD is very noisy. To ensure the convergence of SGD, a decreasing step size must be used, making the speed possibly even slower than batchgradient methods.Recently, the stochastic variance reduction gradient (SVRG) (Xiao and Zhang, 2014) algorithm has been developed. It avoids diminishing step size by introducing variance reduced techniques into gradient updates. As a result, it combines the best of both worlds, i.e., incremental update of the learning parameters while keeping nondiminishing step size, to achieve significantly faster converging speed than SGD. Besides, it is also extended for the problem in (12) with nonconvex objectives (Reddi et al., 2016; AllenZhu and Hazan, 2016). This allows the loss function to be smooth (possibly nonconvex) but the regularizer still needs to be convex. Thus, instead of working on the original problem (10), we work on the transformed problem in (11).
To use SVRG, we first define the augmented loss for the th sample as . The whole procedure is depicted in Algorithm 3. A full gradient is computed in step 4, a minibatch of size is constructed in step 6, and the variance reduced gradient is computed in step 7. Finally, the proximal steps can be separately executed based on (13) in step 8. As mentioned above, the nonconvex variant of SVRG (Reddi et al., 2016; AllenZhu and Hazan, 2016) cannot be directly applied to (10). Instead, we apply it to the transformed problem (11), where the regularizer becomes convex and the augmented loss is still smooth. Thus, Algorithm 3 is guaranteed to generate a critical point of (11).
4.3.3. Complexity Analysis
For nmAPG in Algorithm 2, the main computation cost is incurred in performing the proximal steps (step 5 and 10) which cost ; then the evaluation of function value (step 7 and 11) costs time. Thus, the periteration time complexity for Algorithm 2 is . For SVRG in Algorithm 3, the computation of the full gradient takes in step 5; then time is needed for steps 610 to perform minibatch updates. Thus, one iteration in Algorithm 2 takes time. Usually, shares the same order as (Xiao and Zhang, 2014; Reddi et al., 2016; AllenZhu and Hazan, 2016). Thus, we set in our experiments. As a result, SVRG needs more time to perform one iteration than nmAPG. However, due to stochastic updates, SVRG empirically converges much faster as shown in Section 5.8.
5. Experiments
In this section, we conduct extensive experiments to demonstrate the effectiveness of our proposed framework. We first introduce the datasets, evaluation metrics and experimental settings in Section
5.1. In Section 5.2, we show the recommendation performance of our proposed framework compared to several stateoftheart recommendation methods, including MFbased and HINbased methods. We analyze the influence of the parameter , which controls the weight of convex regularization term, in Section 5.3, and the influence of in the nonconvex regularization term in Section 5.4. To further understand the impact of metagraphs on performance, we discuss the performance of each single metagraph in Section 5.5. In Section 5.6, we compare the performance between NNR and MF in extracting the features. In Section 5.7, we show the influence of of FMG. Finally, the two proposed two optimization algorithms described in Section 4.3 are compared in Section 5.8, and their scalability is studied in Section 5.9.5.1. Setup
To demonstrate the effectiveness of HIN for recommendation, we mainly conduct experiments using two datasets with rich side information. The first dataset is Yelp, which is provided for the Yelp challenge.^{3}^{3}3https://www.yelp.com/dataset_challenge Yelp is a website where a user can rate local businesses or post photos and reviews about them. The ratings fall in the range of 1 to 5, where higher ratings mean users like the businesses while lower rates mean users dislike businesses. Based on the information collected, the website can recommend businesses according to the users’ preferences. The second dataset is Amazon Electronics,^{4}^{4}4http://jmcauley.ucsd.edu/data/amazon/ which is provided in (He and McAuley, 2016). As we know, Amazon highly relies on RSs to present interesting items to the users. In (He and McAuley, 2016) many domains of the Amazon dataset are provided, and we choose the electronics domain for our experiments. We extract subsets of entities from Yelp and Amazon to build the HIN, which includes diverse types and relations. The subsets of the two datasets both include around 200,000 ratings in the useritem rating matrices. Thus, we identify them as Yelp200K and Amazon200K, respectively. Besides, to better compare our framework with existing HINbased methods, we also use the datasets provided in the CIKM paper (Shi et al., 2015), which we denote as CIKMYelp and CIKMDouban. Note that four datasets are used to compare the recommendation performance of different methods, as shown in Section 5.2. To evaluate other aspects of our model, we only conduct experiments on the first two datasets, i.e., the Yelp200K and Amazon200K datasets.
The statistics of our datasets are shown in Table 1. For the detailed information of CIKMYelp and CIKMDouban, we refer the readers to (Shi et al., 2015). Note that i) the number of types and relations in the first two datasets, i.e., Amazon200K and Yelp200K, is much more than that of previous works (Yu et al., 2013, 2014; Shi et al., 2015); ii) the densities of the rating matrices of the four datasets, shown in Table 2, are much smaller than those in previous works (Yu et al., 2013, 2014; Shi et al., 2015). Thus, our datasets are more challenging than that of the previous works.
Relations(AB) 






Amazon200K  UserReview  59,297  183,807  183,807  3.1/1  
BusinessCategory  20,216  682  87,587  4.3/128.4  
BusinessBrand  95,33  2,015  9,533  1/4.7  
ReviewBusiness  183,807  20,216  183,807  1/9.1  
ReviewAspect  183,807  10  796,392  4.3/79,639.2  
Yelp200K  UserBusiness  36,105  22,496  191,506  5.3/8.5  
UserReview  36,105  191,506  191,506  5.3/1  
UserUser  17,065  17,065  140,344  8.2/8.2  
BusinessCategory  22,496  869  67,940  3/78.2  
BusinessStar  22,496  9  22,496  1/2,499.6  
BusinessState  22,496  18  22496  1/1,249.8  
BusinessCity  22,496  215  22,496  1/104.6  
ReviewBusiness  191,506  22,496  191,506  1/8.5  
ReviewAspect  191,506  10  955,041  5/95,504.1 
Amazon200K  Yelp200K  CIKMYelp  CIKMDouban  
Density  0.015%  0.024%  0.086%  0.630% 
To evaluate the recommendation performance, we adopt the rootmeansquareerror (RMSE) as our metric, which is the most popular for rating prediction in the literature (Koren, 2008; Ma et al., 2011; Mnih and Salakhutdinov, 2007). It is defined as
where is the set of all the test samples, is the predicted rating for the th sample, is the observed rating of the th sample in the test set. A smaller RMSE value means better performance.
We compare the following baseline models to our approaches.

[leftmargin=5mm]

FMR (Rendle, 2012): The factorization machine with only the useritem rating matrix. We adopt the method in Section 4.1.1 of (Rendle, 2012) to model the rating prediction task. We use the code provided by the authors.^{5}^{5}5http://www.libfm.org/

HeteRec (Yu et al., 2014): It is based on metapath based similarity between users and items. A weighted ensemble model is learned from the latent features of users and items generated by applying matrix factorization to the similarity matrices of different metapaths. We implemented it based on (Yu et al., 2014).

SemRec (Shi et al., 2015): It is a metapath based recommendation technique on weighted HIN, which is built by connecting users and items with the same ratings. Different models are learned from different metapaths, and a weight ensemble method is used to predict the users’ ratings. We use the code provided by the authors.^{6}^{6}6https://github.com/zzqsmall/SemRec

FMG(LSP): Same as FMG, except nonconvex group lasso regularizer in (9) is used.
Note that it is reported in (Shi et al., 2015) that SemRec outperforms the method in (Yu et al., 2013), which uses metapath based similarities as regularization terms in matrix factorization. Thus, we do not include (Yu et al., 2013) in the comparison. All experiments run in a server (OS: CentOS release 6.9, CPU: Intel i73.4GHz, RAM: 32GB).
On Amazon200K and Yelp200K, we use the metagraphs in Figures 5 and 4 for HeteRec, SemRec, FMG, and FMG(LSP), while on CIKMYelp and CIKMDouban, we use the metapaths provided in (Shi et al., 2015) for these four methods. To get the aspects (e.g., in Figures 4 and 5) from review texts, we use a topic model software Gensim (Řehůřek and S., 2010) to extract topics from the review texts and use the extracted topics as aspects. The number of topics is set to empirically.
In Section 5.2, we use the four datasets in Table 2 to compare the recommendation performance of our models and the baselines. For the experimental settings, we randomly split the whole dataset into 80% for training, 10% for validation and the remaining 10% for testing. The process is repeated five times and the average RMSE of the five rounds is reported. Besides, for the parameters of our models, we set in Eq. (10) for simplicity, and is set to obtain the optimal value on different validation datasets. As in (Zhao et al., 2017b), and are set to for its good performance and computational efficiency. From Sections 5.3 to Section 5.8, to explore the influences of different settings of the proposed framework, we create two smaller datasets, Amazon50K and Yelp50K, where only 50,000 ratings are sampled from Amazon200K and Yelp200K. We then randomly split the whole dataset into 80% for training and 20% for testing, and report the performance on the test datasets. Finally, in Section 5.9, we conduct experiments for the FM part with the two optimization algorithms presented in Sec 4.3 to demonstrate the scalability of the proposed framework. Datasets of different scales are created from Amazon200K and Yelp200K, and the parameters are set to , respectively.
5.2. Recommendation Effectiveness
The RMSEs of all of the methods evaluated are shown in Table 3. The relative decrease of RMSEs achieved by FMG compared to the baselines is shown in Table 4. For CIKMYelp and CIKMDouban, we directly report the performance of SemRec from (Shi et al., 2015) since the same amount of training data is used in our experiement. Besides, the results of SemRec on Amazon200K are not reported, as the programs crashed due to large demand of memory.
Amazon200K  Yelp200K  CIKMYelp  CIKMDouban  
RegSVD  2.96560.0008  2.51410.0006  1.53230.0011  0.76730.0010 
FMR  1.34620.0007  1.76370.0004  1.43420.0009  0.75240.0011 
HeteRec  2.53680.0009  2.34750.0005  1.48910.0005  0.76710.0008 
SemRec  —  1.46030.0003  1.1559(*)  0.7216(*) 
FMG  1.19530.0008  1.25830.0003  1.11670.0011  0.70230.0011 
FMG(LSP)  1.19800.0010  1.25930.0005  1.12550.0012  0.70350.0013 
Recommendation performance of all approaches in terms of RMSE. The lowest RMSEs (according to the pairwise ttest with 95% confidence) are highlighted.
Amazon200K  Yelp200K  CIKMYelp  CIKMDouban  
RegSVD  60.0%  50.0%  27.1%  8.5% 
FMR  11.0%  28.7%  11.0%  6.7% 
HeteRec  52.8%  46.4%  25.0%  8.4% 
SemRec  —  13.8%  3.4%  2.7% 
Firstly, we can see that our FMG model, including the convex and nonconvex ones, consistently outperforms all baselines on the four datasets. This demonstrates the effectiveness of the proposed framework shown in Figure 3. Note that the performance of FMG and FMG(LSP) are very close, but FMG(LSP) needs fewer features to achieve such performance, which supports our motivation to use nonconvex regularization for selecting features. In the following two sections, we will compare in detail the two regularizers.
Secondly, from Table 3, we can see that comparing to RegSVD and FMR, which only use the rating matrix, SemRec and FMG, which use side information from metagraphs, are significantly better. In particular, the sparser the rating matrix, the more obvious is the benefit produced by the additional information. For example, on Amazon200K, FMG outperforms RegSVD by 60%, while for CIKMDouban, the percentage of RMSE decrease is 8.5%. Note that the performance of HeteRec is worse than FMR, despite the fact that we have tried our best to tune the model. This aligns with our discussion in Section 4 that a weighting ensemble of dot products of latent features may cause information loss among the metagraphs and fail to reduce noise caused by having too many metagraphs. These demonstrate the effectiveness of the proposed FMG for fusing various side information for recommendation.
When comparing the results of FMG and SemRec, we find that the performance gap between them are not that large, which means that SemRec is still a good method for rating prediction, especially when comparing to the other three baselines. The good performance of SemRec may be attributed to the reason that it incorporates rating values into HIN to create a weighted HIN, which can better capture the metagraph or metapath based similarities between users and items.
5.3. The Impact of Convex Regularizer
In this part, we study the impact of group lasso regularizer for FMG. Specifically, we show the trend of RMSE by varying (with in (10)), which controls the weights of group lasso. The RMSE of Amazon50K and Yelp50K are shown in Figure 6(a) and (b), respectively. We can see that with increasing, RMSE decreases first and then increases, demonstrating that values that are too large or too small are not good for the performance of rating prediction. Specifically, on Amazon50K, the best performance is achieved when , and on Yelp50K, the best is when . Next, we give further analysis of these two parameters in terms of sparsity and the metagraphs selected by group lasso.
5.3.1. Sparsity of
We study the sparsity of the learned parameters, i.e., the ratio of zeros in , after learning. We define NNZ (number of non zeros) as , where is the total number of nonzero elements in w and V, and and are the number of entries in w and V, respectively. The smaller NNZ, the fewer the nonzero elements in w and V, and the fewer the metagraph based features left after training. The trend of NNZ with different ’s is shown in Figure 7. We can see that with increasing, NNZ becomes smaller, which aligns with the effect of group lasso. Note that the trend is nonmonotonous due to the nonconvexity of the objective.
5.3.2. The Selected Metagraphs
In this part, we analyze the selected features in FMG. From Figure 6(a) and (b), we can see that RMSE and sparsity are good when on Amazon50K and on Yelp50K. Thus, we want to show the selected metagraphs and their user and item features in these configurations. Recall that in Eq. (4), we introduce w and V, respectively, to capture the firstorder weights for the features and secondorder weights for interactions of the features. Thus, after training, the nonzero values in w and V represent the selected features, i.e., the selected metagraphs. We list in Table 5 the selected metagraphs corresponding to nonzero values in w and V from the perspective of both users and items.
UserPart  ItemPart  

firstorder  secondorder  firstorder  secondorder  
Amazon  FMG  ,    ,,,  ,, 
50K  FMG(LSP)  ,    ,   
Yelp  FMG  ,,  ,,  ,,  , 
50K  FMG(LSP)  ,,,  , 
From Table 5, we can observe that the metagraphs with style like are better than those like . We use to represent metagraphs like in Figure 4 (Yelp) and in Figure 5 (Amazon), and to represent metagraphs like in Figure 4 and in Figure 5. On Yelp50K, we can see that metagraphs like tend to be selected while are removed. This means that on Yelp, recommendations by friends or similar users are better than those by similar items. Similar observations can be made on Amazon50K, i.e., tend to be removed. Furthermore, on both datasets, complex structures like in Figure 4 and in Figure 5 are found to be important for item latent features. This demonstrates the importance of the semantics captered by metagraphs, which are ignored in previous metapath based RSs (Yu et al., 2013, 2014; Shi et al., 2015).
5.4. Impact of Nonconvex Regularizer
In this part, we study the performance of the nonconvex regularizer. We conduct experiments on Amazon50K and Yelp50K datasets to compare the results of the convex and nonconvex regularizers.
The results are reported in the same manner as in Section 5.3. The RMSEs of the nonconvex regularizer on Amazon50K and Yelp50K are shown in Figures 6(a) and (b), respectively. We observe that the trend of the nonconvex regularizer is similar to that of the convex regularizer. Specifically, on Amazon, the best performance is achieved when , and on Yelp, the best is when .
As in Section 5.3.1, we also use NNZ to show the performance of FMG(LSP) in Figure 7. We can see that with increasing, NNZ becomes smaller. Note that the trend is also nonmonotonous due to the nonconvexity of the objective. Besides, NNZ of the parameters of FMG(LSP) is much smaller than that of FMG when the best performance on both Amazon50K and Yelp50K is achieved. This is due to the effect of nonconvexity of LSP, which can induce larger sparsity of the parameters with a smaller loss of performance gain.
Next, we analyze the selected features by FMG(LSP). As in FMG, we show the selected metagraphs when the best performance is achieved in Figure 6, i.e., on Amazon50K and on Yelp50K. The results of Amazon50K and Yelp50K are also shown in Table 5, and the observation is very similar to that of FMG, i.e., metagraphs with style like are better than those like . On Yelp50K, metagraphs like tend to be selected while are removed, while on Amazon50K tend to be removed.
Besides sparsity trends and selected metagraphs, we emphasize an interesting discovery here. From Figure 7, we can see that on both Amazon50K and Yelp50K the NNZ of FMG(LSP) is smaller than that of FMG when they both obtain the best performance. For example, on Amazon50K, FMG performs best with and , while FMG(LSP) performs best with and . Similar cases exist on Yelp50K. In other words, to obtain the best performance, nonconvex regularizers can induce larger sparsity, which means they can select useful features more effectively (i.e., they can achieve comparable performance with fewer selected metagraphs).
5.5. Recommending Performance with Single Metagraph
In this part, we compare the performance of different metagraphs separately on Amazon50K and Yelp50K. In the training process, we use only one metagraph for user and item features and then predict with FMG and evaluate the results obtained by the corresponding metagraph. Specifically, we run experiments to compare RMSE of each metagraph in Figures 4 and 5. The RMSE of each metagraph is shown in Figure 8. Note that we show for comparison the RMSE when all metagraphs are used, which is denoted by .
From Figure 8, we can see that on both Amazon50K and Yelp50K, the performance is the best when all metagraph based user and item features are used, which demonstrates the usefulness of the semantics captured by the designed metagraphs in Figures 4 and 5. Besides, we can see that on Yelp50K, the performance of is the worst, and on Amazon50K, the performance of is also among the worst three. Note that they are both metagraphs with style like . Thus, it aligns with the observation in the above two sections that metagraphs with style like are better than those like . These similar observations described in these three sections can be regarded as domain knowledge, which indicates that we should design more metagraphs with style .
Finally, for on Yelp50K and on Amazon50K, we can see that their performance are among the best three, which demonstrates the usefulness of the complex semantics captured in on Yelp50K and on Amazon50K.
5.6. Feature Extraction Methods
In this part, we compare the performance of different feature extraction methods in MF part, i.e., NNR and MF described in Section 3.2. Note that, the parameter of MF and of NNR will lead to different number of latent features for different similarity matrices. Figure 9 shows the performance with different , i.e., total length of the input features. We can see that latent features from NNR have slightly better performance than MF, while the feature dimension resulting from NNR is much larger. These observations support our motivation to use these two methods in Section 3.3, which is that NNR usually has better performance while the recovered rank is often much higher than that of MF. Thus, we can conclude that if we want the best performance, NNR is better for extracting features, while MF is more suitable for tradeoff between performance and efficiency.
5.7. Rank of SecondOrder Weights Matrix
In this part, we show the performance trend by varying , which is the rank of the secondorder weights V in the FMG model (see Section 4). For the sake of efficiency, we conduct extensive experiments on Amazon50K and Yelp50K and employ the MFbased latent features. We set to values in the range of , and the results are shown in Figure 10. We can see that the performance becomes better with larger values on both datasets and reaches a stable performance after . Thus, we fix for all other experiments.
5.8. Optimization Algorithm
In this part, we compare the SVRG and nmAPG algorithms proposed in Section 4.3. Besides, we also use SGD as a baseline since it is the most popular algorithm for models based on factorization machine (Rendle, 2012; Hong et al., 2013). Again, we use the Amazon50K and Yelp50K datasets. As suggested in (Xiao and Zhang, 2014), we compare the efficiency of various algorithms based on RMSE w.r.t. the number of gradient computations divided by .
The results are shown in Figure 11. We can observe that SGD is the slowest among all three algorithms and SVRG is the fastest. Although SGD can be faster than nmAPG at the beginning, the diminishing step size used to guarantee convergence of stochastic algorithms finally drags SGD down to become the slowest. SVRG is also a stochastic gradient method, but it avoids the problem of diminishing step size using variance reduced technique, which results in even faster speed than nmAPG. Finally, as both SVRG and nmAPG are guaranteed to produce a critical point of (10), they have the same empirical prediction performance. Therefore, in practice, the suggestion is to use SVRG as the solver because of the faster speed and empirically good performance.
5.9. Scalability
In this part, we study the scalability of our framework. We extract a series of datasets of different scales from Amazon200K and Yelp200K according to the number of observations in the useritem rating matrix. The specific values are .
The time cost on Amazon and Yelp are shown in Figure 12. For simplicity, we only show the results of FMG with SVRG and nmAPG algorithms. From Figure 12, the training time is almost linear to the number of observed ratings, which aligns with the analysis in Section 4.3.3 and demonstrates that our framework can be applied to largescale datasets.
6. Related Work
In this section, we review existing works related to HIN, RS with side information, and FM.
6.1. Heterogeneous Information Networks (HINs)
HINs have been proposed as a general representation for many realworld graphs or networks (Shi et al., 2017; Sun et al., 2011; Kong et al., 2013b; Joshua, 2012; Sun and Han, 2013).
A metapath is a sequence of entity types defined by the HIN network schema. Based on metapath, several similarity measures, such as PathCount (Sun et al., 2011), PathSim (Sun et al., 2011), and PCRW (Lao and Cohen, 2010) have been proposed, and research has shown that they are useful for entity search and as similarity measure in many realworld networks. After the development of metapath, many data mining tasks have been enabled or enhanced, including recommendation (Yu et al., 2013, 2014; Shi et al., 2015), similarity search (Sun et al., 2011; Shi et al., 2014), clustering (Wang et al., 2015a; Sun et al., 2013), classification (Kong et al., 2013a; Wang et al., 2015b, 2017; Jiang et al., 2017), link prediction (Sun et al., 2012; Zhang et al., 2014), malware detection (Hou et al., 2017; Fan et al., 2018a), and opioid user detection (Fan et al., 2018b).
Recently, metagraph (or metastructure) has been proposed for capturing complicated semantics in HIN that metapath cannot handle (Huang et al., 2016; Fang et al., 2016). However, in existing research, metagraph is limited to entity similarity problems where entities have the same type. In this paper, we extend metagraph to the recommendation problem, where we need to compute the similarity between heterogeneous types of entities, i.e., users and items.
6.2. Recommendation with Heterogeneous Side Information
Modern recommender systems are able to capture rich side information such as social connections among users and metadata and reviews associated with items. Previous works have explored different methods to incorporate heterogeneous side information to enhance CF based recommender systems. For example, (Ma et al., 2011) and (Zhao et al., 2017a), respectively, incorporate social relations into lowrank and local lowrank matrix factorization to improve the recommendation performance. In (McAuley and Leskovec, 2013; Ling et al., 2014), review texts are analyzed together with ratings in the rating prediction task. (Ye et al.,
Comments
There are no comments yet.