Graph Enhanced Representation Learning for News Recommendation

03/31/2020 ∙ by Suyu Ge, et al. ∙ Tsinghua University 0

With the explosion of online news, personalized news recommendation becomes increasingly important for online news platforms to help their users find interesting information. Existing news recommendation methods achieve personalization by building accurate news representations from news content and user representations from their direct interactions with news (e.g., click), while ignoring the high-order relatedness between users and news. Here we propose a news recommendation method which can enhance the representation learning of users and news by modeling their relatedness in a graph setting. In our method, users and news are both viewed as nodes in a bipartite graph constructed from historical user click behaviors. For news representations, a transformer architecture is first exploited to build news semantic representations. Then we combine it with the information from neighbor news in the graph via a graph attention network. For user representations, we not only represent users from their historically clicked news, but also attentively incorporate the representations of their neighbor users in the graph. Improved performances on a large-scale real-world dataset validate the effectiveness of our proposed method.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Both the overwhelming number of newly-sprung news and huge volumes of online news consumption pose challenges to online news aggregating platforms. Thus, how to target different users’ news reading interests and avoid showcasing excessive irrelevant news becomes an important problem for these platforms (Phelan et al., 2011; Liu et al., 2010). A possible solution is personalized news recommendation, which depicts user interests from previous user-news interactions (Li et al., 2011; Bansal et al., 2015). However, unlike general personalized recommendation methods, news recommendation is unique from certain aspects. The fast iteration speed of online news makes traditional ID-based recommendation methods such as collaborative filtering (CF) suffer from data sparsity problem (Guo et al., 2014). Meanwhile, rich semantic information in news texts distinguishes itself from recommendation in other domains (e.g., music, fashion and food). Therefore, a precise understanding of textual content is also vital for news recommendation.

Figure 1. A user-news bipartite graph.

Existing news recommendation methods achieve personalized news ranking by building accurate news and user representations. They usually build news representations from news content (Bansal et al., 2015; Lian et al., 2018; Zhu et al., 2019; Wu et al., 2019c). Based on that, user representations are constructed from their click behaviors, e.g., the aggregation of their clicked news representations. For example, Wang et al. proposed DKN (Wang et al., 2018)

, which formed news representations from their titles via convolutional neural network (CNN). Then they utilized an attention network to select important clicked news for user representations. Wu et al. 

(Wu et al., 2019b) further enhanced personalized news representations by incorporating user IDs as attention queries to select important words in news titles. When forming user representations, the same attention query was used to select important clicked news. Compared with traditional collaborative filtering methods (Konstan et al., 1997; Ren et al., 2017; Ling et al., 2014), which suffer heavy cold-start problems (Lika et al., 2014), these methods gained a competitive edge by learning semantic news representations directly from news context. However, most of them build news representations only from news content and build user representations only from users’ historically clicked news. When the news content such as titles are short and vague, and the historical behaviors of user are sparse, it is difficult for them to learn accurate news and user representations.

Our work is motivated by several observations. First, from user-news interactions, a bipartite graph can be established. Within this graph, both users and news can be viewed as nodes and interactions between them can be viewed as edges. Among them, some news are viewed by the same user, thus are defined as neighbor news. Similarly, specific users may share common clicked news, and are denoted as neighbor users. For example, in Figure 1, news and are neighbors because they are both clicked by user . Meanwhile, and are neighbor users. Second, news representation may be enhanced by considering neighbor news in the graph. For example, neighbor news and both relates to politics. However, the expression “The King” in is vague without any external information. By linking it to news , which is more detailed and explicit, we may infer that talks about president Trump. Thus, when forming news representation for , may be modeled simultaneously as a form of complementary information. Third, neighbor users in the graph may share some similar news preferences. Incorporating such similarities may further enrich target user representations. As illustrated, and share common clicked news , indicating that they may be both interested in political news. Nevertheless, it is challenging to form accurate user representation for since the click history of is very sparse. Thus, explicitly introducing information from may enrich the representation of and lead to better recommendation performances.

In this paper, we propose to incorporate the graph relatedness of users and news to enhance their representation learning for news recommendation. First, we utilize the transformer (Vaswani et al., 2017) to build news semantic representations from textual content. In this way, the multi-head self-attention network encodes word dependency in titles at both short and long distance. We also add topic embeddings of news since they may contain important information. Then we further enhance news representations by aggregating neighbor news via a graph attention network. To enrich neighbour news representations, we utilize both their semantic representations and ID embeddings. For user representations, besides attentively building user representations from their ID embeddings and historically clicked news, our approach also leverages graph information. We use the attention mechanism to aggregate the ID embeddings of neighbor users. Finally, recommendation is made by taking the dot product between user and news representations. We conduct extensive experiments on a large real-world dataset. The improved performances over a set of well-known baselines validate the effectiveness of our approach.

2. Related Work

Neural news recommendation receives attention from both data mining and natural language processing fields (Zheng et al., 2018; Wang et al., 2017; Hamilton et al., 2017). Many previous works handle this problem by learning news and user representations from textual content (Wu et al., 2019b; An et al., 2019; Zhu et al., 2019; Wu et al., 2019a). From such viewpoint, user representations are built upon clicked news representations using certain summation techniques (e.g., attentive aggregation or sequential encoding). For instance, Okura (Okura et al., 2017)

incorporated denoising autoencoder to form news representations. Then they explored various types of recurrent networks to encode users. An et. al 

(An et al., 2019)

attentively encoded news by combining title and topic information. They learned news representations via CNN and formed user representations from their clicked news via a gated recurrent unit (GRU) network. Zhu et. al. 

(Zhu et al., 2019)

exploited long short-term memory network (LSTM) to encode clicked news, then applied a single-directional attention network to select important click history for user representations. Though effective in extracting information from textual content, the works presented above neglect relatedness between neighbor users (or items) in the interaction graph. Different from their methods, our approach exploits both context meaning and neighbor relatedness in graph.

Recently, graph neural networks (GNN) have received wide attention, and a surge of attempts have been made to develop GNN architectures for recommender systems (Ying et al., 2018; Wu et al., 2019d; Hamilton et al., 2017). These models leverage both node attributes and graph structure by representing users and items using a combination of neighbor node embeddings (Song et al., 2019). For instance, Wang et. al. (Wang et al., 2019b)

combined knowledge graph (KG) with collaborative signals via a graph attention network, thus enhancing user and item representations with entity information in KG. Ying et. al. 

(Ying et al., 2018) introduced graph convolution to web-scale recommendation. Node representations of users and items were formed using visual and annotation features. In most works, representations are initially formed via node embedding, then optimized by receiving propagation signals from the graph (Wang et al., 2019c; Wu et al., 2019d). Although node embeddings are enhanced by adding item relation (Xin et al., 2019), visual features (Ying et al., 2018) or knowledge graphs (Wang et al., 2019a), rich semantic information in the textual content may not be fully exploited. Different form their work, our approach learns the node embeddings of news directly from its textual content. We utilize the transformer architecture to model context dependency in news titles. Thus, our approach improves the node embedding by forming context-aware news representation.

3. Our Approach

(a) Overview of the model.
(b) Transformer submodule.
Figure 2. An illustration of our proposed GERL approach. Dashed lines represent graph connectivity established from click behaviors, and solid lines represent the information flow among different modules.

In this section, we will introduce our Graph Enhanced Representation Learning (GERL) approach illustrated in Figure 2, which consists of a one-hop interaction learning module and a two-hop graph learning module. The one-hop interaction learning module represents target user from historically clicked news and represents candidate news based on its textual content. The two-hop graph learning module learns neighbor embeddings of news and users using a graph attention network.

3.1. Transformer for Context Understanding

Motivated by Vaswani et al. (Vaswani et al., 2017), we utilize the transformer to form accurate context representations from news titles and topics. News titles are usually clear and concise. Hence, to avoid the degradation of performance caused by excessive parameters, we simplify the transformer to single layer of multi-head attention.111We also tried the original transformer architecture but the performance is sub-optimal.

We then introduce the modified transformer from bottom to top. The bottom layer is the word embedding, which converts words in a news title into a sequence of low-dimensional embedding vectors. Denote a news title with

words as , through this layer it is converted into the embedded vector sequence .

The following layer is a word-level multi-head self-attention network. Interactions between words are important for learning news representations. For instance, in the title “Sparks gives Penny Toler a fire from the organization”, the interaction between “Sparks” and “organization” helps understand the title. Moreover, a word may relate to more than one words in the title. For example, the word “Sparks” has interactions with both words “fire” and “organization”. Thus, we employ the multi-head self-attention to form contextual word representations. The representation of the word learned by the attention head is computed as:


where and are the projection matrices in the self-attention head, and indicates the relative importance of the relatedness between the and words. The multi-head representation of the word is the concatenation of the representations produced by separate self-attention heads, i.e., To mitigate overfitting, we add dropout (Srivastava et al., 2014) after the self-attention.

Next, we utilize an additive word attention network to model relative importance of different words and aggregate them into title representations. For instance, the word “fire” is more important than other words in the above example. The attention weight of the word is computed as:


where , and are trainable parameters in the word attention network. The news title representation is then calculated as: .

Since topics of user clicked news may also reveal their preferences, we model news topics via an embedding matrix. Denote the output of this embedding matrix as , then the final representation of the news is the concatenation of the title vector and the topic vector, i.e., .

3.2. One-hop Interaction Learning

The one-hop interaction learning module learns candidate news and click behaviors of target users. More specifically, it can be decomposed into three parts: (1) Candidate news semantic representations; (2) Target user semantic representations; (3) Target user ID representations.

Candidate News Semantic Representations. Since understanding the content of candidate news is crucial for recommendation, we propose to utilize the transformer to form accurate representation of it. Given the candidate news , the one-hop (denoted as superscript ) output of the transformer module (denoted as subscript ) is .

Target User Semantic Representations. The news reading preference of a user can be clearly revealed by their clicked news. Thus, we propose to model user representations from the content of their clicked news. Besides, different news may have varied importance for modeling user interests. For example, the news “crazy storms hit Los Angeles” is less important than the news “6 most popular music dramas” in modeling user interests. Thus, we apply an additive attention mechanism to aggregate clicked news vectors for user representations. Given a target user and a total number of clicked news , we first get their transformer encoded outputs . Then the attention weight of the clicked news is calculated as:


where , and are the trainable parameters of the news attention network. The one-hop user semantic representation is then calculated as: .

Target User ID Representations. Since user IDs represent each user uniquely, we incorporate them as latent representations of user interests (Lv et al., 2011; Marlin and Zemel, 2004). We use a trainable ID embedding matrix to represent each user ID as a low-dimensional vector, where is the number of users and is the dimension of the ID embedding. For the user , the one-hop ID embedding vector is denoted as .

3.3. Two-hop Graph Learning

The two-hop graph learning module mines the relatedness between neighbor users and news from the interaction graph. Additionally, for a given target user, neighbor users usually have different levels of similarity with her/his. The same situation exists between neighbor news. To utilize this kind of similarity, we aggregate neighbor news and user information with a graph attention network (Song et al., 2019). The utilized graph information here is heterogeneous, including both semantic representations and ID embeddings. In this two-hop graph learning module, there are also three parts: (1) Neighbor user ID representations; (2) Neighbor news ID representations; (3) Neighbor news semantic representations.

Neighbor User ID Representations. Since adding neighbor user information may complement target user representations, we aggregate the ID embeddings of neighbor users via an additive attention network. Given a user and a list of neighbor users , we first get their ID embeddings via the same user ID embedding matrix , which are denoted as . Then the attention weight of the neighbor user is calculated as:


where , and are trainable parameters in the neighbor user attention network. The two-hop neighbor user ID representation is then calculated as: .

Neighbor News ID Representations. News clicked by the same user reveal certain preference of the user, thus may share some common characteristics. To model this kind of similarity, we utilize an attention network to learn neighbor news ID representations. For news with a list of neighbor news , we first transform neighbors with the news ID embedding matrix , where is the number of news and is the dimension of the ID embedding. The output is . Upon it, we apply an additive attention layer to combine neighbor ID embeddings into a unified output vector. The calculation of attention is similar with that in Eq.(4). The final two-hop neighbor news ID representation of news is denoted as .

Neighbor News Semantic Representations. Although the ID embeddings of news are unique and inherently represent the neighbor news, they encode news information implicitly. Moreover, the IDs of some newly-sprung news may not be included in the predefined news ID embedding matrix . Thus, we propose to attentively learn their context representations via the transformer simultaneously. For the neighbor news list , the transformer outputs are . Then the attention layer is applied to model varied importance of neighbor news. The final neighbor news semantic representation is the output of the attention layer, which is denoted as .

3.4. Recommendation and Model Training

The final representations of users and news are the concatenation of outputs from the one-hop interaction learning module and the two-hop graph learning module, i.e., and . The rating score of a user-item pair is predicted by the inner product of user and item representation, i.e., . Through this operation, the ID representations and semantic representations are optimized in the same vector space.

Motivated by (Huang et al., 2013; Zhai et al., 2016), we formulate the click prediction problem as a pseudo way classification task. We regard the clicked news as positive and the rest unclicked news as negative. We apply maximum likelihood method to minimize the log-likelihood on the positive class:


where is the predicted label of the positive sample and is the predicted label of the associated negative sample.

4. Experiments

4.1. Datasets and Experimental Settings

We constructed a large-scale real-world dataset by randomly sampling user logs from MSN News, 222 statistics of which are shown in Table 1. The logs were collected from Dec. 13rd, 2018 to Jan. 12nd, 2019 and split by time, with logs in the last week for testing, 10% of the rest for validation and others for training.

In our experiment, we construct neighbors of the candidate news by random sampling from the clicked logs of its previous users. For the target user, since there exist massive neighbors users, we rank them according to the number of common clicked news with the target user. Then we pertain the top users and use them as graph inputs. Here we set

to be 15 and use zero padding for cold-start user and newly-sprung news. 

333Due to limit of computational resources,we set to be this moderate value. The dimensions of word embedding, topic embedding and ID embedding are set to 300, 128 and 128 respectively. We use the pretrained Glove embedding (Pennington et al., 2014) to initialize the embedding matrix. There are 8 heads in the multi-head self-attention network, and the output dimension of each head is 16. The negative sampling ratio is set to 4. The maximum number of user clicked news is set to 50, and the maximum length of news title is set to 30. To mitigate overfitting, we apply dropout strategy (Srivastava et al., 2014) with the rate of 0.2 after outputs from the transformer and ID embedding layers. Adam (Kingma and Ba, 2014)

is set to be the optimizer and the batch size is set to be 128. These hyperparameters are selected according to the performances on the validation dataset.

For evaluation, we use the average AUC, MRR, nDCG@5 and nDCG@10 scores over all impressions. We independently repeat each experiment for 5 times and report the average results.

# users 242,175 # samples 32,563,990
# news 249,038 # positive samples 805,411
# sessions 377,953 # negative samples 31,758,579
# avg. words per title 10.99 # topics 285
Table 1. Statistics of our dataset.

4.2. Performance Evaluation

In this section, we will evaluate the performance of our approach by comparing it with some baseline methods and a variant of our own method, which are listed as follow:

  • NGCF (Wang et al., 2019c): a graph neural network based collaborative filtering method for general recommendation. They use ID embeddings as node representations.

  • LibFM (Rendle, 2012): a feature based model for general recommendation using matrix factorization.

  • Wide&Deep (Cheng et al., 2016): a general recommendation model which has both a linear wide channel and a deep dense-layer channel.

  • DFM (Lian et al., 2018)

    : a neural news model utilizing an inception module to learn user features and a dense layer to merge them with item features.

  • DSSM (Huang et al., 2013): a sparse textual feature based model which learns news representation via multiple dense layers.

  • DAN (Zhu et al., 2019): a CNN based news model which learns news representations from news titles. An attentional LSTM is used to learn user representations.

  • GRU (Okura et al., 2017): a deep news model using an auto-encoder to learn news representations and a GRU network to learn user representations.

  • DKN (Wang et al., 2018): a CNN based news model enhanced by the knowledge graph. They utilize news-level attention to form user representations.

  • GERL-Graph: Our model without the two-hop graph learning.

Methods AUC MRR nDCG@5 nDCG@10
NGCF (Wang et al., 2019c) 55.450.16 17.190.05 17.230.10 22.080.09
LibFM (Rendle, 2012) 61.830.10 19.310.06 20.450.08 25.690.08
Wide&Deep (Cheng et al., 2016) 64.620.14 20.710.12 22.430.15 27.990.15
DFM (Lian et al., 2018) 64.720.19 20.750.14 22.600.20 28.220.19
DSSM (Huang et al., 2013) 65.490.18 20.930.13 22.930.22 28.650.27
DAN (Zhu et al., 2019) 65.520.13 21.250.18 23.140.21 28.730.15
GRU (Okura et al., 2017) 65.690.19 21.290.10 23.160.11 28.750.11
DKN (Wang et al., 2018) 65.880.13 21.460.21 23.230.25 28.840.21
GERL-Graph 67.740.13 22.710.15 25.030.13 30.650.15
GERL 68.550.12 23.330.10 25.820.14 31.440.12
Table 2. The performance scores and standard variations of different methods. *The improvement is significant at the level p < 0.002.

For fair comparison, we extract the TF-IDF (Jones, 2004) feature from the concatenation of the clicked or candidate news titles and topics as sparse feature inputs for LibFM, Wide&Deep, DFM and DSSM. For DSSM, the negative sampling ratio is also set to 4. We try to tune all baselines to their best performances. The experimental results are summarized in Table 2, and we have several observations:

First, methods which represent news directly from news texts (e.g., DAN, GRU, DKN, GERL-Graph, GERL) usually outperform feature based methods (e.g., LibFM, Wide&Deep, DFM, DSSM). The possible reason is that although feature based methods learn news content, the useful information exploited from news texts is limited, which may lead to sub-optimal news recommendation results.

Second, compared with NGCF, which also exploits neighbor information in the graph, our method achieves better results. This is because NGCF is an ID-based collaborative filtering method, which may suffer from cold-start problem significantly. This result further proves the effectiveness of introducing textual understanding into graph neural networks for news recommendation.

Third, compared with other methods that involve textual content of news (e.g., DAN, GRU, DKN), our GERL-Graph can consistently outperform other baseline methods. This may because the multi-head attention in transformer module learns contextual dependency accurately. Moreover, our approach utilizes attention mechanism to select important words and news.

Fourth, our GERL approach which combines both textual understanding and graph relatedness learning outperforms all other methods. This is because GERL encodes neighbor user and news information by attentively exploiting the interaction graph. The result validates the effectiveness of our approach.

4.3. Effectiveness of Graph Learning

Figure 3. Effectiveness of two-hop graph learning.

To validate the effectiveness of the two-hop graph learning module, we remove each component of representations in the module to examine its relative importance and illustrate the results in Figure 3444We use a trainable dense layer to transform vector or and keep the dimension uniform as before. Based on it, several observations can be made. First, adding the neighbor user information improves performances more significantly than adding neighbor news information. In our GERL-Graph approach, candidate news can be directly modeled through titles and topics, while target users are only represented by their clicked news. When the user history is sparse, they may not be well represented. Hence, adding IDs of neighbor users may assist our model to learn better user representations. Second, the improvement brought by neighbor news semantic representations outweighs that brought by neighbour news ID. This is intuitive since titles of news contain more explicit and concrete meanings than IDs. Third, combining each part in the graph learning leads to the best model performance. By adding graph information both from neighbor users and news, our model forms better representations for recommendation.

4.4. Ablation Study on Attention Mechanism

Next, we explore the effectiveness of two categories of attention by removing certain part of them. Instead, to keep dimensions of vectors unchanged, we use average pooling to aggregate information. First, we verify two types of attention inside the transformer in Figure 4(a). From it, we conclude that both the additive and the self attention are beneficial for news context understanding. This is because self-attention encodes interactions between words and additive attention helps to select important words. Among them, self-attention contributes more to improving model performances, as it models both short-distance and long-distance word dependency. Moreover, it forms diverse word representations with multiple attention heads. Also, we verify the model-level attention, e.g., attention inside the one-hop interaction learning and that in the two-hop graph learning. From Figure 4(b)

, we observe that the attention in the one-hop module is more important. One-hop attention selects important clicked news of users, thus helping model user preferences directly. Compared with that, two-hop attention models relative importance of neighbors, which may only represent interests implicitly. By using both attentions simultaneously, we obtain the best performances.

4.5. Hyperparameter Analysis

Here we explore the influences of two hyperparameters. One is the number of attention heads in the transformer module. Another one is the degree of graph nodes in the graph learning module.

Number of Attention Heads. In the transformer module, the number of self-attention heads is crucial for learning context representations. We illustrate its influence in Figure 5(a). An evident increase can be observed when the number increases from to , as the rich textual meanings may not be fully exploited when there are few heads. However, the performances drop a little when head number increases from 8. This may happen because news titles are concise and brief, thus too many parameters may be sub-optimal. Based on the above discussion, we set the number to be 8.

Degree of graph nodes. In the graph learning module, the degree of user and item nodes decides how many similar neighbors our model will learn. We increase the node degree from to and showcase its influence in Figure 5(b). As illustrated, the performance improves when more neighbors are taken as model inputs, which is intuitive because more relatedness information from the graph is incorporated. Meanwhile, the increasing trend becomes smooth when the degree is larger than . Therefore, we choose a moderate value as the number of node degree.

(a) Transformer attention.
(b) Model attention.
Figure 4. Effectiveness of attention mechanism.
(a) Attention head number.
(b) Node degree.
Figure 5. Influence of two hyperparameters.

5. Conclusion

In this paper, we propose a graph enhanced representation learning architecture for news recommendation. Our approach consists of a one-hop interaction learning module and a two-hop graph learning module. The one-hop interaction learning module forms news representations via the transformer architecture. It also learns user representations by attentively aggregating their clicked news. The two-hop graph learning module enhances the representations of users and news by aggregating their neighbor embeddings via a graph attention network. Both IDs and textual contents of news are utilized to enrich the neighbor embeddings. Experiments are conducted on a real-world dataset, the improvement of recommendation performances validates the effectiveness of our approach.

The authors would like to thank Microsoft News for providing technical support and data in the experiments. This work was supported by the National Key Research and Development Program of China under Grant number 2018YFC1604002, the National Natural Science Foundation of China under Grant numbers U1836204, U1705261, U1636113, U1536201, and U1536207, and the Tsinghua University Initiative Scientific Research Program under Grant number 20181080368.


  • M. An, F. Wu, C. Wu, K. Zhang, Z. Liu, and X. Xie (2019) Neural news recommendation with long- and short-term user representations. In ACL, Florence, Italy, pp. 336–345. Cited by: §2.
  • T. Bansal, M. Das, and C. Bhattacharyya (2015) Content driven user profiling for comment-worthy recommendations of news and blog articles. In RecSys., pp. 195–202. Cited by: §1, §1.
  • H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, et al. (2016)

    Wide & deep learning for recommender systems

    In DLRS, pp. 7–10. Cited by: 3rd item, Table 2.
  • G. Guo, J. Zhang, and D. Thalmann (2014) Merging trust in collaborative filtering to alleviate data sparsity and cold start. Knowledge-Based Systems 57, pp. 57–68. Cited by: §1.
  • W. L. Hamilton, R. Ying, and J. Leskovec (2017) Representation learning on graphs: methods and applications. arXiv preprint arXiv:1709.05584. Cited by: §2, §2.
  • P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck (2013) Learning deep structured semantic models for web search using clickthrough data. In CIKM, pp. 2333–2338. Cited by: §3.4, 5th item, Table 2.
  • K. S. Jones (2004) A statistical interpretation of term specificity and its application in retrieval. Journal of documentation. Cited by: §4.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl (1997) GroupLens: applying collaborative filtering to usenet news. Communications of the ACM 40 (3), pp. 77–87. Cited by: §1.
  • L. Li, D. Wang, T. Li, D. Knox, and B. Padmanabhan (2011) SCENE: a scalable two-stage personalized news recommendation system. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 125–134. Cited by: §1.
  • J. Lian, F. Zhang, X. Xie, and G. Sun (2018) Towards better representation learning for personalized news recommendation: a multi-channel deep fusion approach.. In IJCAI, pp. 3805–3811. Cited by: §1, 4th item, Table 2.
  • B. Lika, K. Kolomvatsos, and S. Hadjiefthymiades (2014) Facing the cold start problem in recommender systems. Expert Systems with Applications 41 (4), pp. 2065–2073. Cited by: §1.
  • G. Ling, M. R. Lyu, and I. King (2014) Ratings meet reviews, a combined approach to recommend. In Proceedings of the 8th ACM Conference on Recommender systems, pp. 105–112. Cited by: §1.
  • J. Liu, P. Dolan, and E. R. Pedersen (2010) Personalized news recommendation based on click behavior. In IUI, pp. 31–40. Cited by: §1.
  • Y. Lv, T. Moon, P. Kolari, Z. Zheng, X. Wang, and Y. Chang (2011) Learning to model relatedness for news recommendation. In WWW, pp. 57–66. Cited by: §3.2.
  • B. Marlin and R. S. Zemel (2004) The multiple multiplicative factor model for collaborative filtering. In ICML, pp. 73. Cited by: §3.2.
  • S. Okura, Y. Tagami, S. Ono, and A. Tajima (2017) Embedding-based news recommendation for millions of users. In KDD, pp. 1933–1942. Cited by: §2, 7th item, Table 2.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: §4.1.
  • O. Phelan, K. McCarthy, M. Bennett, and B. Smyth (2011) Terms of a feather: content-based news recommendation and discovery using twitter. In ECIR, pp. 448–459. Cited by: §1.
  • Z. Ren, S. Liang, P. Li, S. Wang, and M. de Rijke (2017) Social collaborative viewpoint regression with explainable recommendations. In WSDM, pp. 485–494. Cited by: §1.
  • S. Rendle (2012) Factorization machines with libfm. TIST 3 (3), pp. 57. Cited by: 2nd item, Table 2.
  • W. Song, Z. Xiao, Y. Wang, L. Charlin, M. Zhang, and J. Tang (2019) Session-based social recommendation via dynamic graph attention networks. In WSDM, pp. 555–563. Cited by: §2, §3.3.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR 15 (1), pp. 1929–1958. Cited by: §3.1, §4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §1, §3.1.
  • H. Wang, F. Zhang, X. Xie, and M. Guo (2018) DKN: deep knowledge-aware network for news recommendation. In WWW, pp. 1835–1844. Cited by: §1, 8th item, Table 2.
  • H. Wang, F. Zhang, M. Zhang, J. Leskovec, M. Zhao, W. Li, and Z. Wang (2019a) Knowledge-aware graph neural networks with label smoothness regularization for recommender systems. In 25th ACM SIGKDD, KDD ’19, New York, NY, USA, pp. 968–977. External Links: ISBN 978-1-4503-6201-6 Cited by: §2.
  • X. Wang, X. He, Y. Cao, M. Liu, and T. Chua (2019b) KGAT: knowledge graph attention network for recommendation. In KDD, pp. 950–958. Cited by: §2.
  • X. Wang, X. He, M. Wang, F. Feng, and T. Chua (2019c) Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR 2019, Paris, France, July 21-25, 2019., pp. 165–174. Cited by: §2, 1st item, Table 2.
  • X. Wang, L. Yu, K. Ren, G. Tao, W. Zhang, Y. Yu, and J. Wang (2017) Dynamic attention deep model for article recommendation by learning human editors’ demonstration. In KDD, pp. 2051–2059. Cited by: §2.
  • C. Wu, F. Wu, M. An, J. Huang, Y. Huang, and X. Xie (2019a) Neural news recommendation with attentive multi-view learning. arXiv preprint arXiv:1907.05576. Cited by: §2.
  • C. Wu, F. Wu, M. An, J. Huang, Y. Huang, and X. Xie (2019b) Npa: neural news recommendation with personalized attention. In KDD, pp. 2576–2584. Cited by: §1, §2.
  • C. Wu, F. Wu, M. An, Y. Huang, and X. Xie (2019c) Neural news recommendation with topic-aware news representation. In ACL, pp. 1154–1159. Cited by: §1.
  • S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan (2019d) Session-based recommendation with graph neural networks. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 346–353. Cited by: §2.
  • X. Xin, X. He, Y. Zhang, Y. Zhang, and J. Jose (2019) Relational collaborative filtering: modeling multiple item relations for recommendation. In SIGIR, Cited by: §2.
  • R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018) Graph convolutional neural networks for web-scale recommender systems. In KDD, pp. 974–983. Cited by: §2.
  • S. Zhai, K. Chang, R. Zhang, and Z. M. Zhang (2016)

    Deepintent: learning attentions for online advertising with recurrent neural networks

    In KDD, pp. 1295–1304. Cited by: §3.4.
  • G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li (2018)

    DRN: a deep reinforcement learning framework for news recommendation

    In WWW, pp. 167–176. Cited by: §2.
  • Q. Zhu, X. Zhou, Z. Song, J. Tan, and L. Guo (2019) DAN: deep attention neural network for news recommendation. In AAAI, Vol. 33, pp. 5973–5980. Cited by: §1, §2, 6th item, Table 2.