Why Do We Click: Visual Impression-aware News Recommendation

09/26/2021 ∙ by Jiahao Xun, et al. ∙ HUAWEI Technologies Co., Ltd. National University of Singapore Zhejiang University 0

There is a soaring interest in the news recommendation research scenario due to the information overload. To accurately capture users' interests, we propose to model multi-modal features, in addition to the news titles that are widely used in existing works, for news recommendation. Besides, existing research pays little attention to the click decision-making process in designing multi-modal modeling modules. In this work, inspired by the fact that users make their click decisions mostly based on the visual impression they perceive when browsing news, we propose to capture such visual impression information with visual-semantic modeling for news recommendation. Specifically, we devise the local impression modeling module to simultaneously attend to decomposed details in the impression when understanding the semantic meaning of news title, which could explicitly get close to the process of users reading news. In addition, we inspect the impression from a global view and take structural information, such as the arrangement of different fields and spatial position of different words on the impression, into the modeling of multiple modalities. To accommodate the research of visual impression-aware news recommendation, we extend the text-dominated news recommendation dataset MIND by adding snapshot impression images and will release it to nourish the research field. Extensive comparisons with the state-of-the-art news recommenders along with the in-depth analyses demonstrate the effectiveness of the proposed method and the promising capability of modeling visual impressions for the content-based recommenders.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Nowadays, online content sharing platforms have changed the way of people reading news in a mobile and digital manner. News production sources have been extremely enlarged on such platforms, such as Microsoft News222https://www.msn.com/en-us/news and Google News333https://news.google.com, that users could suffer from information overload due to the overwhelming amount of news. To mitigate information overload and improve user experiences, personalized news recommender systems are devised to make it easy for users to find the news of their interests. The challenging and open-ended nature of news recommendation lends itself to diverse advances in the literature (Okura et al., 2017; Wang et al., 2020, 2018; Wu et al., 2019b, c, e; Zhu et al., 2019; Yao et al., 2021).

Figure 1. An illustration of impression-aware news recommendation. (a) The interface that users are browsing. (b) Before making click decisions, users typically have the semantic understanding of news title and visual impression in mind. (c) Impression-aware recommendation takes the fine-grained visual cues and the global structures into account.

Recently, Okura et al., (Okura et al., 2017)

learn to represent historically interacted news for a user via a denoising autoencoder and RNNs in the recommender system of Yahoo! JAPAN

444https://www.yahoo.co.jp. Wang et al., (Wang et al., 2020) learn to obtain multi-level user representations with stacked dilated convolutions. Despite significant progress made with these advances, they solely use the textual contents of news titles to represent users’ interests and ignore the digital news’s multi-modal nature. As shown in Figure 1, online news might contain a variety of modalities or fields, i.e., title, body, video, soundtrack, image, and category. Thus, we derive inspirations from many other domains (Li et al., 2021; Jin et al., 2021; Zhang et al., 2021a; Lu et al., 2021; Zhang et al., 2020d; Jin et al., 2019; Zhang et al., 2020b; Li et al., 2019, 2020; Tian et al., 2021; Zhang et al., 2020c; Zhang et al., 2021b) and propose to incorporate multi-modal information for an in-depth understanding of users’ preferences on news.

Recently, advances in other domains and applications have demonstrated great successes of multi-modal recommender systems (Arapakis et al., 2009; Chelliah et al., 2019; Kuo et al., 2013; Wang et al., 2015; Wei et al., 2020, 2019; Yu et al., 2019; Zhao et al., 2018). For example, Wei et al., (Wei et al., 2019) propose to model individual user-item interactions for each modality and use graph convolutional networks (Kipf and Welling, 2017) to learn modality-specific representations. Following this work, Wei et al., (Wei et al., 2020) propose to refine the user-item graph connections for each modality and thus leverage modality-specific network structures, which also helps mitigate the implicit feedbacks. Zhao et al., (Zhao et al., 2018) propose to learn multi-modal heterogeneous network representation and incorporate user profiles, social relationships, textual description, and video posters for video recommendation. Despite their successes on real-world datasets, we argue that these methods have two major deficiencies. Firstly, they typically introduce all available modalities without evaluating and explaining which modalities are essential for click-through-rate (CTR) prediction. However, introducing more features would not necessarily mean being more effective since such features might lead to expensive computation, more over-fitting, and even more noise. Secondly, most of them tend to leverage modalities with generic architectures with less recommendation-specific or application-specific designs.

Towards this end, we propose to investigate which modalities we should incorporate for news recommendation and design fusion modules with highly application-specific insights. We aim to answer the question, ”why do users click” and start with the perspective that the user’s click decision is mostly based on his/her inherent interest and the visual impression delivered by the news. The visual impression can be the visual-semantic information he/she perceives when browsing the news application. In this paper, we treat the visual region of the news displayed on the user interface of news applications as the visual impression (as shown in Figure 1). Therefore, our work aims to model such multi-modal visual impression information to improve the click-through-rate prediction. We contend that other modalities or fields such as news body and soundtrack are inaccessible before users clicking the news. Recommender systems might draw false conclusions when spuriously connecting these modalities to users’ interests. Furthermore, we leverage the layout information such as relative positions, relative sizes, and styles as guidance for multi-modal fusion, i.e., a news recommendation-specific design. To be specific, we devise the IMpression-aware multi-modal news Recommendation framework, denoted as IMRec. IMRec comprises two key components: (1) The global impression module that not only fuses the multi-modal content features under the guidance of news layout but also enhances the global item representations. (2) The local impression module that models the correlation of each title word and other impression units, such as visual title words and visual images. In this way, our model bridges the gap between semantic understanding and visual impressions for each news in a fine-grained manner.

To the best of our knowledge, this work is one of the initiatives to investigate impression-aware recommendation, and there is currently no news recommendation dataset suitable for this research. To this end, we construct a large-scale impression-aware news recommendation dataset, IM-MIND, by adding the snapshot impression images into a text-dominated benchmark, MIND (Wu et al., 2020a). We conduct in-depth experimental analyses along on both quantitative and qualitative results, which have demonstrated the effectiveness and necessity of modeling visual impressions for news recommendation. The highlights of this work are summarized as follows:

  • We discuss why users click a news article at an intuitive level and propose to investigate impression-aware news recommendation, which guides the modality selection and model design better for click-through-rate prediction.

  • We propose the IMRec framework that comprehensively exploits the visual impression features in a global-local manner and bridges the gap between the semantic meaning of news title and the visual impression users perceive before clicking.

  • We contribute the visual images of news impressions to MIND dataset to facilitate this line of research and demonstrate the effectiveness of IMRec framework with extensive experiments.

Noteworthy, for ease of modeling, we currently model the visual impression of each news independently because the surrounding news articles can only be obtained after they are finally ranked and displayed to the users. During training and inference, we could simulate the visual impression of each news through the software of UI interface.

2. Related Works

2.1. News Recommendation

In recent years, the explosively growing amount of digital news calls for effective news recommender systems which enable personalized news suggestions. Both natural language processing and data mining research fields

(Phelan et al., 2011; Zheng et al., 2018; Wu et al., 2019a)

have witnessed deep learning based models’ successes in extracting semantic content features and mining user preferences accordingly

(Zhu et al., 2019; Wu et al., 2020b, 2019d, 2019c, 2019b; Wang et al., 2018, 2020; Okura et al., 2017; Hu et al., 2020; An et al., 2019). Diverse models concerning RNNs (Okura et al., 2017), attention mechanisms (Wu et al., 2019c, b; Zhu et al., 2019), dilated convolution (Wang et al., 2020)

, graph neural networks

(Hu et al., 2020), and knowledge distillation (Wang et al., 2018) are explored. Typically, Wu et al., (Wu et al., 2019e) leverage both self-attention mechanisms (Vaswani et al., 2017) and additive attention (Bahdanau et al., 2015) to represent words with one news and multiple news with the user’s historical interactions. FIM is a state-of-the-art recommendation model proposed by Wang et al., (Wang et al., 2020) that captures fine-grained interest matching signals using dilated convolutions. However, most of these works solely model the news title and disregard other modalities that might highly contribute to user’s click behavior, such as the news cover image. Towards this end, we propose to incorporate necessary modalities and design news recommendation-specific architectures.

2.2. Multi-modal Recommendation

Online content sharing platforms are becoming rich in modalities due to the rapidly developing network communication technologies. Therefore, as a nascent research field, multi-modal recommendation attracts increasing attention (Arapakis et al., 2009; Yu et al., 2019; Zhang et al., 2020a) recently with applications in various domains such as music recommendation (Kuo et al., 2013), location recommendation (Wang et al., 2015), movie recommendation (Zhao et al., 2018), micro-video recommendation (Wei et al., 2019, 2020), and fashion recommendation (Chelliah et al., 2019). Noteworthy, MMGCN (Wei et al., 2019) and GRCN (Wei et al., 2020) construct a user-item bipartite graph and conduct information propagation and embedding learning for each modality. GRCN differs from MMGCN by refining each graph’s connections and denoising implicit feedback in the fine-grained modality level. Despite great successes, we argue that most of them model multiple modalities without domain knowledge guidance. To be specific, most architectures disregard a fundamental question, ”why do users click” and fail to model the impression which can be necessary for users’ decision making. In this paper, we propose the impression-aware recommendation framework designed especially for news recommendation by explicitly modeling users’ click decision-making processes.

3. Methods

3.1. Problem Formulation

Following the common practice in modern news recommender systems (Wu et al., 2019c, e), we formulate news recommendation as a sequential recommendation problem and specifically focus on the news click-through-rate (CTR) prediction. We use to denote one user and to denote the sequence of news historically clicked by user on an online news platform. is ordered by the click time beforehand. News CTR prediction aims to predict whether the user will click a candidate news with the binary label denoted as . A deep learning based news recommendation model takes a pair of user and candidate news

as input and predicts a probability

indicating how likely the click will happen. During testing and serving, candidate news will be ranked by the probabilities and displayed on the news platform with positions consistent with the ranks.

Figure 2. Schematic illustration of impression-aware news recommendation framework applied to NRMS. We represent each news using local impression modeling, which explicitly captures multi-modal visual cues within the visual impression and accordingly enhances the semantic understanding of news title, and global impression modeling, that models the visual impression as whole by further taking the arrangement of different fields and relative position of title words into consideration.

3.2. Impression-aware Recommendation

To explicitly model users’ click decision-making processes (depicted in Figure 1) and bridge the gap between semantic understanding and visual impression of news, we devise the impression-aware recommendation framework, denoted as IMRec. As depicted in Figure 2, NRMS with IMRec framework (denoted as NRMS-IM) incorporates the local details of visual impression into the semantic understanding of news titles, which is inspired by the users’ browsing processes that users not only read the meaning of titles but meanwhile receive many impression details such as the visual appearance of words and regions in the news cover image. We denote such a process as local impression modeling. Moreover, once users have captured all the details, they might attempt to construct a holistic recognition of the news, in which we incorporate the fused representation of all modalities to enhance the details further. We denote the holistic impression modeling as global impression modeling, which introduces more structural information from a global view, such as the alignment of different fields and the relative position of words. In the following sections, we will formally describe these two processes based on the sequence model NRMS.

3.2.1. Local Impression Modeling

Local impression modeling aims to simultaneously capture the local impression details while understanding the semantic meaning of news titles. Towards this end, we encapsulate an impression decomposition process that explicitly extracts meaningful cues from the impression image beforehand, and an impression-semantic reasoning module that bridges the modality gap and structural gap between impression and semantics.

Impression Decomposition. To ease the modeling of local details in the impression image, we propose to extract meaningful cues in a pre-processing manner, which obtains analogous gains observed in many other domains (Lu et al., 2019; Anderson et al., 2018). However, different from previous works that commonly employ an object detector (Ren et al., 2015), which is designed to process natural scenes, we propose to first divide the impression image into several salient parts and extract cues from the corresponding feature maps. Since the impression image is well structured, we obtain the news title part, the news cover image part, and the news category part with simple edge detection techniques. For the news title part, we view each word region as an individual cue that users can potentially attend when understanding the semantic meaning of titles. We denote the vectorial representation of word region as . For the news cover image

part, we view each region vector

in the feature map extracted by a pre-trained CNN as a cue. For the news category part, we directly view the whole region with its vectorial representation as a cue. The details of the pre-processing and used pre-trained architectures can be found in Section 5.1.

To ease modeling, we group all the pre-extracted cues together to construct a impression cue memory . Since the representations of different cues are obtained using the same feature extractor, they naturally belong to the same embedding hypersphere and we treat them equally in the following modeling.

Impression-Semantic Reasoning. Given one news title , comprised of a sequence of words , we first embedding the sequence into a low-dimensional representation . To capture the correlations between impression and semantics, we view the impression cues as the external knowledge and follow the memory network schema (Sukhbaatar et al., 2015):

(1)
(2)

where , , and

denote linear transformations with bias terms.

denotes the extent to which the user will attend to impression cue when reading the word . Summing all attended cues results to the linearly transformed word embedding in an impression-aware word representation . To further reason on the impression-semantic joint representations, we next leverage the semantic dependencies implied by the self-attended weights:

(3)

where and are linear transformations and denotes the extent to which the model attends to the impression-semantic representation of the th word to enhance the final representation of the th word. The holistic representation of one news is obtained by summing all words with additive attention weights (Bahdanau et al., 2015):

(4)
(5)
(6)

where transforms into a hidden space and computes the attention weights for aggregation.

3.2.2. Global Impression Modeling.

Local impression modeling captures impression cues separately, which means we disregard the correlations and interactions between different impression cues. A straightforward way is to model them using traditional multi-modal fusion techniques. However, directly employing off-the-shelf techniques might lead to the loss of structural information, such as the location arrangement of different fields and the spatial position of different words. Therefore, instead of fusing different cues separately, we propose to encode the impression image as a whole with pre-trained extractors. Given the global impression embedding , we have:

(7)
(8)

where is a linear transformation,

denotes the sigmoid function, and

serves as a gate to control how much information we should let through from and , by taking their information into consideration. Such a gate is reasonable in the sense that users might not be equally interested in the impression and the textual semantics, and indicates a tradeoff between these two factors.

3.3. User Encoder

For the user encoder, we employ the off-the-shelf sequence modeling tool, i.e., the self-attention mechanisms, to capture the correlations between different news historically clicked by the user. This can be formally formulated as:

(9)

where , , and denote linear transformations. In practice, we take multi-head self-attention for better performance and concatenate the outputs of multiple heads. Similarly, We obtain the final user representation by aggregating all enhanced item representations with additive attention weights:

(10)
(11)

3.4. Training and Discussion

Given one candidate news which we should predict how likely an user will click it, we first transform them into dense vectors and using IMRec, and treat as the indicator and as the expected output. Motivated by Wu et al., (Wu et al., 2019e) and Wang et al., (Wang et al., 2020), we use negative sampling techniques and cross entropy loss for model training:

(12)

where is the number of positive training samples, is the number of negative training samples for each positive sample, and means the -th negative sample in the same group with the -th positive sample.

3.5. IMRec Applied to FIM

Noteworthy, the local impression modeling and global impression modeling modules can be model-agnostic and applied to any other CTR prediction model with ease. In the experiment , we extend another SOTA method, i.e., FIM (Wang et al., 2020), a non-sequence model that employs dilated CNN and computes the matching between each historically interacted news and the target news in a fine-grained level, to the impression-aware version, i.e., FIM-IM. Thereby, we can demonstrate the plug-and-play capability of the proposed modules. Specifically, in the FIM-IM model, only the memory network schema in the local impression module is applied to the initial word embeddings due to the high computation cost of FIM. For global impression modeling, we linearly transform global impression features into low-dimensional representations, based on which we directly compute the matching scores of each historical interacted news and the candidate news. Such matching scores are concatenated with the last layer output before the prediction. Given the integrated matching vector of a user and candidate news pair, and the corresponding global impression representation of the news , we can calculate the final click probability as follows:

(13)

where , , and are learnable parameters,

means concat operation. The loss function is consistent with the NRMS-IM model.

4. Datasets

4.1. Dataset Construction

Figure 3. Sampled cases of news visual impression.

To the best of our knowledge, there is no news recommendation dataset suitable for impression-aware news recommendation. Therefore, we construct two benchmark datasets555The datasets will be released at https://github.com/JiahaoXun/IMRec based on the MIND-News dataset (Wu et al., 2020a) automatically, following the styles, sizes, and spatial arrangement of different fields according to the visual impression presented in (Wu et al., 2020a) and the HTML code of the Microsoft news platform.

To extract news impressions, we have crawled the cover image from the given news URL and then combined news images with texts (title and category) to generate news visual impressions, as shown in figure 3. The size of our news cards is 615*195px with white background. All news images were resized to 200*165px and were pasted at the location (15, 15) to (215, 180). News title starts at the location (215, 180) of background with 10.5px line spacing. The title lines of each news are no more than 3. Words in each line are no more than 27 characters. Font and size of title words are seguisb and 27, respectively. The news category starts at the location (227.725, 142.5) of background. Font and size of the category are segoeui and 24, respectively. Especially if the news does not have an image or its URL is unavailable, the image area would be empty. Since the MIND-news dataset contains large and small versions, we generate visual impressions and construct the IM-MIND-Large and IM-MIND-Small accordingly.

4.2. Dataset Statistics

Dataset Small Users 94057
News 65238 News 27244
Avg. clicked news 21.66 Avg. clicked News 8.82
Avg. title lines 2.76 Avg. words per line 4.12
Dataset Large Users 876956
News 130379 News 54421
Avg. clicked news 17.03 Avg. clicked News 6.35
Avg. title lines 2.76 Avg. words per line 4.11
Table 1. Statistics of IM-MIND-Small and IM-MIND-Large.

The detailed statistics of the IM-MIND-Small and IM-MIND-Large dataset are summarized in table 1. We use News to denote the news that contains the cover image. The whole datatset contains 876956 users and 130379 news articles. There are 54421 available news images among all the news. The number of avg./max./min./med. clicked news are 17.35, 801, 0 and 10. The number of avg./max./min./med. clicked news with the image are 6.54, 356, 0, and 4. The avg./max. lines of title are 2.76 and 3. The avg./max./min. words per title line are 4.12, 15, and 0 (punctuation only).

Datasets Metric DeepFM DKN NPA LSTUR NRMS FIM NRMS-IM FIM-IM Improv.
MIND-Small AUC 0.6542 0.6290 0.6465 0.6587 0.6585 0.6572 0.6619 0.6661 1.12%
NDCG@5 0.3378 0.3099 0.3314 0.3395 0.3414 0.3424 0.3465 0.3526 2.98%
NDCG@10 0.4025 0.3741 0.3947 0.4015 0.4051 0.4044 0.4097 0.4146 2.35%
MRR 0.3084 0.2837 0.3001 0.3078 0.3097 0.3091 0.3132 0.3199 3.29%
MIND-Large AUC 0.6591 0.6715 0.6752 0.6801 0.6762 0.6845 0.6866 0.6912 0.98%
NDCG@5 0.3446 0.3531 0.3581 0.3629 0.3575 0.3682 0.3688 0.3725 1.17%
NDCG@10 0.4070 0.4171 0.4217 0.4265 0.4224 0.4313 0.4317 0.4364 1.18%
MRR 0.3140 0.3206 0.3261 0.3290 0.3227 0.3313 0.3305 0.3364 1.54%
Table 2. Overall performance comparison with state-of-the-art news recommenders.

5. Experiments

We analyze IMRec framework and demonstrate its effectiveness by anwering the following research questions:

  • [leftmargin=*]

  • RQ1: How does IMRec perform compared with existing state-of-the-art news recommender systems?

  • RQ2: Does global/local impression modeling all contribute to the effectiveness of base models in a model-agnostic manner?

  • RQ3: How does IMRec perform in practical news recommendation scenarios (e.g., cold-start setting, unseen users)?

  • RQ4: How does IMRec improve the performance internally?

5.1. Experimental Settings

Implementation Details. The word embeddings are 100-dimensional and initialized using pre-trained Glove embedding vectors (Pennington et al., 2014). We use pretrained resnet101666https://pytorch.org/vision/stable/models.html (He et al., 2016) to extract local and global visual impression features. Specifically, for the visual word impression, we remove the last two layers of resnet101 and obtain the feature map with size (512, 28, 28) with solely the title region as input. We vertically divide the feature map by the lines of titles and further horizontally divide the vertically divided feature maps to have the word feature map by the length of words. We mean-pool the word feature map to have the impression feature of each word with dimension 512. For the cover image impression, we use the same pipeline except that we equally divide the feature map into 9 regions and mean-pool each region feature map to have a region feature. For global impression

, we remove the last layer of resnet101 and obtain a 2048 dimensional tensor representing the global impression of the whole news card. The negative sampling ratio K is set to 4. Adam

(Kingma and Ba, 2015) is used as the optimizer, the batch size is 32, and the initial learning rate is 1e-4. These hyper-parameters are applied for both NRMS-IM and FIM-IM.

  • [leftmargin=*]

  • NRMS-IM. Self-attention networks have 3 heads, and the output dimension of each head is 50. The dimension of the additive attention query vectors is 200. The maximum length of the tokenized word sequence of news title is set to 15. At most 60 browsed news are kept to construct the user’s recently reading behaviors.

  • FIM-IM. The maximum length of the tokenized word sequence of news title is set to 30, and at most 50 browsed news are kept for representing the user’s recently reading behaviors. Other hyper-parameter settings are following the original paper (Wang et al., 2020).

Evaluation Criteria. Following (Wu et al., 2019c), we employ three widely used metrics for evaluation, i.e., AUC (Area Under the ROC Curve), NDCG (Normalized Discounted Cumulative Gain), and MRR (Mean Reciprocal Rank).

IM-MIND-Small IM-MIND-Large
Model AUC NDCG@5 MRR AUC NDCG@5 MRR
NRMS-IM 0.6619 0.3465 0.3132 0.6866 0.3688 0.3305
- L_IM 0.6612 0.3440 0.3108 0.6837 0.3674 0.3295
- G_IM 0.6591 0.3427 0.3124 0.6809 0.3616 0.3257
NRMS 0.6585 0.3414 0.3097 0.6762 0.3575 0.3227
FIM-IM 0.6661 0.3526 0.3199 0.6912 0.3725 0.3364
- L_IM 0.6640 0.3479 0.3144 0.6909 0.3704 0.3349
- G_IM 0.6629 0.3512 0.3183 0.6902 0.3689 0.3323
FIM 0.6572 0.3424 0.3091 0.6845 0.3682 0.3313
Table 3. Ablation studies by selectively discarding the local impression modeling module (- L_IM) and global impression modeling module (- G_IM). We study both NRMS-IM and FIM-IM to reveal the modal-agnostic capability of the proposed modules.

Comparison Baseline Methods. For a comprehensive comparison to NRMS-IM and FIM-IM, we incorporate state-of-the-art baseline methods concerning both manual feature-based approaches and neural recommendation ones:

  • [leftmargin=*]

  • DeepFM (Xue et al., 2017). DeepFM parallelly combines deep neural network and factorization machine. We implement it using the same feature as the LibFM.

  • DKN (Wang et al., 2018)

    . DKN leverages entity embeddings from knowledge graphs as external knowledge for news recommendation.

  • NPA (Wu et al., 2019c). NPA uses user ID embeddings to weight each word/news and thus captures important features.

  • LSTUR (An et al., 2019). LSTUR takes the topic/subtopic as input of news encoder and uses GRU to fuse interacted news and the user embedding.

  • NRMS (Wu et al., 2019b). NRMS uses the multi-head self-attention to encode both news and users.

  • FIM (Wang et al., 2020). FIM employs dilated CNN and computes the matching between each historically interacted news and the target news in a fine-grained level.

All User Seen User Unseen User
Model AUC N@5 N@10 MRR AUC N@5 N@10 MRR AUC N@5 N@10 MRR
All News NRMS-IM 0.6866 0.3688 0.4317 0.3305 0.6901 0.3684 0.4317 0.3298 0.6630 0.3712 0.4317 0.3353
NRMS 0.6762 0.3575 0.4224 0.3227 0.6801 0.3577 0.4229 0.3225 0.6496 0.3557 0.4190 0.3243
FIM-IM 0.6912 0.3725 0.4364 0.3364 0.6941 0.3716 0.4359 0.3351 0.6717 0.3784 0.4397 0.3452
FIM 0.6845 0.3682 0.4313 0.3312 0.6877 0.3676 0.4312 0.3302 0.6625 0.3723 0.4321 0.3379
News NRMS-IM 0.6984 0.4273 0.4869 0.3812 0.7010 0.4246 0.4880 0.3816 0.6794 0.4175 0.4785 0.3785
NRMS 0.6909 0.4149 0.4798 0.3733 0.6939 0.4163 0.4813 0.3741 0.6691 0.4050 0.4691 0.3673
FIM-IM 0.6991 0.4225 0.4866 0.3802 0.7021 0.4240 0.4882 0.3811 0.6768 0.4117 0.4751 0.3740
FIM 0.6974 0.4218 0.4862 0.3815 0.7005 0.4232 0.4876 0.3822 0.6742 0.4113 0.4758 0.3759
News NRMS-IM 0.6724 0.4819 0.5379 0.4208 0.6764 0.4814 0.5380 0.4203 0.6443 0.4859 0.5374 0.4246
NRMS 0.6575 0.4688 0.5266 0.4098 0.6620 0.4690 0.5273 0.4100 0.6257 0.4676 0.5220 0.4085
FIM-IM 0.6782 0.4895 0.5450 0.4296 0.6810 0.4880 0.5442 0.4281 0.6588 0.4998 0.5507 0.4398
FIM 0.6697 0.4806 0.5378 0.4210 0.6731 0.4797 0.5374 0.4201 0.6464 0.4870 0.5405 0.4275
Table 4. Analysis on different user/news groups. IMRec framework shows consistent improvement across various scenarios.

5.2. Overall Results (RQ1)

Table 2 lists the comparison results of NRMS-IM and FIM-IM with state-of-the-art neural recommendation methods on the MIND-Small and MIND-Large datasets. From the results, we can find that:

  • [leftmargin=*]

  • Overall, the results across multiple evaluation metrics consistently indicate that NRMS-IM and FIM-IM

    both achieve better results against various SOTA designs. We note that these improvements are significant and comparative to the improvement of recent SOTAs (e.g., FIM over NRMS).

  • Surprisingly, DeepFM achieves competitive performance on the MIND-Small dataset and outperforms many advanced designs like LSTUR with GRU and NPA with the attention mechanism. However, it achieves significantly inferior results on the large-scale MIND-Large dataset. The reason might be that FM based methods might fail to handle highly sparse and complex correlations. In contrast, NRMS-IM and FIM-IM achieve consistently convincing results on two datasets.

  • Compared to DKN that also exploits additional information (i.e., entities in a knowledge graph) to enhance news representation learning. FIM-IM shows a clear advantage over it on two datasets. Noteworthy, FIM-IM improves DKN by AUC +0.0371 (relatively 5.9%), NDCG@5 +0.0427 (relatively 13.7%), NDCG@10 +0.0405 (relatively 10.8%) and MRR +0.0362 (relatively 12.8%) on the MIND-Small dataset. These results show that, compared to further enhancing the semantic understanding itself like DKN, it might be more promising to introduce visual impressions that explicitly get close to the user’s click decision process.

  • Compared to other attention-based approaches, i.e., NPA and NRMS, NRMS-IM also exhibits better performance, especially on the large-scale MIND-Large dataset. These results basically indicate that modeling the semantic-impression correlations (memory attending) can help improve the semantic-semantic correlation modeling.

  • FIM-IM considers a strong baseline FIM that uses CNN as building blocks and further equips it with impression modeling modules. FIM-IM achieves state-of-the-art results with substantial improvement, demonstrating that the proposed local/global impression modeling can improve a ranking baseline with arbitrary architectures in a plug-and-play manner.

Percentage AUC NDCG@5 NDCG@10 MRR
100% 0.6866 0.3688 0.4317 0.3305
75% 0.6846 0.3673 0.4307 0.3297
50% 0.6832 0.3653 0.4287 0.3273
25% 0.6815 0.3652 0.4284 0.3275
0% 0.6762 0.3575 0.4224 0.3227
Table 5. Performance on NRMS-IM by varying the percentage of visual impression used in training.

5.3. Model Analysis (RQ2, RQ3)

5.3.1. Analysis on key building blocks (Ablation Study).

Local impression modeling and global impression modeling are two key components of IMRec framework. We conduct the ablation study on them to reveal the efficacy of the architectures and the benefits of incorporating local/global impression information. Specifically, we selectively discard the local impression modeling module and the global impression modeling module from NRMS-IM to generate ablation architectures, i.e., - L_IM, and - G_IM, respectively. We also conduct another ablation study on the FIM-IM model to show the model-agnostic capability of these two modules. The results are shown in Table 3. We can observe that:

  • [leftmargin=*]

  • Removing either L_IM or G_IM leads to performance degradation, and removing both modules (i.e., the base model) leads to the worst performance. These results demonstrate the effectiveness of the proposed two modules as well as the benefits of introducing visual impressions for news recommendation. We attribute this superiority to the fact that we can explicitly get close to the click decision-making process by modeling the interactions of visual impression and semantic understanding of news titles.

  • Removing G_IM leads to more performance drops than removing L_IM. This means that directly modeling the visual cues in the impression image without capturing these cues’ spatial arrangement might be inferior to visual impression modeling.

  • The results are consistent across different baselines, which demonstrates that the proposed two modules can easily boost a recommendation model in a plug-and-play and model-agnostic manner.

Figure 4. Visualization of impression-semantic correlations by plotting the memory attending weights.

5.3.2. Analysis on different user/news groups.

In real-world news recommendation platforms, there are always unseen users beyond the training set and news without cover images. To reveal the effectiveness of impression modeling on different recommendation scenarios, we take an in-depth analysis of these two factors. The results are shown in Table 4. For brevity, we use News to denote the news that contains the cover image and News to denote the news without the cover image. We can see that:

  • [leftmargin=*]

  • We observe a consistent improvement of NRMS-IM/FIM-IM over the base model across various scenarios, which further demonstrates the effectiveness of IMRec framework and especially the generalization capability across different settings.

  • The results of the two models on unseen users are worse than those on seen users, which is a reasonable result. However, we notice that the improvement of NRMS-IM over the NRMS model on unseen users is consistently more significant than the other user group. For example, NRMS-IM achieves 0.0155 (relatively 4.36%) NDCG@5 improvement on unseen users (All News setting) and 0.0107 (relatively 2.99%) NDCG@5 improvement on seen user (All News setting). Since incorporating multi-modal content is essential for cold-start setting (unseen users), these results show that the introduction of visual impression is a promising direction for news recommendation.

  • Both two models yield better results on the news than on the news. We attribute this phenomenon to the fact that users might click the news and eventually find that the news turns out to be less attractive. In other words, click behaviors on the news are less noisy than the news, and users’ interests in the news are more consistent and easy to capture. Surprisingly, the improvement of impression modeling on the news is more significant than the others. Considering that the interests in the news are generally harder to capture by only using title texts, the results indicate the advantage of impression modeling in dealing with such news (e.g., the visual words might help attract users’ attention). Overall, impression modeling yields improvement on all news.

5.3.3. Analysis on the percentage of visual impressions used in training.

For this experiment, we disregard the visual impressions of randomly sampled 25%, 50%, and 75% news in training. In other words, news with visual impressions being masked will be represented solely by the semantic meaning of its title. We conduct the experiment on the IM-MIND-Large dataset with NRMS-IM. As shown in Table 5, the metrics grow monotonically as the percentage of news with visual impression increases, which suggests that IMRec framework boosts the performance of the base model by effectively modeling the visual impression.

5.4. Qualitative Analysis (RQ4)

The above analysis quantitatively shows the effectiveness of impression-aware news recommendation. We take a further step to reveal how IMRec framework internally improves the performance of semantic-only news recommendation systems. As shown in Figure 4, we plot the memory attending weights of each impression word on each textual semantic word in the local impression module, which explicitly indicates the semantic-impression correlations in the fine-grained level. We note that the cases are sampled from the IM-MIND-Large dataset and are unseen by NRMS-IM during training. Since news with the cover image will intuitively enhance the semantic representation by providing an additional modality, we disregard them here and are more interested in the impression words, which are harder to leverage. Based on the visualization, we can find that:

  • [leftmargin=*]

  • Impression words that are at the left beginning of each line (e.g., review in the second case, lock in the third case) obtain more attention weights than the others typically. This finding is intuitive, as users read the impression title in a left-to-right manner. IMRec framework automatically captures such a visual correlation pattern and accordingly enhances the semantic word representation.

  • Semantic words are more likely to attend the impression words that are spatially closer to the corresponding impression words. For example, in the third case, semantic words Richardson and take both attend to the impression word holidays, which are all at the beginning of lines. This result further demonstrates that IMRec framework captures the impression cues besides the sequential dependencies in semantics.

  • A few impression words obtain the most attention, showing that IMRec framework succeeds in capturing the critical points in the impression rather than roughly attending to all impression cues.

6. Conclusion and Future Work

In this work, we investigate users’ decision-making process when browsing and clicking news, and propose the visual impression-aware modeling framework, i.e., IMRec, for multi-modal news recommendation. IMRec explicitly gets close to the users’ news reading process and simultaneously attends to local details within the impression when understanding the news title. Furthermore, IMRec fuses the multi-modal local details by considering the global arrangement of them on the impression. We contribute visual images of news impressions to MIND dataset to promote this line of research. Extensive experiments demonstrate the efficacy of IMRec in that both NRMS-IM and FIM-IM achieve better results against various state-of-the-art designs.

To the best of our knowledge, the work is one of the initiatives to incorporate visual impressions for news recommendation. By modeling the visual impressions, we might attempt to safely disregard unnecessary modalities that are absent before users clicking the news and design application-specific modules for users’ interest mining. We believe that this idea can be inspirational for other researchers and will open up a promising direction for recommendation. We disregard more complex impression modeling designs in this paper to fairly show that introducing visual impression itself will bring many benefits. Incorporating more advanced techniques to boost performance can be a promising future work. Moreover, in many other recommendation domains, few works investigate users’ click decision-making process, hence we plan to extend our idea in these domains in the future.

7. Acknowledgements

This work was supported in part by the National Key R&D Program of China under Grant No.2020YFC0832505, National Natural Science Foundation of China under Grant No.61836002, No.62072397 and Zhejiang Natural Science Foundation under Grant No.LR19F020006.

References

  • (1)
  • An et al. (2019) Mingxiao An, Fangzhao Wu, Chuhan Wu, Kun Zhang, Zheng Liu, and Xing Xie. 2019. Neural News Recommendation with Long- and Short-term User Representations.. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers.
  • Anderson et al. (2018) Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018.

    Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    .
  • Arapakis et al. (2009) Ioannis Arapakis, Yashar Moshfeghi, Hideo Joho, Reede Ren, David Hannah, and Joemon M. Jose. 2009. Integrating facial expressions into user profiling for the improvement of a multimodal recommender system.. In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo, ICME 2009, June 28 - July 2, 2009, New York City, NY, USA.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate.. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  • Chelliah et al. (2019) Muthusamy Chelliah, Soma Biswas, and Lucky Dhakad. 2019. Principle-to-program: Neural Fashion Recommendation with Multi-modal Input.. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21-25, 2019.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Hu et al. (2020) Linmei Hu, Siyong Xu, Chen Li, Cheng Yang, Chuan Shi, Nan Duan, Xing Xie, and Ming Zhou. 2020. Graph Neural News Recommendation with Unsupervised Preference Disentanglement.. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020.
  • Jin et al. (2019) Weike Jin, Zhou Zhao, Mao Gu, Jun Yu, Jun Xiao, and Yueting Zhuang. 2019. Multi-interaction network with object relation for video question answering. In Proceedings of the 27th ACM international conference on multimedia. 1193–1201.
  • Jin et al. (2021) Weike Jin, Zhou Zhao, Pengcheng Zhang, Jieming Zhu, Xiuqiang He, and Yueting Zhuang. 2021. Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1114–1124.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization.. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  • Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In ICLR.
  • Kuo et al. (2013) Fang-Fei Kuo, Man-Kwan Shan, and Suh-Yin Lee. 2013. Background music recommendation for video based on multimodal latent semantic analysis.. In Proceedings of the 2013 IEEE International Conference on Multimedia and Expo, ICME 2013, San Jose, CA, USA, July 15-19, 2013.
  • Li et al. (2019) Juncheng Li, Siliang Tang, Fei Wu, and Yueting Zhuang. 2019. Walking with mind: Mental imagery enhanced embodied qa. In Proceedings of the 27th ACM International Conference on Multimedia. 1211–1219.
  • Li et al. (2021) Juncheng Li, Siliang Tang, Linchao Zhu, Haochen Shi, Xuanwen Huang, Fei Wu, Yi Yang, and Yueting Zhuang. 2021. Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference.
  • Li et al. (2020) Juncheng Li, Xin Wang, Siliang Tang, Haizhou Shi, Fei Wu, Yueting Zhuang, and William Yang Wang. 2020.

    Unsupervised reinforcement learning of transferable meta-skills for embodied navigation. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12123–12132.
  • Lu et al. (2019) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada.
  • Lu et al. (2021) Yujie Lu, Shengyu Zhang, Yingxuan Huang, Luyao Wang, Xinyao Yu, Zhou Zhao, and Fei Wu. 2021. Future-Aware Diverse Trends Framework for Recommendation. In Proceedings of the Web Conference 2021. 2992–3001.
  • Okura et al. (2017) Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017. Embedding-based News Recommendation for Millions of Users.. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation.. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL.
  • Phelan et al. (2011) Owen Phelan, Kevin McCarthy, Mike Bennett, and Barry Smyth. 2011. Terms of a Feather: Content-Based News Recommendation and Discovery Using Twitter.. In Advances in Information Retrieval - 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18-21, 2011. Proceedings.
  • Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada.
  • Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-To-End Memory Networks.. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada.
  • Tian et al. (2021) Qi Tian, Kun Kuang, Kelu Jiang, Fei Wu, and Yisen Wang. 2021. Analysis and Applications of Class-wise Robustness in Adversarial Training. arXiv preprint arXiv:2105.14240 (2021).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need.. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA.
  • Wang et al. (2020) Heyuan Wang, Fangzhao Wu, Zheng Liu, and Xing Xie. 2020. Fine-grained Interest Matching for Neural News Recommendation.. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020.
  • Wang et al. (2018) Hongwei Wang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018. DKN: Deep Knowledge-Aware Network for News Recommendation.. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018.
  • Wang et al. (2015) Xiangyu Wang, Yi-Liang Zhao, Liqiang Nie, Yue Gao, Weizhi Nie, Zheng-Jun Zha, and Tat-Seng Chua. 2015. Semantic-Based Location Recommendation With Multimodal Venue Semantics. IEEE Trans. Multim. (2015).
  • Wei et al. (2020) Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, and Tat-Seng Chua. 2020. Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback.. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020.
  • Wei et al. (2019) Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video.. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21-25, 2019.
  • Wu et al. (2019b) Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang, and Xing Xie. 2019b. Neural News Recommendation with Attentive Multi-View Learning.. In

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019

    .
  • Wu et al. (2019c) Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang, and Xing Xie. 2019c. NPA: Neural News Recommendation with Personalized Attention.. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019.
  • Wu et al. (2019a) Chuhan Wu, Fangzhao Wu, Mingxiao An, Yongfeng Huang, and Xing Xie. 2019a. Neural News Recommendation with Topic-Aware News Representation.. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers.
  • Wu et al. (2019d) Chuhan Wu, Fangzhao Wu, Mingxiao An, Tao Qi, Jianqiang Huang, Yongfeng Huang, and Xing Xie. 2019d. Neural News Recommendation with Heterogeneous User Behavior.. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019.
  • Wu et al. (2019e) Chuhan Wu, Fangzhao Wu, Suyu Ge, Tao Qi, Yongfeng Huang, and Xing Xie. 2019e. Neural News Recommendation with Multi-Head Self-Attention.. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019.
  • Wu et al. (2020b) Chuhan Wu, Fangzhao Wu, Xiting Wang, Yongfeng Huang, and Xing Xie. 2020b. Fairness-aware News Recommendation with Decomposed Adversarial Learning. CoRR (2020).
  • Wu et al. (2020a) Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, and et al. 2020a. MIND: A Large-scale Dataset for News Recommendation.. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020.
  • Xue et al. (2017) Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen. 2017. Deep Matrix Factorization Models for Recommender Systems.. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017.
  • Yao et al. (2021) Jiangchao Yao, Feng Wang, KunYang Jia, Bo Han, Jingren Zhou, and Hongxia Yang. 2021. Device-Cloud Collaborative Learning for Recommendation. arXiv preprint arXiv:2104.06624 (2021).
  • Yu et al. (2019) Tong Yu, Yilin Shen, Ruiyi Zhang, Xiangyu Zeng, and Hongxia Jin. 2019. Vision-Language Recommendation via Attribute Augmented Multimodal Reinforcement Learning.. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21-25, 2019.
  • Zhang et al. (2020a) Shengyu Zhang, Tan Jiang, Tan Wang, Kun Kuang, Zhou Zhao, Jianke Zhu, Jin Yu, Hongxia Yang, and Fei Wu. 2020a. Devlbert: Learning deconfounded visio-linguistic representations. In Proceedings of the 28th ACM International Conference on Multimedia. 4373–4382.
  • Zhang et al. (2020b) Shengyu Zhang, Ziqi Tan, Zhou Zhao, Jin Yu, Kun Kuang, Tan Jiang, Jingren Zhou, Hongxia Yang, and Fei Wu. 2020b. Comprehensive information integration modeling framework for video titling. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2744–2754.
  • Zhang et al. (2021a) Shengyu Zhang, Dong Yao, Zhou Zhao, Tat-Seng Chua, and Fei Wu. 2021a. Causerec: Counterfactual user sequence synthesis for sequential recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 367–377.
  • Zhang et al. (2020c) Zhu Zhang, Zhou Zhao, Zhijie Lin, Xiuqiang He, et al. 2020c. Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding. Advances in Neural Information Processing Systems 33 (2020), 18123–18134.
  • Zhang et al. (2020d) Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. 2020d. Where does it exist: Spatio-temporal video grounding for multi-form sentences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10668–10677.
  • Zhang et al. (2021b) Zhu Zhang, Chang Zhou, Jianxin Ma, Zhijie Lin, Jingren Zhou, Hongxia Yang, and Zhou Zhao. 2021b. Learning to Rehearse in Long Sequence Memorization. arXiv preprint arXiv:2106.01096 (2021).
  • Zhao et al. (2018) Zhou Zhao, Qifan Yang, Hanqing Lu, Tim Weninger, Deng Cai, Xiaofei He, and Yueting Zhuang. 2018. Social-Aware Movie Recommendation via Multimodal Network Learning. IEEE Trans. Multim. (2018).
  • Zheng et al. (2018) Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. 2018. DRN: A Deep Reinforcement Learning Framework for News Recommendation.. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018.
  • Zhu et al. (2019) Qiannan Zhu, Xiaofei Zhou, Zeliang Song, Jianlong Tan, and Li Guo. 2019. DAN: Deep Attention Neural Network for News Recommendation.. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019.