Log In Sign Up

MAMO: Memory-Augmented Meta-Optimization for Cold-start Recommendation

by   Manqing Dong, et al.

A common challenge for most current recommender systems is the cold-start problem. Due to the lack of user-item interactions, the fine-tuned recommender systems are unable to handle situations with new users or new items. Recently, some works introduce the meta-optimization idea into the recommendation scenarios, i.e. predicting the user preference by only a few of past interacted items. The core idea is learning a global sharing initialization parameter for all users and then learning the local parameters for each user separately. However, most meta-learning based recommendation approaches adopt model-agnostic meta-learning for parameter initialization, where the global sharing parameter may lead the model into local optima for some users. In this paper, we design two memory matrices that can store task-specific memories and feature-specific memories. Specifically, the feature-specific memories are used to guide the model with personalized parameter initialization, while the task-specific memories are used to guide the model fast predicting the user preference. And we adopt a meta-optimization approach for optimizing the proposed method. We test the model on two widely used recommendation datasets and consider four cold-start situations. The experimental results show the effectiveness of the proposed methods.


page 1

page 2

page 3

page 4


Task Aligned Meta-learning based Augmented Graph for Cold-Start Recommendation

The cold-start problem is a long-standing challenge in recommender syste...

Comprehensive Fair Meta-learned Recommender System

In recommender systems, one common challenge is the cold-start problem, ...

Personalized Adaptive Meta Learning for Cold-start User Preference Prediction

A common challenge in personalized user preference prediction is the col...

Meta-Learning for Online Update of Recommender Systems

Online recommender systems should be always aligned with users' current ...

Learning to Learn a Cold-start Sequential Recommender

The cold-start recommendation is an urgent problem in contemporary onlin...

Task-adaptive Neural Process for User Cold-Start Recommendation

User cold-start recommendation is a long-standing challenge for recommen...

RESUS: Warm-Up Cold Users via Meta-Learning Residual User Preferences in CTR Prediction

Click-Through Rate (CTR) prediction on cold users is a challenging task ...

1. Introduction

Personalized recommender systems are playing more and more important roles in web and mobile applications. Despite the success of traditional matrix-factorization based recommendation methods (Bokde et al., 2015)

or the most current deep-learning based techniques 

(Zhang et al., 2019), a common challenge for most recommendation methods is the cold-start problem (Wei et al., 2017). Because of the lack of user-item interactions, the recommendation approaches that utilize such interactions are unable to handle the situations where new users or items exist, corresponding to the user cold-start and item cold-start problems.

Traditional way to solve the cold start problem is leveraging auxiliary information into the recommender systems, e.g.,content-based recommender systems (Roy and Guntuku, 2016; Wei et al., 2016) and cross-domain recommenders (Li et al., 2018; Wang et al., 2019). For example, to address the item cold-start problem, (Wei et al., 2016)

proposes a hybrid model in which item features are learned from the descriptions of items via a stacked denoising autoencoder and further combined into a collaborative filtering model timeSVD++. In

(Wang et al., 2019), the authors propose a cross-domain latent feature mapping model, where the neighborhood-based cross-domain latent feature mapping method is applied to learn a feature mapping function for each cold-start user. However, the major limitation of such approaches is that the learned model may recommend same items for users with similar content thereby neglect the personal interests.

Inspired by recent works in few-shot learning (Wang and Yao, 2019), meta-learning (Vanschoren, 2018) has been introduced into the recommender systems to solve the cold-start problem, i.e. predicting a user’s preferences by only a few past interacted items. Most of the current meta-learning based recommender systems (Chen et al., 2018a; Lee et al., 2019; Zhao et al., 2019) adopt optimization-based algorithms such as model-agnostic meta-learning (MAML) (Finn et al., 2017), for their promising performance in learning configuration initialization for new tasks. Generally, the recommendation for a user is regarded as a learning task. The core idea is learning a global parameter to initialize the parameter of personalized recommender models. The personalized parameter will be locally updated to learn a specific user’s preference, and the global parameter will be updated by minimizing the loss over the training tasks among the users. Then, the learned global parameter is used to guide the model settings for new users. For example, (Lee et al., 2019) uses several fully connected neural networks as recommender models. They first learn user and item embedding from the user and item profiles, and then feed them into the recommender model to get predictions. They define two groups of global parameters: parameters in learning embeddings and parameters in recommender models. They locally update the recommender model for personalized recommendation and globally update the two groups of parameters for the initialization of new users. (Du et al., 2019) also constructs a deep neural net as the recommender model. They define a global parameter for initializing the recommender model, and then locally updates the model parameter by introducing update and stop controllers.

The aforementioned approaches have been proven to be promising with regards to applying meta-optimization in the cold-start scenarios and presenting competitive performance in warm-start scenarios. However, they have the following limitations. Most of them are build upon MAML algorithm and its variants, which are powerful to cope with data sparsity, while having a variety of issues such as instability, slow convergence, and weak generalization. In particular, they often suffer from gradient degradation ending up with a local optima when handling users who show different gradient descent directions comparing with the majority of users in the training set. To address this issue, we propose Memory-Augmented Meta-Optimization (MAMO) for cold-start recommendation. In details, 1) for solving the problem of local optima, we design a feature-specific memory to provide a personalized bias term when initializing the model parameters. Specifically, the feature-specific memory includes two memory matrices: one stores the user profile memory to provide retrievable attention values, and the other caches the fast gradients of the previous training sets to be read according to the retrievable attention values. 2) We further design a task-specific memory cube, i.e. user preference memory, to learn to capture the shared potential user preference commonality on different items. It is used as fast weights for the recommender model to alleviate a need for storing copies of neural activity patterns. 3) The extensive experiments on two widely used datasets show MAMO performs favorably against the state-of-the-arts. The code is publicly available for reproduction111

The rest of the paper is organized as follows. We review proposed approach in Section 2; Section 3 presents the settings, experimental results and analysis of the experiments; we review the related work in Section 4, followed by the conclusion in Section 5.

2. Proposed Approach

2.1. Overview

2.1.1. Problem Definition

We consider the recommendation for a user as one task. Given a user with profile and rated items , where each item is associated with a description file and corresponding ratings , our goal is predicting the rating by user for a new item . Here, stands for the support set, and stands for the query set.

2.1.2. Motivation

Most existing meta learning based recommendation techniques attempt to learn a meta global parameter to guide the model initialization , i.e. , where is learned across all the users in training set. It provides a uniform parameter initialization to govern the trained recommendation model to predict on new users. The global parameter works uniformly across all users and thus may be inadequate to discern the intrinsic discrepancies among a variety of user patterns, resulting in poor generalization. Also, the model may tend to be stuck on a local optimum and encounter gradient degradation. Instead of learning a single initialization of the model parameters, we propose an adaptive meta learning recommendation model, Memory-Augmented Meta-Optimization (MAMO), to improve the stability, generalization and computational cost of model by learning a multi-level personalized model parameters.

2.1.3. Model Structure

The proposed model includes two parts: the recommender model for predicting the user preference and the memory-augmented meta-optimization learner for initializing recommender model parameters. The recommender model predicts the recommendation scores, where the model parameters will be locally updated for single users to provide personalized recommendation. The meta-learner, which includes global parameters and memories , will provide personalized initialization for the recommender model parameters , and will be globally updated during the training process for users . The learned meta-learner is further used to initialize the recommender model for new users .

Figure 1. The training phase of MAMO

2.2. Recommender Model

Similar to many previous works, we assume that the user preference is coming from a complex combination of his/her personal information such as user profiles, user rating records, and the item profiles.

2.2.1. Embedding: ,

We use user profile to represent the initial user preference for the following considerations. First, user profile normally includes general information such as age groups and locations. In a cold-start scenario, where there are only limited user-item interaction records, this will provide potential user preference for the recommenders. Second, traditional one-hot user representation (by a unique id) is strongly relied on collaborative filtering techniques. The fine-tuned model is hard to adapt to new users. Similarly, we use to denote the item profile for learning the item embedding.

To learn the user and item embedding vectors

and from and , the simplest way is constructing a multi-layer fully connected neural network, i.e.


where is the embedding size; denotes the fully connected layers; and stand for the dimension of the user profile vector and item profile vector; and are the parameters of the fully connected layers for learning user and item embedding, respectively. We will omit similar notations in the rest of the paper.

2.2.2. Recommendation:

Given the user embedding , and a list of rated item embeddings for , we get the prediction of preference score for each item by:


where is the concatenation of the user embedding and the item embedding, and denotes the fully connected layers. is a matrix that includes the fast weights of the recommender model for user , which is extracted from the task-specific memory according to the user profile , i.e. . The task-specific memory for user , will be locally updated (i.e. during the learning on support set of user ) for personalized recommendation. We will introduce more details about the task-specific memory in the further parts.

2.3. Memory-Augmented Meta-Optimization

2.3.1. Feature-specific Memory: .

Recall that the parameters used for extracting user and item embedding are and , traditional meta optimization approach will learn the global parameters and for initialization, i.e. , . Here for addressing the single initialization problem, we introduce feature-specific memories, i.e. the user embedding memory and the profile memory , to help the model initialize a personalized parameter . The profile memory stores information relevant to user profiles to provide the retrieval attention values . The retrieval attention values are applied for extracting information from user embedding memory , where each row of keeps the according fast gradients (or bias terms). Together, the two memory matrices will help to generate a personalized bias term when initializing , i.e. . Here the bias term can be regarded as a personalized initialization parameter for guiding the global parameter to fast adapt to the case of user ; the is a hyper-parameter for controlling how much bias term is considered when initializing .

In details, given a user profile , the profile memory will calculate the attention values by



function calculates the cosine similarity between the user profile and the user profile memory, and then be normalized by softmax function. The dimension

denotes the number of user preference types, which is a predefined number before the training process. We obtain the personalized bias term by


where stores the memory of the fast gradients. Notice that the user embedding model may comprise more than one neural layer and more than one parameter, which means is not a numerical matrix but stores all the fast gradients with the same shape as the parameters in the user embedding layers, so here denotes the dimension of the parameters in user embedding layers.

Before the training process, the two memory matrices are randomly initialized, and will be updated during the training process. Specifically, profile memory will be updated by:


where is the cross product of and , is a hyper-parameter to control how much new profile information is added. Here we add an attention mask when add the new information so that the new profile information will be attentively added to the memory matrix. Similarly, the parameter memory will be updated by


where denotes the training loss, and is the hyper-parameter to control how much new information is kept.

2.3.2. Task-specific Memory:

The user preference matrix serves as fast weights or a transform matrix for the recommender model from the user and item embedding (see equation (2)) that is extracted from the memory cube , where

is the same notation for the dimension of feature-specific memory that denotes the number of user preference type. Similar to the feature-specific memory, which follows the idea of Neural Turing Machine (NTM)

(Graves et al., 2014), the memory cube have a read head to retrieve the memory and a write head to update the memory. Similarly, we attentively retrieve the preference matrix from by:


where is learned from equation 3. The write head will write the updated personal preference memory matrix to the after the learning on the support set.



denotes the tensor product,

is a hyper-parameter to control how much new preference information is added.

Input: Training user set ; User profile ; Item profile ; User ratings ; Hyper-parameters , , , , ,
Output: Meta parameters , , , , ,
1 Randomly initialize the meta parameters , , ;
2 Randomly initialize the memories , , ;
3 while Not Done do
4       for  do
5             Calculate bias term by Eq. (3-4);
6             Initialize the local parameters , , by Eq. (9-10);
7             Initialize the preference memory by Eq. (7);
8             for  do
9                   Get user and item embedding and by Eq. (1);
10                   Get prediction of by Eq. (2);
11                   Local update ;
12                   Local update , , by: ;
14             end for
16       end for
17      Update feature-specific memory , by Eq. (5-6);
18       Update task-specific memory by Eq. (8);
19       Update global parameters , , by: ;
21 end while
Algorithm 1 Training process of MAMO

2.3.3. Local Update

Traditionally, the parameters of a neural network are initialized by randomly sampling from a statistical distribution. Given sufficient training data, the randomly initialized parameters can usually converge to a good local optimum but may take a long time (Du et al., 2019). In the cold-start scenario, random initialization combined with limited training data can lead to serious over-fitting, which makes the trained recommender insufficient to generalize well. Inspired by recent works of meta-training(Lee et al., 2019), we initialize the recommender parameters from the global initial values.

At the beginning of the training process, we randomly initialize the global parameters, i.e. , , , , , . For each user , we have support set and query set. During the local learning phase (i.e. learning on the support set), we initialize the local recommender parameters by:


where is obtained via equation (3-4). Then we obtain the task-specific memory matrix by equation (7). The prediction of ratings for items is based on equation (2). The optimization goal in local training is to minimize the loss of the recommendation for a single user, i.e. updating the local parameters by minimizing the prediction loss . Thus, the local parameters will be updated by:


where could be either , , or ; is the learning rate for updating the local parameters. The preference matrix will also be locally updated via back-propagation.

2.3.4. Global Update

The aim of the meta optimization is to minimize the expected loss on the local query set for . Here, the parameters of the meta-learner include: shared initial parameters , and ; feature-specific memories, i.e. profile memory and user embedding memory ; and the task-specific memory . The gradients related to the meta-testing loss, which we call meta-gradient, can be computed via back-propagation. The meta-gradient may involve higher-order derivatives, which are expensive to compute when the depth of the neural nets is deep. Therefore, MAML(Santoro et al., 2016) takes one-step gradient descent for meta-optimization. We take similar ideas, where after the local training on the support set, we update the global parameters according to the loss on query sets.

Suppose the recommender model is denoted as combined with a task-specific memory for user , where . After the local training on support set, we get the model with updated parameters . Our goal is minimizing the training loss for users on query sets for . Then, the global parameters are updated by


Meanwhile, the feature-specific memories and will be updated by equation (5) and  (6); the task-specific memory will be updated by equation (8). The pseudo code of the training process is listed in algorithm 1.

3. Experiments

3.1. Datasets

3.1.1. Datasets.

We use two widely used public available datasets for evaluation: MovieLens 1M222 and Book-crossing333, which have both user information and item information. MovieLens 1M includes around 1 million ratings from about 6 thousand users for over 3 thousand movies, and the ratings range from 1 to 5. For the Book-crossing dataset, we filter the records that users or items do not have relevant profiles. The processed dataset includes about 600 thousand ratings from around 50 thousand users for over 50 thousand books, the ratings are between 1 and 10. See appendix A.1 for more details.

3.1.2. Data preprocessing

We use different strategies when learning the feature representation. For features have single categorical value, such as location and occupation, each feature is denoted by index and is represented with a randomly initialized embedding vector; we use similar approaches to process the numerical features such as the publication year and age group. For features that may have multiple categories, such as movie genres and directors, we use one-hot representation and then transform them into vectors that have same dimension with other feature vectors.

3.1.3. Training and testing dataset.

Each user is regarded as a sample in the dataset. We randomly separate the users into training and testing users with ratio 80:20. The history records for a single user are further divided into support set and query set. We trim the number of rating records for each user as with 20 records to force the model to learn from few samples. In default settings, we consider the first 15 items for a user as the support set and the others as the query set. A detail is that we sort these items according to the user review time. By doing so, the rated items in query set are logically regarded as ’new items’ for a user. We will further discuss whether the ratio of training dataset or the number of cases in the support set will affect the model performance.

3.1.4. Cold-start scenarios.

We further consider the recommendation performance on four scenarios: 1) existing users for existing items (W-W) ; 2) existing users for cold items (W-C); 3) cold users for existing items (C-W); and 4) cold users for cold items (C-C). For the MovieLens dataset, we classify the users into warm or cold users according to their first comment time. The first comment time in the MovieLens dataset ranges from 2000-04-26 to 2003-03-21. We found most users (about 90%) provide their first comments before 2000-12-03. By dividing the users based on this time, we have 5,400 warm users and 640 cold users. As for the items, we regard items with less than 10 ratings as cold items, where we get 1,683 warm items and 1,645 cold items. For the Book-crossing dataset, where the rating time is not provided, we assume the order of the user ids is consistent with the order of the user registration time. Thus, we assume the first 90 percent of the users are warm users and the others are cold users. Similarly, we regard items with more than 5 ratings as warm items, from which we get 628 warm items and the others as cold items.

(a) (b) (c) (d) (e) (f) (g) (h)
Figure 2. Parameter sensitivity in different settings.

3.2. Parameter Studies

Here we show the parameter studies on several key parameters of our method: the model set-up parameters when initializing the training and testing dataset; the model structure parameters for constructing the recommender models; and the hyper-parameters during the learning process. We present the results under two evaluation metrics, i.e. Mean Absolute Error (

) and normalized discounted cumulative gain ():


where is the number of instances in the query set for each user; calculates the actual rating values sorted by predicted rating values; and calculates the sorted actual rating values, i.e. the most possible values, for . evaluates the accuracy of rating prediction, where a lower value indicates better model performance. considers the preference ordering performance on observed query sets, where a higher value indicates a better performance. The experiments are conducted on MovieLens dataset.

Method Metrics MovieLens-1M Book-crossing
MeLU MAE 0.7661 0.9361 0.7884 0.9299 0.7799 2.1047 1.8701 2.1475
0.8904 0.7990 0.8810 0.8011 0.9572 0.8441 0.8527 0.8410
MetaCS-DNN MAE 0.9047 1.1694 1.0625 1.2012 1.6206 2.1457 2.2648 2.3088
0.9090 0.7715 0.8559 0.7680 0.8860 0.8365 0.8202 0.8160
Meta MAE 0.7567 0.8860 0.8443 0.9130 1.5147 1.9767 1.8738 2.0921
0.8870 0.8323 0.8401 0.8148 0.8781 0.8463 0.8490 0.8422
Item-level RUM MAE 0.8874 1.3424 1.2655 1.3509 1.4611 2.2857 2.2608 2.3197
0.9102 0.7692 0.7420 0.7680 0.8920 0.8265 0.8202 0.8160
Feature-level RUM MAE 0.8739 1.2439 1.2488 1.2773 1.3945 2.0837 2.0322 2.1260
0.9120 0.7721 0.7642 0.7547 0.8909 0.8430 0.8338 0.8273
MAMO MAE 0.8725 0.9306 0.8967 0.8894 1.4879 1.7379 1.6217 1.8188
0.8866 0.8315 0.8799 0.8709 0.8879 0.8402 0.8565 0.8384
Table 1. Comparison Results

3.2.1. Model set-up parameters.

Here we consider the model performance under different separating ratios for training-testing or support-query sets. The experimental results (Figure 2 (a)) suggest a higher ratio of the training set improves the model performance; while the performance for training with only half of the users is still acceptable, which indicates the effectiveness of the meta-training approach. As for the sample size of the support set, where we control the maximum of rating records for a user as 20, we compare the results with support sample size between 5 to 15, where we use the last 5 samples for each user as the query sample set for a fair comparison. According to Figure 2 (b), we can find that a larger sample size provides better results, while the model can also provide acceptable predictions with smaller sample size, which shows its capacity in cold scenarios.

3.2.2. Model structure parameters.

The settings of the parameters for deep neural nets, especially the number of layers and the dimension of layer nodes, can sometimes affect the model performance. Generally, a network with large dimension and a complex structure may provide accurate predictions but is data hungry and slow in convergence. In our work, we use fully connected layers to learn the user and item embeddings. According to Figure 2 (c)-(d), which show the model performance under different number of embedding layers and embedding size , we could see that the number of neural layers has slight impacts on the model performance, while too shallow or too deep layers may lead to bad results. A moderate setting of the embedding size can provide acceptable prediction results and require less training time.

3.2.3. Hyper-parameters.

We have hyper-parameters and (ranging from 0 to 1) for globally updating the profile memory and user embedding memory in feature-specific memory, and for globally updating the task-specific memory . The hyper-parameters control how much new information is added to the memory matrices. Figure 2 (e)-(f) show the parameter sensitivity for parameter and . For parameter , values around 0.5 provide better performance of the models, because of the update strategy of the memory matrices. Since the profile memory is randomly initialized and is not updated by back-propagation, the profile memory requires more new information to complete the distribution of the user profiles. While for the user embedding memory, too much new information may cause chaos in existing memory, where provides the best performance. The parameter for task-specific memory (not listed in the figures) has similar patterns with , where it provides best results with values around 0.1. We have and for updating local or global parameters, and for providing personalized bias term. Figure 2 (g)-(h) show the results for and , where small learning rates provide best results, since a large value may make the model difficult in convergence. As for the parameter , which is not listed in the figures, we find the values around 0.1 perform best.

3.3. Comparison Results

3.3.1. Comparison Methods

We adopt the following representative state-of-the-arts that apply the meta-learning idea into the recommender systems, where the first three methods are based on meta-optimization, and the last method, which includes item-level model and feature-level model, is based on memory networks.

  • MeLU (Lee et al., 2019): get rating predictions by feeding the concatenation of user and item embeddings into fully connected layers. The local parameter of the fully connected layers will be locally updated for personalized recommendation. The global parameter of the recommender and the parameters of embedding layers will be globally updated for all the users.

  • MetaCS-DNN (Bharadhwaj, 2019): follows a similar idea of MeLU when constructing the recommender model, i.e. several fully connected layers, the difference is the local and global parameters involve all parameters from the inputs to the predictions.

  • Meta (Du et al., 2019): uses a deep neural net as the recommender model and updated via meta-optimization approach. Specifically, it introduced update and stop controllers to update the local parameters.

  • RUM: (Chen et al., 2018b) uses neural matrix factorization as the recommender model, where the score for an item is the product of the user and item embeddings. The user embedding is learned from a user’s intrinsic embedding and a memory embedding. The memory embedding is learned by either item-level or feature-level memories, i.e. item-level RUM and feature-level RUM. The item-level RUM stores and extracts the memory according to different items, while the feature-level RUM considers the similarity of the different features. Here, the memory matrix is initialized by each user’s history records and is locally updated for personalized recommendation.

The parameter settings for our method and the comparison methods can be found in appendix A.2.

3.3.2. Comparison Results

The comparison results are listed in Table 1. Generally, we can see that the meta-optimization based methods show outstanding performance in cold scenarios, while RUM outperforms other methods in warm situations. This can be attributed to the stronger capability of meta-optimization approach in capturing the general preferences. Since the memory matrices designed in RUM store only the personal history information, it needs more history records to establish an efficient memory matrix. Comparing MetaCS-DNN and MeLU, which utilize different strategies for updating embedding parameters, the results show that locally updating the embedding parameters performs better than globally updating the embedding parameters. We can see that our method shows stable and well performance in different scenarios, which shows the effectiveness of the meta-augmented meta-optimization strategy: learning a personalized initialization of local parameters from the memories.

3.4. Discussion

MAMO has two groups of memories. The first group includes user profile memory and user embedding memory that are utilized for providing personalized bias term when initializing the local parameters. The second group includes the task-specific memory . In this section, we will discuss the impact of these two groups of memories. Also, we predefined user types when constructing the aforementioned memories. We will discuss how to set an appropriate and present an example of the standard users for user preference patterns.

(a) (b)
Figure 3. The major difference between MAML and our method. Red lines stand for the global/meta-optimization, the blue lines stand for the local learning/adaption.

3.4.1. Impact of Personalized Bias Term

The basic idea of meta-optimization process is learning a sharing global parameter when initializing the personalized local parameter (see Figure 3 (a)). For addressing the potential local optima issues by the unique global sharing parameter for most current meta-optimization based works, we introduce two memory matrices to provide a personalized bias term when initializing the local parameters when learning the user embedding. Specifically, the profile memory stores types of user profiles, and stores corresponding bias terms. Then, the local parameters will be locally updated for personalized recommendation (see Figure 3 (b)). The comparison methods, MeLU and MetaCS-DNN, both utilize MAML for meta-optimization. According to the comparison results, we can see that our strategy outperforms these two methods in cold scenarios, especially for cold users. The reason may lie in that our optimization approach takes the user profile into consideration. The profile memory will automatically cluster users with similar profiles to provide fast bias terms.

(a)Training: —S—=10 (b)Testing: —S—=10 (c)Training: —S—=5 (d)Testing: —S—=5
Figure 4. The performance during training process and testing process with different settings of support sample size on different recommender models.

3.4.2. Impact of Preference Memory

We also introduce the task-specific memory that contains types of preference matrices during the recommendation process. To see whether it helps the recommendation process, we compare it with different designs of recommender models. The first is predicting the ratings by the concatenation of the user and item embedding through several fully connected layers, denoted by DNN. The second is taking the idea of neural matrix factorization (He et al., 2017), which multiplies the user embedding and item embedding as the predictions, denoted by NMF. The results are shown in Figure 4, where (a), (c) show the performance on the query dataset during the training process; (b), (d) show the performance on the query dataset during the testing process. We can observe that DNN and NMF converge fast in the training process but perform unwell on testing dataset with small support sample size, while our method needs more steps to store the preference information but can learn fast during the testing process.

3.4.3. Discussion of Preference Type

In this work we predefined user types when constructing the proposed memories, which can be regarded as the number of clusters of users based on the user profiles. Figure 5 (a) shows the model performance under different . We can see that the values around 2-4 achieve the best performance. A larger has potential in providing more accurate guidance in recommendation, but it will take more computation cost (each type in stores parameters with the same shape as all the network parameters in user embedding networks). In Figure 5 (b), we present the case when . We selected two representative user profiles which are most similar to , the distributions are from the top 100 users that belong to these two types.

(a) (b)
Figure 5. (a)Performance with different . (b) Example for representative users when .

3.4.4. Limitations and Future work

In cold-start scenarios, auxiliary information is valuable for providing potential recommendations. Our work targets at the situations that users or items have limited interaction histories and utilizes the idea of few-shot learning to provide recommendations. We assume we have enough essential profile details of the users or items. During the data preprocessing process, we filter the data that are without relevant profiles. However, in the real-world cases, the auxiliary information is not always attainable, where our model becomes inefficient under such circumstance. In future works, we plan to leverage the knowledge from multi-modal studies (Baltrušaitis et al., 2018) to address this issue. For example, Wu et al. (Wu and Goodman, 2018) design a multimodal variational autoencoder that learns a joint feature distribution under any missing modalities.

4. Related Work

4.1. Cold-start Problem in Recommender Systems

It is still a tricky and significant problem for recommenders to accurately recommend items to users. Some solutions in solving this problem have been provided in current publications (Mishra et al., 2017), involving second-domain knowledge transfer (Li et al., 2018), auxiliary and contextual information (Roy and Guntuku, 2016; Yao et al., 2018)

, active learning

(Zhu et al., 2019) and deep learning (Ebesu and Fang, 2017).

Commonly, it is reported that the more information the better recommendation results are. (Mirbakhsh and Ling, 2015) adds the matrix factorization with clusters to recommendations of cross-domains. Similarly, (Li et al., 2018) presents an innovative model of cross-domain recommendation according to the partial least squares regression (PLSR) analysis. Both PLSR-CrossRec and PLSR-Latent can be utilized in rating of source-domain for better predicting ratings of cold-start users. Apart from leveraging auxiliary information from another domain, some researchers have regarded representative item content as auxiliary information to learn representative latent factor when coped with the cold-start problem. According to implied feedback, (Roy and Guntuku, 2016) presents an approach named visual-CLiMF to learn representative latent factors for cold start videos, where emotional aspects of items are incorporated into the latent factor representations of video contents. In order to better use content characteristics in further latent characteristics, (Chou et al., 2016) proposed a next-song recommender system for cold-start recommendation by mapping both learning and updating within spaces of the audio characteristic and the item latent. In addition, active learning has found its way in tackling both the user cold-start problem in both users (Elahi et al., 2016) and the items (Anava et al., 2015). (Zhu et al., 2019) combines the active learning approach with item’s attribute information to deal with the cold-start problem in items. More recent works take advantage of deep learning (Wei et al., 2016; Ebesu and Fang, 2017; Yuan et al., 2016). (Ma et al., 2020) proposes a multiplex interaction-oriented service recommendation approach (MISR), which merged different interactions into a deep neural network to search latent information so that new users can be better dealt with. (Wei et al., 2016) proposes a hybrid recommendation model in researching the content characteristics of items by means of a deep learning neural network and then exploited them to the timeSVD++ collaborative filtering model.

4.2. Meta-learning for Recommender Systems

Meta-learning, namely learning to learn, is learning across various tasks with fewer training samples in each task, which can be easily adapted to new tasks (Vanschoren, 2018). In recent years, meta-learning has been attracting much attention to alleviate cold-start problems (Chen et al., 2018a; Lee et al., 2019; Zhao et al., 2019), and most of them adopted optimization-based meta-learning approach and chose model-agnostic meta-learning (MAML) (Finn et al., 2017) for model training. Generally, those works define a recommendation model with parameter , and a meta-learner . Modeling each user’s preference is regarded as a learning task, and the users are divided into training and testing users. The initialization of parameter is defined by some function , e.g. in a simplest case, . During the training process, the parameter of the recommendation model, i.e. , will be locally updated by minimizing the recommendation loss on the support set (the history ratings by a user) for a single user; and then used to retrieve the recommendation loss on the query set . The parameter of the meta-learner will be globally updated by minimizing the sum of recommendation loss for all users in the training dataset. Then, the meta-learner will guide the learning process for new users by providing the initialization of parameter , which can expedite the learning process of the recommendation models for testing users.

The major differences among the above works lie in the design of the recommender model and the meta-optimization approach . For example, (Lee et al., 2019) targets at a rating prediction problem where they feed the concatenation of user and item embedding into fully-connected layers to get predictions. They locally update the parameters of neural layers to get personalized recommendation and globally update the parameters of embedding learning and a global parameter of neural layers. (Du et al., 2019) constructs a deep neural nets as the recommender model, and they design an input gate and a forget gate during updating the local parameters. One limitation of the current meta-optimization approach for recommendation is that most works learn a shared global initialization parameter, which means the initialization parameter is same for all users. This may lead the recommender model to the local optima problem and to be slow in convergence. To address this issue, our work introduces two memory matrices to provide personalized bias terms when initializing the parameters.

4.3. Memory-augmented Neural Networks

With the ability to express, store, and manipulate the records explicitly, dynamically, and effectively, external memory networks (Sukhbaatar et al., 2015) have gained popularity in recent years in fields such as question-answering systems (Xiong et al., 2016) and knowledge tracking (Ghazvininejad et al., 2018). One branch of the memory neural networks is based on Neural Turing Machine (NTM) (Graves et al., 2014): the memory is normally stored in a matrix with a read head to extract the information from the memory and a write head to update the memory at each time step. Such features make the NTM a good module for meta-learning or few-shot learning. Following this idea, (Santoro et al., 2016) proposes Memory-augmented Neural Networks (MANN). Similar to NTM, they design a read controller to learn a key vector from the inputs at time , and is used to learn the rating weights for the rows in the memory matrix ; the retrieved vector is the sum of rows in memory matrix with the above attention values, i.e. . Then, for updating the memory, the authors propose usage weights to evaluate the usage of the rows in the memory matrix. The writing weights are learned from the previous reading weights and the usage weights. Last, the memory is updated by .

The work by (Chen et al., 2018b) is one of the first works that integrate MANN into recommendation tasks. They consider the matrix factorization model as a neural network, where the score for an item from user is predicted from the user and item embeddings, i.e. . They propose memory enhanced the user embedding, where the user embedding is learned from a user’s intrinsic embedding and a memory embedding , i.e. . The memory embedding reads from the user personalized memory matrix to retrieve the related information for item according to the item embedding . Another work by (Li et al., 2019) adds a gate mechanism to control how much past memory and current memory are kept. Different from these works where the memory matrices are designed for each user and cannot share among different users, we propose two global sharing memories: feature-specific memory and task-specific memory. The feature-specific memory includes user profile memory and user embedding memory. Given a user, this memory will extract the parameter memory according to the similarity between the user profile and the user profile memory. The task-specific memory stores the preference information, which serves as fast weights for the recommender model, to alleviate the need for storing copies of neural activity patterns.

5. Conclusion

A common challenge for most current recommender systems is the cold-start problem. Recently, inspired by the progress of few-shot learning, some works leverage the meta-optimization idea into the recommendation. The core idea is learning a global sharing initialization parameter of the recommender model for all users and then updating the local parameters to learn personalized recommender model for single users. A common limitation for most current works is that the global sharing initialization parameter is unique for all users, which may lead the recommendation model into local optima for users with distinct preference patterns. For addressing this issue, in this work, we design two memory matrices to provide personalized initialization for recommender models: feature-specific memory that provides a personalized bias term when initializing the recommender model, and a task-specific memory to guide the recommendation process. We evaluate the proposed methods on two public available datasets and provide details of the implementation. The experimental results show the effectiveness of our method.


  • (1)
  • Anava et al. (2015) Oren Anava, Shahar Golan, Nadav Golbandi, Zohar Karnin, Ronny Lempel, Oleg Rokhlenko, and Oren Somekh. 2015. Budget-constrained item cold-start handling in collaborative filtering recommenders via optimal design. In Proceedings of the 24th International Conference on World Wide Web. 45–54.
  • Baltrušaitis et al. (2018) Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018.

    Multimodal machine learning: A survey and taxonomy.

    IEEE transactions on pattern analysis and machine intelligence 41, 2 (2018), 423–443.
  • Bharadhwaj (2019) Homanga Bharadhwaj. 2019. Meta-Learning for User Cold-Start Recommendation. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
  • Bokde et al. (2015) Dheeraj Bokde, Sheetal Girase, and Debajyoti Mukhopadhyay. 2015. Matrix factorization model in collaborative filtering algorithms: A survey. Procedia Computer Science 49 (2015), 136–146.
  • Chen et al. (2018a) Fei Chen, Zhenhua Dong, Zhenguo Li, and Xiuqiang He. 2018a. Federated meta-learning for recommendation. arXiv preprint arXiv:1802.07876 (2018).
  • Chen et al. (2018b) Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2018b. Sequential recommendation with user memory networks. In Proceedings of the eleventh ACM international conference on web search and data mining. ACM, 108–116.
  • Chou et al. (2016) Szu-Yu Chou, Yi-Hsuan Yang, Jyh-Shing Roger Jang, and Yu-Ching Lin. 2016. Addressing cold start for next-song recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems. 115–118.
  • Du et al. (2019) Zhengxiao Du, Xiaowei Wang, Hongxia Yang, Jingren Zhou, and Jie Tang. 2019. Sequential Scenario-Specific Meta Learner for Online Recommendation. arXiv preprint arXiv:1906.00391 (2019).
  • Ebesu and Fang (2017) Travis Ebesu and Yi Fang. 2017. Neural Semantic Personalized Ranking for item cold-start recommendation. Information Retrieval Journal 20, 2 (2017), 109–131.
  • Elahi et al. (2016) Mehdi Elahi, Francesco Ricci, and Neil Rubens. 2016. A survey of active learning in collaborative filtering recommender systems. Computer Science Review 20 (2016), 29–50.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1126–1135.
  • Ghazvininejad et al. (2018) Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural conversation model. In

    Thirty-Second AAAI Conference on Artificial Intelligence

  • Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401 (2014).
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web. 173–182.
  • Lee et al. (2019) Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. 2019.

    MeLU: Meta-Learned User Preference Estimator for Cold-Start Recommendation. In

    Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1073–1082.
  • Li et al. (2018) Cheng-Te Li, Chia-Tai Hsu, and Man-Kwan Shan. 2018. A Cross-Domain Recommendation Mechanism for Cold-Start Users Based on Partial Least Squares Regression. ACM Transactions on Intelligent Systems and Technology (TIST) 9, 6 (2018), 1–26.
  • Li et al. (2019) Yunxiao Li, Jiaxing Song, Xiao Li, and Weidong Liu. 2019. Gated Sequential Recommendation with Dynamic Memory Network. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
  • Ma et al. (2020) Yutao Ma, Xiao Geng, and Jian Wang. 2020. A Deep Neural Network With Multiplex Interactions for Cold-Start Service Recommendation. IEEE Transactions on Engineering Management (2020).
  • Mirbakhsh and Ling (2015) Nima Mirbakhsh and Charles X Ling. 2015. Improving top-n recommendation for cold-start users via cross-domain information. ACM Transactions on Knowledge Discovery from Data (TKDD) 9, 4 (2015), 1–19.
  • Mishra et al. (2017) Nitin Mishra, Vimal Mishra, and Saumya Chaturvedi. 2017. Tools and techniques for solving cold start recommendation. In Proceedings of the 1st International Conference on Internet of Things and Machine Learning. 1–6.
  • Roy and Guntuku (2016) Sujoy Roy and Sharath Chandra Guntuku. 2016. Latent factor representations for cold-start video recommendation. In Proceedings of the 10th ACM conference on recommender systems. 99–106.
  • Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In International conference on machine learning. 1842–1850.
  • Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems. 2440–2448.
  • Vanschoren (2018) Joaquin Vanschoren. 2018. Meta-learning: A survey. arXiv preprint arXiv:1810.03548 (2018).
  • Wang et al. (2019) Xinghua Wang, Zhaohui Peng, Senzhang Wang, S Yu Philip, Wenjing Fu, Xiaokang Xu, and Xiaoguang Hong. 2019. CDLFM: cross-domain recommendation for cold-start users via latent feature mapping. Knowledge and Information Systems (2019), 1–28.
  • Wang and Yao (2019) Yaqing Wang and Quanming Yao. 2019. Few-shot learning: A survey. arXiv preprint arXiv:1904.05046 (2019).
  • Wei et al. (2016) Jian Wei, Jianhua He, Kai Chen, Yi Zhou, and Zuoyin Tang. 2016. Collaborative filtering and deep learning based hybrid recommendation for cold start problem. In 2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech). IEEE, 874–877.
  • Wei et al. (2017) Jian Wei, Jianhua He, Kai Chen, Yi Zhou, and Zuoyin Tang. 2017. Collaborative filtering and deep learning based recommendation system for cold start items. Expert Systems with Applications 69 (2017), 29–39.
  • Wu and Goodman (2018) Mike Wu and Noah Goodman. 2018.

    Multimodal generative models for scalable weakly-supervised learning. In

    Advances in Neural Information Processing Systems. 5575–5585.
  • Xiong et al. (2016) Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In International conference on machine learning. 2397–2406.
  • Yao et al. (2018) Lina Yao, Quan Z Sheng, Xianzhi Wang, Wei Emma Zhang, and Yongrui Qin. 2018. Collaborative location recommendation by integrating multi-dimensional contextual information. ACM Transactions on Internet Technology (TOIT) 18, 3 (2018), 1–24.
  • Yuan et al. (2016) Jianbo Yuan, Walid Shalaby, Mohammed Korayem, David Lin, Khalifeh AlJadda, and Jiebo Luo. 2016. Solving cold-start problem in large-scale recommendation engines: A deep learning approach. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 1901–1910.
  • Zhang et al. (2019) Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR) 52, 1 (2019), 5.
  • Zhao et al. (2019) Liang Zhao, Yang Wang, Daxiang Dong, and Hao Tian. 2019. Learning to Recommend via Meta Parameter Partition. arXiv preprint arXiv:1912.04108 (2019).
  • Zhu et al. (2019) Yu Zhu, Jinghao Lin, Shibi He, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Cai. 2019. Addressing the item cold-start problem by attribute-driven active learning. IEEE Transactions on Knowledge and Data Engineering (2019).

Appendix A Appendix

a.1. Dataset Details

a.1.1. Movielens.

The user profile includes gender (male or female), age group (under 18, 18-25, 26-35, 36-45, 46-50, 51-56, 56+), and 21 different occupations (e.g. student, engineer). The item profile includes movie’s release year, genres (e.g. action or adventure), director (one or more directors), and rate (e.g. R, PG). The mean rating value is 3.58, and the rating comment time is between 2000-04-26 and 2003-03-01. In the raw data, each user has minimum 20 rating histories. We sort the items rated by a user according to their review time, and we trim the dataset, i.e. each user having 20 rating records, to force the model to learn from few shot cases.

a.1.2. Book-crossing

The raw Book-crossing data contains many missing and misleading values. The user information includes user age and location. The original user age ranges from 0 to 237, which is in contrast of real cases. Thus, we control the age interval as 5 to 110. The location includes city, state, and country. We only keep the country information for both missing value and privacy consideration. The filtered data has users from 65 different countries. The item information includes the publication year (ranges from 1500 to 2010), author (25593 unique authors), and publisher (5254 unique publishers). The Book-crossing dataset does not contain time information for the ratings, thus we assume the data stored in the public dataset are based on the review time, where we keep the order of items that rated by a user to make our dataset. We also keep 20 rating records for each user for few-shot learning.

a.2. Compared methods

a.2.1. Code.

The code of MeLU444 and Meta555

are provided by the authors. We modify the input and evaluation modules to fit our experimental settings. For MetaCS-DNN, which has similar idea of MeLU, we modify the code of MeLU to implement it. As for RUM, the code is not published; thus, we implement them with Pytorch.

a.2.2. Parameter configuration.

For MeLU, we use the default parameter settings in the published code. For Meta, we implement the code with the parameter settings for Movielens Dataset. For other two comparison methods, we use the suggested parameters when reproduce them. For MetaCS-DNN, the global update learning rate is set to 0.4. For RUM, the learning rate of SGD is determined by grid search in the range of [1, 0.1, 0.01, 0.001, 0.0001], and the number of memory slot K is empirically set as 20. The MERGE parameter is searched in the range of [0,1] with step 0.1. The embedding dimension and regularization parameters are determined by grid search in the range of [10, 20, 30, 40, 50] and [0.1, 0.01, 0.001, 0.0001], respectively.

a.2.3. The evaluation metrics in cold-scenarios.

We provided our definitions for cold-users and cold-items in section 3.1.4 and evaluated the performance under the evaluation metrics (see Eq. 13) and (see Eq. 14). Notice that the users could be either warm users or cold users, and the items rated by a user could be either warm items or cold items – we label each rating record as in four cold-start scenarios. For example, for a cold user, the rated items are labeled as either warm items (C-W) or cold items (C-C). The calculation for in four cold-start scenarios is easy, where we can simply calculate the mean value for the ratings in the four scenarios (i.e. W-W, W-C, C-W, and C-C). While for , the query set for a user may contain less than records in different scenarios. Thus, for each scenario, we concatenate the results; separate the results into small clips; calculate for each clip; and then take the mean value of the clips as the results.

a.3. Parameter settings.

Our code is implemented with PyTorch 666

1.4.0 in Python 3.7 and runs on a Linux server with NVIDIA TITAN X. The processed datasets will take about 2GB hard disk space. The default activation function is LeakyRelu 

777 For Movielens dataset: the dimension of the embedding is set to 100; the default setting of the number of layers is 2; the local learning rate is 0.01 and the global learning rate is 0.05; the hyper-parameters , , and are set to 0.5, 0.05, 0.1, and 0.1, respectively. For Bookcrossing dataset: the dimension of the embedding is set to 50; the default setting of the number of layers is 2; the local learning rate is 0.01 and the global learning rate is 0.01; the hyper-parameters , , and

are set to 0.5, 0.1, 0.1, and 0.15, respectively. The random seed may affect the results – sometimes the results may show better or worse performance than our presented results. The running time for one epoch over all users is about half an hour; so, a strategy is updating the global parameters after learning from a batch of training users. During the test process, the model updates as in algorithm 


Input: Testing user set ; User profile ; Item profile ; User ratings ; Hyper-parameters , , , , , ; Meta parameters , , , , ,
Output: Predicted user preference
1 for  do
2       Calculate bias term by Eq. (3-4);
3       Initialize the local parameters , , by Eq. (9-10);
4       Initialize the preference memory by Eq. (7);
5       for  do
6             Get user and item embedding and by Eq. (1);
7             Get prediction of by Eq. (2);
8             Local update ;
9             Local update , , by: ;
11       end for
12      for  do
13             Get user and item embedding and by Eq. (1);
14             Get prediction of by Eq. (2);
16       end for
18 end for
Algorithm 2 Testing process of MAMO