ICMT: Item Cluster-Wise Multi-Objective Training for Long-Tail Recommendation

by   Yule Wang, et al.

Item recommendation based on historical user-item interactions is of vital importance for web-based services. However, the data used to train a recommender system (RS) suffers from severe popularity bias. Interactions of a small fraction of popular (head) items account for almost the whole training data. Normal training methods from such biased data tend to repetitively generate recommendations from the head items, which further exacerbates the data bias and affects the exploration of potentially interesting items from niche (tail) items. In this paper, we explore the central theme of long-tail recommendation. Through an empirical study, we find that head items are very likely to be recommended due to the fact that the gradients coming from head items dominate the overall gradient update process, which further affects the optimization of tail items. To this end, we propose a general framework namely Item Cluster-Wise Multi-Objective Training (ICMT) for long-tail recommendation. Firstly, the disentangled representation learning is utilized to identify the popularity impact behind user-item interactions. Then item clusters are adaptively formulated according to the disentangled popularity representation. After that, we consider the learning over the whole training data as a weighted aggregation of multiple item cluster-wise objectives, which can be resolved through a Pareto-Efficient solver for a harmonious overall gradient direction. Besides, a contractive loss focusing on model robustness is derived as a regularization term. We instantiate ICMT with three state-of-the-art recommendation models and conduct experiments on three real-world datasets. the proposed ICMT significantly improves the overall recommendation performance, especially on tail items.



There are no comments yet.


page 1


CITIES: Contextual Inference of Tail-Item Embeddings for Sequential Recommendation

Sequential recommendation techniques provide users with product recommen...

Measuring the Eccentricity of Items

The long-tail phenomenon tells us that there are many items in the tail....

Connecting User and Item Perspectives in Popularity Debiasing for Collaborative Recommendation

Recommender systems learn from historical data that is often non-uniform...

LambdaOpt: Learn to Regularize Recommender Models in Finer Levels

Recommendation models mainly deal with categorical variables, such as us...

Value-Aware Item Weighting for Long-Tail Recommendation

Many recommender systems suffer from the popularity bias problem: popula...

Popularity Bias Is Not Always Evil: Disentangling Benign and Harmful Bias for Recommendation

Recommender system usually suffers from severe popularity bias – the col...

Model-Agnostic Counterfactual Reasoning for Eliminating Popularity Bias in Recommender System

The general aim of the recommender system is to provide personalized sug...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recommender systems are emerging as a crucial role in online services and platforms to address the problem of information overload [rendle2010factorization, mnih2007probabilistic, he2017neural]. A recommender system (RS) is trained using historical user-item interactions with the target of providing the most interesting items given the current user state. However, there is a self-loop in the training of a RS [chen2020bias]. The exposure mechanism of the RS affects the collection of user-item interactions, which are then circled back as the training data for the RS itself. Such self-loop leads to severe popularity bias in the training data. Specifically, the item frequency distribution in the training data is an extreme long-tail distribution[AbdollahpouriBM17]. A small fraction of popular (head) items accounts for almost the whole training dataset. Normal learning-to-rank methods [rendle2012bpr] based on such biased data would lead to a situation that head items are pushed towards much higher ranking scores compared with other items. As a result, popular items are repetitively recommended, which further intensifies the popularity bias and the “rich get richer” Matthew effect [chen2020bias].

Nevertheless, the recommendation from tail items plays an important role in improving the system performance. From the user’s perspective, he/she could be easily bored with the repetitive popular recommendation. There are potentially relevant items that will lead to larger user satisfaction among tail items [li2017two, adomavicius2011improving]. For service providers, the recommendation from tail items can embrace more marginal profit compared with head items [anderson2006long]. Generally speaking, the recommendation task is a typical exploitation-exploration problem. Long-tail recommendation will benefit both users and service providers with better exploration, which finally turns into larger profits in the long-run[jannach2015recommenders].

Existing methods that focus on long-tail recommendation are usually based on metrics like recommendation diversity and novelty [wu2019pd, zhou2010solving, ribeiro2014multiobjective]. However, these metrics are infeasible to directly optimize. Recommendations based on promoting such metrics could lead to a huge sacrifice of accuracy [wang2016multi]. Besides, the definition of diversity or novelty is still an open research problem without a standard benchmark [ge2010beyond].

(a) Gradient Norm Distribution
(b) Gradient Conflict
Fig. 1: Empirical study on Gowalla.

In this paper, we analyze the popularity bias problem of RS from an optimization perspective. We conduct an empirical study on the Gowalla111https://snap.stanford.edu/data/loc-gowalla.html dataset. We train the state-of-the-art LightGCN [he2020lightgcn] model on this dataset. Figure 1(a) visualizes the norm (i.e., norm ) of gradients coming from different items. Figure 1(b) shows the gradients from a popular item and two tail items , in the dataset. We can have the following observation:

  • Head items have much larger gradient norm than tail items, indicating that the overall gradient direction is actually dominated by head items.

  • There are potential conflicts between gradients coming from head items and tail items. That is to say, updating model parameters based on gradients dominated by head items could potentially scarify the learning of tail items.

Motivated from the above observation, we propose Item Cluster-Wise Multi-Objective Training (ICMT) for long-tail recommendation to address the popularity bias. Note that ICMT is a general learning framework and can be instantiated with different specific models, such as Probabilistic Matrix Factorization (PMF) [koren2009matrix], NeuMF [he2017neural], etc. More precisely, in the first place, a universal popularity embedding is involved in the ranking score prediction. This popularity embedding is then disentangled from user interest embedding for the modeling of popularity impact. Based on the disentangled representations, we split items into different clusters according to their correlation with the popularity embedding. We then consider the learning on each item cluster as an optimization objective. As a result, the learning over the whole training data can be seen as a weighted aggregation of multiple cluster-wise objectives. Then we utilize a Pareto-Efficient (PE) solver to adaptively learn the weight of each objective. Through the PE solver, we can find a solution that every cluster-wise objective is optimized without hurting the other one. In other words, the learning of head items would not affect the learning of tail items. Finally, a contractive loss focusing on model robustness is introduced as a regularization term to further prevent the potential overfitting of head items.

To summarize, this work makes the following contributions:

  • We propose to tackle the long-tail recommendation task from an item cluster-wise optimization perspective. We show that head items are high likely to be recommended due to the domination of gradients, providing new directions to address the popularity bias of RS.

  • We propose a general long-tail recommendation framework ICMT which is featured with popularity disentanglement, cluster-wise multi-objective optimization, and robust contractive regularization.

  • We instantiate ICMT with three state-of-the-art recommendation models and conduct experiments on three real-world datasets. Experimental results demonstrate that ICMT significantly alleviates the popularity bias problem in recommender systems.

Ii Related Work

Ii-a Methods for long-tail recommendation

Due to the popularity bias and the exposure mechanism of RS, tail items usually have much less training data. As a result, generating recommendations from head items is a conservative but effective way to improve recommendation accuracy (e.g., recall) [adomavicius2011improving]. To get rid of the conformity influence, some existing methods argue that other metrics like diversity [kaminskas2016diversity, hurley2013personalised, chen2020improving] and novelty [ribeiro2012pareto, ribeiro2014multiobjective] should be considered simultaneously as an additional regularization term. For example, [wang2016multi] proposed a metric based on unpopularity of items. [ribeiro2012pareto, ribeiro2014multiobjective, shi2013trading] consider diversity as the item difference within one recommendation list while novelty as the difference across lists. However, this kind of metrics is usually infeasible to directly optimize. Also, [jang2020cities, zhang2021model] utilized knowledge transfer from many-shot head items to enhance the quality of tail-item embeddings. Inverse-propensity-scoring (IPS) is a practical one for industry product[huang2006correcting]

. Since it is relatively easy to reweight training samples and ameliorate the distribution shift problem. Nevertheless, it suffers severely from high variance.

Besides, there are also methods based on additional knowledge input such as side-information, user feedback, and niche item clustering to relieve the cold start problem of tail items [bai2017dltsr, kim2019sequential]. However, none of the above work emphasized easing the neglect of tail items during the gradient update process.

Ii-B Multi-Objective Optimization in RS

Despite recommendation accuracy as the main objective of recommendation, some researches have also been done focusing on other objectives such as availability, profitability, and usefulness [jambor2010optimizing, mcnee2006being]. Besides, metrics about long-tail recommendation such as diversity and novelty are also considered as objectives [ribeiro2012pareto, ribeiro2014multiobjective]. Recently, user-oriented objectives such as user sentiment are considered for better recommendation [musto2017multi, rodriguez2012multiple]. For a commercial RS, CTR (Click Through Rate) and GMV (Gross Merchandise Volume) are included in [nguyen2017multi, lin2019pareto] to gain higher profits.

The optimization methods for multiple objectives can be categorized into two categories: heuristic search

[zitzler2001spea2] and scalarization [desideri2009multiple, xiao2017fairness]

. Evolutionary algorithms are popular choices for heuristic search which deal simultaneously with a series of possible solutions in a single run

[wang2016multi]. But these algorithms usually depend heavily on the heuristic experience [kendall2018multi, zitzler2001spea2]. Scalarization methods transform multiple objectives into a single one with a weighted sum of all objective functions [xiao2017fairness]. Then the overall objective function is optimized to be Pareto-Efficient, where no single objective can be further improved without hurting the others [lin2019pareto].

In this paper, we aim to address the long-tail recommendation task from an optimization perspective. Unlike existing methods which introduce new metrics as the objectives [lin2019pareto, ribeiro2012pareto, ribeiro2014multiobjective], we consider the learning on a cluster of items as an objective. After that, we focus on finding an optimal solution in which the learning over one item cluster would not affect the learning on the other one.

Iii Methodologies

Fig. 2: An holistic illustration of ICMT. The reasons behind user-item interactions are firstly disentangled into popularity and interest factors. Then, interest factors are modeled through a base recommendation model (e.g. PMF, NeuMF, LightGCN), in which the user interest embeddings and item embeddings are obtained. Thereafter, items are clustered according to the correlation between item embeddings and the popularity embedding. Then the training objective is formulated as a weighted sum of cluster-wise objectives, which is optimized by a PE solver. Finally, a robust contractive loss is added as regularization.

In this section, we describe details of the proposed ICMT framework. Figure 2 illustrates the structure of ICMT, which contains a base recommendation model with the structural disentanglement of popularity, a PE solver to find the non-conflict gradient directions for multiple item cluster-wise objectives and a hybrid training loss consisting of weighted binary cross-entropy (BCE) and robust contractive regularization.

Iii-a Base Model with Popularity Disentanglement

A traditional recommender base model mainly consists of user side and item side . For each user-item pair , the core task is to map user and item into a user interest embedding and an item embedding respectively, where denotes the embedding size. It can be formulated as:


In traditional base models, when user and item embeddings are obtained, a scoring function is utilized to calculate the relevance score for a user and an item(e.g., inner product [mnih2007probabilistic, he2020lightgcn]), which can be formulated as:


Recently, many works design novel and to extract better features. For example, neural matrix factorization (NeuralMF) [he2017neural] adopts MLP as and to extract user and item features respectively, LightGCN [he2020lightgcn] introduces the graph convolutional mechanism on and . However, the engagement of popularity impact is ignored by this approach, where different reasons of an interaction are bundled together as unified user representations. Therefore, we separate representations for user interest and popularity through assigning each user a unique interest embedding while maintaining a general popularity embedding . The latter one represents the public preference that participates with the item embedding in every interaction, which is denoted as function . Through aggregation, the prediction function of ICMT can be formulated as:


where is a weighting parameter controlling the ratio of popularity impact. During inference stage, only the interest part will be used for final recommendation, where the popularity embedding has been disentangled. Without loss of generality, we implement as matrix factorization in ICMT.

Iii-B Item Cluster-Wise Pareto Efficiency

In this paper, we consider training the RS from implicit feedback (i.e., observed interactions are considered as positive samples while negative samples are sampled from missing interactions). As for the base model extracting user interest and item representations, we start from the normal training setting, in which each training sample has equal weight (i.e., 1). The total loss function is defined on the whole training data, which is shown as


where and denote the set of positive samples and the set of sampled negative samples, correspondingly. is the specific loss (i.e., cross-entropy) of pair and denotes base model parameters. Then we expand the loss on positive examples222We don’t consider the loss on negative examples since we assume that they are sampled uniformly. as


where denotes the whole item set and is the set of users who have interacted with item .

Due to the popularity bias in the training data, is extremely imbalanced. For head items, contains much more samples than tail items. As a result, when we perform updates according to , the majority of gradients would come from the loss on head items. When there are conflicts between gradients coming from head items and gradients coming from tail items, the normal training setting would scarify the learning on tail items to achieve a lower overall loss. That is to say, the overall gradient direction is actually dominated by head items.

To address this problem, we propose to consider the learning on each item as an objective and formulate the training over whole (positive) data as a weighted sum of these multiple item-wise objectives:


can be seen as the initial item-wise optimization objective. is the weight for objective .

We aim to find a set of so that each item-wise objective can be optimized without hurting the others. In the real-world scenario, there are usually a huge amount of items. As a result, naively consider each as an optimization objective could heavily increase the computation complexity of finding . To address this problem, we first split the whole item set to clusters and then consider the learning on each cluster of items.

Besides, in the total learning space , there are some parameters which are shared between objectives and some parameters which are only related to item . We just need to consider for the selection of objective weights, since would not affect the learning of other items. It’s worth mentioning that though is a shared embedding among between all the objectives, its purpose is to evenly capture the popularity preference among all the items. Therefore, ICMT doesn’t apply the re-weighting strategy on and keeps it in the parameter set . Taking all the above factors, we formulate our loss function to train on positive samples as


where is the weight for the -th item cluster and denotes the set of items in this cluster. When , we can recover the normal training weight setting. is the learning objective for cluster . For the training of item-specific parameters , we still use the as shown in Eq.(5).

In the following, we will first describe the item clustering strategy and then describe how to find for each cluster objective.

Iii-B1 Adaptive Item Clustering

We cluster the items according to the observation that items whose embeddings are similar with the popularity embedding should be separated out from items whose embeddings are dissimilar with popular embedding to formulate different item clusters. We implement this idea through operating element-wise product on each with . The final item clustering embedding is defined as follows:


where denotes the element-wise product and this product is divided by the inner product (i.e.,

) to normalize the scale impact. The clustering embedding mainly correlates with the item’s similarity with the popularity embedding in terms of direction. Then, we perform K-Means clustering algorithm

[likas2003global] using this item clustering embedding .

Iii-B2 Pareto-Efficient Solver

In the following, we will describe how to find for each cluster objective. Firstly, we provide a brief introduction to Pareto-Efficiency and some related concepts.

Given a system that aims to minimize a series of objective functions , Pareto-Efficiency is a state when it is impossible to improve one objective without hurting other objectives. Formally, we provide the following definition:

Definition 1

For a minimization task of multiple objectives, let and denote two solutions as and , dominates if and only if , . . . , .

Then the concept of Pareto-Efficiency is defined as:

Definition 2

A solution is Pareto-Efficient if and only if there is no other solution that dominates .

It is worth mentioning that Pareto-Efficient solutions are not unique and the set of all such solutions is named as ”Pareto Frontier”. In this paper, we aim to find so that the solution of each cluster-wise objective is Pareto-Efficient, aka Item Cluster-Wise Pareto-Efficiency.

According to the definition of Pareto-Efficiency, we can use the Karush-Kuhn-Tucker (KKT) conditions [wu2007karush] to describe the property of :

  • and

  • For the shared parameters :

As a result, the task of finding can be formulated as


The optimization problem defined in Eq.(9) is equivalent to finding a minimum-norm point in the convex hull of the set of input points (i.e., ), which has been extensively studied [makimoto1994efficient].

Given the current in one training step, we utilize the Frank-Wolfe algorithm [jaggi2013revisiting] to solve the convex optimization problem of Eq.(9). Algorithm 1 shows the detail of this process. Its time complexity is mostly determined by the number of objectives and iterations with a time complexity upper bound of , where is the dimension of parameter space . Usually, the number of objectives is limited, therefore the running time of Algorithm 1 is negligible compared to the model training time cost.

1:, ,
2:: []
7:     , where

is a vector whose elements are 1 at index

and 0 otherwise.
9:until  or Number of Iterations Limit
Algorithm 1 PE-solver

Iii-C Robust Contractive Regularization

With head items dominating the gradient update process, it has been demonstrated that many state-of-the-art recommendation models are actually fragile and vulnerable to small fluctuations and changes from head items [yu2019vaegan]. In this section, we propose a simple yet effective penalty term to encourage model robustness and further downgrade the side impact from popularity bias.

For a robust recommendation model, the user representation should remain a small change when there is a tiny fluctuation of item-specific parameter . Meanwhile, to refrain from the popularity impact, item representation should also be less disturbed by the popularity embedding parameter . To this end, we define the contractive loss as the sum of the squared Jacobian matrix of with respect to and with respect to :


This contractive loss is added as a regularization term to encourage more robust model training.

Iii-D Training Details

1:,, recommendation model, item clusters, learning rate

and all other hyperparameters

2:parameters in the whole learning space : {,
4:sample a mini-batch of from and
5:for each batch do
6:     compute and through base recommendation model
7:     compute according to Eq.(3)
8:     compute according to Eq.(8).
9:     generating item clusters utilizing K-Means algorithm according to .
10:     run Alg. 1 to update = PEsolver(, );
11:     compute according to Eq.(12);
12:     compute according to Eq.(13);
13:     ;
14:     ;
15:end for
16:until converge
18:generate recommendation according to
Algorithm 2 Training and inference of ICMT

In this paper, we use the BCE loss as the basement loss to train the recommendation model. More precisely, the specific loss for a pair is formulated as


where is the label for the pair (i.e., if there is an interaction between user and item , otherwise ).

is the sigmoid function.

is calculated according to Eq.(3).

Considering the robust contractive regularization, the final training function of ICMT regarding the shared parameter is formulated as


where is calculated according to Eq.(7), and are regularization coefficients. For item-specific parameter , the training function of ICMT is formulated as


where is calculated according to Eq.(5).

In each training step of ICMT, the popularity impact, unique user interests, and items are firstly mapped into latent representations , , and by the recommendation model. The items are clustered through the K-Means algorithm according to their clustering embedding in the form of Eq.(8). We then run the PE-solver according to Algorithm 1 to get the weight for item cluster . Then, we calculate the prediction score according to Eq.(3). For the shared parameter , we perform update by minimizing while for item-specific parameter , we perform update through minimizing . Algorithm 2 illustrates the overall training and inference procedure of ICMT.

Iv Experimental Setup

In this section, we conduct experiments aiming to answer the following research questions:

RQ1: How does the proposed IICMT perform compared with normal training and other long-tail recommendation algorithms?

RQ2: How do the components and hyper-parameters of ICMT affect the recommendation performance?

RQ3: Can the PE-solver find Pareto-Efficiency solutions for multiple item cluster-wise objectives?

RQ4: How is the interpretability of ICMT?

Iv-a Experimental Settings

DataSet Last.Fm Gowalla Yelp2018
Users 1270 15612 31646
Items 4475 13701 21098
Interactions 156882 546299 1331183
Density 0.0276 0.00255 0.00199
TABLE I: Dataset statistics.

Iv-A1 Datasets

We conduct experiments on three public accessible datasets: Last.Fm 333https://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip, Gowalla and Yelp2018 444https://www.kaggle.com/yelp-dataset/yelp-dataset/version/7. The datasets vary in domains, platforms, and sparsity. Table I summarizes the statistics of the three datasets.

Last.Fm: This is a widely used dataset which contains

million ratings between users and movies. We binarize the ratings into implicit feedback. Interacted items are considered as positive samples. Due to the sparsity of the dataset, we use the 10-core setting, i.e., retaining users and items which have at least ten interactions.

Gowalla: This is the check-in dataset obtained from Gowalla, where users share their locations by checking-in behavior [liang2016modeling]. To ensure the quality of the dataset, we use the 20-core setting.

Yelp2018: This dataset is adopted from the 2018 edition of the Yelp challenge. Wherein, the local businesses like restaurants and bars are viewed as the items. Similarly, we use the 20-core setting to ensure that each user and item have at least twenty interactions.

Iv-A2 Evaluation protocols.

We adopt cross-validation to evaluate the performance. The ratio of training, validation, and test set is 8:1:1. The ranking is performed among the whole item set. Each experiment is repeated 5 times and the average performance is reported.

The recommendation quality is measured both in terms of overall accuracy and long-tail performance that reflecting the alleviation of popularity bias. The overall accuracy is measured with two metrics: Recall and Normalized Discounted Cumulative Gain (NDCG). Recall@N measures how many ground-truth items are included in the top-N positions of the recommendation list. NDCG is a rank-sensitive metric that assigns higher weights to top positions in the recommendation list [jarvelin2002cumulated].

For the evaluation of long-tail performance, we first split items into and according to the ratio of 20% 80% Pareto Principle. represents the set of head items and denotes the set of tail items. Here 20% means 20% of total item numbers, other than 20% of interactions. We then adopt the following four metrics.

Recall-Tail and NDCG-Tail: Recall-Tail@N measures how many tail items belong to are included in the top-N positions of the recommendation list and then interacted by the user. Similarly, NDCG-Tail@N assigns higher weights to top positions.

Coverage and APT: Coverage measures how many different items appear in the top-N recommendation list. A more readily interpretable but closely related metric of success we will use for evaluation is the Average Percentage of Tail items (APT) in the recommendation lists. More precisely, Coverage@N and APT@N are defined as:


where and are the number of users and items in the test set, represents the list of top-N recommended items for each user in test set.

Iv-A3 Baselines

We instantiate the proposed ICMT with three state-of-the-art recommendation models:

  • PMF [mnih2007probabilistic]

    : Probabilistic Matrix Factorization models the conditional probability of latent factors given the observed ratings and includes Gaussian priors as regularization.

  • NeuMF [he2017neural]

    : Neural Matrix Factorization is one notable deep learning-based recommendation model. It combines matrix factorization and multi-layer perceptrons (MLP) to learn high-order interaction signals.

  • LightGCN [he2020lightgcn]: LightGCN is a graph-based model that learns user and item representations by linearly propagating them on the interaction graph. The user and item embedding is formulated as the aggregation of hidden vectors in all layers.

Each model is trained with the following model-agnostic frameworks:

  • Normal Training: This is the normal training procedure with simple BCE loss, as shown in Eq.(1).

  • IPS [gruson2019offline]: IPS re-weights each interaction according to item popularity. Specifically, weight for an interaction is set as the inverse of corresponding item popularity value.

  • PO-EA [ribeiro2014multiobjective]: PO-EA utilizes an evolutionary algorithm to find Pareto-Efficient solutions for multiple objectives like accuracy, diversity, and novelty.

  • MORS [wang2016multi]: MORS [wang2016multi] proposes a novel multi-objective evolutionary algorithm to find trade-off solutions between recommending accurate and niche items simultaneously.

  • FA-reg [abdollahpouri2017controlling]: Fairness-aware regularization (FA-reg) introduces a flexible regularization-based framework to enhance the long-tail coverage of recommendation lists in a learning-to-rank algorithm.

  • ICMT: our proposed learning framework.

Last.Fm Gowalla Yelp2018
Base Methods RT@20 NT@20 Cov@20 AT@20 RT@20 NT@20 Cov@20 AT@20 RT@20 NT@20 Cov@20 AT@20
PMF Normal 0.0014 0.0011 0.3692 0.1252 0.0399 0.0193 0.3428 0.1140 0.0048 0.0024 0.3286 0.0611
IPS 0.0009 0.0009 0.3034 0.0713 0.0292 0.0154 0.3064 0.0926 0.0029 0.0015 0.2845 0.0387
PO-EA 0.0011 0.0009 0.3106 0.0863 0.0337 0.0171 0.3293 0.1068 0.0039 0.0028 0.2992 0.0543
MORS 0.0015 0.0012 0.3714 0.1246 0.0420 0.0203 0.3640 0.1233 0.0055 0.0030 0.3437 0.0738
FA-reg 0.0016 0.0012 0.3914 0.1373 0.0418 0.0203 0.3722 0.1332 0.0056 0.0031 0.3512 0.0745
ICMT 0.0019* 0.0014* 0.4785* 0.1542* 0.0448* 0.0226* 0.4112* 0.1451* 0.0070* 0.0036* 0.3818* 0.0862*
NeuMF Normal 0.0023 0.0025 0.6235 0.2336 0.0249 0.0123 0.5908 0.2166 0.0052 0.0028 0.5617 0.0969
IPS 0.0018 0.0020 0.5580 0.1778 0.0197 0.0099 0.5053 0.1531 0.0032 0.0021 0.4740 0.0695
PO-EA 0.0021 0.0022 0.5853 0.1911 0.0215 0.0121 0.5392 0.2086 0.0044 0.0025 0.5273 0.0745
MORS 0.0031 0.0031 0.6188 0.2447 0.0284 0.0144 0.6477 0.2589 0.0061 0.0029 0.5897 0.1089
FA-reg 0.0029 0.0030 0.6242 0.2429 0.0293 0.0133 0.6581 0.2650 0.0062 0.0029 0.5875 0.1081
ICMT 0.0054* 0.0032* 0.6418* 0.2770* 0.0348* 0.0171* 0.6724* 0.3188* 0.0067* 0.0030* 0.6350* 0.1339*
LGC Normal 0.0034 0.0025 0.4273 0.1835 0.0425 0.0197 0.4195 0.1257 0.0071 0.0035 0.3767 0.0708
IPS 0.0022 0.0013 0.3508 0.1065 0.0364 0.0135 0.3547 0.0920 0.0042 0.0021 0.2955 0.0545
PO-EA 0.0034 0.0027 0.4115 0.1790 0.0388 0.0160 0.3710 0.1178 0.0064 0.0029 0.3035 0.0639
MORS 0.0049 0.0030 0.4248 0.1951 0.0433 0.0215 0.4291 0.1391 0.0094 0.0048 0.3923 0.0919
FA-reg 0.0039 0.0026 0.4211 0.1807 0.0435 0.0219 0.4307 0.1331 0.0081 0.0043 0.3841 0.0923
ICMT 0.0057* 0.0033* 0.4418* 0.2043* 0.0497* 0.0234* 0.4860* 0.1600* 0.0106* 0.0053* 0.4677* 0.1168*
  • denotes significance p-value <0.01 compared with normal training.

TABLE II: Top-20 long-tail performance on three datasets. RT, NT, Cov and AT are short for Recall-Tail, NDCG-Tail, Coverage and APT respectively. Boldface denotes the highest score.
Last.Fm Gowalla Yelp2018
Base Methods R@20 NG@20 R@20 NG@20 R@20 NG@20
PMF Normal 0.0303 0.0309 0.1719 0.1463 0.0507 0.0377
IPS 0.0278 0.0283 0.1306 0.1219 0.0382 0.0291
PO-EA 0.0282 0.0292 0.1459 0.1304 0.0456 0.0345
MORS 0.0301 0.0302 0.1614 0.1398 0.0480 0.0360
FA-reg 0.0292 0.0298 0.1623 0.1385 0.0474 0.0361
ICMT 0.0329* 0.0327* 0.1754* 0.1489* 0.0513* 0.0386*
NeuMF Normal 0.0236 0.0229 0.1148 0.0920 0.0388 0.0280
IPS 0.0207 0.0209 0.0865 0.0771 0.0302 0.0219
PO-EA 0.0215 0.0207 0.0917 0.0848 0.0335 0.0234
MORS 0.0228 0.0206 0.1066 0.0949 0.0388 0.0279
FA-reg 0.0210 0.0208 0.1090 0.0918 0.0385 0.0283
ICMT 0.0237 0.0231* 0.1237* 0.0953* 0.0416* 0.0297*
LGC Normal 0.0413 0.0391 0.1873 0.1616 0.0580 0.0437
IPS 0.0396 0.0383 0.1494 0.1278 0.0423 0.0312
PO-EA 0.0404 0.0382 0.1669 0.1439 0.0479 0.0369
MORS 0.0407 0.0382 0.1719 0.1542 0.0546 0.0415
FA-reg 0.0406 0.0386 0.1762 0.1544 0.0531 0.0415
ICMT 0.0432* 0.0404* 0.1898* 0.1626* 0.0599* 0.0448*
  • denotes significance p-value <0.01 compared with normal training.

TABLE III: Top-20 overall performance on three datasets. R and NG are short for Recall and NDCG respectively. Boldface denotes the highest score.

Iv-A4 Parameter Settings.

All methods are learned with the Adam optimizer [kingma2014adam]

except utilizing RMSprop optimizer in NeuMF based models. The batch size is set as 512. The learning rate is set as

. We evaluate on the validation set every 3000 batches of updates. For a fair comparison, the embedding size is set as 64 for all models. For NeuMF and LightGCN, we utilize a three-layer-structure. The node-dropout and message-dropout in LightGCN are set as on all datasets. For hyperparameters of ICMT, , , and are searched between {1e-4, 1e-3, 2e-3, 5e-3, 1e-2} on all three datasets. We set item cluster number without special mention. Note that model hyperparameters keep exactly the same across all different training frameworks for a fair comparison.

Iv-B Performance Comparison (RQ1)

Table II and Table III shows the performance of top-N recommendation on Last.Fm, Gowalla and Yelp2018, respectively. We make the following observations from the results:

(1). According to Table II, ICMT achieves the best long-tail recommendation performance among all methods. This observation confirms that the proposed ICMT is effective to alleviate the popularity bias and generate better recommendations from tail items. The base model with ICMT achieves the average absolute Recall-Tail@20 / NDCG-Tail@20 / Coverage@20 / APT@20 / gain of 42.45% / 30.11% / 15.45% / 33.23%.

(2). As shown in Table III, although ICMT is proposed to tackle the long-tail recommendation task, in all cases, it outperforms normal training and the other long-tail recommendation methods in terms of overall accuracy. It demonstrates that ICMT achieves a better trade-off between long-tail recommendation and overall accuracy compared with other frameworks. The performance gain regarding the overall accuracy metrics NDCG@20 / Recall@20 is 4.03 % / 2.98 %. This improvement mainly comes from promoting the high-quality niche items while downgrading the irrelevant head items.

(3). We conduct one-sample t-tests and the obtained results (i.e., p-value <0.01) indicate that the improvement regarding both long-tail recommendation metrics and overall metrics of ICMT is statistically significant.

(4). We also analyze the head items and tail items trade-off in ICMT. Figure 3 visualizes the average recommendation accuracy (i.e., NDCG@20, Recall@20) on head items and tail items on Gowalla dataset with LightGCN as the recommendation model. Generally, the points which fall into the top-right direction indicate better performance with both higher head and tail accuracy. It’s obvious that the proposed ICMT achieves the highest tail accuracy while only scarifies a little head accuracy compared with normal training. However, other methods can not achieve such a performance. In most cases, these methods tend to lead to a larger decrease in head accuracy while obtaining a smaller gain on tail accuracy. This result demonstrates that ICMT achieves a better trade-off between head items and tail items, compared with other methods. The performance gain of ICMT mainly comes from the growth in niche items without the loss across head items.

To conclude, the proposed ICMT significantly improves the long-tail recommendation performance compared with normal training, and meanwhile, enhances the overall accuracy.

Fig. 3: NDCG@20 and Recall@20 Profile of head and tail items on Gowalla Dataset

Iv-C Ablation Study and Hyper-parameter Study (RQ2)

Dataset Methods Long-Tail Metrics Overall Metrics
Last.Fm Default  0.0037  0.0033  0.4418  0.2043  0.0432  0.0404
w/o CR      0.4340  0.1890  0.0421  0.0389
w/o PD  0.0035  0.0030  0.4248  0.1829  0.0404  0.0381
w/o CL  0.0036  0.0030  0.4373  0.1925  0.0412  0.0391
Gowalla Default  0.0497  0.0234  0.4860  0.1600  0.1898  0.1626
w/o CR      0.4677  0.1508  0.1885  0.1616
w/o PD  0.0467  0.0213      0.1864  0.1597
w/o CL  0.0495  0.0232  0.4768  0.1573  0.1885  0.1614
Yelp2018 Default  0.0106  0.0053  0.4677  0.1168  0.0599  0.0448
w/o CR      0.4659  0.1012  0.0594  0.0444
w/o PD  0.0097  0.0052  0.4411    0.0590  0.0441
w/o CL  0.0100  0.0049  0.4627  0.1119  0.0596  0.0447
TABLE IV: Ablation study of method components on three datasets. Boldface denotes highest scores. w/o denotes without. indicates as a severe performance drop (more than 10%).
Fig. 4: Effect of item cluster number .

Iv-C1 Ablation Study

In this part, we conduct ablation study to analyze the functionality of the three components of ICMT (i.e.,cluster-wise re-weighting (CR), popularity disentanglement (PD), and contractive loss (CL)). Table IV shows the performance of ICMT and its variants on all three datasets utilizing LightGCN as the base recommendation model. We introduce the variants and analyze their effects respectively:

(1). Remove Cluster-wise Re-weighting (w/o CR): The most significant long-tail accuracy degradation occurred without the re-weighting strategy, which implies that niche items tend to obtain higher weights because of their weak relationship with popularity embedding. This proves that clustering the items and treating the recommendation target as a multi-objective optimization problem can greatly improve the performance of the model in the long-tail space.

(2). Remove Popularity Disentanglement (w/o PD): After taking away the popularity embedding and perform clustering on the item embedding distribution, we find that Coverage and APT are significantly worse, meaning that the vanilla user embedding is biased by the item popularity. On the other hand, with the participation of popularity embedding, user interest embedding in ICMT alone reflects the users’ true preference, which can thus explore more niche items and improve the long-tail performance.

(3) Remove Contractive Loss (w/o CL): We can see that the long-tail results downgrade without the contractive regularization, manifesting that the niche items are boosted after regularizing with contractive Jacobi gradients.

To sum up, the combination of these three strategies (i.e., ICMT) yields the best performance, proving that all the three components of ICMT are effective and work collaboratively to improve the long-tail recommendation performance and overall accuracy.

Iv-C2 Hyper-parameter Study

(1). Effect of Item Cluster Number : In this part, we use LightGCN as the base recommendation model since it has the best overall accuracy performance. Here we choose to conduct the experiment. Figure 4 illustrates, NDCG@20, NDCG-Tail@20, and APT@20 under different cluster numbers on Last.Fm and Yelp2018 dataset. The general observation is that overall recommendation accuracy maintains the same level while the long-tail performance shows bell-shaped curves. Increasing the cluster number from 1 (i.e., normal training) to 2 leads to the largest long-tail performance improvement. Later on, long-tail performance keeps diminishing along with the increase of cluster number. Such experimental results indicate that adaptively assigning the items into two clusters leads to the most satisfactory performance. However, with more item clusters, the performance of ICMT on long-tail items gets compromised. The reason could be that too many clusters would make ICMT put more focus on the balance between tail clusters rather than the balance between head and tail items.

(2). Effect of Popularity Factor Weight : To evaluate the impact of popularity factor weight , we vary the weight in the range of {0, 1e-4, 1e-3, 2e-3, 5e-3, 1e-2}. The experimental results are summarized in Figure 5. We can observe that the overall NDCG@20 reaches its peak when , thus demonstrating that properly disentangle the popularity factor closely reflects the true user interest. Meanwhile, with the incremental of , the long-tail performance keeps rising swiftly. This implies that more niche items are excavated when we put more emphasis on decomposing the popularity factor. To sum up, for a higher overall accuracy, we choose as our default setting.

(3). Effect of Contractive Loss Weight : From Figure 6, we can observe that a small value of promotes the accuracy, especially in the long-tail space according to its ability of balancing the gradients from all items and suppressing the popularity embedding dominance. However, despite keep promoting the APT, a larger portion of robust regularization does not necessarily lead to better accuracy due to the issue of losing information from the gradients. Therefore, we set as (1e-3, 1e-3, 2e-3) corresponding to Last.Fm, Gowalla, Yelp2018 to achieve the best overall performance.

Fig. 5: Effect of Popularity Factor Weight .
Fig. 6: Effect of Contractive Loss Weight .

Iv-D PE-Solver Investigation (RQ3)

In this part, we conduct experiments to show whether the PE-solver generates reasonable Pareto-Efficient solutions.

(a) The Pareto Frontier and searched solutions

Long-tail weights learning curves. Vertical bars mark convergence epochs.

Fig. 7: PE-solver investigation

Iv-D1 Pareto Frontier and the Searched PE Point

On the Gowalla dataset with all the three recommendation models, we first generate the Pareto Frontiers of head and tail losses by running the Pareto MTL algorithm [lin2019pareto] with different trade-off preference vectors, shown in Figure 7(a). It can be observed that the obtained Pareto Frontiers under different constraints follow Pareto-Efficiency, i.e., no point achieves both lower short-head and long-tail losses than other points. When the model focuses more on head items, the short-head loss is lower while the long-tail loss increases, and vice versa.

When it comes to the searched point of the PE solver, we can see that on all recommendations models, those points mainly lie in the middle part of the Pareto Frontiers. This observation indicates that the PE-solver coincides with our aim of balancing the trade-off between head items and tail items.

Iv-D2 The Learning of Weights

To be clear of the training process and reveal the attention of long-tail items, we further plot the learning curves of the average weights assigned to long-tail items, as shown in Figure 7(b). We use LightGCN as the recommendation model and visualize the trend on all three datasets. We can observe that the obtained weights from PE-solver tend to focus on tail items. After variating at the early training stage, the weight for tail items becomes flattened and then converges to a value around [1.04, 1.16]. On the other hand, the normal training methods neglect these PE weights and treat all items in the same way, leading to the overfitting on head items. Hence, the proposed ICMT can effectively eliminate the popularity bias of RS by assigning adaptive weights to head and tail cluster-wise objectives.

Fig. 8: Case study of a single user. The upper part is the recommendation list from LightGCN-Normal and the lower part is the recommendation list of LightGCN-ICMT. H denotes the movie is a head item while T denotes the movie is a tail item. denotes the average weight assigned to each item. The blue check mark denotes that the recommended item belongs to the ground-truth items in the test set. The arrow indicates that there are some relations between the two items (e.g., same tag).

Iv-E Case Study (RQ4)

To show the interpretability of ICMT, in the Last.Fm dataset, we randomly select a user and retrieve his two top-5 recommendation lists from LightGCN-Normal and LightGCN-ICMT given the same interaction history. Figure 8 illustrates the recommendation detail. We observed that the recommendation list derived from LightGCN-Normal contains popular ground-truth items but does not contain any tail items. LightGCN-ICMT contains two tail (unpopular) items, Farben Lehre and Plavi Orkestar thanks to the higher average weights assigned to tail items in his interaction history. One of the two recommended tail items belongs to the ground-truth test set, which improves the long-tail and overall performance.

Note that the two recommendation lists have several items in common (e.g. Erasure), which indicates that LightGCN-ICMT can also catch the preference on popular items.

V Conclusion

In this paper, we propose to tackle the long-tail recommendation task from a multi-objective optimization perspective. We find that head items are repetitively recommended due to the fact that head items tend to have larger gradient norms and thus dominate the gradient updates. Learning parameters based on such gradients could scarify the performance of long-tail items. To alleviate such a phenomenon, we propose a general learning framework namely ICMT which is featured with popularity disentanglement, cluster-wise multi-objective re-weighting, and robust contractive regularization. We instantiate ICMT with three state-of-the-art recommendation models and conduct extensive experiments on three real-world datasets. The results demonstrate the effectiveness of ICMT. Future work includes generalizing ICMT for other tasks such as multi-class classification and long-tail document retrieval.