I Introduction
Recommender systems are emerging as a crucial role in online services and platforms to address the problem of information overload [rendle2010factorization, mnih2007probabilistic, he2017neural]. A recommender system (RS) is trained using historical useritem interactions with the target of providing the most interesting items given the current user state. However, there is a selfloop in the training of a RS [chen2020bias]. The exposure mechanism of the RS affects the collection of useritem interactions, which are then circled back as the training data for the RS itself. Such selfloop leads to severe popularity bias in the training data. Specifically, the item frequency distribution in the training data is an extreme longtail distribution[AbdollahpouriBM17]. A small fraction of popular (head) items accounts for almost the whole training dataset. Normal learningtorank methods [rendle2012bpr] based on such biased data would lead to a situation that head items are pushed towards much higher ranking scores compared with other items. As a result, popular items are repetitively recommended, which further intensifies the popularity bias and the “rich get richer” Matthew effect [chen2020bias].
Nevertheless, the recommendation from tail items plays an important role in improving the system performance. From the user’s perspective, he/she could be easily bored with the repetitive popular recommendation. There are potentially relevant items that will lead to larger user satisfaction among tail items [li2017two, adomavicius2011improving]. For service providers, the recommendation from tail items can embrace more marginal profit compared with head items [anderson2006long]. Generally speaking, the recommendation task is a typical exploitationexploration problem. Longtail recommendation will benefit both users and service providers with better exploration, which finally turns into larger profits in the longrun[jannach2015recommenders].
Existing methods that focus on longtail recommendation are usually based on metrics like recommendation diversity and novelty [wu2019pd, zhou2010solving, ribeiro2014multiobjective]. However, these metrics are infeasible to directly optimize. Recommendations based on promoting such metrics could lead to a huge sacrifice of accuracy [wang2016multi]. Besides, the definition of diversity or novelty is still an open research problem without a standard benchmark [ge2010beyond].
In this paper, we analyze the popularity bias problem of RS from an optimization perspective. We conduct an empirical study on the Gowalla^{1}^{1}1https://snap.stanford.edu/data/locgowalla.html dataset. We train the stateoftheart LightGCN [he2020lightgcn] model on this dataset. Figure 1(a) visualizes the norm (i.e., norm ) of gradients coming from different items. Figure 1(b) shows the gradients from a popular item and two tail items , in the dataset. We can have the following observation:

Head items have much larger gradient norm than tail items, indicating that the overall gradient direction is actually dominated by head items.

There are potential conflicts between gradients coming from head items and tail items. That is to say, updating model parameters based on gradients dominated by head items could potentially scarify the learning of tail items.
Motivated from the above observation, we propose Item ClusterWise MultiObjective Training (ICMT) for longtail recommendation to address the popularity bias. Note that ICMT is a general learning framework and can be instantiated with different specific models, such as Probabilistic Matrix Factorization (PMF) [koren2009matrix], NeuMF [he2017neural], etc. More precisely, in the first place, a universal popularity embedding is involved in the ranking score prediction. This popularity embedding is then disentangled from user interest embedding for the modeling of popularity impact. Based on the disentangled representations, we split items into different clusters according to their correlation with the popularity embedding. We then consider the learning on each item cluster as an optimization objective. As a result, the learning over the whole training data can be seen as a weighted aggregation of multiple clusterwise objectives. Then we utilize a ParetoEfficient (PE) solver to adaptively learn the weight of each objective. Through the PE solver, we can find a solution that every clusterwise objective is optimized without hurting the other one. In other words, the learning of head items would not affect the learning of tail items. Finally, a contractive loss focusing on model robustness is introduced as a regularization term to further prevent the potential overfitting of head items.
To summarize, this work makes the following contributions:

We propose to tackle the longtail recommendation task from an item clusterwise optimization perspective. We show that head items are high likely to be recommended due to the domination of gradients, providing new directions to address the popularity bias of RS.

We propose a general longtail recommendation framework ICMT which is featured with popularity disentanglement, clusterwise multiobjective optimization, and robust contractive regularization.

We instantiate ICMT with three stateoftheart recommendation models and conduct experiments on three realworld datasets. Experimental results demonstrate that ICMT significantly alleviates the popularity bias problem in recommender systems.
Ii Related Work
Iia Methods for longtail recommendation
Due to the popularity bias and the exposure mechanism of RS, tail items usually have much less training data. As a result, generating recommendations from head items is a conservative but effective way to improve recommendation accuracy (e.g., recall) [adomavicius2011improving]. To get rid of the conformity influence, some existing methods argue that other metrics like diversity [kaminskas2016diversity, hurley2013personalised, chen2020improving] and novelty [ribeiro2012pareto, ribeiro2014multiobjective] should be considered simultaneously as an additional regularization term. For example, [wang2016multi] proposed a metric based on unpopularity of items. [ribeiro2012pareto, ribeiro2014multiobjective, shi2013trading] consider diversity as the item difference within one recommendation list while novelty as the difference across lists. However, this kind of metrics is usually infeasible to directly optimize. Also, [jang2020cities, zhang2021model] utilized knowledge transfer from manyshot head items to enhance the quality of tailitem embeddings. Inversepropensityscoring (IPS) is a practical one for industry product[huang2006correcting]
. Since it is relatively easy to reweight training samples and ameliorate the distribution shift problem. Nevertheless, it suffers severely from high variance.
Besides, there are also methods based on additional knowledge input such as sideinformation, user feedback, and niche item clustering to relieve the cold start problem of tail items [bai2017dltsr, kim2019sequential]. However, none of the above work emphasized easing the neglect of tail items during the gradient update process.
IiB MultiObjective Optimization in RS
Despite recommendation accuracy as the main objective of recommendation, some researches have also been done focusing on other objectives such as availability, profitability, and usefulness [jambor2010optimizing, mcnee2006being]. Besides, metrics about longtail recommendation such as diversity and novelty are also considered as objectives [ribeiro2012pareto, ribeiro2014multiobjective]. Recently, useroriented objectives such as user sentiment are considered for better recommendation [musto2017multi, rodriguez2012multiple]. For a commercial RS, CTR (Click Through Rate) and GMV (Gross Merchandise Volume) are included in [nguyen2017multi, lin2019pareto] to gain higher profits.
The optimization methods for multiple objectives can be categorized into two categories: heuristic search
[zitzler2001spea2] and scalarization [desideri2009multiple, xiao2017fairness]. Evolutionary algorithms are popular choices for heuristic search which deal simultaneously with a series of possible solutions in a single run
[wang2016multi]. But these algorithms usually depend heavily on the heuristic experience [kendall2018multi, zitzler2001spea2]. Scalarization methods transform multiple objectives into a single one with a weighted sum of all objective functions [xiao2017fairness]. Then the overall objective function is optimized to be ParetoEfficient, where no single objective can be further improved without hurting the others [lin2019pareto].In this paper, we aim to address the longtail recommendation task from an optimization perspective. Unlike existing methods which introduce new metrics as the objectives [lin2019pareto, ribeiro2012pareto, ribeiro2014multiobjective], we consider the learning on a cluster of items as an objective. After that, we focus on finding an optimal solution in which the learning over one item cluster would not affect the learning on the other one.
Iii Methodologies
In this section, we describe details of the proposed ICMT framework. Figure 2 illustrates the structure of ICMT, which contains a base recommendation model with the structural disentanglement of popularity, a PE solver to find the nonconflict gradient directions for multiple item clusterwise objectives and a hybrid training loss consisting of weighted binary crossentropy (BCE) and robust contractive regularization.
Iiia Base Model with Popularity Disentanglement
A traditional recommender base model mainly consists of user side and item side . For each useritem pair , the core task is to map user and item into a user interest embedding and an item embedding respectively, where denotes the embedding size. It can be formulated as:
(1)  
In traditional base models, when user and item embeddings are obtained, a scoring function is utilized to calculate the relevance score for a user and an item(e.g., inner product [mnih2007probabilistic, he2020lightgcn]), which can be formulated as:
(2) 
Recently, many works design novel and to extract better features. For example, neural matrix factorization (NeuralMF) [he2017neural] adopts MLP as and to extract user and item features respectively, LightGCN [he2020lightgcn] introduces the graph convolutional mechanism on and . However, the engagement of popularity impact is ignored by this approach, where different reasons of an interaction are bundled together as unified user representations. Therefore, we separate representations for user interest and popularity through assigning each user a unique interest embedding while maintaining a general popularity embedding . The latter one represents the public preference that participates with the item embedding in every interaction, which is denoted as function . Through aggregation, the prediction function of ICMT can be formulated as:
(3) 
where is a weighting parameter controlling the ratio of popularity impact. During inference stage, only the interest part will be used for final recommendation, where the popularity embedding has been disentangled. Without loss of generality, we implement as matrix factorization in ICMT.
IiiB Item ClusterWise Pareto Efficiency
In this paper, we consider training the RS from implicit feedback (i.e., observed interactions are considered as positive samples while negative samples are sampled from missing interactions). As for the base model extracting user interest and item representations, we start from the normal training setting, in which each training sample has equal weight (i.e., 1). The total loss function is defined on the whole training data, which is shown as
(4) 
where and denote the set of positive samples and the set of sampled negative samples, correspondingly. is the specific loss (i.e., crossentropy) of pair and denotes base model parameters. Then we expand the loss on positive examples^{2}^{2}2We don’t consider the loss on negative examples since we assume that they are sampled uniformly. as
(5) 
where denotes the whole item set and is the set of users who have interacted with item .
Due to the popularity bias in the training data, is extremely imbalanced. For head items, contains much more samples than tail items. As a result, when we perform updates according to , the majority of gradients would come from the loss on head items. When there are conflicts between gradients coming from head items and gradients coming from tail items, the normal training setting would scarify the learning on tail items to achieve a lower overall loss. That is to say, the overall gradient direction is actually dominated by head items.
To address this problem, we propose to consider the learning on each item as an objective and formulate the training over whole (positive) data as a weighted sum of these multiple itemwise objectives:
(6) 
can be seen as the initial itemwise optimization objective. is the weight for objective .
We aim to find a set of so that each itemwise objective can be optimized without hurting the others. In the realworld scenario, there are usually a huge amount of items. As a result, naively consider each as an optimization objective could heavily increase the computation complexity of finding . To address this problem, we first split the whole item set to clusters and then consider the learning on each cluster of items.
Besides, in the total learning space , there are some parameters which are shared between objectives and some parameters which are only related to item . We just need to consider for the selection of objective weights, since would not affect the learning of other items. It’s worth mentioning that though is a shared embedding among between all the objectives, its purpose is to evenly capture the popularity preference among all the items. Therefore, ICMT doesn’t apply the reweighting strategy on and keeps it in the parameter set . Taking all the above factors, we formulate our loss function to train on positive samples as
(7)  
where is the weight for the th item cluster and denotes the set of items in this cluster. When , we can recover the normal training weight setting. is the learning objective for cluster . For the training of itemspecific parameters , we still use the as shown in Eq.(5).
In the following, we will first describe the item clustering strategy and then describe how to find for each cluster objective.
IiiB1 Adaptive Item Clustering
We cluster the items according to the observation that items whose embeddings are similar with the popularity embedding should be separated out from items whose embeddings are dissimilar with popular embedding to formulate different item clusters. We implement this idea through operating elementwise product on each with . The final item clustering embedding is defined as follows:
(8) 
where denotes the elementwise product and this product is divided by the inner product (i.e.,
) to normalize the scale impact. The clustering embedding mainly correlates with the item’s similarity with the popularity embedding in terms of direction. Then, we perform KMeans clustering algorithm
[likas2003global] using this item clustering embedding .IiiB2 ParetoEfficient Solver
In the following, we will describe how to find for each cluster objective. Firstly, we provide a brief introduction to ParetoEfficiency and some related concepts.
Given a system that aims to minimize a series of objective functions , ParetoEfficiency is a state when it is impossible to improve one objective without hurting other objectives. Formally, we provide the following definition:
Definition 1
For a minimization task of multiple objectives, let and denote two solutions as and , dominates if and only if , . . . , .
Then the concept of ParetoEfficiency is defined as:
Definition 2
A solution is ParetoEfficient if and only if there is no other solution that dominates .
It is worth mentioning that ParetoEfficient solutions are not unique and the set of all such solutions is named as ”Pareto Frontier”. In this paper, we aim to find so that the solution of each clusterwise objective is ParetoEfficient, aka Item ClusterWise ParetoEfficiency.
According to the definition of ParetoEfficiency, we can use the KarushKuhnTucker (KKT) conditions [wu2007karush] to describe the property of :

and

For the shared parameters :
As a result, the task of finding can be formulated as
(9)  
The optimization problem defined in Eq.(9) is equivalent to finding a minimumnorm point in the convex hull of the set of input points (i.e., ), which has been extensively studied [makimoto1994efficient].
Given the current in one training step, we utilize the FrankWolfe algorithm [jaggi2013revisiting] to solve the convex optimization problem of Eq.(9). Algorithm 1 shows the detail of this process. Its time complexity is mostly determined by the number of objectives and iterations with a time complexity upper bound of , where is the dimension of parameter space . Usually, the number of objectives is limited, therefore the running time of Algorithm 1 is negligible compared to the model training time cost.
IiiC Robust Contractive Regularization
With head items dominating the gradient update process, it has been demonstrated that many stateoftheart recommendation models are actually fragile and vulnerable to small fluctuations and changes from head items [yu2019vaegan]. In this section, we propose a simple yet effective penalty term to encourage model robustness and further downgrade the side impact from popularity bias.
For a robust recommendation model, the user representation should remain a small change when there is a tiny fluctuation of itemspecific parameter . Meanwhile, to refrain from the popularity impact, item representation should also be less disturbed by the popularity embedding parameter . To this end, we define the contractive loss as the sum of the squared Jacobian matrix of with respect to and with respect to :
(10) 
This contractive loss is added as a regularization term to encourage more robust model training.
IiiD Training Details
In this paper, we use the BCE loss as the basement loss to train the recommendation model. More precisely, the specific loss for a pair is formulated as
(11) 
where is the label for the pair (i.e., if there is an interaction between user and item , otherwise ).
is the sigmoid function.
is calculated according to Eq.(3).Considering the robust contractive regularization, the final training function of ICMT regarding the shared parameter is formulated as
(12) 
where is calculated according to Eq.(7), and are regularization coefficients. For itemspecific parameter , the training function of ICMT is formulated as
(13) 
where is calculated according to Eq.(5).
In each training step of ICMT, the popularity impact, unique user interests, and items are firstly mapped into latent representations , , and by the recommendation model. The items are clustered through the KMeans algorithm according to their clustering embedding in the form of Eq.(8). We then run the PEsolver according to Algorithm 1 to get the weight for item cluster . Then, we calculate the prediction score according to Eq.(3). For the shared parameter , we perform update by minimizing while for itemspecific parameter , we perform update through minimizing . Algorithm 2 illustrates the overall training and inference procedure of ICMT.
Iv Experimental Setup
In this section, we conduct experiments aiming to answer the following research questions:
RQ1: How does the proposed IICMT perform compared with normal training and other longtail recommendation algorithms?
RQ2: How do the components and hyperparameters of ICMT affect the recommendation performance?
RQ3: Can the PEsolver find ParetoEfficiency solutions for multiple item clusterwise objectives?
RQ4: How is the interpretability of ICMT?
Iva Experimental Settings
DataSet  Last.Fm  Gowalla  Yelp2018 

Users  1270  15612  31646 
Items  4475  13701  21098 
Interactions  156882  546299  1331183 
Density  0.0276  0.00255  0.00199 
IvA1 Datasets
We conduct experiments on three public accessible datasets: Last.Fm ^{3}^{3}3https://files.grouplens.org/datasets/hetrec2011/hetrec2011lastfm2k.zip, Gowalla and Yelp2018 ^{4}^{4}4https://www.kaggle.com/yelpdataset/yelpdataset/version/7. The datasets vary in domains, platforms, and sparsity. Table I summarizes the statistics of the three datasets.
Last.Fm: This is a widely used dataset which contains
million ratings between users and movies. We binarize the ratings into implicit feedback. Interacted items are considered as positive samples. Due to the sparsity of the dataset, we use the 10core setting, i.e., retaining users and items which have at least ten interactions.
Gowalla: This is the checkin dataset obtained from Gowalla, where users share their locations by checkingin behavior [liang2016modeling]. To ensure the quality of the dataset, we use the 20core setting.
Yelp2018: This dataset is adopted from the 2018 edition of the Yelp challenge. Wherein, the local businesses like restaurants and bars are viewed as the items. Similarly, we use the 20core setting to ensure that each user and item have at least twenty interactions.
IvA2 Evaluation protocols.
We adopt crossvalidation to evaluate the performance. The ratio of training, validation, and test set is 8:1:1. The ranking is performed among the whole item set. Each experiment is repeated 5 times and the average performance is reported.
The recommendation quality is measured both in terms of overall accuracy and longtail performance that reflecting the alleviation of popularity bias. The overall accuracy is measured with two metrics: Recall and Normalized Discounted Cumulative Gain (NDCG). Recall@N measures how many groundtruth items are included in the topN positions of the recommendation list. NDCG is a ranksensitive metric that assigns higher weights to top positions in the recommendation list [jarvelin2002cumulated].
For the evaluation of longtail performance, we first split items into and according to the ratio of 20% 80% Pareto Principle. represents the set of head items and denotes the set of tail items. Here 20% means 20% of total item numbers, other than 20% of interactions. We then adopt the following four metrics.
RecallTail and NDCGTail: RecallTail@N measures how many tail items belong to are included in the topN positions of the recommendation list and then interacted by the user. Similarly, NDCGTail@N assigns higher weights to top positions.
Coverage and APT: Coverage measures how many different items appear in the topN recommendation list. A more readily interpretable but closely related metric of success we will use for evaluation is the Average Percentage of Tail items (APT) in the recommendation lists. More precisely, Coverage@N and APT@N are defined as:
(14) 
(15) 
where and are the number of users and items in the test set, represents the list of topN recommended items for each user in test set.
IvA3 Baselines
We instantiate the proposed ICMT with three stateoftheart recommendation models:

PMF [mnih2007probabilistic]
: Probabilistic Matrix Factorization models the conditional probability of latent factors given the observed ratings and includes Gaussian priors as regularization.

NeuMF [he2017neural]
: Neural Matrix Factorization is one notable deep learningbased recommendation model. It combines matrix factorization and multilayer perceptrons (MLP) to learn highorder interaction signals.

LightGCN [he2020lightgcn]: LightGCN is a graphbased model that learns user and item representations by linearly propagating them on the interaction graph. The user and item embedding is formulated as the aggregation of hidden vectors in all layers.
Each model is trained with the following modelagnostic frameworks:

Normal Training: This is the normal training procedure with simple BCE loss, as shown in Eq.(1).

IPS [gruson2019offline]: IPS reweights each interaction according to item popularity. Specifically, weight for an interaction is set as the inverse of corresponding item popularity value.

POEA [ribeiro2014multiobjective]: POEA utilizes an evolutionary algorithm to find ParetoEfficient solutions for multiple objectives like accuracy, diversity, and novelty.

MORS [wang2016multi]: MORS [wang2016multi] proposes a novel multiobjective evolutionary algorithm to find tradeoff solutions between recommending accurate and niche items simultaneously.

FAreg [abdollahpouri2017controlling]: Fairnessaware regularization (FAreg) introduces a ﬂexible regularizationbased framework to enhance the longtail coverage of recommendation lists in a learningtorank algorithm.

ICMT: our proposed learning framework.
Last.Fm  Gowalla  Yelp2018  

Base  Methods  RT@20  NT@20  Cov@20  AT@20  RT@20  NT@20  Cov@20  AT@20  RT@20  NT@20  Cov@20  AT@20 
PMF  Normal  0.0014  0.0011  0.3692  0.1252  0.0399  0.0193  0.3428  0.1140  0.0048  0.0024  0.3286  0.0611 
IPS  0.0009  0.0009  0.3034  0.0713  0.0292  0.0154  0.3064  0.0926  0.0029  0.0015  0.2845  0.0387  
POEA  0.0011  0.0009  0.3106  0.0863  0.0337  0.0171  0.3293  0.1068  0.0039  0.0028  0.2992  0.0543  
MORS  0.0015  0.0012  0.3714  0.1246  0.0420  0.0203  0.3640  0.1233  0.0055  0.0030  0.3437  0.0738  
FAreg  0.0016  0.0012  0.3914  0.1373  0.0418  0.0203  0.3722  0.1332  0.0056  0.0031  0.3512  0.0745  
ICMT  0.0019*  0.0014*  0.4785*  0.1542*  0.0448*  0.0226*  0.4112*  0.1451*  0.0070*  0.0036*  0.3818*  0.0862*  
NeuMF  Normal  0.0023  0.0025  0.6235  0.2336  0.0249  0.0123  0.5908  0.2166  0.0052  0.0028  0.5617  0.0969 
IPS  0.0018  0.0020  0.5580  0.1778  0.0197  0.0099  0.5053  0.1531  0.0032  0.0021  0.4740  0.0695  
POEA  0.0021  0.0022  0.5853  0.1911  0.0215  0.0121  0.5392  0.2086  0.0044  0.0025  0.5273  0.0745  
MORS  0.0031  0.0031  0.6188  0.2447  0.0284  0.0144  0.6477  0.2589  0.0061  0.0029  0.5897  0.1089  
FAreg  0.0029  0.0030  0.6242  0.2429  0.0293  0.0133  0.6581  0.2650  0.0062  0.0029  0.5875  0.1081  
ICMT  0.0054*  0.0032*  0.6418*  0.2770*  0.0348*  0.0171*  0.6724*  0.3188*  0.0067*  0.0030*  0.6350*  0.1339*  
LGC  Normal  0.0034  0.0025  0.4273  0.1835  0.0425  0.0197  0.4195  0.1257  0.0071  0.0035  0.3767  0.0708 
IPS  0.0022  0.0013  0.3508  0.1065  0.0364  0.0135  0.3547  0.0920  0.0042  0.0021  0.2955  0.0545  
POEA  0.0034  0.0027  0.4115  0.1790  0.0388  0.0160  0.3710  0.1178  0.0064  0.0029  0.3035  0.0639  
MORS  0.0049  0.0030  0.4248  0.1951  0.0433  0.0215  0.4291  0.1391  0.0094  0.0048  0.3923  0.0919  
FAreg  0.0039  0.0026  0.4211  0.1807  0.0435  0.0219  0.4307  0.1331  0.0081  0.0043  0.3841  0.0923  
ICMT  0.0057*  0.0033*  0.4418*  0.2043*  0.0497*  0.0234*  0.4860*  0.1600*  0.0106*  0.0053*  0.4677*  0.1168* 

denotes significance pvalue <0.01 compared with normal training.
Last.Fm  Gowalla  Yelp2018  
Base  Methods  R@20  NG@20  R@20  NG@20  R@20  NG@20 
PMF  Normal  0.0303  0.0309  0.1719  0.1463  0.0507  0.0377 
IPS  0.0278  0.0283  0.1306  0.1219  0.0382  0.0291  
POEA  0.0282  0.0292  0.1459  0.1304  0.0456  0.0345  
MORS  0.0301  0.0302  0.1614  0.1398  0.0480  0.0360  
FAreg  0.0292  0.0298  0.1623  0.1385  0.0474  0.0361  
ICMT  0.0329*  0.0327*  0.1754*  0.1489*  0.0513*  0.0386*  
NeuMF  Normal  0.0236  0.0229  0.1148  0.0920  0.0388  0.0280 
IPS  0.0207  0.0209  0.0865  0.0771  0.0302  0.0219  
POEA  0.0215  0.0207  0.0917  0.0848  0.0335  0.0234  
MORS  0.0228  0.0206  0.1066  0.0949  0.0388  0.0279  
FAreg  0.0210  0.0208  0.1090  0.0918  0.0385  0.0283  
ICMT  0.0237  0.0231*  0.1237*  0.0953*  0.0416*  0.0297*  
LGC  Normal  0.0413  0.0391  0.1873  0.1616  0.0580  0.0437 
IPS  0.0396  0.0383  0.1494  0.1278  0.0423  0.0312  
POEA  0.0404  0.0382  0.1669  0.1439  0.0479  0.0369  
MORS  0.0407  0.0382  0.1719  0.1542  0.0546  0.0415  
FAreg  0.0406  0.0386  0.1762  0.1544  0.0531  0.0415  
ICMT  0.0432*  0.0404*  0.1898*  0.1626*  0.0599*  0.0448* 

denotes significance pvalue <0.01 compared with normal training.
IvA4 Parameter Settings.
All methods are learned with the Adam optimizer [kingma2014adam]
except utilizing RMSprop optimizer in NeuMF based models. The batch size is set as 512. The learning rate is set as
. We evaluate on the validation set every 3000 batches of updates. For a fair comparison, the embedding size is set as 64 for all models. For NeuMF and LightGCN, we utilize a threelayerstructure. The nodedropout and messagedropout in LightGCN are set as on all datasets. For hyperparameters of ICMT, , , and are searched between {1e4, 1e3, 2e3, 5e3, 1e2} on all three datasets. We set item cluster number without special mention. Note that model hyperparameters keep exactly the same across all different training frameworks for a fair comparison.IvB Performance Comparison (RQ1)
Table II and Table III shows the performance of topN recommendation on Last.Fm, Gowalla and Yelp2018, respectively. We make the following observations from the results:
(1). According to Table II, ICMT achieves the best longtail recommendation performance among all methods. This observation confirms that the proposed ICMT is effective to alleviate the popularity bias and generate better recommendations from tail items. The base model with ICMT achieves the average absolute RecallTail@20 / NDCGTail@20 / Coverage@20 / APT@20 / gain of 42.45% / 30.11% / 15.45% / 33.23%.
(2). As shown in Table III, although ICMT is proposed to tackle the longtail recommendation task, in all cases, it outperforms normal training and the other longtail recommendation methods in terms of overall accuracy. It demonstrates that ICMT achieves a better tradeoff between longtail recommendation and overall accuracy compared with other frameworks. The performance gain regarding the overall accuracy metrics NDCG@20 / Recall@20 is 4.03 % / 2.98 %. This improvement mainly comes from promoting the highquality niche items while downgrading the irrelevant head items.
(3). We conduct onesample ttests and the obtained results (i.e., pvalue <0.01) indicate that the improvement regarding both longtail recommendation metrics and overall metrics of ICMT is statistically significant.
(4). We also analyze the head items and tail items tradeoff in ICMT. Figure 3 visualizes the average recommendation accuracy (i.e., NDCG@20, Recall@20) on head items and tail items on Gowalla dataset with LightGCN as the recommendation model. Generally, the points which fall into the topright direction indicate better performance with both higher head and tail accuracy. It’s obvious that the proposed ICMT achieves the highest tail accuracy while only scarifies a little head accuracy compared with normal training. However, other methods can not achieve such a performance. In most cases, these methods tend to lead to a larger decrease in head accuracy while obtaining a smaller gain on tail accuracy. This result demonstrates that ICMT achieves a better tradeoff between head items and tail items, compared with other methods. The performance gain of ICMT mainly comes from the growth in niche items without the loss across head items.
To conclude, the proposed ICMT significantly improves the longtail recommendation performance compared with normal training, and meanwhile, enhances the overall accuracy.
IvC Ablation Study and Hyperparameter Study (RQ2)
Dataset  Methods  LongTail Metrics  Overall Metrics  
Last.Fm  Default  0.0037  0.0033  0.4418  0.2043  0.0432  0.0404 
w/o CR  0.4340  0.1890  0.0421  0.0389  
w/o PD  0.0035  0.0030  0.4248  0.1829  0.0404  0.0381  
w/o CL  0.0036  0.0030  0.4373  0.1925  0.0412  0.0391  
Gowalla  Default  0.0497  0.0234  0.4860  0.1600  0.1898  0.1626 
w/o CR  0.4677  0.1508  0.1885  0.1616  
w/o PD  0.0467  0.0213  0.1864  0.1597  
w/o CL  0.0495  0.0232  0.4768  0.1573  0.1885  0.1614  
Yelp2018  Default  0.0106  0.0053  0.4677  0.1168  0.0599  0.0448 
w/o CR  0.4659  0.1012  0.0594  0.0444  
w/o PD  0.0097  0.0052  0.4411  0.0590  0.0441  
w/o CL  0.0100  0.0049  0.4627  0.1119  0.0596  0.0447 
IvC1 Ablation Study
In this part, we conduct ablation study to analyze the functionality of the three components of ICMT (i.e.,clusterwise reweighting (CR), popularity disentanglement (PD), and contractive loss (CL)). Table IV shows the performance of ICMT and its variants on all three datasets utilizing LightGCN as the base recommendation model. We introduce the variants and analyze their effects respectively:
(1). Remove Clusterwise Reweighting (w/o CR): The most significant longtail accuracy degradation occurred without the reweighting strategy, which implies that niche items tend to obtain higher weights because of their weak relationship with popularity embedding. This proves that clustering the items and treating the recommendation target as a multiobjective optimization problem can greatly improve the performance of the model in the longtail space.
(2). Remove Popularity Disentanglement (w/o PD): After taking away the popularity embedding and perform clustering on the item embedding distribution, we find that Coverage and APT are significantly worse, meaning that the vanilla user embedding is biased by the item popularity. On the other hand, with the participation of popularity embedding, user interest embedding in ICMT alone reflects the users’ true preference, which can thus explore more niche items and improve the longtail performance.
(3) Remove Contractive Loss (w/o CL): We can see that the longtail results downgrade without the contractive regularization, manifesting that the niche items are boosted after regularizing with contractive Jacobi gradients.
To sum up, the combination of these three strategies (i.e., ICMT) yields the best performance, proving that all the three components of ICMT are effective and work collaboratively to improve the longtail recommendation performance and overall accuracy.
IvC2 Hyperparameter Study
(1). Effect of Item Cluster Number : In this part, we use LightGCN as the base recommendation model since it has the best overall accuracy performance. Here we choose to conduct the experiment. Figure 4 illustrates, NDCG@20, NDCGTail@20, and APT@20 under different cluster numbers on Last.Fm and Yelp2018 dataset. The general observation is that overall recommendation accuracy maintains the same level while the longtail performance shows bellshaped curves. Increasing the cluster number from 1 (i.e., normal training) to 2 leads to the largest longtail performance improvement. Later on, longtail performance keeps diminishing along with the increase of cluster number. Such experimental results indicate that adaptively assigning the items into two clusters leads to the most satisfactory performance. However, with more item clusters, the performance of ICMT on longtail items gets compromised. The reason could be that too many clusters would make ICMT put more focus on the balance between tail clusters rather than the balance between head and tail items.
(2). Effect of Popularity Factor Weight : To evaluate the impact of popularity factor weight , we vary the weight in the range of {0, 1e4, 1e3, 2e3, 5e3, 1e2}. The experimental results are summarized in Figure 5. We can observe that the overall NDCG@20 reaches its peak when , thus demonstrating that properly disentangle the popularity factor closely reflects the true user interest. Meanwhile, with the incremental of , the longtail performance keeps rising swiftly. This implies that more niche items are excavated when we put more emphasis on decomposing the popularity factor. To sum up, for a higher overall accuracy, we choose as our default setting.
(3). Effect of Contractive Loss Weight : From Figure 6, we can observe that a small value of promotes the accuracy, especially in the longtail space according to its ability of balancing the gradients from all items and suppressing the popularity embedding dominance. However, despite keep promoting the APT, a larger portion of robust regularization does not necessarily lead to better accuracy due to the issue of losing information from the gradients. Therefore, we set as (1e3, 1e3, 2e3) corresponding to Last.Fm, Gowalla, Yelp2018 to achieve the best overall performance.
IvD PESolver Investigation (RQ3)
In this part, we conduct experiments to show whether the PEsolver generates reasonable ParetoEfficient solutions.
IvD1 Pareto Frontier and the Searched PE Point
On the Gowalla dataset with all the three recommendation models, we first generate the Pareto Frontiers of head and tail losses by running the Pareto MTL algorithm [lin2019pareto] with different tradeoff preference vectors, shown in Figure 7(a). It can be observed that the obtained Pareto Frontiers under different constraints follow ParetoEfficiency, i.e., no point achieves both lower shorthead and longtail losses than other points. When the model focuses more on head items, the shorthead loss is lower while the longtail loss increases, and vice versa.
When it comes to the searched point of the PE solver, we can see that on all recommendations models, those points mainly lie in the middle part of the Pareto Frontiers. This observation indicates that the PEsolver coincides with our aim of balancing the tradeoff between head items and tail items.
IvD2 The Learning of Weights
To be clear of the training process and reveal the attention of longtail items, we further plot the learning curves of the average weights assigned to longtail items, as shown in Figure 7(b). We use LightGCN as the recommendation model and visualize the trend on all three datasets. We can observe that the obtained weights from PEsolver tend to focus on tail items. After variating at the early training stage, the weight for tail items becomes flattened and then converges to a value around [1.04, 1.16]. On the other hand, the normal training methods neglect these PE weights and treat all items in the same way, leading to the overfitting on head items. Hence, the proposed ICMT can effectively eliminate the popularity bias of RS by assigning adaptive weights to head and tail clusterwise objectives.
IvE Case Study (RQ4)
To show the interpretability of ICMT, in the Last.Fm dataset, we randomly select a user and retrieve his two top5 recommendation lists from LightGCNNormal and LightGCNICMT given the same interaction history. Figure 8 illustrates the recommendation detail. We observed that the recommendation list derived from LightGCNNormal contains popular groundtruth items but does not contain any tail items. LightGCNICMT contains two tail (unpopular) items, Farben Lehre and Plavi Orkestar thanks to the higher average weights assigned to tail items in his interaction history. One of the two recommended tail items belongs to the groundtruth test set, which improves the longtail and overall performance.
Note that the two recommendation lists have several items in common (e.g. Erasure), which indicates that LightGCNICMT can also catch the preference on popular items.
V Conclusion
In this paper, we propose to tackle the longtail recommendation task from a multiobjective optimization perspective. We find that head items are repetitively recommended due to the fact that head items tend to have larger gradient norms and thus dominate the gradient updates. Learning parameters based on such gradients could scarify the performance of longtail items. To alleviate such a phenomenon, we propose a general learning framework namely ICMT which is featured with popularity disentanglement, clusterwise multiobjective reweighting, and robust contractive regularization. We instantiate ICMT with three stateoftheart recommendation models and conduct extensive experiments on three realworld datasets. The results demonstrate the effectiveness of ICMT. Future work includes generalizing ICMT for other tasks such as multiclass classification and longtail document retrieval.
Comments
There are no comments yet.