Be Causal: De-biasing Social Network Confounding in Recommendation

05/17/2021 ∙ by Qian Li, et al. ∙ University of Technology Sydney 0

In recommendation systems, the existence of the missing-not-at-random (MNAR) problem results in the selection bias issue, degrading the recommendation performance ultimately. A common practice to address MNAR is to treat missing entries from the so-called "exposure" perspective, i.e., modeling how an item is exposed (provided) to a user. Most of the existing approaches use heuristic models or re-weighting strategy on observed ratings to mimic the missing-at-random setting. However, little research has been done to reveal how the ratings are missing from a causal perspective. To bridge the gap, we propose an unbiased and robust method called DENC (De-bias Network Confounding in Recommendation) inspired by confounder analysis in causal inference. In general, DENC provides a causal analysis on MNAR from both the inherent factors (e.g., latent user or item factors) and auxiliary network's perspective. Particularly, the proposed exposure model in DENC can control the social network confounder meanwhile preserves the observed exposure information. We also develop a deconfounding model through the balanced representation learning to retain the primary user and item features, which enables DENC generalize well on the rating prediction. Extensive experiments on three datasets validate that our proposed model outperforms the state-of-the-art baselines.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recommender systems aim to handle information explosion meanwhile to meet users’ personalized interests, which have received extensive attention from both research communities and industries. The power of a Recommender system highly relies on whether the observed user feedback on items “correctly” reflects the users’ preference or not. However, such feedback data often contains only a small portion of observed feedback (e.g., explicit ratings), leaving a large number of missing ratings to be predicted. To handle the partially observed feedback, a common assumption for model building is that the feedback is missing at random (MAR), i.e., the probability of a rating to be missing is independent of the value. When the observed data follows the MAR, using only the observed data via statistical analysis methods can yield “correct” prediction without introducing bias 

(Marlin and Zemel, 2009; Lim et al., 2015). However, this MAR assumption usually does not hold in reality and the missing pattern exhibits missing not at random (MNAR) phenomenon. Generally, MNAR is related to selection bias. For instance, in movie recommendation, instead of randomly choosing movies to watch, users are prone to those that are highly recommended, while in advertisement recommendation, whether an advertisement is presented to a user is purely subject to the advertiser’s provision, rather than at random. In these scenarios, the missing pattern of data mainly depends on whether the users are exposed to the items, and consequently, the ratings in fact are missing not at random (MNAR) (He et al., 2016). These findings shed light on the origination of selection bias from MNAR  (Sportisse et al., 2020). Therefore the selection bias cannot be ignored in practice and it has to be modeled properly in order for reliable recommendation prediction. How to model the missing data mechanism and debias the rating performance forms up the main motivation of this research.

Existing MNAR-aware Methods

There are abundant methods for addressing the MNAR problem on the implicit or explicit feedback. For implicit feedback, traditional methods (Hu et al., 2008) take the uniformity assumption that assigns a uniform weight to down-weight the missing data, assuming that each missing entry is equally likely to be negative feedback. This is a strong assumption and limits models’ flexibility for real applications. Recently, researchers tackle MNAR data directly through simulating the generation of the missing pattern under different heuristics (Hernández-Lobato et al., 2014). Of these works, probabilistic models are presented as a proxy to relate missing feedback to various factors, e.g., item features. For explicit feedback, a widely adopted mechanism is to exploit the dependencies between rating missingness and the potential ratings (e.g., 1-5 star ratings) (Koren and Bell, ). That is, high ratings are less likely to be missing compared to items with low ratings. However, these paradigm methods involve heuristic alterations to the data, which are neither empirically verified nor theoretically proven (Saito, 2020).

A couple of methods have recently been studied for addressing MNAR (Hernández-Lobato et al., 2014; Liang et al., 2016; Schnabel et al., 2016) by treating missing entries from the so-called “exposure” perspective, i.e., indicating whether or not an item is exposed (provided) to a user. For example, ExpoMF resorts modeling the probability of exposure (Hernández-Lobato et al., 2014), and up-weighting the loss of rating prediction with high exposure probability. However, ExpoMF can lead to a poor prediction accuracy for rare items when compared with popular items. Likewise, recent works (Liang et al., 2016; Schnabel et al., 2016) resort to propensity score to model exposure. The propensity score introduced in causal inference indicates the probability that a subject receiving the treatment or action. Exposing a user to an item in a recommendation system is analogous to exposing a subject to a treatment. Accordingly, they adopt propensity score to model the exposure probability and re-weight the prediction error for each observed rating with the inverse propensity score. The ultimate goal is to calibrate the MNAR feedbacks into missing-at-random ones that can be used to guide unbiased rating prediction.

Whilst the state-of-the-art propensity-based methods are validated to alleviate the MNAR problem for recommendation somehow, they still suffer from several major drawbacks: 1) they merely exploit the user/item latent vectors from the ratings for mitigating MNAR, but fail to disentangle different causes for MNAR from a causal perspective; 2) technically, they largely rely on propensity score estimation to mitigate MNAR problem; the performance is sensitive to the choice of propensity estimator 

(Wang et al., 2019), which is notoriously difficult to tune.

Figure 1. The causal view for MNAR problem: treatment and outcome are terms in the theory of causal inference, which denote an action taken (e.g.,exposure) and its result (e.g., rating), respectively. The confounder (e.g., social network) is the common cause of treatment and outcome.

The proposed approach

To overcome these obstacles, in contrast, we aim to address the fundamental MNAR issue in recommendation from a novel causal inference perspective, to attain a robust and unbiased rating prediction model. From a causal perspective, we argue that the selection bias (i.e., MNAR) in the recommendation system is attributed to the presence of confounders. As explained in Figure 1, confounders are factors (or variables) that affect both the treatment assignments (exposure) and the outcomes (rating). For example, friendships (or social network) can influence both users’ choice of movie watching and their further ratings. Users who choose to watch the movie are more likely to rate than those who do not. So, the social network is indeed a confounding factor that affects which movie the user is exposed to and how the user rates the movie. The confounding factor results in a distribution discrepancy between the partially observed ratings and the complete ratings as shown in Figure 2. Without considering the distribution discrepancy, the rating model trained on the observed ratings fails to generalize well on the unobserved ratings. With this fact in mind, our idea is to analyze the confounder effect of social networks on rating and exposure, and in turn, fundamentally alleviate the MNAR problem to predict valid ratings.

Figure 2. The training space of conventional recommendation models is the observed rating space , whereas the inference space is the entire exposure space . The discrepancy of data distribution between and leads to selection bias in conventional recommendation models.

In particular, we attempt to study the MNAR problem in recommendation from a causal view and propose an unbiased and robust method called DENC (De-bias Network Confounding in Recommendation). To sufficiently consider the selection bias in MNAR, we model the underlying factors (i.e., inherent user-item information and social network) that can generate observed ratings. In light of this, as shown in Figure 4, we construct a causal graph based recommendation framework by disentangling three determinants for the ratings, i.e., inherent factors, confounder and exposure. Each determinant accordingly corresponds to one of three specific components in DENC: deconfonder model, social network confounder and exposure model, all of which jointly determine the rating outcome.

In summary, the key contributions of this research are as follows:

  • Fundamentally different from previous works, DENC is the first method for the unbiased rating prediction through disentangling determinants of selection bias from a causal view.

  • The proposed exposure model is capable of revealing the exposure assignment and accounting for the confounder factors derived from the social network confounder, which thus remedies selection bias in a principled manner.

  • We develop a deconfonder model via the balanced representation learning that embeds inherent factors independent of the exposure, therefore mitigating the distribution discrepancy between the observed rating and inference space.

  • We conduct extensive experiments to show that our DENC method outperforms state-of-the-art methods. The generalization ability of our DENC is also validated by verifying different degrees of confounders.

2. Related Work

2.1. MNAR-aware Methods

2.1.1. Traditional Heuristic Models

Early works on explicit feedback formulate a recommendation a rating prediction problem in which the large volume of unobserved ratings (i.e., missing data) is assumed to be extraneous for user preference (Hu et al., 2008). Following this unreliable assumption, numerous recommenders have been developed including basic matrix factorization based-recommenders (Rendle and Schmidt-Thieme, 2008) and sophisticated ones such as SVD++ (Koren and Bell, ). As statistical analysis with missing data techniques, especially MNAR proposition, find widespread applications, there is much interest in understanding its impacts on the recommendation system. Previous research has shown that for explicit-feedback recommenders, users’ ratings are MNAR (Marlin and Zemel, 2009). Marlin and Zemel (Marlin and Zemel, 2009) first study the effect of violating MNAR assumption in recommendation methods; they propose statistical models to address MNAR problem of missing ratings based on the heuristic that users are more likely to supply ratings for items that they do like. Another work of  (Hernández-Lobato et al., 2014) also has focused on addressing MNAR problem; they propose a probabilistic matrix factorization model (ExpoMF) for collaborative filtering that learns from observation data. However, these heuristic paradigm methods are neither empirically verified nor theoretically proven (Schnabel et al., 2016; Liang et al., 2016).

2.1.2. Propensity-based Model

The basic idea of propensity scoring methods is to turn the outcomes of an observational study into a pseudo-randomized trial by re-weighting samples, similarly to importance sampling. Typically, using Inverse Propensity Weighting (IPW) estimator, Liang (Liang et al., 2016) proposes a framework consisted of one exposure model and one preference model. The exposure model is estimated by a Poisson factorization, then preference model is fit with weighted click data, where each click is weighted by the inverse of exposure and be used to alleviate popularity bias. Based on Self Normalized Inverse Propensity Scoring (SNIPS) estimator, the model in (Schnabel et al., 2016) are developed either directly through observed ratings of a missing-completely-at-random sample estimated by SNIPS or indirectly through user and item covariates. These works re-weight the observational click data as though it came from an “experiment” where users are randomly shown items. Thus, the measurement is still adopting re-weighting strategies to mimic the missing-completely-at-random like most of the heuristic models do (Yang et al., 2018). Besides, these works are sensitive to the choice of propensity score estimators (Wang et al., 2019). In contrast, our work relies solely on the observed ratings: we do not require ratings from a gold-standard randomized exposure estimation and nor do we use external covariates; moreover, we consider another important bias in the recommendation scenario, namely, social counfounding bias.

2.2. Social Network-based Methods

The effectiveness of social networks has been proved by a vast amount of social recommenders. Purushotham (Purushotham et al., 2012) has explored how traditional factorization methods can exploit network connections; this brings the latent preferences of connected users closer to each other, reflecting that friends have similar tastes. Other research has included social information directly into various collaborative filtering methods. TrustMF (Yang et al., 2016) adopts collaborative filtering to map users into low-dimensional latent feature spaces in terms of their trust relationship; the remarkable performance of the proposed model reflects individuals among a social network will affect each other while reviewing items. SocialMF (Jamali and Ester, 2010) incorporates trust propagation into the matrix factorization model, which assumes the factors of every user are dependent on the factor vectors of his/her direct neighbors in the social network. However, despite the remarkable contribution of social network information in various recommendation methods, it has not been utilized in controlling for confounding bias of causal inference-based recommenders yet.

3. DENC Method

3.1. Notations

We first give some preliminaries of our method and used notation. Suppose we have rating matrix describing the numerical rating of users on items. Let and be the set of users and items respectively. For each user-item pair, we use to indicate whether user has been exposed to item and . We use to represent the rating given by to item .

3.2. A Causal Inference Perspective

Viewing recommendation from a causal inference perspective, we argue that exposing a user to an item in recommendation is an intervention analogous to exposing a patient to a treatment in a medical study. Following the potential outcome framework in causal inference (Rubin, 1974), we reformulate the rating prediction as follows.

Problem 1 (Causal View for Recommendation).

For every user-item pair with a binary exposure , there are two potential rating outcomes and . We aim to estimating the ratings had all movies been seen by all users, i.e., estimate (i.e., ) for all and .

Figure 3. The causal graph for recommendation.

As we can only observe the outcome when the user is exposed by the item , i.e., , we target at the problem that what would happen to the unobserved rating if we set exposure variable by setting . In our settings, the confounder derived from the social network among users are denoted as a common cause that affects the exposure and outcome . We aim to disentangle the underlying factors in observation ratings and social networks as shown in Figure 3. The intuition behind Figure 3 is that the observed rating outcomes are generated as a result of both inherent and confounding factors. The inherent factors refer to the user preferences and inherent item properties, and auxiliary factors are the confounding factors from the social network. By disentangling determinants that cause the observed ratings, we can account for effects separately from the selection bias of confounders and the exposure, which ensures to attain an unbiased rating estimator with superior generalization ability.

Followed the causal graph in Figure 3, we now design our DENC method incorporates three determinants in Figure 4. Each component accordingly corresponds to one of three specific determinants: social network confounder, exposure model and deconfonder model, which jointly determine the rating outcome.

Figure 4. Our DENC method consists of four components: Social network confounder, exposure model, deconfonder model and rating model.

3.3. Exposure Model

To cope with the selection bias caused by users or the external social relations, we build on the causal inference theory and propose an effective exposure model. Guided by the treatment assignment mechanism in causal inference, we propose a novel exposure model that computes the probability of exposure variable specific to the user-item pair. This model is beneficial to understand the generation of the Missing Not At Random (MNAR) patterns in ratings, which thus remedies selection biases in a principled manner. For example, user A goes to watch the movie because of his friend’s strong recommendation. Thus, we propose to mitigate the selection bias by exploiting the network connectivity information that indicating to which extent the exposure for a user will be affected by its neighbors.

3.3.1. Social Network Confounder

To control the selection bias arisen from the external social network, we propose a confounder representation model that quantifies the common biased factors affecting both the exposure and rating outcome.

We now discuss the method of choosing and learning exposure. Let present the social relationships among users , where an edge denotes there is a friend relationship between users. We resort to node2vec (Grover and Leskovec, 2016) method and learn network embedding from diverse connectivity provided by the social network. More details about node2vec method can be found in Section A.4 in the appendix. To mine the deep social structure from , for every source user , node2vec generates the network neighborhoods of node through a sampling strategy to explore its neighborhoods in a breadth-first sampling as well as a depth-first sampling manner. The representation for user can be learned by minimizing the negative likelihood of preserving network neighborhoods :


The final output sufficiently explores diverse neighborhoods of each user, which thus represents to what extent the exposure for a user is influenced by his friends in graph .

3.3.2. Exposure Assignment Learning

The exposure under the recommendation scenario is not randomly assigned. Users in social networks often express their own preferences over the social network, which therefore will affect their friends’ exposure policies. In this section, to characterize the Missing Not At Random (MNAR) pattern in ratings, we resort to causal inference (Pearl, 2009) to build the exposure mechanism influenced by social networks.

To begin with, we are interested in the binary exposure that defines whether the item is exposed () or unexposed () to user , i.e., . Based on the informative confounder learned from social network, we propose the notation of propensity to capture the exposure from the causal inference language.

Definition 0 (Propensity).

Given an observed rating and confounder in (1), the propensity of the corresponding exposure for user–item pair is defined as


In view of the foregoing, we model the exposure mechanism by the probability of being assigned to 0 or 1.


where is an index set for the observed ratings. The case of can result in an observed rating or unobserved rating: 1) for the observed rating represented by , we definitely know the item is exposed, i.e., ; 2) an unobserved rating may represent a negative feedback (i.e., the user is not reluctant to rating the item) on the exposed item . In light of this, based on (2), we have


where . The exposure that is unknown follows the distributions as


By substituting Eq. (4) and Eq. (5) for Eq. (3), we attain the exposure assignment for the overall rating data as


Inspired by (Pan et al., 2008), we assume uniform scheme for when no side information is available. According to most causal inference methods (Shalit et al., 2017; Pearl, 2009), a widely-adopted parameterization for

is a logistic regression network parameterized by

, i.e.,


Based on Eq. (7), the overall exposure in Eq. (6) can be written as the function of parameters and , i.e.,


where social network confounder

is learned by the pre-trained node2vec algorithm. Similar to supervised learning,

can be optimized through minimization of the negative log-likelihood.

3.4. Deconfounder Model

Traditional recommendation learns the latent factor representations for user and item by minimizing errors on the observed ratings, e.g., matrix factorization. Due to the existence of selection bias, such a learned representation may not necessarily minimize the errors on the unobserved rating prediction. Inspired by (Shalit et al., 2017), we propose to learn a balanced representation that is independent of exposure assignment such that it represents inherent or invariant features in terms of users and items. The invariant features must also lie in the inference space shown in Figure 2, which can be used to consistently infer unknown ratings using observed ratings. This makes sense in theory: if the learned representation is hard to distinguish across different exposure settings, it represents invariant features related to users and items.

According to Figure 3, we can define two latent vectors and to represent the inherent factor of a user and a item, respectively. Recall that different values for in Eq. (6) can generate different exposure assignments for the observed rating data. Following this intuition, we construct two different exposure assignments and corresponding two settings of . Accordingly, and are defined to include inherent factors of users and items, i.e., , . Figure 3 also indicates that the inherent factors of user and item would keep unchanged even if the exposure variable is altered from 0 to 1, and vice versa. That means and should be independent of the exposure assignment, i.e., or . Accordingly, minimizing the discrepancy between and ensures that the learned factors embeds no information about the exposure variable and thus reduce selection bias. The penalty term for such a discrepancy is defined as


Inspired by (Müller, 1997), we employ Integral Probability Metric (IPM) to estimate the discrepancy between and . is the (empirical) integral probability metric defined by the function family

. Define two probability distributions

and , the corresponding IPM is denoted as


where is a class of real-valued bounded measurable functions. We adopt as 1-Lipschitz functions that lead IPM to the Wasserstein-1 distance, i.e.,


where is the -th column of and the set of push-forward functions can transform the representation distribution of the exposed to that of the unexposed . Thus, is a pairwise distance matrix between the exposed and unexposed user-item pairs. Based on the discrepancy defined in (12), we define and reformulate penalty term in (9) as


We adopt the efficient approximation algorithm proposed by (Shalit et al., 2017) to compute the gradient of (12) for training the deconfounder model. In particular, a mini-batch with exposed and unexposed user-item pairs is sampled from and , respectively. The element of distance matrix is calculated as . After computing , we can approximate and the gradient against the model parameters  111For a more detailed calculation, refer to Algorithm 2 in the appendix of prior work (Shalit et al., 2017). In conclusion, the learned latent factors generated by the deconfounder model embed no information about exposure variable. That means all the confounding factors are retained in social network confounder .

3.5. Learning

3.5.1. Rating prediction

Having obtained the final representations and by the deconfounder model, we use an inner product of as the inherent factors to estimate the rating. As shown in the causal structure in Figure 4, another component affecting the rating prediction is the social network confounder. A simple way to incorporate these components into recommender systems is through a linear model as follows.


where is coefficient that describes how much the confounder

contributes to the rating. To define the unbiased loss function for the biased observations

, we leverage the IPS strategy (Schnabel et al., 2016) to weight each observation with Propensity. By Definition 1, the intuition of the inverse propensity is to down-weight the commonly observed ratings while up-weighting the rare ones.


3.5.2. Optimization

To this end, the objective function of our DENC method to predict ratings could be derived as:


where represents the trainable parameters and is a squared norm regularization term on to alleviate the overfitting problem. , and

are trade-off hyper-parameters. To optimize the objective function, we adopt Stochastic Gradient Descent(SGD) 

(Bottou, 2010) as the optimizer due to its efficiency.

4. Experiments

To more thoroughly understand the nature of MNAR issue and the proposed unbiased DENC, experiments are conducted to answer the following research questions:

  • [leftmargin=*]

  • (RQ1) How confounder bias caused by the social network is manifested in real-world recommendation datasets?

  • (RQ2) Does our DENC method achieve the state-of-the-art performance in debiasing recommendation task?

  • (RQ3) How does the embedding size of each component (e.g., social network confounder and deconfounder model) in our DENC method impact the debiasing performance?

  • (RQ4) How do the missing social relations impact the debiasing performance of our DENC method?

4.1. Setup

4.1.1. Evaluation Metrics

We adopt two popular metrics including Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) to evaluate the performance. Since improvements in MAE or RMSE terms can have a significant impact on the quality of the Top- recommendations (Koren, 2008), we also evaluate our DENC with Precision@K and Recall@K for the ranking performance222We consider items with a rating greater than or equal to 3.5 as relevant.

4.1.2. Datasets

We conduct experiments on three datasets including one semi-synthetic dataset and two benchmark datasets Epinions 333 tangjili/trust.html and Ciao (Tang et al., 2012) 444 tangjili/trust.html. We maintain all the user-item interaction records in the original datasets instead of discard items that have sparse interactions with users.555Models can benefit from the preprocessed datasets in which all items interact with at least a certain amount of users, for such preprocessing will reduce the dataset sparsity. The semi-synthetic dataset is generated by incorporating the social network into MovieLens666 dataset. The details of these datasets are given in Section A.1 in the appendix.

4.1.3. Baselines

We compare our DENC against three groups of methods for rating prediction: (1) Traditional methods, including NRT (Li et al., 2017a) and PMF (Mnih and Salakhutdinov, 2008). (2) Social network-based methods, including GraphRec (Fan et al., 2019), DeepFM+ (Guo et al., ), SocialMF (Jamali and Ester, 2010), SREE (Li et al., 2017b) and SoReg (Ma et al., 2011). (3) Propensity-based methods, including CausE (Bonner and Vasile, 2018) and D-WMF (Wang et al., 2018). More implementation details of baselines and parameter settings are included in Section A.2 in the appendix.

Epinions Ciao MovieLens-1M
# users 22,164 7,317 6,040
# items 296,277 104,975 3,706
# ratings 922,267 283,319 1000,209
density-R (%) 0.0140 0.0368 4.4683
# relations 355,754 111,781 9,606
density-SR (%) 0.0724 0.2087 0.0263
Table 1. Statistics of Datasets. Density for rating (density-R) is , Density for social relations (density-SR) is .

4.1.4. Parameter Settings

We implement all baseline models on a Linux server with Tesla P100 PCI-E 16GB GPU. 777Our code is currently shared on Github, however, due to the double-blind submission policy requirement, we leave the link void now but promise to activate it after paper acceptance. Datasets for all models except CausE 888 As in CausE, we sample 10% of the training set to build an additional debiased dataset (mandatory in model training), where items are sampled to be uniformly exposed to users. are split as training/test sets with a proportion of 80/20, and 20% of the training set are validation set.

We optimize all models with Stochastic Gradient Descent(SGD) (Bottou, 2010). For fair comparison, a grid search is conducted to choose the optimal parameter settings, e.g., dimension of user/item latent vector for matrix factorization-based models and dimension of embedding vector

for neural network-based models. The embedding size is initialized with the Xavier 

(Glorot and Bengio, 2010) and searched in . The batch size and learning rate are searched in and

, respectively. The maximum epoch

is set as 2000, an early stopping strategy is performed. Moreover, we employ three hidden layers for the neural components of NRT, GraphRec and DeepFM+. Like our DENC method, DeepFM+ uses node2vec to train the social network embeddings. Hence, the embedding size of its node2vec is set as the same as in our DENC for a fair comparison.

Without specification, unique hyperparameters of DENC are set as: three coefficients

, and are tuned in . The dimension of node2vec embedding size and the dimension of inherent factor are tuned in , and their influences are reported in Section 4.4.

Traditional Social network-based Propensity-based Ours
Dataset Metrics PMF NRT SocialMF SoReg SREE GraphRec DeepFM+ CausE D-WMF DENC
Epinions MAE 0.9505 0.9294 0.8722 0.8851 0.8193 0.7309 0.5782 0.5321 0.3710 0.2684
RMSE 1.2169 1.1934 1.1655 1.1775 1.1247 0.9394 0.6728 0.7352 0.6299 0.5826
Ciao MAE 0.8868 0.8444 0.7614 0.7784 0.7286 0.6972 0.3641 0.4209 0.2808 0.2487
RMSE 1.1501 1.1495 1.0151 1.0167 0.9690 0.9021 0.5886 0.8850 0.5822 0.5592
MovieLens-1M MAE 0.8551 0.8959 0.8674 0.9255 0.8408 0.7727 0.5786 0.4683 0.3751 0.2972
RMSE 1.0894 1.1603 1.1161 1.1916 1.0748 0.9582 0.6730 0.8920 0.6387 0.5263
MovieLens-1M MAE 0.8086 0.8801 0.8182 0.8599 0.7737 0.7539 0.5281 0.4221 0.3562 0.2883
RMSE 1.0034 1.1518 1.0382 1.1005 0.9772 0.9454 0.6477 0.8333 0.6152 0.5560
MovieLens-1M MAE 0.7789 0.7771 0.7969 0.8428 0.7657 0.7423 0.3672 0.4042 0.3151 0.2836
RMSE 0.9854 0.9779 1.0115 1.0792 0.9746 0.9344 0.5854 0.8173 0.5962 0.5342
Table 2. Performance comparison: bold numbers are the best results. Strongest baselines are highlighted with underlines.

4.2. Understanding Social Confounder (RQ1)

We initially conduct an experiment to understand to what extent the confounding bias caused by social networks is manifested in real-world recommendation datasets. The social network as a confounder will bias the interactions between the user and items. We aim to verify two kinds of scenarios: (1) User in the social network interacts with more items than users outside the social network. (2) The pair of user-neighbor in the social network has more common interacted items than the pair of user-neighbor outside the social network. Intuitively, an unbiased platform should expect users to interact with items broadly, which indicates that interactions are likely to be evenly distributed. Thus, we investigate the social confounder bias by analyzing the statistics of interactions in these two scenarios in Epinions and Ciao dataset.

(a) (a) Distribution on Ciao.
(b) (b) Distribution on Epinions.
Figure 5. Scenario (1): the distribution of (the number of items interacted by a user). The smooth probability curves visualize how the number of items is distributed.
(a) (a) Distribution on Ciao.
(b) (b) Distribution on Epinions.
Figure 6. Scenario (2): the distribution of (the number of items commonly interacted by a user-pair).

For the first scenario, we construct two user sets within or outside the social network, i.e., and . Specially, is constructed by randomly sampling a set of users in social network , and is randomly sampled out of . The size of and is the same and defined as . Following the above guidelines, we sample users for and . Figure 5 depicts the distributions of the interacted items by users in and

. The smooth curves are continuous distribution estimates produced by the kernel density estimation. Apparently, the distribution for

is significantly skewed: Most of the users interact with few items. For example, on

Ciao, more than 90% of users interact with fewer than 50 items. By contrast, most users in the social network tend to interact with items more frequently, which is also confirmed by the even distribution. In general, the distribution curve of is quite different from , which reflects that the social network influences the interactions between users and items. In addition, the degree of bias varies across datasets: Epinions is less biased than Ciao.

For the second scenario, based on and , we further analyze the number of commonly interacted items by the user-pair. Particularly, we randomly sample four one-hop neighbours for each user in to construct user-pairs. Since users in have no neighbours, for each of them, we randomly select another four users999According to the statistics, we discover that 90% of users have at least four one-hop neighbours in Ciao and Epinions in to construct four user-pairs. Recall that and both have 70 users, then we totally have user-pairs for and , respectively. Figure 6 represents the distribution of how many items are commonly interacted by the users in each pair.101010For example, are one-hop neighbours of . If the number of commonly items interacted by and is 3, then in the -axis of Figure 6 is nonzero. Figure 6 indicates most user-neighbour pairs in the social network have fewer than 10 items in common. However the user-pairs outside the social network nearly have no items in common, i.e., less than 1. We can conclude that social networks can encourage users to share more items with their neighbours, compared with users who are not connected by any social networks.

4.3. Performance Comparison (RQ2)

We compare the rating prediction of DENC with nine recommendation baselines on three datasets including Epinions, Ciao and MovieLens-1M. Table 2 demonstrates the performance comparison, where the confounder in MovieLens-1M is assigned with three different settings, i.e., -0.35, 0 and 0.35. Analyzing Table 2, we have the following observations.

  • [leftmargin=*]

  • Overall, our DENC consistently yields the best performance among all methods on five datasets. For instance, DENC improves over the best baseline model w.r.t. MAE/RMSE by 10.26/4.73%, 3.21/2.3%, and 7.79/11.24% on Epinions, Ciao and MovieLens-1M (=-0.35) datasets, respectively. The results indicate the effectiveness of DENC on the task of rating prediction, which has adopted a principled causal inference way to leverage both the inherent factors and auxiliary social network information for improving recommendation performance.

  • Among the three kinds of baselines, propensity-based methods serves as the strongest baselines in most cases. This justifies the effectiveness of exploring the missing pattern in rating data by estimating the propensity score, which offers better guidelines to identify the unobserved confounder effect from ratings. However, propensity-based methods perform worse than our DENC, as they ignore the social network information. It is reasonable that exploiting the social network is useful to alleviate the confounder bias to rating outcome. The importance of social networks can be further verified by the fact that most of the social network-based methods consistently outperform PMF on all datasets.

  • All baseline methods perform better on Ciao than on Epinions, because Epinions is significantly sparser than Ciao with 0.0140% and 0.0368% density of ratings. Besides this, DENC still achieves satisfying performance on Epinions and its performance is competitive with the counterparts on Ciao. This demonstrates that its exposure model of DENC has an outstanding capability of identifying the missing pattern in rating prediction, in which biased user-item pairs in Epinions can be captured and then alleviated. In addition, the performance of DENC on three Movielens-1M datasets is stable w.r.t. different levels of confounder bias, which verifies the robust debiasing capability of DENC.

4.4. Ablation Study (RQ3)

In this section, we conduct experiments to evaluate the parameter sensitivity of our DENC method. We have two important hyperparameters and that correspond to the embedding size in loss function and , respectively. Based on the hyperparameter setup in Section 4.1.4, we vary the value of one hyperparameter while keeping the others unchanged.

(a) (a) MAE on Ciao.
(b) (b) RMSE on Ciao.
(c) (a) MAE on Epinions.
(d) (b) RMSE on Epinions.
Figure 7. Our DENC: parameter sensitivity of and against (a) MAE (b) RMSE on Ciao and Epinions dataset.

Figure 7 lays out the performance of DENC with different parameter settings. For both datasets, the performance of our DENC is stable under different hyperparameters and . The performance of DENC increases while the embedding size increase from approximately 0-15 for ; afterwards, its performance decreases. It is clear that when the embedding size is set to approximately =45 and =15, our DENC method achieves the optimal performance. Our DENC is less sensitive to the change of than , since MAE/RMSE values change with a obvious concave curve along =0 to 50 in Figure 7, while MAE/RMSE values only change gently with a downwards trend along =0 to 50. It is reasonable since controls the embedding size of disentangled user-item representation attained by the deconfounder model, i.e., the inherent factors, while social network embedding size serves as the controller for auxiliary social information, the former can influence the essential user-item interaction while the latter affects the auxiliary information.

4.5. Case Study (RQ4)

We first investigate how the missing social relations affect the performance of DENC. We randomly mask a percentage of social relations to simulate the missing connections in social networks. For Epinions, Ciao and MovieLens dataset, we fix the social network confounder as . Meanwhile, we exploit different percentages of missing social relations including {20%, 50%, 80%}. Note that we do not consider the missing percentage of , i.e., the social network information is completely unobserved. Considering that the social network is viewed as a proxy variable of the confounder, the social network should provide partially known information. Following this guideline, we firstly investigate how the debias capability of our DENC method varies under the different missing percentages. Secondly, we also report the ranking performance of DENC (percentages of missing social relations is set to ) under Precision@K and Recall@K with to evaluate our model thoroughly.

(a) (a) MAE of DENC.
(b) (b) RMSE of DENC.
(c) (c) Precision@20 of DENC.
(d) (d) Recall@20 of DENC.
Figure 8. Our DENC: debias performance w.r.t. different missing percentages of social relation.

Figure 8 illustrates our debias performance w.r.t. different missing percentages of social relations on three datasets. As shown in Figure 8, the missing social relations can obviously degrade the debias performance of DENC method. The performance evaluated by four metrics in Figure 8 consistently degrades when the missing percentage increases from 0% to 80%, which is consistent with the common observation. This indicates that the underlying social network can play a significant role in a recommendation, we consider the because it can capture the preference correlations between users and their neighbours.

(a) (a) Epinions.
(b) (b) Ciao.
(c) (c) MovieLens-1M().
Figure 9. Performance of DENC in terms of Precision@K and RecallG@K under difference

Based on the evaluation on Precision@K and Recall@K, Figure 9 show that DENC achieves stable performance on Top- recommendation when (i.e., the length of ranking list) varies from 10 to 40. Our DENC can recommend more relevant items within top positions when the ranking list length increases.

5. Conclusion and Future Work

In this paper, we have researched the missing-not-at-random problem in the recommendation and addressed the confounding bias from a causal perspective. Instead of merely relying on inherent information to account for selection bias, we developed a novel social network embedding based de-bias recommender for unbiased rating, through correcting the confounder effect arising from social networks. We evaluate our DENC method on two real-world and one semi-synthetic recommendation datasets, with extensive experiments demonstrating the superiority of DENC in comparison to state-of-the-arts. In future work, we will explore the effect of different exposure policies on the recommendation system using the intervention analysis in causal inference. In addition, another promising further work is to explore the selection bias arisen from other confounder factors, e.g., user demographic features. This can be explained that a user’s nationality affects which restaurant he is more likely to visit (i.e., exposure) and meanwhile affects how he will rate the restaurant (i.e., outcome).


  • S. Bonner and F. Vasile (2018) Causal embeddings for recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 104–112. Cited by: 8th item, §4.1.3.
  • L. Bottou (2010)

    Large-scale machine learning with stochastic gradient descent

    In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §3.5.2, §4.1.4.
  • P. Cui, X. Wang, J. Pei, and W. Zhu (2018) A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering 31 (5), pp. 833–852. Cited by: §A.4.
  • W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin (2019) Graph neural networks for social recommendation. In The World Wide Web Conference, pp. 417–426. Cited by: 6th item, §4.1.3.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    pp. 249–256. Cited by: §4.1.4.
  • A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §A.4, §3.3.1.
  • [7] H. Guo, R. Tang, Y. Ye, Z. Li, and X. D. He A factorization-machine based neural network for ctr prediction. arxiv 2017. arXiv preprint arXiv:1703.04247. Cited by: 7th item, §4.1.3.
  • X. He, H. Zhang, M. Kan, and T. Chua (2016) Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 549–558. Cited by: §1.
  • K. Henderson, B. Gallagher, T. Eliassi-Rad, H. Tong, S. Basu, L. Akoglu, D. Koutra, C. Faloutsos, and L. Li (2012) Rolx: structural role extraction & mining in large graphs. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1231–1239. Cited by: §A.4, §A.4.
  • J. M. Hernández-Lobato, N. Houlsby, and Z. Ghahramani (2014) Probabilistic matrix factorization with non-random missing data. In International Conference on Machine Learning, pp. 1512–1520. Cited by: §1, §1, §2.1.1.
  • Y. Hu, Y. Koren, and C. Volinsky (2008) Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE International Conference on Data Mining, pp. 263–272. Cited by: §1, §2.1.1.
  • M. Jamali and M. Ester (2010) A matrix factorization technique with trust propagation for recommendation in social networks. In Proceedings of the fourth ACM conference on Recommender systems, pp. 135–142. Cited by: 3rd item, §2.2, §4.1.3.
  • [13] Y. Koren and R. Bell Advances in collaborative filtering. In Recommender systems handbook, pp. 77–118. Cited by: §1, §2.1.1.
  • Y. Koren (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 426–434. Cited by: §4.1.1.
  • P. Li, Z. Wang, Z. Ren, L. Bing, and W. Lam (2017a) Neural rating regression with abstractive tips generation for recommendation. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 345–354. Cited by: 2nd item, §4.1.3.
  • W. Li, M. Gao, W. Rong, J. Wen, Q. Xiong, R. Jia, and T. Dou (2017b) Social recommendation using euclidean embedding. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 589–595. Cited by: 5th item, §4.1.3.
  • D. Liang, L. Charlin, and D. M. Blei (2016) Causal inference for recommendation. In Causation: Foundation to Application, Workshop at UAI, Cited by: §1, §2.1.1, §2.1.2.
  • D. Lim, J. McAuley, and G. Lanckriet (2015) Top-n recommendation with missing implicit feedback. In Proceedings of the 9th ACM Conference on Recommender Systems, pp. 309–312. Cited by: §1.
  • H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King (2011) Recommender systems with social regularization. In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 287–296. Cited by: 4th item, §4.1.3.
  • B. M. Marlin and R. S. Zemel (2009) Collaborative prediction and ranking with non-random missing data. In Proceedings of the third ACM conference on Recommender systems, pp. 5–12. Cited by: §1, §2.1.1.
  • A. Mnih and R. R. Salakhutdinov (2008) Probabilistic matrix factorization. In Advances in neural information processing systems, pp. 1257–1264. Cited by: 1st item, §4.1.3.
  • A. Müller (1997) Integral probability metrics and their generating classes of functions. Advances in Applied Probability, pp. 429–443. Cited by: §3.4.
  • R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, M. Scholz, and Q. Yang (2008) One-class collaborative filtering. In 2008 Eighth IEEE International Conference on Data Mining, pp. 502–511. Cited by: §3.3.2.
  • J. Pearl (2009) Causality. Cambridge university press. Cited by: §3.3.2, §3.3.2.
  • S. Purushotham, Y. Liu, and C. J. Kuo (2012) Collaborative topic regression with social matrix factorization for recommendation systems. arXiv preprint arXiv:1206.4684. Cited by: §2.2.
  • S. Rendle and L. Schmidt-Thieme (2008) Online-updating regularized kernel matrix factorization models for large-scale recommender systems. In Proceedings of the 2008 ACM conference on Recommender systems, pp. 251–258. Cited by: §2.1.1.
  • D. B. Rubin (1974) Estimating causal effects of treatments in randomized and nonrandomized studies.. Journal of educational Psychology 66 (5), pp. 688. Cited by: §3.2.
  • Y. Saito (2020) Asymmetric tri-training for debiasing missing-not-at-random explicit feedback. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 309–318. Cited by: §1.
  • T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims (2016) Recommendations as treatments: debiasing learning and evaluation. arXiv preprint arXiv:1602.05352. Cited by: §1, §2.1.1, §2.1.2, §3.5.1.
  • U. Shalit, F. D. Johansson, and D. Sontag (2017) Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning, pp. 3076–3085. Cited by: §3.3.2, §3.4, §3.4, footnote 1.
  • A. Sportisse, C. Boyer, and J. Josse (2020) Imputation and low-rank estimation with missing not at random data. Statistics and Computing 30 (6), pp. 1629–1643. Cited by: §1.
  • J. Tang, H. Gao, and H. Liu (2012) MTrust: Discerning multi-faceted trust in a connected world. In Proceedings of the fifth ACM international conference on Web search and data mining, pp. 93–102. Cited by: §4.1.2.
  • J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015) Line: large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pp. 1067–1077. Cited by: 1st item, §A.4.
  • D. Wang, P. Cui, and W. Zhu (2016) Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234. Cited by: 2nd item, §A.4.
  • X. Wang, R. Zhang, Y. Sun, and J. Qi (2019) Doubly robust joint learning for recommendation on data missing not at random. In International Conference on Machine Learning, pp. 6638–6647. Cited by: §1, §2.1.2.
  • Y. Wang, D. Liang, L. Charlin, and D. M. Blei (2018) The deconfounded recommender: a causal inference approach to recommendation. arXiv preprint arXiv:1808.06581. Cited by: 9th item, §4.1.3.
  • B. Yang, Y. Lei, J. Liu, and W. Li (2016) Social collaborative filtering by trust. IEEE transactions on pattern analysis and machine intelligence 39 (8), pp. 1633–1647. Cited by: §2.2.
  • L. Yang, Y. Cui, Y. Xuan, C. Wang, S. Belongie, and D. Estrin (2018) Unbiased offline recommender evaluation for missing-not-at-random implicit feedback. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 279–287. Cited by: §2.1.2.

Appendix A Appendix

a.1. Datasets

The statistics of baseline datasets are given in Table 1. In Epinions and Ciao, the rating values are integers from 1 (like least) to 5 (like most). Since observed ratings are very sparse (rating density 0.0140% for Epinions and 0.0368% for Ciao), thus the rating prediction on these two datasets is challenging.

In addition, we also simulate a semi-synthetic dataset based on MovieLens. It is well-known that MovieLens is a benchmark dataset of user-movie ratings without social network information. For MovieLens-1M, we first need to construct a social network by placing an edge between each pair of users independently with a probability 0.5 depending on whether the nodes belong to . Recall that the social network is viewed as the confounder (common cause) which affects both exposure variables and ratings. We generate the exposure assignment by the confounder of three levels . Then, the exposure and rating outcome are simulated as follows.

where is the original rating in MovieLens and the parameter controls the amount of social network confounder. The exposure indicating whether item being exposed to user

is given by a Bernoulli distribution parameterized by the confounder

. The non-zero is used to simulate the semi-synthetic rating by the second equation. The third equation indicates that the ratings of user will keep unchanged if s/he is not connected by .

a.2. Baselines

We compare our DENC against three groups of methods, covering matrix factorization method, social network-based method, and propensity-based method. For each group, we select its representative baselines with details as follows.

  • [leftmargin=*]

  • PMF (Mnih and Salakhutdinov, 2008)

    : The method utilizes user-item rating matrix and models latent factors of users and items by Gaussian distributions;

  • NRT (Li et al., 2017a)

    : A deep-learning method that adopts multi-layer perceptron network to model user-item interactions for rating predictions.

  • SocialMF (Jamali and Ester, 2010): It considers the social information by adding the propagation of social relation into the matrix factorization model.

  • SoReg (Ma et al., 2011): It models social information as regularization terms to constrain the Matrix Factorization framework.

  • SREE (Li et al., 2017b): It models users and items embeddings into a Euclidean space as well as users’ social relations.

  • GraphRec (Fan et al., 2019): This is a state-of-the-art social recommender that models social information with Graph Neural Network, it organizes user behaviors as a user-item interaction graph.

  • DeepFM (Guo et al., )+: DeepFM is a state-of-the-art recommender that integrates Deep Neural Networks and Factorization Machine (FM). To incorporate the social information into DeepFM, we change the output of FM in DeepFM+ to the linear combination of the original FM function in  (Guo et al., ) and the pre-trained node2vec user embeddings. We also change the task of DeepMF from click-through rate (CTR) to rating prediction.

  • CausE (Bonner and Vasile, 2018): It firstly fits exposure variable embedding with Poisson factorization, then integrates the embedding into PMF for rating prediction.

  • D-WMF (Wang et al., 2018): A propensity-based model which uses Poisson Factorization to infer latent confounders then augments Weighted Matrix Factorization to correct for potential confounding bias.

a.3. Model Variants Configuration

To get a better understanding of our DENC method, we further evaluate the key components of DENC including Exposure model and Social network confounder. We evaluate the performance of DENC on the condition that if a specific component is removed, and then compare the performance of the intact DENC method. In the following, we define two variants of DENC as (1) DENC- that removes Exposure model; (2) DENC- that removes Social network confounder. Note that we do not consider the evaluation of removing Deconfounder in DENC, since Deconfounder models the inherent factors of user-item information, removing user-item information in a recommender can result in poor performance. We record evaluation results in Table 3 and have the following findings:

  • [leftmargin=*]

  • By comparing DENC with DENC-, we find that Exposure model is important for capturing missing patterns and thus boosting the recommendation performance. Removing Exposure model can lead a drastic degradation of MAE/RMSE by 20.41%/24.08% on Epinions and 18.93%/24.34% on Ciao, respectively.

  • We observe that without Social network confounder, the performance of DENC- is deteriorated significantly, with the degradation of MAE/RMSE by 16.10%/20.50% on Epinions and 13.83%/11.31% on Ciao, respectively.

  • Exposure model has a greater impact on DENC compared with Social network confounder. It is reasonable since Exposure model simulates the missing patterns, then Social network confounder can consequently debias the potential confounding bias under the guidance of missing patterns.

Dataset Models MAE RMSE
Epinions DENC- 0.4725 0.8234
DENC- 0.4294 0.7876
DENC 0.2684 0.5826
Ciao DENC- 0.4380 0.8026
DENC- 0.3870 0.6723
DENC 0.2487 0.5592
Table 3. Experimental results of DENC- and DENC-.

a.4. Investigation on Different Network Embedding Methods

We construct network embedding with node2vec (Grover and Leskovec, 2016) that has the capacity of learning richer representations by adding flexibility in exploring neighborhoods of nodes. Besides, by adjusting the weight of the random walk between breadth-first and depth-first sampling, embeddings generated by node2vec can balance the trade-off between homophily and structural equivalence (Henderson et al., 2012), both of which are essential feature expressions in recommendation systems. The key characteristic of node2vec is its scalability and efficiency as it scales to networks of millions of nodes.

By comparison, we further investigate how different network embedding methods impact the performance of DENC, i.e., LINE (Tang et al., 2015), SDNE (Wang et al., 2016).

  • [leftmargin=*]

  • LINE (Tang et al., 2015) preserves both first-order and second-order proximities, it suits arbitrary types of information networks and can easily scale to millions of nodes.

  • SDNE (Wang et al., 2016) is a Deep Learning-based network embedding method, like LINE, it exploits the first-order and second-order proximity jointly to preserve the network structure.

We train the three embedding methods with embedding size =10 while the batch size and epochs are set to 1024 and 50, respectively. The experimental results are given in Table 4.

Dataset Embedding MAE RMSE Precision@20 Recall@20
Epinions node2vec 0.2684 0.5826 0.2832 0.2501
LINE 0.4241 0.6307 0.1736 0.1534
SDNE 0.4021 0.6137 0.1928 0.1837
Ciao node2vec 0.2487 0.5592 0.2703 0.2212
LINE 0.5218 0.7605 0.1504 0.1209
SDNE 0.4538 0.6274 0.2082 0.1594
Table 4. Experimental results of DENC under node2vec, LINE, SDNE.

The results show that under the same experimental settings, DENC performs worse with embeddings trained by LINE and SDNE compared with node2vec on both datasets. Although LINE considers the higher-order proximity, unlike node2vec, it still cannot balance the representation between homophily and structural equivalence (Henderson et al., 2012), in which connectivity information and network structure information can be captured jointly. The results show that our DENC benefits more from the balanced representation that can learn both the connectivity information and network structure information. Based on higher-order proximity, SDNE develops a deep-learning representation method. However, compared with node2vec, SDNE suffers from higher time complexity. The deep architecture of SDNE framework mainly causes the high time complexity of SDNE, the input vector dimension can expand to millions for the auto-encoder in SDNE (Cui et al., 2018). Thus, we consider it reasonable that our DENC with SDNE embedding cannot outperform the counterpart with node2vec embedding under the same training epochs, since it requires more iterations for SDNE to get finer representation.