Online Evaluation Methods for the Causal Effect of Recommendations

by   Masahiro Sato, et al.

Evaluating the causal effect of recommendations is an important objective because the causal effect on user interactions can directly leads to an increase in sales and user engagement. To select an optimal recommendation model, it is common to conduct A/B testing to compare model performance. However, A/B testing of causal effects requires a large number of users, making such experiments costly and risky. We therefore propose the first interleaving methods that can efficiently compare recommendation models in terms of causal effects. In contrast to conventional interleaving methods, we measure the outcomes of both items on an interleaved list and items not on the interleaved list, since the causal effect is the difference between outcomes with and without recommendations. To ensure that the evaluations are unbiased, we either select items with equal probability or weight the outcomes using inverse propensity scores. We then verify the unbiasedness and efficiency of online evaluation methods through simulated online experiments. The results indicate that our proposed methods are unbiased and that they have superior efficiency to A/B testing.


page 1

page 2

page 3

page 4


Unbiased Learning for the Causal Effect of Recommendation

Increasing users' positive interactions, such as purchases or clicks, is...

Causal Embeddings for Recommendation: An Extended Abstract

Recommendations are commonly used to modify user's natural behavior, for...

CausCF: Causal Collaborative Filtering for RecommendationEffect Estimation

To improve user experience and profits of corporations, modern industria...

Causality-Aware Neighborhood Methods for Recommender Systems

The business objectives of recommenders, such as increasing sales, are a...

In the Eye of the Beholder: Robust Prediction with Causal User Modeling

Accurately predicting the relevance of items to users is crucial to the ...

Do In-Person Lectures Help? A Study of a Large Statistics Class

Over 1000 students over the course of four semesters were given the opti...

1. Introduction

A recommendation is a treatment that can affect user behavior. An increase in user actions, such as purchases or views, by the recommendation is the treatment effect (also called the causal effect). Because this leads to improved sales or user engagement, the causal effect of recommendations is important for businesses. While most recommendation methods aim for accurate predictions of user behaviors, there may be a discrepancy between the accuracy and the causal effect of recommendations (Sato et al., 2019). Several recent works have thus proposed recommendation methods to rank items by the causal effect of recommendations (Bodapati, 2008; Sato et al., 2016, 2019, 2020b, 2021).

Online experiments are commonly conducted to compare model performance and select the best recommendation model. However, evaluating the causal effect is not straightforward; we cannot naively compare the outcomes of recommended items because the causal effect is the difference between the potential outcomes with and without the treatment (Rubin, 1974; Imbens and Rubin, 2015). A/B testing that compares the total user actions on all items, not only recommended items, can reveal the difference in the average causal effect (see Section 3.2). Nevertheless, it suffers from large fluctuations due to the variability in natural user behaviors for non-recommended items: some users tend to purchase more items than others. A large number of users is required to compensate for such fluctuations, making online experiments costly and risky.

In this paper, we propose efficient online evaluation methods for the causal effect of recommendations based on interleaving. Interleaving generates a list from the lists ranked by the two models to be compared (Chapelle et al., 2012). Whereas previous interleaving methods only measure the outcomes of items in the intersection of the original and interleaved lists, our proposed methods also measure the outcomes of items in the original lists but not in the interleaved list. We propose an interleaving method that selects items with equal probability for unbiased evaluation. With unequal selection probabilities, the evaluation might be biased due to confounding (Hernán and Robins, 2020) between recommendation and potential outcomes, leading to inaccurate judgments of the recommendation models. We remove the possible bias by properly weighting the outcomes based on the inverse propensity score (IPS) method used in causal inference (Rosenbaum and Rubin, 1983; Lunceford and Davidian, 2004). This enables the use of a more general interleaving framework that only requires non-zero probabilities to be selected for any item in the original lists. As an instance of the framework, we propose a causal balanced interleaving method that balances the number of items chosen from the two compared lists. To verify the unbiasedness and efficiency of the proposed interleaving methods, we simulate online experiments to compare ranking models.

The contributions of this paper are summarized as follows.

  • We propose the first interleaving methods to compare recommendation models in terms of their causal effect.

  • We verify the unbiasedness and efficiency of the proposed methods through simulated online experiments.

2. Related Work

2.1. Interleaving Methods

Interleaving is an online evaluation method for comparing two ranking models by observing user interactions with an interleaved list that is generated from lists ranked by the two models to be compared (Chapelle et al., 2012). Several interleaving methods have been proposed for evaluating information retrieval systems. Balanced interleaving (Joachims, 2002, 2003) generates an interleaved list from two rankings to be compared such that the highest ranks in the interleaved list and from the two rankings and , respectively, are the same or different by at most one. Team draft interleaving (Radlinski et al., 2008) alternatively selects items from compared rankings, analogously to selecting teams for a friendly team-sports match. Probabilistic interleaving (Hofmann et al., 2011) selects items according to probabilities that depend on the item ranks. Optimized interleaving (Radlinski and Craswell, 2013) makes the properties required for interleaving in information retrieval explicit and then generates interleaved lists by solving an optimization problem to fulfill those properties. Interleaving methods have been extended to multileaving that compare multiple rankings simultaneously (Schuth et al., 2014, 2015). Multileaving has been also applied to the evaluation of a news recommender system (Iizuka et al., 2019). The objective of previous interleaving methods is to evaluate how accurately the rankings reflect queries or user preferences, whereas our goal is to evaluate rankings in terms of the causal effect. To the best of our knowledge, at present there are no interleaving methods for causal effects.

2.2. Recommendation Methods for the Causal Effect

Recommendations can affect users’ opinions (Cosley et al., 2003) and induce users’ actions (Dias et al., 2008; Jannach and Jugovac, 2019). However, users’ actions on recommended items could have occurred even without the recommendations (Sharma et al., 2015). Building recommendation models that target the causal effect is challenging because the ground truth data of causal effects are not observable (Holland, 1986). One approach is to train prediction models for both recommended and non-recommended outcomes and then to rank the items based on the difference between the two predictions (Bodapati, 2008; Sato et al., 2016). Another approach is to optimize models directly for the causal effect. ULRMF and ULBPR (Sato et al., 2019) are respectively pointwise and pairwise optimization methods that use label transformations and training data samplings designed for causal effect optimization. DLCE (Sato et al., 2020b) is an unbiased learning method for the causal effect that uses an IPS-based unbiased learning objective. There are also neighborhood methods for causal effects (Sato et al., 2021)

that are based on a matching estimator in causal inference. These prior works on causal effects evaluated methods offline and did not discuss protocols for online evaluation. In this study, we develop online evaluation methods and compare some of the aforementioned recommendation methods in simulated online experiments.

Another line of works in the area of causal recommendation aims for debiasing (Chen et al., 2020). Several methods have been proposed to learn users’ true preferences from biased (missing-not-at-random) feedback data (Schnabel et al., 2016; Saito et al., 2020; Wang et al., 2020; Bonner and Vasile, 2018).111Note that CausE proposed by Bonner and Vasile (Bonner and Vasile, 2018) can be used for the causal effect ranking (Sato et al., 2019), although their original work tackles unbiased prediction of and only refers to the prediction of . Wang et al. (Wang et al., 2020) suggested that they want to recommend items that have low probability of exposure and that would be rated high if exposed. Their approach might be regarded as indirectly targeting the causal effect of recommendations, assuming that recommendations increase exposures. To take this approach, it might be also important to model the influence of recommendations on exposures (Sato et al., 2020a). These methods can be regarded as predicting interactions with recommendations (i.e., , defined in the next section). Hence, we can evaluate them using previous interleaving methods.

3. Evaluation Methods for the Causal Effect of Recommendations

3.1. Causal Effect of Recommendations

In this subsection, we define the causal effect of recommendations. Let and be sets of users and items, respectively. Let denote the interaction (e.g., purchase or view) of user with item . User interactions may differ depending on whether the item is recommended or not. We denote the binary indicator for the recommendation (also called the treatment assignment) by . Let and be hypothetical user interactions (also called potential outcomes (Rubin, 1974)) when item is recommended to () and when it is not recommended (), respectively. The causal effect of recommending item to user is defined as the difference between the two potential outcomes: that takes ternary values, . Using potential outcomes, the observed interaction can be expressed as


if is recommended () and if it is not recommended (). Note that or cannot both be observed at a specific time; hence, is not directly observable.

The recommendation model generates a recommendation list for each user. The average causal effect of model is then defined as


In this work, we evaluate models using the above metric.222This metric is identical to the causal precision@ in (Sato et al., 2020b). That is, when comparing two models, we regard to superior to when .

3.2. A/B testing for the Causal Effect

For A/B testing, we randomly select non-overlapping subsets of users and (i.e., and ) and apply models and to each subset. Let be the size of the recommendation list, which we assume to be constant. The subset average causal effect is then defined as


This converges to as increases.

The typical evaluation metrics for A/B testing are either based on total user interactions (such as sales or user engagement) or only on interactions with recommended lists (such as click-through rates or conversion rates) 

(Jannach and Jugovac, 2019). Here we show that the former is a valid evaluation for the causal effect. The total user interactions divided by the number of recommendations can be expressed as


Because the rightmost term in the final equation does not depend on the model, we can compare and by comparing and . On the other hand, the average interactions with the recommended lists can be expressed as


Hence, the evaluation based only on interactions with recommended lists is not valid testing for the causal effect.

Although A/B testing with Eq. (3.2

) can be used for unbiased model comparisons, it may have large variance due to the variability in natural user behaviors (i.e., the potential outcomes under no recommendations,

). If users in tend to purchase more items than those in , becomes larger than , thereby altering the comparison in Eq. (3.2). To minimize such discrepancies, a sufficiently large number of users need to be recruited for A/B testing. We thus introduce more efficient evaluation methods in the next subsection.

3.3. Interleaving for the Causal Effect

In this subsection, we propose interleaving methods for the online evaluation of the causal effects of recommendations. Previous interleaving methods only measure outcomes in the interleaved lists: they only include and lack information on . Further, if the item selection for the interleaved list is not randomized controlled, the naive estimate from the observed outcomes might be biased due to the confounding between recommendations and potential outcomes. We need to remedy the bias for valid comparison.

Here we describe the problem setting of interleaving for the causal effect. For each user , we construct the interleaved list from the compared lists and . We observe outcomes for all items . Note that if item is in the interleaved list ( or equivalently, ) and if it is not in the list ( or equivalently, ). We want to compare the average causal effects of lists and :


We need to estimate the above values from observed outcomes because we cannot directly observe .

If the items in and are randomly assigned to the interleaved list independent of the potential outcomes, that is, , the case can be regarded as a randomized controlled trial (RCT) (Rubin, 1974; Imbens and Rubin, 2015).333For our interleaving methods, the independence is required only for the items in the union of and . We can then simply estimate as the difference in average outcomes for items on and not on the interleaved list:


One way to realize such a randomized assignment is to select items from with equal probability: . We call this method equal probability interleaving (EPI).

The independence requirement heavily restricts the potential design space of interleaving methods. We thus derive estimates that are applicable to more general cases. Denote the probability (also called the propensity) of being included in the interleaved list by . We assume that 1) the covariates contain all confounders of and , and 2) the treatment assignment is not deterministic ( for ).444Taken together, these two assumptions are called strongly ignorable treatment assignment (Rosenbaum and Rubin, 1983). Assumption 1 is equivalent to conditional independence: . When we design an interleaving method, we know the covariates that affect and Assumption 1 can always be satisfied.555Confounders are covariates that affect both and , and they are subsets of covariates that affect . Hence, including the latter in is a sufficient condition for Assumption 1. Therefore, the only restriction for interleaving methods is Assumption 2 (also called positivity).

Under these assumptions, we can construct an unbiased estimator using IPS weighting 

(Lunceford and Davidian, 2004):


This estimator is unbiased since


We propose a general framework for interleaving as follows.

  1. Construct interleaved lists using an interleaving method that satisfies positivity (Assumption 2).

  2. Conduct online experiments and obtain outcomes .

  3. Estimate and by Eq. (8) and compare them.

Input: Compared lists and
Output: Interleaved list with size ()
  // initialize interleaved list
  // Randomly select whether starting from A or B
1 while  do
2       if  then
               // Randomly choose one item from not yet in
4      else
             )   // Randomly choose one item from not yet in
Algorithm 1 Causal Balanced Interleaving (CBI).

As an example of a valid interleaving method that satisfies positivity, we propose causal balanced interleaving (CBI), the pseudo-code for which is shown in Algorithm 1. CBI alternatively selects items from each list to balance the items chosen from each list. The item choice in each round is not deterministic in order to satisfy the positivity required for causal effect estimates. The propensity depends on whether an item is in the intersection, . If an item is included in both lists, it has a greater probability of being chosen. The propensity also depends on the cardinality of the union of the compared lists, , because smaller cardinality implies that each item has a greater chance of being selected. The possible values of the covariates are limited: is binary and . Hence, we can easily compute the propensity numerically by repeating Algorithm 1 a sufficient number of times and recording for each combination of covariates.

4. Experiments

4.1. Experimental Setup

We experimented with the following online evaluation methods.666For reproducibility, the code is available at

  • AB-total: A/B testing evaluated by the total user interactions, as expressed in Eq. (3.2).

  • AB-list: A/B testing evaluated by user interactions only with items on the recommended list, as in Eq. (3.2).

  • EPI-RCT: Interleaving to select items from with equal probability and evaluation using Eq. (7).

  • CBI-RCT: Interleaving by Algorithm 1 and evaluation using Eq. (7), that is, no bias correction by IPS.

  • CBI-IPS: Interleaving by Algorithm 1 and evaluation using Eq. (8).

Through the experiments, we aim to answer the following research questions: RQ1) Which method produces valid (unbiased) estimates of the true differences in average causal effects (4.2.1)?, and RQ2) Are the proposed interleaving methods more efficient (do they require fewer experimental users) than AB testing (4.2.2)? We first prepared semi-synthetic datasets that contain both potential outcomes and for all user-item pairs. Because we observe if and if , both potential outcomes are necessary to simulate user outcomes under various ranking models and online evaluation methods. Following the procedure described in (Sato et al., 2021), we generated two datasets: one is based on the Dunnhumby dataset,777

and the other is based on the MovieLens-1M (ML-1M) dataset 

(Harper and Konstan, 2015).888 The detail and rationale of ML one are described in Section 5.1 of (Sato et al., 2021) and that of DH one are described in 5.1.1 of (Sato et al., 2020b). Each dataset is comprised of independently generated training and testing data. The testing data were used to simulate online evaluation, and the training data were used to train the following models:999We used hyper-parameters for CP@10, described in the ancillary files at the causality-aware user-based neighborhood methods (CUBN) with outcome similarity (-O) and treatment similarity (-T) (Sato et al., 2021), the uplift-based pointwise and pairwise learning methods (ULRMF and ULBPR) (Sato et al., 2019), the Bayesian personalized ranking method (BPR) (Rendle et al., 2009), and the user-based neighborhood method (UBN) (Ning et al., 2015). We compared two models among CUBN-T, ULRMF, BPR on the Dunnhumby data and two models among CUBN-O, ULBPR, UBN on the ML-1M data.101010We intended to compare models of different families, i.e., one of {CUBN-T, CUBN-O} with one of {ULBPR, ULRMF}. The average causal effect and the average treated outcomes of the trained models are listed in Table 1. The superior models in terms of the average causal effect do not necessarily have higher average treated outcomes. That is, we may mistakenly select a poor model in terms of the causal effect if we only evaluate the outcomes of the recommended items.

Dunnhumby-Original MovieLens-1M
0.0507 0.0347 0.0295 0.332 0.280 -0.186
0.1359 0.1396 0.1869 0.341 0.285 0.308
Table 1. Averages of causal effect and potential outcomes under treatment with recommendation lists of size .

Our protocol for simulating online experiments is the following. First, we randomly select a subset of users and generate lists using compared models. For the A/B testing methods (AB-total, AB-list), we further split the subset into two groups: and , and and are recommended for each group, respectively. For the interleaving methods (EPI-RCT, CBI-RCT, CBI-IPS), we generate interleaved recommendation lists using EPI or CBI. In the simulation, recommendation means that is set to , and user outcomes are observed by calculating with potential outcomes and . Using the observed outcomes, we estimate the difference in the average causal effects of the compared models: . We repeated the above protocol times and recorded the estimated differences using each online evaluation method. The size of recommendation list was set to 10.

4.2. Results and Discussion

4.2.1. Validity of the evaluation methods

We evaluated the validity of the online evaluation methods using random subsets of 1,000 users. The means and standard deviations of the estimated differences are shown in Table

2. The means obtained by EPI-RCT and CBI-IPS are close to the true differences. The means obtained by AB-total are also close to the true value for Dunnhumby but deviate slightly for ML-1M. The AB-list often yields estimates that differ substantially from the true values but are similar to the differences in treated outcomes, , as shown in Table 1. This is expected because the AB-list evaluates , not , as expressed in Eq. (3.2). Further, the CBI-RCT estimates also deviate from the true differences in most cases.111111In the comparisons of CUBN-O & UBN and ULBPR & UBN, the results of CBI-RCT and CBI-IPS are identical. There was no overlaps between and in these comparisons, and the propensity was constant. Hence, IPS was not necessary, and CBI-RCT and CBI-IPS were equivalent. This is due to the bias induced by the uneven probability of recommendation in interleaving. Conversely, CBI-IPS successfully removes the bias and produces estimates centered around the true values.

Dunnhumby-Original MovieLens-1M
Truth 0.0212 0.0160 0.0052 0.5177 0.0512 0.4665
AB-total 0.0210 0.0399 0.0159 0.0399 0.0051 0.0397 0.5301 1.2048 0.0635 1.2102 0.4789 1.2052
AB-list -0.0510 0.0071 -0.0037 0.0065 -0.0471 0.0073 0.0325 0.0104 0.0550 0.0104 -0.0226 0.0100
EPI-RCT 0.0212 0.0069 0.0159 0.0075 0.0053 0.0076 0.5178 0.0137 0.0512 0.0083 0.4666 0.0135
CBI-RCT 0.0429 0.0067 0.0192 0.0067 0.0188 0.0076 0.5179 0.0126 0.0444 0.0066 0.4667 0.0126
CBI-IPS 0.0213 0.0063 0.0160 0.0066 0.0051 0.0070 0.5179 0.0126 0.0512 0.0075 0.4667 0.0126
Table 2. Estimated differences between the causal effects of the compared models (mean

standard deviations for 10,000 simulated runs). The results highlighted in bold indicate that the true values are within the 95% confidence intervals of the mean estimates.

4.2.2. Efficiency of the interleaving methods

We compared the efficiency of AB-total, EPI-RCT, and CBI-IPS, all of which were shown to be valid in the previous section. We simulated user subsets of various sizes in {10, 14, 20, 30, 50, 70, 100, 140, 200, 300, 500, 700, 1000, 1400, 2000} and evaluated the ratio of false judgments (when the sign of the estimated difference is the opposite of the truth). Figure 1 shows the ratio of false judgments according to the number of users. As the number of users increases, the false ratios of CBI-IPS and EPI-RCT decrease more rapidly than that of AB-total does. For the Dunnhumby dataset, AB-total requires around 30 times more users than CBI-IPS and EPI-RCT to achieve the same false ratio. For the ML-1M dataset, AB-total did not reach the same false ratio in the experimental range of subset sizes. These results demonstrate the superior efficiency of the proposed interleaving methods. Furthermore, CBI-IPS tends to be slightly more efficient than EPI-RCT, as expected from the smaller standard deviations shown in Table 2. This is probably because the number of items selected from the compared lists is balanced in this interleaving method.

(a) CUBN-T & BPR in Dunnhumby.
(b) CUBN-T & ULRMF in Dunnhumby.
(c) ULRMF & BPR in Dunnhumby.
(d) CUBN-O & UBN in ML-1M.
(e) CUBN-O & ULBPR in ML-1M.
(f) ULBPR & UBN in ML-1M.
Figure 1. Dependence on the number of users.

5. Conclusions

In this paper, we proposed the first interleaving methods for comparing recommender models in terms of causal effects. To realize unbiased model comparisons, our methods either select items with equal probabilities or weight the outcomes using IPS. We simulated online experiments and verified that our interleaving methods and an A/B testing method are unbiased and that our interleaving methods are largely more efficient than the A/B testing method. In the future, we plan to extend our methods to multileaving. Online experimentation in real recommendation services will also be important for future work.


  • A. V. Bodapati (2008) Recommendation systems with purchase data. Journal of marketing research 45 (1), pp. 77–93. Cited by: §1, §2.2.
  • S. Bonner and F. Vasile (2018) Causal embeddings for recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys ’18, New York, NY, USA, pp. 104–112. External Links: ISBN 9781450359016, Link, Document Cited by: §2.2, footnote 1.
  • O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue (2012) Large-scale validation and analysis of interleaved search evaluation. ACM Trans. Inf. Syst. 30 (1). External Links: ISSN 1046-8188, Link, Document Cited by: §1, §2.1.
  • J. Chen, H. Dong, X. Wang, F. Feng, M. Wang, and X. He (2020) Bias and debias in recommender system: a survey and future directions. arXiv preprint arXiv:2010.03240. Cited by: §2.2.
  • D. Cosley, S. K. Lam, I. Albert, J. A. Konstan, and J. Riedl (2003) Is seeing believing? how recommender system interfaces affect users’ opinions. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’03, New York, NY, USA, pp. 585–592. External Links: ISBN 1581136307, Link, Document Cited by: §2.2.
  • M. B. Dias, D. Locher, M. Li, W. El-Deredy, and P. J.G. Lisboa (2008) The value of personalised recommender systems to e-business: a case study. In Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys ’08, New York, NY, USA, pp. 291–294. External Links: ISBN 9781605580937, Link, Document Cited by: §2.2.
  • F. M. Harper and J. A. Konstan (2015) The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5 (4). External Links: ISSN 2160-6455, Link, Document Cited by: §4.1.
  • M. Hernán and J. Robins (2020) Causal inference: what if. Boca Raton: Chapman & Hill/CRC. Cited by: §1.
  • K. Hofmann, S. Whiteson, and M. de Rijke (2011) A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, New York, NY, USA, pp. 249–258. External Links: ISBN 9781450307178, Link, Document Cited by: §2.1.
  • P. W. Holland (1986) Statistics and causal inference. Journal of the American statistical Association 81 (396), pp. 945–960. Cited by: §2.2.
  • K. Iizuka, T. Yoneda, and Y. Seki (2019) Greedy optimized multileaving for personalization. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, New York, NY, USA, pp. 413–417. External Links: ISBN 9781450362436, Link, Document Cited by: §2.1.
  • G. W. Imbens and D. B. Rubin (2015) Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge University Press, New York, NY, USA. External Links: ISBN 0521885884 Cited by: §1, §3.3.
  • D. Jannach and M. Jugovac (2019) Measuring the business value of recommender systems. ACM Trans. Manage. Inf. Syst. 10 (4). External Links: ISSN 2158-656X, Link, Document Cited by: §2.2, §3.2.
  • T. Joachims (2002) Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, New York, NY, USA, pp. 133–142. External Links: ISBN 158113567X, Link, Document Cited by: §2.1.
  • T. Joachims (2003) Evaluating retrieval performance using clickthrough data. In Text Mining, Theoretical Aspects and Applications, J. Franke, G. Nakhaeizadeh, and I. Renz (Eds.), pp. 79–96. Cited by: §2.1.
  • J. K. Lunceford and M. Davidian (2004) Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine 23 (19), pp. 2937–2960. Cited by: §1, §3.3.
  • X. Ning, C. Desrosiers, and G. Karypis (2015) A comprehensive survey of neighborhood-based recommendation methods. In Recommender Systems Handbook, F. Ricci, L. Rokach, and B. Shapira (Eds.), pp. 37–76. Cited by: §4.1.
  • F. Radlinski and N. Craswell (2013) Optimized interleaving for online retrieval evaluation. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13, New York, NY, USA, pp. 245–254. External Links: ISBN 9781450318693, Link, Document Cited by: §2.1.
  • F. Radlinski, M. Kurup, and T. Joachims (2008) How does clickthrough data reflect retrieval quality?. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, New York, NY, USA, pp. 43–52. External Links: ISBN 9781595939913, Link, Document Cited by: §2.1.
  • S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme (2009) BPR: bayesian personalized ranking from implicit feedback. In

    Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence

    UAI ’09, Arlington, Virginia, USA, pp. 452–461. External Links: ISBN 9780974903958 Cited by: §4.1.
  • P. R. Rosenbaum and D. B. Rubin (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1), pp. 41–55. Cited by: §1, footnote 4.
  • D. B. Rubin (1974) Estimating causal effects of treatments in randomized and nonrandomized studies.. Journal of educational Psychology 66 (5), pp. 688. Cited by: §1, §3.1, §3.3.
  • Y. Saito, S. Yaginuma, Y. Nishino, H. Sakata, and K. Nakata (2020) Unbiased recommender learning from missing-not-at-random implicit feedback. In Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM ’20, New York, NY, USA, pp. 501–509. External Links: ISBN 9781450368223, Link, Document Cited by: §2.2.
  • M. Sato, H. Izumo, and T. Sonoda (2016) Modeling individual users’ responsiveness to maximize recommendation impact. In Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization, UMAP ’16, New York, NY, USA, pp. 259–267. External Links: ISBN 978-1-4503-4368-8, Link, Document Cited by: §1, §2.2.
  • M. Sato, J. Singh, S. Takemori, T. Sonoda, Q. Zhang, and T. Ohkuma (2019) Uplift-based evaluation and optimization of recommenders. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, New York, NY, USA, pp. 296–304. External Links: ISBN 9781450362436, Link, Document Cited by: §1, §2.2, §4.1, footnote 1.
  • M. Sato, J. Singh, S. Takemori, T. Sonoda, Q. Zhang, and T. Ohkuma (2020a) Modeling user exposure with recommendation influence. In Proceedings of the 35th Annual ACM Symposium on Applied Computing, SAC ’20, New York, NY, USA, pp. 1461–1464. External Links: ISBN 9781450368667, Link, Document Cited by: footnote 1.
  • M. Sato, S. Takemori, J. Singh, and T. Ohkuma (2020b) Unbiased learning for the causal effect of recommendation. In Fourteenth ACM Conference on Recommender Systems, RecSys ’20, New York, NY, USA, pp. 378–387. External Links: ISBN 9781450375832, Link, Document Cited by: §1, §2.2, §4.1, footnote 2.
  • M. Sato, S. Takemori, J. Singh, and Q. Zhang (2021) Causality-aware neighborhood methods for recommender systems. pp. 603–618. External Links: Link, Document, ISBN 978-3-030-72113-8 Cited by: §1, §2.2, §4.1.
  • T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims (2016) Recommendations as treatments: debiasing learning and evaluation. In

    Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48

    ICML’16, pp. 1670–1679. Cited by: §2.2.
  • A. Schuth, R. Bruintjes, F. Buüttner, J. van Doorn, C. Groenland, H. Oosterhuis, C. Tran, B. Veeling, J. van der Velde, R. Wechsler, D. Woudenberg, and M. de Rijke (2015) Probabilistic multileave for online retrieval evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, New York, NY, USA, pp. 955–958. External Links: ISBN 9781450336215, Link, Document Cited by: §2.1.
  • A. Schuth, F. Sietsma, S. Whiteson, D. Lefortier, and M. de Rijke (2014) Multileaved comparisons for fast online evaluation. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, CIKM ’14, New York, NY, USA, pp. 71–80. External Links: ISBN 9781450325981, Link, Document Cited by: §2.1.
  • A. Sharma, J. M. Hofman, and D. J. Watts (2015) Estimating the causal impact of recommendation systems from observational data. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, EC ’15, New York, NY, USA, pp. 453–470. External Links: ISBN 978-1-4503-3410-5, Link, Document Cited by: §2.2.
  • Y. Wang, D. Liang, L. Charlin, and D. M. Blei (2020) Causal inference for recommender systems. In Fourteenth ACM Conference on Recommender Systems, RecSys ’20, New York, NY, USA, pp. 426–431. External Links: ISBN 9781450375832, Link, Document Cited by: §2.2, footnote 1.