1. Introduction
A recommendation is a treatment that can affect user behavior. An increase in user actions, such as purchases or views, by the recommendation is the treatment effect (also called the causal effect). Because this leads to improved sales or user engagement, the causal effect of recommendations is important for businesses. While most recommendation methods aim for accurate predictions of user behaviors, there may be a discrepancy between the accuracy and the causal effect of recommendations (Sato et al., 2019). Several recent works have thus proposed recommendation methods to rank items by the causal effect of recommendations (Bodapati, 2008; Sato et al., 2016, 2019, 2020b, 2021).
Online experiments are commonly conducted to compare model performance and select the best recommendation model. However, evaluating the causal effect is not straightforward; we cannot naively compare the outcomes of recommended items because the causal effect is the difference between the potential outcomes with and without the treatment (Rubin, 1974; Imbens and Rubin, 2015). A/B testing that compares the total user actions on all items, not only recommended items, can reveal the difference in the average causal effect (see Section 3.2). Nevertheless, it suffers from large fluctuations due to the variability in natural user behaviors for nonrecommended items: some users tend to purchase more items than others. A large number of users is required to compensate for such fluctuations, making online experiments costly and risky.
In this paper, we propose efficient online evaluation methods for the causal effect of recommendations based on interleaving. Interleaving generates a list from the lists ranked by the two models to be compared (Chapelle et al., 2012). Whereas previous interleaving methods only measure the outcomes of items in the intersection of the original and interleaved lists, our proposed methods also measure the outcomes of items in the original lists but not in the interleaved list. We propose an interleaving method that selects items with equal probability for unbiased evaluation. With unequal selection probabilities, the evaluation might be biased due to confounding (Hernán and Robins, 2020) between recommendation and potential outcomes, leading to inaccurate judgments of the recommendation models. We remove the possible bias by properly weighting the outcomes based on the inverse propensity score (IPS) method used in causal inference (Rosenbaum and Rubin, 1983; Lunceford and Davidian, 2004). This enables the use of a more general interleaving framework that only requires nonzero probabilities to be selected for any item in the original lists. As an instance of the framework, we propose a causal balanced interleaving method that balances the number of items chosen from the two compared lists. To verify the unbiasedness and efficiency of the proposed interleaving methods, we simulate online experiments to compare ranking models.
The contributions of this paper are summarized as follows.

We propose the first interleaving methods to compare recommendation models in terms of their causal effect.

We verify the unbiasedness and efficiency of the proposed methods through simulated online experiments.
2. Related Work
2.1. Interleaving Methods
Interleaving is an online evaluation method for comparing two ranking models by observing user interactions with an interleaved list that is generated from lists ranked by the two models to be compared (Chapelle et al., 2012). Several interleaving methods have been proposed for evaluating information retrieval systems. Balanced interleaving (Joachims, 2002, 2003) generates an interleaved list from two rankings to be compared such that the highest ranks in the interleaved list and from the two rankings and , respectively, are the same or different by at most one. Team draft interleaving (Radlinski et al., 2008) alternatively selects items from compared rankings, analogously to selecting teams for a friendly teamsports match. Probabilistic interleaving (Hofmann et al., 2011) selects items according to probabilities that depend on the item ranks. Optimized interleaving (Radlinski and Craswell, 2013) makes the properties required for interleaving in information retrieval explicit and then generates interleaved lists by solving an optimization problem to fulfill those properties. Interleaving methods have been extended to multileaving that compare multiple rankings simultaneously (Schuth et al., 2014, 2015). Multileaving has been also applied to the evaluation of a news recommender system (Iizuka et al., 2019). The objective of previous interleaving methods is to evaluate how accurately the rankings reflect queries or user preferences, whereas our goal is to evaluate rankings in terms of the causal effect. To the best of our knowledge, at present there are no interleaving methods for causal effects.
2.2. Recommendation Methods for the Causal Effect
Recommendations can affect users’ opinions (Cosley et al., 2003) and induce users’ actions (Dias et al., 2008; Jannach and Jugovac, 2019). However, users’ actions on recommended items could have occurred even without the recommendations (Sharma et al., 2015). Building recommendation models that target the causal effect is challenging because the ground truth data of causal effects are not observable (Holland, 1986). One approach is to train prediction models for both recommended and nonrecommended outcomes and then to rank the items based on the difference between the two predictions (Bodapati, 2008; Sato et al., 2016). Another approach is to optimize models directly for the causal effect. ULRMF and ULBPR (Sato et al., 2019) are respectively pointwise and pairwise optimization methods that use label transformations and training data samplings designed for causal effect optimization. DLCE (Sato et al., 2020b) is an unbiased learning method for the causal effect that uses an IPSbased unbiased learning objective. There are also neighborhood methods for causal effects (Sato et al., 2021)
that are based on a matching estimator in causal inference. These prior works on causal effects evaluated methods offline and did not discuss protocols for online evaluation. In this study, we develop online evaluation methods and compare some of the aforementioned recommendation methods in simulated online experiments.
Another line of works in the area of causal recommendation aims for debiasing (Chen et al., 2020). Several methods have been proposed to learn users’ true preferences from biased (missingnotatrandom) feedback data (Schnabel et al., 2016; Saito et al., 2020; Wang et al., 2020; Bonner and Vasile, 2018).^{1}^{1}1Note that CausE proposed by Bonner and Vasile (Bonner and Vasile, 2018) can be used for the causal effect ranking (Sato et al., 2019), although their original work tackles unbiased prediction of and only refers to the prediction of . Wang et al. (Wang et al., 2020) suggested that they want to recommend items that have low probability of exposure and that would be rated high if exposed. Their approach might be regarded as indirectly targeting the causal effect of recommendations, assuming that recommendations increase exposures. To take this approach, it might be also important to model the influence of recommendations on exposures (Sato et al., 2020a). These methods can be regarded as predicting interactions with recommendations (i.e., , defined in the next section). Hence, we can evaluate them using previous interleaving methods.
3. Evaluation Methods for the Causal Effect of Recommendations
3.1. Causal Effect of Recommendations
In this subsection, we define the causal effect of recommendations. Let and be sets of users and items, respectively. Let denote the interaction (e.g., purchase or view) of user with item . User interactions may differ depending on whether the item is recommended or not. We denote the binary indicator for the recommendation (also called the treatment assignment) by . Let and be hypothetical user interactions (also called potential outcomes (Rubin, 1974)) when item is recommended to () and when it is not recommended (), respectively. The causal effect of recommending item to user is defined as the difference between the two potential outcomes: that takes ternary values, . Using potential outcomes, the observed interaction can be expressed as
(1) 
if is recommended () and if it is not recommended (). Note that or cannot both be observed at a specific time; hence, is not directly observable.
The recommendation model generates a recommendation list for each user. The average causal effect of model is then defined as
(2) 
In this work, we evaluate models using the above metric.^{2}^{2}2This metric is identical to the causal precision@ in (Sato et al., 2020b). That is, when comparing two models, we regard to superior to when .
3.2. A/B testing for the Causal Effect
For A/B testing, we randomly select nonoverlapping subsets of users and (i.e., and ) and apply models and to each subset. Let be the size of the recommendation list, which we assume to be constant. The subset average causal effect is then defined as
(3) 
This converges to as increases.
The typical evaluation metrics for A/B testing are either based on total user interactions (such as sales or user engagement) or only on interactions with recommended lists (such as clickthrough rates or conversion rates)
(Jannach and Jugovac, 2019). Here we show that the former is a valid evaluation for the causal effect. The total user interactions divided by the number of recommendations can be expressed as(4) 
Because the rightmost term in the final equation does not depend on the model, we can compare and by comparing and . On the other hand, the average interactions with the recommended lists can be expressed as
(5) 
Hence, the evaluation based only on interactions with recommended lists is not valid testing for the causal effect.
Although A/B testing with Eq. (3.2
) can be used for unbiased model comparisons, it may have large variance due to the variability in natural user behaviors (i.e., the potential outcomes under no recommendations,
). If users in tend to purchase more items than those in , becomes larger than , thereby altering the comparison in Eq. (3.2). To minimize such discrepancies, a sufficiently large number of users need to be recruited for A/B testing. We thus introduce more efficient evaluation methods in the next subsection.3.3. Interleaving for the Causal Effect
In this subsection, we propose interleaving methods for the online evaluation of the causal effects of recommendations. Previous interleaving methods only measure outcomes in the interleaved lists: they only include and lack information on . Further, if the item selection for the interleaved list is not randomized controlled, the naive estimate from the observed outcomes might be biased due to the confounding between recommendations and potential outcomes. We need to remedy the bias for valid comparison.
Here we describe the problem setting of interleaving for the causal effect. For each user , we construct the interleaved list from the compared lists and . We observe outcomes for all items . Note that if item is in the interleaved list ( or equivalently, ) and if it is not in the list ( or equivalently, ). We want to compare the average causal effects of lists and :
(6) 
We need to estimate the above values from observed outcomes because we cannot directly observe .
If the items in and are randomly assigned to the interleaved list independent of the potential outcomes, that is, , the case can be regarded as a randomized controlled trial (RCT) (Rubin, 1974; Imbens and Rubin, 2015).^{3}^{3}3For our interleaving methods, the independence is required only for the items in the union of and . We can then simply estimate as the difference in average outcomes for items on and not on the interleaved list:
(7) 
One way to realize such a randomized assignment is to select items from with equal probability: . We call this method equal probability interleaving (EPI).
The independence requirement heavily restricts the potential design space of interleaving methods. We thus derive estimates that are applicable to more general cases. Denote the probability (also called the propensity) of being included in the interleaved list by . We assume that 1) the covariates contain all confounders of and , and 2) the treatment assignment is not deterministic ( for ).^{4}^{4}4Taken together, these two assumptions are called strongly ignorable treatment assignment (Rosenbaum and Rubin, 1983). Assumption 1 is equivalent to conditional independence: . When we design an interleaving method, we know the covariates that affect and Assumption 1 can always be satisfied.^{5}^{5}5Confounders are covariates that affect both and , and they are subsets of covariates that affect . Hence, including the latter in is a sufficient condition for Assumption 1. Therefore, the only restriction for interleaving methods is Assumption 2 (also called positivity).
Under these assumptions, we can construct an unbiased estimator using IPS weighting
(Lunceford and Davidian, 2004):(8) 
This estimator is unbiased since
(9) 
We propose a general framework for interleaving as follows.

Construct interleaved lists using an interleaving method that satisfies positivity (Assumption 2).

Conduct online experiments and obtain outcomes .

Estimate and by Eq. (8) and compare them.
As an example of a valid interleaving method that satisfies positivity, we propose causal balanced interleaving (CBI), the pseudocode for which is shown in Algorithm 1. CBI alternatively selects items from each list to balance the items chosen from each list. The item choice in each round is not deterministic in order to satisfy the positivity required for causal effect estimates. The propensity depends on whether an item is in the intersection, . If an item is included in both lists, it has a greater probability of being chosen. The propensity also depends on the cardinality of the union of the compared lists, , because smaller cardinality implies that each item has a greater chance of being selected. The possible values of the covariates are limited: is binary and . Hence, we can easily compute the propensity numerically by repeating Algorithm 1 a sufficient number of times and recording for each combination of covariates.
4. Experiments
4.1. Experimental Setup
We experimented with the following online evaluation methods.^{6}^{6}6For reproducibility, the code is available at https://github.com/masatoh73/causalinterleaving.

ABtotal: A/B testing evaluated by the total user interactions, as expressed in Eq. (3.2).

ABlist: A/B testing evaluated by user interactions only with items on the recommended list, as in Eq. (3.2).

EPIRCT: Interleaving to select items from with equal probability and evaluation using Eq. (7).
Through the experiments, we aim to answer the following research questions: RQ1) Which method produces valid (unbiased) estimates of the true differences in average causal effects (4.2.1)?, and RQ2) Are the proposed interleaving methods more efficient (do they require fewer experimental users) than AB testing (4.2.2)? We first prepared semisynthetic datasets that contain both potential outcomes and for all useritem pairs. Because we observe if and if , both potential outcomes are necessary to simulate user outcomes under various ranking models and online evaluation methods. Following the procedure described in (Sato et al., 2021), we generated two datasets: one is based on the Dunnhumby dataset,^{7}^{7}7https://www.dunnhumby.com/careers/engineering/sourcefiles
and the other is based on the MovieLens1M (ML1M) dataset
(Harper and Konstan, 2015).^{8}^{8}8https://grouplens.org/datasets/movielens The detail and rationale of ML one are described in Section 5.1 of (Sato et al., 2021) and that of DH one are described in 5.1.1 of (Sato et al., 2020b). Each dataset is comprised of independently generated training and testing data. The testing data were used to simulate online evaluation, and the training data were used to train the following models:^{9}^{9}9We used hyperparameters for CP@10, described in the ancillary files at http://arxiv.org/abs/2012.09442. the causalityaware userbased neighborhood methods (CUBN) with outcome similarity (O) and treatment similarity (T) (Sato et al., 2021), the upliftbased pointwise and pairwise learning methods (ULRMF and ULBPR) (Sato et al., 2019), the Bayesian personalized ranking method (BPR) (Rendle et al., 2009), and the userbased neighborhood method (UBN) (Ning et al., 2015). We compared two models among CUBNT, ULRMF, BPR on the Dunnhumby data and two models among CUBNO, ULBPR, UBN on the ML1M data.^{10}^{10}10We intended to compare models of different families, i.e., one of {CUBNT, CUBNO} with one of {ULBPR, ULRMF}. The average causal effect and the average treated outcomes of the trained models are listed in Table 1. The superior models in terms of the average causal effect do not necessarily have higher average treated outcomes. That is, we may mistakenly select a poor model in terms of the causal effect if we only evaluate the outcomes of the recommended items.DunnhumbyOriginal  MovieLens1M  
CUBNT  ULRMF  BPR  CUBNO  ULBPR  UBN  
0.0507  0.0347  0.0295  0.332  0.280  0.186  
0.1359  0.1396  0.1869  0.341  0.285  0.308  
Our protocol for simulating online experiments is the following. First, we randomly select a subset of users and generate lists using compared models. For the A/B testing methods (ABtotal, ABlist), we further split the subset into two groups: and , and and are recommended for each group, respectively. For the interleaving methods (EPIRCT, CBIRCT, CBIIPS), we generate interleaved recommendation lists using EPI or CBI. In the simulation, recommendation means that is set to , and user outcomes are observed by calculating with potential outcomes and . Using the observed outcomes, we estimate the difference in the average causal effects of the compared models: . We repeated the above protocol times and recorded the estimated differences using each online evaluation method. The size of recommendation list was set to 10.
4.2. Results and Discussion
4.2.1. Validity of the evaluation methods
We evaluated the validity of the online evaluation methods using random subsets of 1,000 users. The means and standard deviations of the estimated differences are shown in Table
2. The means obtained by EPIRCT and CBIIPS are close to the true differences. The means obtained by ABtotal are also close to the true value for Dunnhumby but deviate slightly for ML1M. The ABlist often yields estimates that differ substantially from the true values but are similar to the differences in treated outcomes, , as shown in Table 1. This is expected because the ABlist evaluates , not , as expressed in Eq. (3.2). Further, the CBIRCT estimates also deviate from the true differences in most cases.^{11}^{11}11In the comparisons of CUBNO & UBN and ULBPR & UBN, the results of CBIRCT and CBIIPS are identical. There was no overlaps between and in these comparisons, and the propensity was constant. Hence, IPS was not necessary, and CBIRCT and CBIIPS were equivalent. This is due to the bias induced by the uneven probability of recommendation in interleaving. Conversely, CBIIPS successfully removes the bias and produces estimates centered around the true values.DunnhumbyOriginal  MovieLens1M  
CUBNT & BPR  CUBNT & ULRMF  ULRMF & BPR  CUBNO & UBN  CUBNO & ULBPR  ULBPR & UBN  
Truth  0.0212  0.0160  0.0052  0.5177  0.0512  0.4665 
ABtotal  0.0210 0.0399  0.0159 0.0399  0.0051 0.0397  0.5301 1.2048  0.0635 1.2102  0.4789 1.2052 
ABlist  0.0510 0.0071  0.0037 0.0065  0.0471 0.0073  0.0325 0.0104  0.0550 0.0104  0.0226 0.0100 
EPIRCT  0.0212 0.0069  0.0159 0.0075  0.0053 0.0076  0.5178 0.0137  0.0512 0.0083  0.4666 0.0135 
CBIRCT  0.0429 0.0067  0.0192 0.0067  0.0188 0.0076  0.5179 0.0126  0.0444 0.0066  0.4667 0.0126 
CBIIPS  0.0213 0.0063  0.0160 0.0066  0.0051 0.0070  0.5179 0.0126  0.0512 0.0075  0.4667 0.0126 
standard deviations for 10,000 simulated runs). The results highlighted in bold indicate that the true values are within the 95% confidence intervals of the mean estimates.
4.2.2. Efficiency of the interleaving methods
We compared the efficiency of ABtotal, EPIRCT, and CBIIPS, all of which were shown to be valid in the previous section. We simulated user subsets of various sizes in {10, 14, 20, 30, 50, 70, 100, 140, 200, 300, 500, 700, 1000, 1400, 2000} and evaluated the ratio of false judgments (when the sign of the estimated difference is the opposite of the truth). Figure 1 shows the ratio of false judgments according to the number of users. As the number of users increases, the false ratios of CBIIPS and EPIRCT decrease more rapidly than that of ABtotal does. For the Dunnhumby dataset, ABtotal requires around 30 times more users than CBIIPS and EPIRCT to achieve the same false ratio. For the ML1M dataset, ABtotal did not reach the same false ratio in the experimental range of subset sizes. These results demonstrate the superior efficiency of the proposed interleaving methods. Furthermore, CBIIPS tends to be slightly more efficient than EPIRCT, as expected from the smaller standard deviations shown in Table 2. This is probably because the number of items selected from the compared lists is balanced in this interleaving method.
5. Conclusions
In this paper, we proposed the first interleaving methods for comparing recommender models in terms of causal effects. To realize unbiased model comparisons, our methods either select items with equal probabilities or weight the outcomes using IPS. We simulated online experiments and verified that our interleaving methods and an A/B testing method are unbiased and that our interleaving methods are largely more efficient than the A/B testing method. In the future, we plan to extend our methods to multileaving. Online experimentation in real recommendation services will also be important for future work.
References
 Recommendation systems with purchase data. Journal of marketing research 45 (1), pp. 77–93. Cited by: §1, §2.2.
 Causal embeddings for recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys ’18, New York, NY, USA, pp. 104–112. External Links: ISBN 9781450359016, Link, Document Cited by: §2.2, footnote 1.
 Largescale validation and analysis of interleaved search evaluation. ACM Trans. Inf. Syst. 30 (1). External Links: ISSN 10468188, Link, Document Cited by: §1, §2.1.
 Bias and debias in recommender system: a survey and future directions. arXiv preprint arXiv:2010.03240. Cited by: §2.2.
 Is seeing believing? how recommender system interfaces affect users’ opinions. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’03, New York, NY, USA, pp. 585–592. External Links: ISBN 1581136307, Link, Document Cited by: §2.2.
 The value of personalised recommender systems to ebusiness: a case study. In Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys ’08, New York, NY, USA, pp. 291–294. External Links: ISBN 9781605580937, Link, Document Cited by: §2.2.
 The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5 (4). External Links: ISSN 21606455, Link, Document Cited by: §4.1.
 Causal inference: what if. Boca Raton: Chapman & Hill/CRC. Cited by: §1.
 A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, New York, NY, USA, pp. 249–258. External Links: ISBN 9781450307178, Link, Document Cited by: §2.1.
 Statistics and causal inference. Journal of the American statistical Association 81 (396), pp. 945–960. Cited by: §2.2.
 Greedy optimized multileaving for personalization. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, New York, NY, USA, pp. 413–417. External Links: ISBN 9781450362436, Link, Document Cited by: §2.1.
 Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge University Press, New York, NY, USA. External Links: ISBN 0521885884 Cited by: §1, §3.3.
 Measuring the business value of recommender systems. ACM Trans. Manage. Inf. Syst. 10 (4). External Links: ISSN 2158656X, Link, Document Cited by: §2.2, §3.2.
 Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, New York, NY, USA, pp. 133–142. External Links: ISBN 158113567X, Link, Document Cited by: §2.1.
 Evaluating retrieval performance using clickthrough data. In Text Mining, Theoretical Aspects and Applications, J. Franke, G. Nakhaeizadeh, and I. Renz (Eds.), pp. 79–96. Cited by: §2.1.
 Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine 23 (19), pp. 2937–2960. Cited by: §1, §3.3.
 A comprehensive survey of neighborhoodbased recommendation methods. In Recommender Systems Handbook, F. Ricci, L. Rokach, and B. Shapira (Eds.), pp. 37–76. Cited by: §4.1.
 Optimized interleaving for online retrieval evaluation. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13, New York, NY, USA, pp. 245–254. External Links: ISBN 9781450318693, Link, Document Cited by: §2.1.
 How does clickthrough data reflect retrieval quality?. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, New York, NY, USA, pp. 43–52. External Links: ISBN 9781595939913, Link, Document Cited by: §2.1.

BPR: bayesian personalized ranking from implicit feedback.
In
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, UAI ’09, Arlington, Virginia, USA, pp. 452–461. External Links: ISBN 9780974903958 Cited by: §4.1.  The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1), pp. 41–55. Cited by: §1, footnote 4.
 Estimating causal effects of treatments in randomized and nonrandomized studies.. Journal of educational Psychology 66 (5), pp. 688. Cited by: §1, §3.1, §3.3.
 Unbiased recommender learning from missingnotatrandom implicit feedback. In Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM ’20, New York, NY, USA, pp. 501–509. External Links: ISBN 9781450368223, Link, Document Cited by: §2.2.
 Modeling individual users’ responsiveness to maximize recommendation impact. In Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization, UMAP ’16, New York, NY, USA, pp. 259–267. External Links: ISBN 9781450343688, Link, Document Cited by: §1, §2.2.
 Upliftbased evaluation and optimization of recommenders. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, New York, NY, USA, pp. 296–304. External Links: ISBN 9781450362436, Link, Document Cited by: §1, §2.2, §4.1, footnote 1.
 Modeling user exposure with recommendation influence. In Proceedings of the 35th Annual ACM Symposium on Applied Computing, SAC ’20, New York, NY, USA, pp. 1461–1464. External Links: ISBN 9781450368667, Link, Document Cited by: footnote 1.
 Unbiased learning for the causal effect of recommendation. In Fourteenth ACM Conference on Recommender Systems, RecSys ’20, New York, NY, USA, pp. 378–387. External Links: ISBN 9781450375832, Link, Document Cited by: §1, §2.2, §4.1, footnote 2.
 Causalityaware neighborhood methods for recommender systems. pp. 603–618. External Links: Link, Document, ISBN 9783030721138 Cited by: §1, §2.2, §4.1.

Recommendations as treatments: debiasing learning and evaluation.
In
Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48
, ICML’16, pp. 1670–1679. Cited by: §2.2.  Probabilistic multileave for online retrieval evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, New York, NY, USA, pp. 955–958. External Links: ISBN 9781450336215, Link, Document Cited by: §2.1.
 Multileaved comparisons for fast online evaluation. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, CIKM ’14, New York, NY, USA, pp. 71–80. External Links: ISBN 9781450325981, Link, Document Cited by: §2.1.
 Estimating the causal impact of recommendation systems from observational data. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, EC ’15, New York, NY, USA, pp. 453–470. External Links: ISBN 9781450334105, Link, Document Cited by: §2.2.
 Causal inference for recommender systems. In Fourteenth ACM Conference on Recommender Systems, RecSys ’20, New York, NY, USA, pp. 426–431. External Links: ISBN 9781450375832, Link, Document Cited by: §2.2, footnote 1.