Log In Sign Up

Causal Embeddings for Recommendation: An Extended Abstract

Recommendations are commonly used to modify user's natural behavior, for example, increasing product sales or the time spent on a website. This results in a gap between the ultimate business objective and the classical setup where recommendations are optimized to be coherent with past user behavior. To bridge this gap, we propose a new learning setup for recommendation that optimizes for the Incremental Treatment Effect (ITE) of the policy. We show this is equivalent to learning to predict recommendation outcomes under a fully random recommendation policy and propose a new domain adaptation algorithm that learns from logged data containing outcomes from a biased recommendation policy and predicts recommendation outcomes according to random exposure. We compare our method against state-of-the-art factorization methods, in addition to new approaches of causal recommendation and show significant improvements.


page 1

page 2

page 3

page 4


Online Evaluation Methods for the Causal Effect of Recommendations

Evaluating the causal effect of recommendations is an important objectiv...

The Deconfounded Recommender: A Causal Inference Approach to Recommendation

The goal of a recommender system is to show its users items that they wi...

Be Causal: De-biasing Social Network Confounding in Recommendation

In recommendation systems, the existence of the missing-not-at-random (M...

Using offline metrics and user behavior analysis to combine multiple systems for music recommendation

There are many offline metrics that can be used as a reference for evalu...

Learning the Optimal Recommendation from Explorative Users

We propose a new problem setting to study the sequential interactions be...

ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest

Learned embeddings for products are an important building block for web-...

Heterogeneous Causal Learning for Effectiveness Optimization in User Marketing

User marketing is a key focus of consumer-based internet companies. Lear...

1 Introduction

In recent years, online commerce has outpaced the growth of traditional commerce. As such, research work on recommender systems has also grown significantly, with recent Deep Learning (DL) approaches achieving state-of-the-art results. Broadly, these DL approaches frame the recommendation task as either:

  • A distance learning problem between pairs of products or pairs of users and products, measured with Mean Squared Error (MSE) and Area Under the Curve (AUC), like in the work by [Grbovic et al.2015, Vasile et al.2016, Pennington et al.2014].

  • A next item prediction problem that models user behavior and predicts the next action, measured with ranking metrics such as Precision@K and Normalized Discounted Cumulative Gain (NDCG), as presented in [Hidasi et al.2015, Hidasi et al.2016].

However, we argue that both approaches fail to model the inherent interventionist nature of recommendation, which should not only attempt to model the organic user behavior, but to actually attempt to optimally influence it according to a preset objective.

Ideally, the change in user behavior should be measured against a case where no recommendations are shown. This is not an easy problem, since we do not know what the user would have done in the absence of recommendations and is a natural fit for the causal / counterfactual inference paradigm.

Using a causal vocabulary, we are interested in finding the treatment recommendation policy that maximizes the reward obtained from each user with respect to the control recommendation policy. This objective is traditionally denoted as the Individual Treatment Effect (ITE) [Rubin1974].

In our work, we introduce a modification to the classical matrix factorization approach which leverages both a large biased sample of biased recommendation outcomes and a small sample of randomized recommendation outcomes in order to create user and products representations. We show that using our method, the associated pairwise distance between user and item pairs is a more strongly aligned with the corresponding ITE of recommending a particular item to the user than in both traditional matrix factorization and causal inference approaches.

1.1 Causal Vocabulary

Below we briefly introduce the causal vocabulary and notation that we will be using throughout the paper.

The Causal Inference Objective.

In the classical setup, we want to determine the causal effect of one single action which constitutes the treatment versus the control case where no action or a placebo action is undertaken (do vs. not do). In the stochastic setup, we want to determine the causal effect of a stochastic treatment policy versus the baseline control policy. In this case, both treatment and control are distributions over all possible actions. We retrieve the classical setup as a special case.

Recommendation Policy.

We assume a stochastic policy that associates to each user and product

a probability for the user

to be exposed to the recommendation of product :

For simplicity we assume showing no products is also a valid intervention in .

Policy Rewards.

Reward is distributed according to an unknown conditional distribution depending on and :

The reward associated with a policy is equal to the sum of the rewards collected across all incoming users by using the associated personalized product exposure probability:

Individual Treatment Effect.

The Individual Treatment Effect (ITE) value of a policy for a given user and a product is defined as the difference between its reward and the control policy reward:

We are interested in finding the policy with the highest sum of ITEs:


Optimal ITE Policy.

It is easy to show that, starting from any control policy , the best incremental policy is the policy that shows deterministically to each user the product with the highest personalized reward :

Note: This assumes non-fatigability, e.g. non-diminishing returns of recommending the same product repeatedly to the user (no user state / repeated action effects at play).

IPS Solution For

In order to find the optimal policy we need to find for each user the product with the highest personalized reward .

In practice we do not observe directly , but .

The current approach to estimate

constitutes in using Inverse Propensity Scoring (IPS)-based methods to predict the unobserved reward :

This assumes we have incorporated randomization in the current policy . Even with the existence of randomization, the main shortcoming of IPS-based estimators is that they do not handle well big shifts in exposure probability between treatment and control policies (products with low probability under the logging policy will tend to have higher predicted rewards).

Addressing the variance issues Of IPS.

It is easy to observe that in order to obtain minimum variance we should collect data using fully randomized recommendations, e.g. when:

. However, this means zero recommendation performance and therefore cannot be a solution in practice.

Our question: Could we learn from a predictor for performance under and use it to compute the optimal product recommendations ?

2 Our Approach: Causal Embeddings (CausE)

We are interested in building a good predictor for recommendation outcomes under random exposure for all the user-product pairs, which we denote as . We make the assumption that we have access to a large sample from the logging policy and a small sample from the randomized treatment policy (e.g. the logging policy uses e-greedy randomization).

To this end, we propose a multi-task objective that jointly factorizes the matrix of observations and the matrix of observations . Our approach is inspired by the work in [Rosenfeld et al.2016] and shares similarities with other domain-adaptation based models for counterfactual inference such as the work in [Johansson et al.2016, Shalit et al.2017].

2.1 Predicting Rewards Via Matrix Factorization

By using a matrix factorization model, we assume that both the expected factual control and treatment rewards can be approximated as linear predictors over the shared user representations , as shown in Fig. 1.

Figure 1: The joint MF problem.

As a result, we can approximate the ITE of a user-product pair as the difference between the two, see eq.1 below:

Proposed joint optimization solution

The joint optimization objective has naturally two terms, one measuring the performance of the solution on the treatment sample and the on control sample. The novel part of the objective comes from the additional constraint on the distance between the treatment and control vectors for the same action/item, that can be directly linked to the ITE effect of the item. We are listing below each one of the terms of the overall objective.

Sub-objective #1: Treatment Loss Term

We define the first part of our joint prediction objective as the supervised predictor for , trained on the limited sample , as shown in the eq. 2 below:



  • is the parameter matrix of treatment product representations.

  • is the fixed matrix of the user representations.

  • is the observed rewards matrix.

  • is an arbitrary loss function.

  • is a regularization term over the model parameters.

Linking the control and treatment effects

Additionally, we can use the translation factor in order to be able to use the model built from the treatment data to predict outcomes from the control distribution :

Sub-objective #2: Control Loss Term

Now we want to leverage our ample control data and we can use our treatment product representations through a translation:

which can be written equivalently as:


where we regularize the control against the treatment embeddings (). As shown in the eq. 4 below, we can see that is a function of . Therefore, by regularizing we are effectively putting a constraint on the magnitude of the term.

Overall Joint Objective

By putting the two tasks together ( and ) and regrouping the loss functions and the regularizer terms, we have that:


where is the reconstruction loss function for the concatenation matrix of and , is a regularization function that weights the discrepancy between the treatment and control product representations and is a regularization function that weights the representation vectors.

Figure 2: The final joint MF objective.
Question: How about user shift?

The current recommendation solution is targeting a subset of users, for example, active buyers on a website and the new recommendation targets mainly newly signed users (modulo randomization which should give non-zero probabilities for all user product pairs).

Generalization of the objective to user shift

Our objective function can be altered to allow for the user representations to change, we obtain the equation below:

Putting the loss functions associated with the user and product dimension together (, ), we reach the final loss function for our method:

Input : Mini-batches of and , regularization parameters and , learning rate
Output :  - Product and User Control and Treatment Matrices
1 Random initialization of ;
2 while not converged do
3        read batch of training samples;
4        for each product in  do
5               Update product vector: )
6        end for
7       for each user in  do
8               Update user vector: )
9        end for
11 end while
Algorithm 1 CausE Algorithm: Causal Embeddings For Recommendations

3 Experimental Results

3.1 Experimental Setup

The task is predicting the outcomes under treatment policy , where all of the methods have available at training time a large sample of observed recommendations outcomes from and a small sample from . Essentially this is a classical conversion-rate prediction problem so we measure Mean-Squared Error (MSE) and Negative Log-Likelihood (NLL). We report lift over average conversation rate from the test dataset:


MovieLens10M (SKEW)

Netflix (SKEW)
MSE lift NLL lift AUC MSE lift NLL lift AUC
Table 1: Results for MovieLens10M and Netflix on the Skewed (SKEW) test datasets. All three versions of the CausE algorithm outperform both the standard and the IPS-weighted causal factorization methods, with CausE-avg and CausE-prod-C also out-performing BanditNet. We can observe that our best approach CausE-prod-C outperforms the best competing approaches WSP2V-blend by a large margin (21% MSE and 20% NLL lifts on the MovieLens10M dataset) and BN-blend (5% AUC lift on MovieLens10M).

3.2 Baselines

We compare our method with the following baselines:

Matrix Factorization Baselines:
  • Bayesian Personalized Ranking (BPR) To compare our approach against a ranking based method, we use Bayesian Personalized Ranking (BPR) for matrix factorization on implicit feedback data [Rendle et al.2009].

  • Supervised-Prod2Vec (SP2V): As a second factorization baseline we will use a Factorization Machine-like method [Rendle2010] that approximates

    as a sigmoid over a linear transform of the inner-product between the user and product representations.

Causal Inference Baselines:
  • Weighted-SupervisedP2V (WSP2V): We employ the SP2V algorithm on propensity-weighted data, this method is similar to the Propensity-Scored Matrix Factorization (PMF) from [Schnabel et al.2016] but with cross-entropy reconstruction loss instead of MSE/MAE.

  • BanditNet (BN): To utilize BanditNet [Joachims et al.2018] as a baseline, we use SP2V as our target policy . For the existing policy , we model the behavior of the recommendation system as a popularity-based solution, described by the marginal probability of each product in the training data.

3.3 Experimental Datasets

We use the Netflix and MovieLens10M

explicit rating datasets (1-5). In order to validate our method, we preprocess them as follows: We binarize the ratings

by setting 5-star ratings to 1 (click) and everything else to zero (view only) and generate a skewed dataset (SKEW) with 70/10/20 train/validation/test event split that simulates rewards collected from uniform exposure , following a similar protocol with the one presented in previous counterfactual estimation work such as in [Liang et al.2016, Swaminathan and Joachims2015] and described in detail in the long version of our paper [Bonner and Vasile2018].

3.3.1 Experimental Setup: Exploration Sample

We define 5 possible setups of incorporating the exploration data:

  • No adaptation (no) - trained only on .

  • Blended adaptation (blend) - trained on the blend of the and samples.

  • Test adaptation (test) - trained only on the samples.

  • Product adaptation (prod) - separate treatment embedding for each product based on the sample.

  • Average adaptation (avg) - average treatment product by pooling all the sample into a single vector.

3.4 Results

Table 1 displays the results for running all the approaches on the datasets. Our proposed CausE method significantly outperforms all baselines across both datasets, demonstrating that it has a better capacity to leverage the small test distribution sample . We observe that, out of the three CausE variants, CausE-prod-C, the variant that is using the regularized control matrix, clearly out-performs the others. Further, figure 3 highlights how CausE is able to make better use of increasing quantities of test distribution present in the training data compared with the baselines.

Figure 3: Change in MSE lift as more test set is injected into the blend training dataset.

4 Conclusions

We have introduced a novel method for factorizing matrices of user implicit feedback that optimizes for causal recommendation outcomes. We show that the objective of optimizing for causal recommendations is equivalent with factorizing a matrix of user responses collected under uniform exposure to item recommendations. We propose the CausE algorithm, which is a simple extension to current matrix factorization algorithms that adds a regularizer term on the discrepancy between the item vectors used to fit the biased sample and the vectors that fit the uniform exposure sample .


  • [Bonner and Vasile2018] Stephen Bonner and Flavian Vasile. Causal embeddings for recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, pages 104–112. ACM, 2018.
  • [Grbovic et al.2015] Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. E-commerce in your inbox: Product recommendations at scale. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 1809–1818, New York, NY, USA, 2015. ACM.
  • [Hidasi et al.2015] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939, 2015.
  • [Hidasi et al.2016] Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos Tikk.

    Parallel recurrent neural network architectures for feature-rich session-based recommendations.

    In Proceedings of the 10th ACM Conference on Recommender Systems, pages 241–248. ACM, 2016.
  • [Joachims et al.2018] Thorsten Joachims, Artem Grotov, Adith Swaminathan, and Maarten de Rijke. Deep learning with logged bandit feedback. Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  • [Johansson et al.2016] Fredrik D Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. arXiv preprint arXiv:1605.03661, 2016.
  • [Liang et al.2016] Dawen Liang, Laurent Charlin, and David M Blei. Causal inference for recommendation. 2016.
  • [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global Vectors for Word Representation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1532–1543, Doha, Qatar, 2014. Association for Computational Linguistics.
  • [Rendle et al.2009] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In

    Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence

    , pages 452–461. AUAI Press, 2009.
  • [Rendle2010] Steffen Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 995–1000. IEEE, 2010.
  • [Rosenfeld et al.2016] Nir Rosenfeld, Yishay Mansour, and Elad Yom-Tov. Predicting counterfactuals from large historical data and small randomized trials. arXiv preprint arXiv:1610.07667, 2016.
  • [Rubin1974] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974.
  • [Schnabel et al.2016] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. In

    Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48

    , ICML’16, pages 1670–1679, 2016.
  • [Shalit et al.2017] Uri Shalit, Fredrik D Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3076–3085. JMLR. org, 2017.
  • [Swaminathan and Joachims2015] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16:1731–1755, 2015.
  • [Vasile et al.2016] Flavian Vasile, Elena Smirnova, and Alexis Conneau. Meta-prod2vec: Product embeddings using side-information for recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 225–232. ACM, 2016.