1 Introduction
In recent years, online commerce has outpaced the growth of traditional commerce. As such, research work on recommender systems has also grown significantly, with recent Deep Learning (DL) approaches achieving stateoftheart results. Broadly, these DL approaches frame the recommendation task as either:

A distance learning problem between pairs of products or pairs of users and products, measured with Mean Squared Error (MSE) and Area Under the Curve (AUC), like in the work by [Grbovic et al.2015, Vasile et al.2016, Pennington et al.2014].

A next item prediction problem that models user behavior and predicts the next action, measured with ranking metrics such as Precision@K and Normalized Discounted Cumulative Gain (NDCG), as presented in [Hidasi et al.2015, Hidasi et al.2016].
However, we argue that both approaches fail to model the inherent interventionist nature of recommendation, which should not only attempt to model the organic user behavior, but to actually attempt to optimally influence it according to a preset objective.
Ideally, the change in user behavior should be measured against a case where no recommendations are shown. This is not an easy problem, since we do not know what the user would have done in the absence of recommendations and is a natural fit for the causal / counterfactual inference paradigm.
Using a causal vocabulary, we are interested in finding the treatment recommendation policy that maximizes the reward obtained from each user with respect to the control recommendation policy. This objective is traditionally denoted as the Individual Treatment Effect (ITE) [Rubin1974].
In our work, we introduce a modification to the classical matrix factorization approach which leverages both a large biased sample of biased recommendation outcomes and a small sample of randomized recommendation outcomes in order to create user and products representations. We show that using our method, the associated pairwise distance between user and item pairs is a more strongly aligned with the corresponding ITE of recommending a particular item to the user than in both traditional matrix factorization and causal inference approaches.
1.1 Causal Vocabulary
Below we briefly introduce the causal vocabulary and notation that we will be using throughout the paper.
The Causal Inference Objective.
In the classical setup, we want to determine the causal effect of one single action which constitutes the treatment versus the control case where no action or a placebo action is undertaken (do vs. not do). In the stochastic setup, we want to determine the causal effect of a stochastic treatment policy versus the baseline control policy. In this case, both treatment and control are distributions over all possible actions. We retrieve the classical setup as a special case.
Recommendation Policy.
We assume a stochastic policy that associates to each user and product
a probability for the user
to be exposed to the recommendation of product :For simplicity we assume showing no products is also a valid intervention in .
Policy Rewards.
Reward is distributed according to an unknown conditional distribution depending on and :
The reward associated with a policy is equal to the sum of the rewards collected across all incoming users by using the associated personalized product exposure probability:
Individual Treatment Effect.
The Individual Treatment Effect (ITE) value of a policy for a given user and a product is defined as the difference between its reward and the control policy reward:
We are interested in finding the policy with the highest sum of ITEs:
where:
Optimal ITE Policy.
It is easy to show that, starting from any control policy , the best incremental policy is the policy that shows deterministically to each user the product with the highest personalized reward :
Note: This assumes nonfatigability, e.g. nondiminishing returns of recommending the same product repeatedly to the user (no user state / repeated action effects at play).
IPS Solution For
In order to find the optimal policy we need to find for each user the product with the highest personalized reward .
In practice we do not observe directly , but .
The current approach to estimate
constitutes in using Inverse Propensity Scoring (IPS)based methods to predict the unobserved reward :This assumes we have incorporated randomization in the current policy . Even with the existence of randomization, the main shortcoming of IPSbased estimators is that they do not handle well big shifts in exposure probability between treatment and control policies (products with low probability under the logging policy will tend to have higher predicted rewards).
Addressing the variance issues Of IPS.
It is easy to observe that in order to obtain minimum variance we should collect data using fully randomized recommendations, e.g. when:
. However, this means zero recommendation performance and therefore cannot be a solution in practice.Our question: Could we learn from a predictor for performance under and use it to compute the optimal product recommendations ?
2 Our Approach: Causal Embeddings (CausE)
We are interested in building a good predictor for recommendation outcomes under random exposure for all the userproduct pairs, which we denote as . We make the assumption that we have access to a large sample from the logging policy and a small sample from the randomized treatment policy (e.g. the logging policy uses egreedy randomization).
To this end, we propose a multitask objective that jointly factorizes the matrix of observations and the matrix of observations . Our approach is inspired by the work in [Rosenfeld et al.2016] and shares similarities with other domainadaptation based models for counterfactual inference such as the work in [Johansson et al.2016, Shalit et al.2017].
2.1 Predicting Rewards Via Matrix Factorization
By using a matrix factorization model, we assume that both the expected factual control and treatment rewards can be approximated as linear predictors over the shared user representations , as shown in Fig. 1.
As a result, we can approximate the ITE of a userproduct pair as the difference between the two, see eq.1 below:
(1) 
Proposed joint optimization solution
The joint optimization objective has naturally two terms, one measuring the performance of the solution on the treatment sample and the on control sample. The novel part of the objective comes from the additional constraint on the distance between the treatment and control vectors for the same action/item, that can be directly linked to the ITE effect of the item. We are listing below each one of the terms of the overall objective.
Subobjective #1: Treatment Loss Term
We define the first part of our joint prediction objective as the supervised predictor for , trained on the limited sample , as shown in the eq. 2 below:
(2) 
where:

is the parameter matrix of treatment product representations.

is the fixed matrix of the user representations.

is the observed rewards matrix.

is an arbitrary loss function.

is a regularization term over the model parameters.
Linking the control and treatment effects
Additionally, we can use the translation factor in order to be able to use the model built from the treatment data to predict outcomes from the control distribution :
Subobjective #2: Control Loss Term
Now we want to leverage our ample control data and we can use our treatment product representations through a translation:
which can be written equivalently as:
(3) 
where we regularize the control against the treatment embeddings (). As shown in the eq. 4 below, we can see that is a function of . Therefore, by regularizing we are effectively putting a constraint on the magnitude of the term.
(4) 
Overall Joint Objective
By putting the two tasks together ( and ) and regrouping the loss functions and the regularizer terms, we have that:
(5) 
where is the reconstruction loss function for the concatenation matrix of and , is a regularization function that weights the discrepancy between the treatment and control product representations and is a regularization function that weights the representation vectors.
Question: How about user shift?
The current recommendation solution is targeting a subset of users, for example, active buyers on a website and the new recommendation targets mainly newly signed users (modulo randomization which should give nonzero probabilities for all user product pairs).
Generalization of the objective to user shift
Our objective function can be altered to allow for the user representations to change, we obtain the equation below:
Putting the loss functions associated with the user and product dimension together (, ), we reach the final loss function for our method:
(6) 
3 Experimental Results
3.1 Experimental Setup
The task is predicting the outcomes under treatment policy , where all of the methods have available at training time a large sample of observed recommendations outcomes from and a small sample from . Essentially this is a classical conversionrate prediction problem so we measure MeanSquared Error (MSE) and Negative LogLikelihood (NLL). We report lift over average conversation rate from the test dataset:
Method  MovieLens10M (SKEW) 
Netflix (SKEW)  

MSE lift  NLL lift  AUC  MSE lift  NLL lift  AUC  
BPRno  
BPRblend  
SP2Vno  
SP2Vblend  
SP2Vtest  
WSP2Vno  
WSP2Vblend  
BNblend  
CausEavg  
CausEprodT  
CausEprodC 
3.2 Baselines
We compare our method with the following baselines:
Matrix Factorization Baselines:

Bayesian Personalized Ranking (BPR) To compare our approach against a ranking based method, we use Bayesian Personalized Ranking (BPR) for matrix factorization on implicit feedback data [Rendle et al.2009].

SupervisedProd2Vec (SP2V): As a second factorization baseline we will use a Factorization Machinelike method [Rendle2010] that approximates
as a sigmoid over a linear transform of the innerproduct between the user and product representations.
Causal Inference Baselines:

WeightedSupervisedP2V (WSP2V): We employ the SP2V algorithm on propensityweighted data, this method is similar to the PropensityScored Matrix Factorization (PMF) from [Schnabel et al.2016] but with crossentropy reconstruction loss instead of MSE/MAE.

BanditNet (BN): To utilize BanditNet [Joachims et al.2018] as a baseline, we use SP2V as our target policy . For the existing policy , we model the behavior of the recommendation system as a popularitybased solution, described by the marginal probability of each product in the training data.
3.3 Experimental Datasets
We use the Netflix and MovieLens10M
explicit rating datasets (15). In order to validate our method, we preprocess them as follows: We binarize the ratings
by setting 5star ratings to 1 (click) and everything else to zero (view only) and generate a skewed dataset (SKEW) with 70/10/20 train/validation/test event split that simulates rewards collected from uniform exposure , following a similar protocol with the one presented in previous counterfactual estimation work such as in [Liang et al.2016, Swaminathan and Joachims2015] and described in detail in the long version of our paper [Bonner and Vasile2018].3.3.1 Experimental Setup: Exploration Sample
We define 5 possible setups of incorporating the exploration data:

No adaptation (no)  trained only on .

Blended adaptation (blend)  trained on the blend of the and samples.

Test adaptation (test)  trained only on the samples.

Product adaptation (prod)  separate treatment embedding for each product based on the sample.

Average adaptation (avg)  average treatment product by pooling all the sample into a single vector.
3.4 Results
Table 1 displays the results for running all the approaches on the datasets. Our proposed CausE method significantly outperforms all baselines across both datasets, demonstrating that it has a better capacity to leverage the small test distribution sample . We observe that, out of the three CausE variants, CausEprodC, the variant that is using the regularized control matrix, clearly outperforms the others. Further, figure 3 highlights how CausE is able to make better use of increasing quantities of test distribution present in the training data compared with the baselines.
4 Conclusions
We have introduced a novel method for factorizing matrices of user implicit feedback that optimizes for causal recommendation outcomes. We show that the objective of optimizing for causal recommendations is equivalent with factorizing a matrix of user responses collected under uniform exposure to item recommendations. We propose the CausE algorithm, which is a simple extension to current matrix factorization algorithms that adds a regularizer term on the discrepancy between the item vectors used to fit the biased sample and the vectors that fit the uniform exposure sample .
References
 [Bonner and Vasile2018] Stephen Bonner and Flavian Vasile. Causal embeddings for recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, pages 104–112. ACM, 2018.
 [Grbovic et al.2015] Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. Ecommerce in your inbox: Product recommendations at scale. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 1809–1818, New York, NY, USA, 2015. ACM.
 [Hidasi et al.2015] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Sessionbased recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939, 2015.

[Hidasi et al.2016]
Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos Tikk.
Parallel recurrent neural network architectures for featurerich sessionbased recommendations.
In Proceedings of the 10th ACM Conference on Recommender Systems, pages 241–248. ACM, 2016.  [Joachims et al.2018] Thorsten Joachims, Artem Grotov, Adith Swaminathan, and Maarten de Rijke. Deep learning with logged bandit feedback. Proceedings of the International Conference on Learning Representations (ICLR), 2018.
 [Johansson et al.2016] Fredrik D Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. arXiv preprint arXiv:1605.03661, 2016.
 [Liang et al.2016] Dawen Liang, Laurent Charlin, and David M Blei. Causal inference for recommendation. 2016.

[Pennington et al.2014]
Jeffrey Pennington, Richard Socher, and Christopher Manning.
Glove: Global Vectors for Word Representation.
In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages 1532–1543, Doha, Qatar, 2014. Association for Computational Linguistics. 
[Rendle et al.2009]
Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars SchmidtThieme.
Bpr: Bayesian personalized ranking from implicit feedback.
In
Proceedings of the twentyfifth conference on uncertainty in artificial intelligence
, pages 452–461. AUAI Press, 2009.  [Rendle2010] Steffen Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 995–1000. IEEE, 2010.
 [Rosenfeld et al.2016] Nir Rosenfeld, Yishay Mansour, and Elad YomTov. Predicting counterfactuals from large historical data and small randomized trials. arXiv preprint arXiv:1610.07667, 2016.
 [Rubin1974] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974.

[Schnabel et al.2016]
Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten
Joachims.
Recommendations as treatments: Debiasing learning and evaluation.
In
Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48
, ICML’16, pages 1670–1679, 2016.  [Shalit et al.2017] Uri Shalit, Fredrik D Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3076–3085. JMLR. org, 2017.
 [Swaminathan and Joachims2015] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16:1731–1755, 2015.
 [Vasile et al.2016] Flavian Vasile, Elena Smirnova, and Alexis Conneau. Metaprod2vec: Product embeddings using sideinformation for recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 225–232. ACM, 2016.