# Unsupervised Domain Adaptation Meets Offline Recommender Learning

To construct a well-performing recommender offline, eliminating selection biases of the rating feedback is critical. A current promising solution to the challenge is the causality approach using the propensity scoring method. However, the performance of existing propensity-based algorithms can be significantly affected by the propensity estimation bias. To alleviate the problem, we formulate the missing-not-at-random recommendation as the unsupervised domain adaptation problem and drive the propensity-agnostic generalization error bound. We further propose a corresponding algorithm minimizing the bound via adversarial learning. Empirical evaluation using the Yahoo! R3 dataset demonstrates the effectiveness and the real-world applicability of the proposed approach.

10/09/2018

### SPIGAN: Privileged Adversarial Learning from Simulation

Deep Learning for Computer Vision depends mainly on the source of superv...
10/10/2018

### Domain Confusion with Self Ensembling for Unsupervised Adaptation

Data collection and annotation are time-consuming in machine learning, e...
10/19/2020

Unsupervised Domain Adaptation (UDA) is essential for autonomous driving...
09/14/2018

We propose a normalization layer for unsupervised domain adaption in sem...
03/04/2021

Unsupervised domain adaptation (UDA) typically carries out knowledge tra...
06/21/2021

### f-Domain-Adversarial Learning: Theory and Algorithms

Unsupervised domain adaptation is used in many machine learning applicat...
07/05/2022

### Cooperative Distribution Alignment via JSD Upper Bound

Unsupervised distribution alignment estimates a transformation that maps...

## 1 Introduction

The main objective of recommender systems is to obtain a well-performing rating predictor from sparse observed rating feedback. During the process, a great challenge is that most of the missing mechanism of the real-world rating dataset is missing-not-at-random (MNAR). The MNAR missing mechanism is created because of two factors. The first is the past recommendation policy. If one relied on a policy recommending popular items with high probability in the past, then the observed ratings under that policy include more data of popular items

[5]. The other is users’ self-selection. For example, users tend to rate items for which they have positive preferences, and the ratings for negative preferences are more likely to be missing [24, 17].

The MNAR problem makes it difficult to learn rating predictors from observable data [29] because it is widely recognized that naive methods often lead to sub-optimal and biased recommendations under the MNAR settings [24, 25]

. One of the most established solutions to the problem is the propensity-based approach. It defines the probability of each instance being observed as the propensity score and obtains an unbiased estimator for the true metric of interest by weighting each data by the inverse of its propensity

[24, 16, 30]. In general, this unbiasedness is desirable; however, this property is ensured only when the true propensities are available; it is widely known that the performance of the propensity-based algorithms is highly susceptible to the propensity estimation methods [23, 22]

. However, in real-world recommender systems, the true propensities are mostly unknown, and this leads to the severe bias in the estimation of the loss function of interest.

To solve the limitation of the previous propensity-based methods, in this work, we establish a new theory of MNAR recommendation inspired by the theoretical framework of unsupervised domain adaptation. Similar to the causal inference, unsupervised domain adaptation addresses the problem settings in which the feature distributions between the training and test sets are different. Moreover, methods of unsupervised domain adaptation generally utilize distance metrics measuring dissimilarity between probability distributions and does not depend on propensity weighting techniques

[7, 6, 20, 21]. Thus, it is considered to be useful to solve the problem caused by the propensity estimation. However, the connection between the MNAR recommendation and unsupervised domain adaptation has not yet been thoroughly investigated.

To bridge the two potentially related fields, we first define a discrepancy metric to measure the distance between the two missing mechanisms inspired by domain discrepancy measures for unsupervised domain adaptation [3, 4]. Then, we derive a generalization error bound depending on the naive loss on the MNAR feedback and the discrepancy between the ideal missing-completely-at-random (MCAR) and the common MNAR missing mechanisms. Our theoretical bound is independent of the propensity score; thus, the bias problem relating to the propensity scoring is eliminated. Moreover, we propose an algorithm called Domain Adversarial Matrix Factorization. The proposed algorithm simultaneously minimizes the naive loss on the MNAR feedback and the discrepancy measure in an adversarial manner. Finally, we conduct an experiment on a standard real-world dataset to empirically demonstrate the effectiveness of the proposed approach.

## 2 Preliminaries

In this section, we introduce the notations and formulation of the MNAR recommendation with explicit feedback. Then, we describe previous estimators and their limitations.

### 2.1 Notation and Formulation

In this study, is a set of users (), and is a set of items (). We also denote the set of all user and item pairs as . Let be a true rating matrix, where each entry represents the true rating of user to item .

The objective of this study is to develop an algorithm to obtain an optimal predicted rating matrix , where each entry is the predicted rating for . To achieve this objective, we formally define the ideal loss function that an optimal algorithm should minimize as follows:

 Lℓideal(ˆR)=1|D|∑(u,i)∈Dℓ(Ru,i,ˆRu,i) (1)

where is an arbitrary loss function. For example, when , Eq. (1) is called the mean-absolute-error (MAE), in contrast, when , it is called the mean-squared-error (MSE).

In real-world recommender systems, calculating the ideal loss function is impossible because most of the rating data are missing. To precisely formulate this missing mechanism, we utilize two other matrices. The first is the propensity matrix denoted as . Each entry of this matrix is the propensity score of representing the probability of the feedback of the pair being observed. Next, let be an observation matrix where each entry

is a Bernoulli random variable with its expectation

. If , then the rating of the pair is observed; otherwise, it is unobserved. Throughout this study, we assume for all the observation matrices.

Under the formulation, constructing an effective estimator for the ideal loss function that can be estimated using only a set of observable feedback is critical to developing an effective recommendation algorithm.

### 2.2 Naive Estimator

Given a feedback data , the most basic estimator for the ideal loss is the naive estimator defined as follows:

 ˆLℓnaive(ˆR|O)=1M∑(u,i)∈DOu,i⋅ℓ(Ru,i,ˆRu,i) (2)

The naive estimator is the averaged loss values over the observed rating feedback. This estimator is valid when the missing mechanism of the rating data is missing-at-random (MAR), because, under the MAR settings, the estimator is unbiased against the ideal loss function.

However, several previous studies have shown that, under the general MNAR settings, the simple naive estimator actually has a bias. Thus, it is undesirable to learn a recommendation algorithm [25, 24]; one should rely on an estimator addressing this bias as an alternative to using the naive estimator.

### 2.3 Inverse Propensity Score Estimator

To improve the naive estimator, several previous works applied the IPS estimation to the recommendation settings [24, 16]. In the context of causal inference, the propensity scoring estimator is widely used to estimate causal effects of treatments from observational data [18, 19, 11]. One can derive an unbiased estimator for the loss function of interest with the true propensity score as follows:

 ˆLℓIPS(ˆR|O) =1|D|∑(u,i)∈DOu,i⋅ℓ(Ru,i,ˆRu,i)Pu,i (3)

This estimator is unbiased against the ideal loss and thus is considered to be more desirable than the naive estimator in terms of bias. However, this unbiasedness is valid only when the true propensity score is available; the IPS estimator can have a bias with an inaccurate propensity estimator (see Lemma 5.1 of [24]). The bias problem of the IPS estimator often occurs in most real-world recommender systems. This is because the missing mechanism of the rating feedback can depend on user self-selection as well as past recommendation policy; it is challenging to accurately estimate the missing probability of each instance [17, 24, 30].

In fact, most of the previous studies estimate the propensity score for the propensity-based matrix factorization model using some amount of test data [24, 29]. However, this kind of propensity estimation is actually infeasible because of the costly annotation process [8]. Therefore, in the next section, we explore the theory and algorithm that are independent of the propensity score aiming to alleviate the problem of propensity estimation bias. In addition, we investigated the effect of using different propensity estimators on the performance of the propensity-based matrix factorization method in the experimental part.

## 3 Proposed Method

In this section, we first derive the generalization error bound of the ideal loss function based on the discrepancy measure between two different propensity matrices. Our bound is propensity-agnostic; thus, the problem relating to the propensity estimation is eliminated in this bound. Then, we propose Domain Adversarial Matrix Factorization (DAMF), which minimizes the theoretical upper bound via the adversarial learning procedure. The optimization of the proposed algorithm is independent of the propensity score; thus, the benefit of the proposed method is emphasized in situations with unknown propensities. Note that all the proofs in this section can be found in the supplementary materials.

### 3.1 Theoretical Bound

First, we define the discrepancy measure for the recommendation settings.

###### Definition 1.

(-divergence for recommendation) Let be a class of predicted rating matrices and let be a loss function. Then, the -divergence between the two propensity matrices and is defined as follows:

 dHΔH(P,P′)=supˆR,ˆR′∈H∣∣Lℓ(ˆR,ˆR′|P)−Lℓ(ˆR,ˆR′|P′)∣∣ (4)

where

 Lℓ(ˆR,ˆR′|P) =EO∼P[ˆLℓnaive(ˆR,ˆR′|O)]=1M∑(u,i)∈DPu,i⋅ℓ(ˆRu,i,ˆR′u,i)

Note that -divergence for recommendation is independent of the true rating matrices. Therefore, one can calculate this divergence for any given pair of propensity matrices without the true rating information.

However, in reality, the true propensity matrices ( and ) are unobserved. Thus, one has to estimate the divergence using realizations of them ( and ). The following lemma shows that an estimated -divergence converges to the true divergence at a convergence rate of .

###### Lemma 1.

Any pair of propensity matrices and and their realizations and are given. The loss function is bounded above by a positive constant . Then, for any , the following inequality holds with a probability of at least

 ∣∣dHΔH(P,P′)−dHΔH(O,O′)∣∣≤ΔM√2|D|log4|H|2δ (5)

Then, we state the generalization error bound based on an ideal MCAR observation.

###### Lemma 2.

(Generalization Error Bound under MCAR observation.) An MCAR-observation matrix where

 (PMCAR)u,i=E[Ou,i]=M|D|,∀(u,i)∈D

and any finite hypothesis space of predictions are given. The loss function is bounded above by a positive constant . Then, for any hypothesis and for any , the following inequality holds with a probability of at least :

 Lℓideal(ˆR)≤ˆLℓnaive(ˆR|OMCAR)+ΔM√|D|2log2|H|δ (6)

Next lemma relates the losses under two different propensity matrices.

###### Lemma 3.

Assume that the loss function obeys the triangle inequality. Then, for any given predicted rating matrices and two propensity matrices and , the following inequality holds

 Lℓ(ˆR|P) ≤Lℓ(ˆR|P′)+dHΔH(P,P′)+λ (7)

where

 λ=minˆR∈HLℓ(ˆR|P)+Lℓ(ˆR|P′)

Finally, using the -divergence for recommendation, we derive the propensity-agnostic generalization error bound of the ideal loss function.

###### Theorem 1.

(Propensity-agnostic Generalization Error Bound) Two observation matrices and having MCAR and MNAR missing mechanisms, respectively, and any finite hypothesis space of predictions are given. The loss function is bounded above by a positive constant . Then, for any hypothesis and for any , the following inequality holds with a probability of at least

 Lℓideal(ˆR) ≤ˆLℓnaive(ˆR|OMNAR)+dHΔH(OMCAR,OMNAR) +ΔM⎛⎝√|D|2log4|H|δ+√2|D|log8|H|2δ⎞⎠+λ (8)

The propensity-agnostic generalization error bound in Eq. (8) consists of the following four terms.

• The naive loss on the MNAR rating feedback. One can straightforwardly minimize this term from observable data.

• The empirical -divergence between the MCAR and the MNAR observations. One can estimate this term from observable data because unlabeled MCAR observation can be derived by randomly sampling user-item pairs.

• The variance term that is independent of the propensity score. It converges to zero as the observed feedback

increases.

• The combined loss of the ideal hypothesis . This term is incalculable. If there is no hypothesis that performs well for both the MCAR and MNAR settings, one cannot hope to find a good rating predictor by minimizing the first and second terms.

As previously explained, the bound is independent of the propensity score, and the problems relating to the propensity score estimation is avoided.

### 3.2 Algorithm

Here, we describe the detailed algorithm of the proposed DAMF. Inspired by Theorem 1, we consider minimizing the following objective:

 minˆR∈HˆLℓnaive(ˆR|OMNAR)loss on MNAR feedback+βdHΔH(OMCAR,OMNAR)disc between MCAR and MNAR

where

is the trade-off hyperparameter between the naive loss on the MNAR feedback and the discrepancy between the MCAR and MNAR observation mechanisms. This optimization criterion consists of the two controllable terms of the theoretical bound in Eq. (

8). The minimization of the first term (loss on MNAR feedback) can easily be conducted. On the other hand, that of the second term (disc between MCAR and MNAR) is difficult because the optimization over the pair of hypathesis is needed.

Therefore, in this work, we introduce a discriminator to classify item latent factors into two classes, rare and popular, aiming to derive item latent factors such that item popularity bias is eliminated. We adopt this approach because item popularity bias is the most problematic type of bias in recommender systems

[31], and a similar optimization approach has shown promising results in the neural word embedding literature [9].

The proposed algorithm consists of the following three parts: representation, domain, and prediction parts. First, the representation part seeks to find the user and item latent factors denoted as by the following two objectives: (i) high predictive power on the ratings and (ii) low predictive power on the item popularities.

Next, in the prediction part, rating predictions are completed via the following linear transformation.

 ˆR(Uu,Vi;θpred)=W⊤p[U⊤u,V⊤i]⊤+bp

where

is a vector-scalar parameter for the prediction part. The loss function to derive the parameters of the prediction part is as follows:

 ˆLℓpred(ˆR;θpred)=1M∑(u,i)∈DOu,i⋅ℓ(Ru,i,ˆR(Uu,Vi;θpred))

In the domain part, predictions for the item popularity are completed via the following linear transformation:

where is a vector-scalar parameter pair for the domain part and

is the sigmoid function. The outputs of the domain layer are confidence scores representing how rare each item is. The loss to derive the parameters of the domain part is represented as the following binary cross entropy form.

 ˆLdom(f;θdom) =1Mrare∑(u,i)∈DOrare% u,i⋅log(f(Vi;θdom)) +1Mpop∑(u,i)∈DOpopu,i⋅log(1−f(Vi;θdom))

when is a popular item and , then and ; otherwise, and . In addition, and .

Following the framework of domain adversarial training [6, 7, 9], the rating predictor and the popularity discriminator are trained in a minimax manner as follows:

 minU,V,θpredmaxθdomˆLℓpred(ˆR;θpred)−βˆL%dom(f;θdom) (9)

where is the trade-off hyperparameter between the prediction and domain loss. Given fixed latent factors and , and parameters of the prediction part , the optimization of the discriminator is as follows:

 maxθdom−βˆLdom(f;θdom) (10)

Then, given fixed parameters of the domain part , the optimization of , and is as follows:

 minU,V,θpredˆLℓpred(ˆR;θ%pred)−βˆLdom(f;θ% dom) (11)

We implement the proposed algorithm by TensorFlow and optimize

, , and iteratively using the Adam optimizer [12].

The detailed training procedure of DAMF is described in Algorithm 1.

## 4 Experiments

We conducted an empirical evaluation to compare the proposed method to other existing baselines.

### 4.1 Experimental Setup

We used the Yahoo! R3 dataset. This dataset contains approximately 300,000 MNAR five-star user-song ratings from 15,400 users against 1,000 songs, and the test data were collected by asking a subset of 5,400 users to rate 10 randomly selected songs. Thus, the test data are regarded as a MAR dataset.

We compared MF [13], MF with IPS estimator (MF-IPS) [24], and the proposed DAMF. We tuned the dimensions of the latent factors within the range of , and the L2-regularization parameter within the range of for all the methods. The trade-off hyperparameter was tuned within the range of for the proposed method. The combinations of the hyperparameters minimized the loss on validation sets were selected using the Optuna software [1]. In addition, for the proposed method, we set the top 20% frequent items in the training set as popular items and the remainder as rare.

We estimated the propensity score for MF-IPS in the following manner (NB represents Naive Bayes).

 user propensity : ˆPu,∗=∑i∈IOu,imaxu∈U∑i∈IOu,i item propensity : ˆP∗,i=∑u∈UOu,imaxi∈I∑u∈UOu,i user-item propensity : ˆPu,i=ˆPu,∗⋅ˆP∗,i NB with uniform prior : ˆP(Ou,i=1|Ru,i=r)=P(R=r|O=1)P(O=1)

In contrast to previous works [24, 29]; we did not use any data in the test set for the propensity estimation to imitate the real-world situation. However, we report the results of MF-IPS with the following propensity estimator, just as reference.

 NB with true prior : ˆP(Ou,i=1|Ru,i=r)=P(R=r|O=1)P(O=1)P(R=r)

NB with true prior is, in reality, infeasible in most of the real-world problems, because it requires the MCAR explicit feedback to estimate the prior rating distribution.

### 4.2 Results

Table 1 provides the averaged MSE and MAE and their standard errors over five different simulations on the Yahoo! R3 dataset.

First, consistent with the previous works [24, 29], MF-IPS with true prior information performed well and outperformed the other methods in terms of both MAE and MSE. However, MF-IPS with the other propensity estimators cannot outperform the vanilla MF. The results suggest that MF-IPS is potentially an effective debiasing method but is highly sensitive to the way of propensity estimation. In particular, using only MNAR training data for the propensity estimation does not lead to a well-performing recommender.

In contrast, the proposed DAMF algorithm significantly outperforms the other baseline methods without estimating the propensity score. The results validate the effectiveness of the proposed approach under situations where the true propensities are unknown, or the costly MCAR data is unavailable.

## 5 Related Work

### 5.1 Causal-based Recommendation

To address the bias of the MNAR explicit feedback, several related works assume the missing data model and rating model and estimate parameters using the iterative procedure [17, 10]. However, these methods are highly complex and do not perform well on real-world rating datasets [24, 30].

The causality-based methods were proposed to solve the limitations [24, 16, 30, 29]. Among them, the most basic means is called the Inverse Propensity Score (IPS) estimation established in the context of causal inference [18, 19, 11]. This estimation method provides the unbiased estimator of the true metric of interest by utilizing the propensity score defined as the probability of observing each instance. It has been shown that the unbiased estimator can be derived by weighting each data by the inverse of its propensity. The rating predictor based on the IPS estimator empirically outperformed the naive matrix factorization [13] and the probabilistic generative model [10]. These causality-based methods can remove the bias of the naive methods, but the performance of these methods largely depends on the propensity score estimation model. In fact, ensuring the performance of the propensity estimator is challenging in real-world recommendations because users are free to choose which items to rate, and one cannot control the missing mechanism [10]. In the empirical evaluations of propensity-based methods [24, 29], the MCAR test data is used to estimate the propensity score, but in reality, the use of MCAR data is infeasible in most situations because gathering a sufficient amount of MCAR data takes time and cost for the annotation process.

Another approach for the MNAR recommendation is Causal Embeddings For Recommendations (CausE) proposed in [5]. This method jointly trains two matrix factorization models to reduce the effect of selection bias and empirically outperformed the propensity-based methods in terms of binary classification metrics. However, this method also requires some amount of MCAR rating feedback that is costly and inaccessible in most real-world recommender systems.

Therefore, the method that is independent of both the propensity score and MCAR data should be developed to be used in real-world applications but has not yet been proposed in the context of the MNAR recommendation.

Unsupervised domain adaptation (UDA) aims to train a predictor that works well on a target domain by using only labeled source data and unlabeled target data during training [20, 14]. The major challenge of this field is that the feature distributions and the labeling functions222a mapping from feature space to outcome space can differ between the source and target domains. Thus, a predictor trained using only the labeled source data does not generalize well on the target domain. Therefore, measuring the discrepancy between the two domains is essential to achieve the desired performance on the target domain [14, 15]. Several discrepancy measures to measure the difference in the feature distributions between the source and target domains have been proposed [3, 14, 15, 32]. For example, -divergence and -divergence [4, 3] have been used to construct many prediction methods in UDA such as DANN, ADDA, and MCD [7, 6, 28, 21]. These methods are built on the adversarial learning framework and can be theoretically explained as minimizing the empirical errors and the discrepancy measures between the source and target domains. The optimization of these methods does not depend on the propensity score. Thus, methods of UDA are considered to be beneficial to construct an effective recommender with biased rating feedback, because one cannot have access to the true propensities in most of the real-world recommender systems.

The work that is most related to ours is [2]. In this study, the propensity-agnostic lower bound of the performance of treatment policies are derived. The bound is based on the well-established -divergence and can be optimized through domain adversarial learning. The proposed DACPOL procedure empirically outperforms the propensity-based treatment policy optimization algorithm called POEM [26, 27] under the situation where the past treatment policies (propensities) are unknown.

Our proposed method shares a similar structure with the method proposed in [2] but is the first extension of the domain adversarial learning to develop a method to alleviate the bias of the MNAR recommendation without true propensity information.

## 6 Conclusion

In this study, we explored the problem of learning rating predictors from MNAR explicit feedback. First, we derived the generalization error bound of the loss function of interest inspired by the theoretical framework of unsupervised domain adaptation. The bound is propensity-agnostic; thus, problems related to the propensity estimation are eliminated in this bound. Then, we proposed Domain Adversarial Matrix Factorization that simultaneously minimizes the naive loss of the MNAR feedback and the discrepancy between two missing mechanisms. Finally, we conducted an experiment on the standard real-world dataset and showed that the proposed method significantly outperformed the baseline methods under a realistic situation where the true propensities are inaccessible. Important future research directions are the user-dependent discrepancy estimation for the proposed algorithm and extension of the proposed method to the recommendation using implicit feedback. Moreover, several disconnections between theory and algorithm still exist, although the benefit of the proposed algorithm was empirically shown. Bridging the gap between the theory and algorithm is another important and interesting theme.

Acknowledgement. The author would like to thank Suguru Yaginuma and Kazuki Taniguchi for their helpful comments and discussions.

## References

• [1] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &#38; Data Mining, KDD ’19, pages 2623–2631, New York, NY, USA, 2019. ACM.
• [2] Onur Atan, William R Zame, and Mihaela van der Schaar. Learning optimal policies from observational data. arXiv preprint arXiv:1802.08679, 2018.
• [3] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
• [4] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137–144, 2007.
• [5] Stephen Bonner and Flavian Vasile. Causal embeddings for recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys ’18, pages 104–112, New York, NY, USA, 2018. ACM.
• [6] Yaroslav Ganin and Victor Lempitsky.

In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1180–1189, Lille, France, 07–09 Jul 2015. PMLR.
• [7] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky.

The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
• [8] Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 198–206. ACM, 2018.
• [9] Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Frage: Frequency-agnostic word representation. In Advances in neural information processing systems, pages 1334–1345, 2018.
• [10] José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani. Probabilistic matrix factorization with non-random missing data. In International Conference on Machine Learning, pages 1512–1520, 2014.
• [11] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
• [12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
• [13] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30–37, 2009.
• [14] Seiichi Kuroki, Nontawat Charonenphakdee, Han Bao, Junya Honda, Issei Sato, and Masashi Sugiyama. Unsupervised domain adaptation based on source-guided discrepancy. arXiv preprint arXiv:1809.03839, 2018.
• [15] Jongyeong Lee, Nontawat Charoenphakdee, Seiichi Kuroki, and Masashi Sugiyama. Domain discrepancy measure using complex models in unsupervised domain adaptation. arXiv preprint arXiv:1901.10654, 2019.
• [16] Dawen Liang, Laurent Charlin, and David M Blei. Causal inference for recommendation. In Causation: Foundation to Application, Workshop at UAI, 2016.
• [17] Benjamin M Marlin and Richard S Zemel. Collaborative prediction and ranking with non-random missing data. In Proceedings of the third ACM conference on Recommender systems, pages 5–12. ACM, 2009.
• [18] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
• [19] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974.
• [20] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domain adaptation. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2988–2997, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
• [21] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pages 3723–3732, 2018.
• [22] Yuta Saito. Eliminating bias in recommender systems via pseudo-labeling. arXiv preprint arXiv:1910.01444, 2019.
• [23] Yuta Saito, Hayato Sakata, and Kazuhide Nakata. Doubly robust prediction and evaluation methods improve uplift modeling for observational data. In Proceedings of the 2019 SIAM International Conference on Data Mining, pages 468–476. SIAM, 2019.
• [24] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1670–1679, New York, New York, USA, 20–22 Jun 2016. PMLR.
• [25] Harald Steck. Training and testing of recommender systems on data missing not at random. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 713–722. ACM, 2010.
• [26] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814–823, 2015.
• [27] Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning. In advances in neural information processing systems, pages 3231–3239, 2015.
• [28] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
• [29] Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. Doubly robust joint learning for recommendation on data missing not at random. In International Conference on Machine Learning, pages 6638–6647, 2019.
• [30] Yixin Wang, Dawen Liang, Laurent Charlin, and David M. Blei. The deconfounded recommender: A causal inference approach to recommendation. CoRR, abs/1808.06581, 2018.
• [31] Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. Unbiased offline recommender evaluation for missing-not-at-random implicit feedback. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys ’18, pages 279–287, New York, NY, USA, 2018. ACM.
• [32] Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning, pages 7404–7413, 2019.

## Appendix A Omitted Proofs

### a.1 Proof of Lemma 1

###### Proof.

First,

 ∣∣dHΔH(P,P′)−dHΔH(O,O′)∣∣ =∣∣supˆR,ˆR′∈H∣∣Lℓ(ˆR,ˆR′|P)−Lℓ(ˆR,ˆR′|P′)∣∣−supˆR,ˆR′∈H∣∣Lℓ(ˆR,ˆR′|O)−Lℓ(ˆR,ˆR′|O′)∣∣∣∣ ≤supˆR,ˆR′∈H∣∣∣∣Lℓ(ˆR,ˆR′|P)−Lℓ(ˆR,ˆR′|O)∣∣−∣∣Lℓ(ˆR,ˆR′|P′)−Lℓ(ˆR,ˆR′|O′)∣∣∣∣ ≤supˆR,ˆR′∈H∣∣(Lℓ(ˆR,ˆR′|P)−Lℓ(ˆR,ˆR′|O))−(Lℓ(ˆR,ˆR′|P′)−Lℓ(ˆR,ˆR′|O′))∣∣ ≤supˆR,ˆR′∈H∣∣Lℓ(ˆR,ˆR′|P)−Lℓ(ˆR,ˆR′|O)∣∣+supˆR,ˆR′∈H∣∣Lℓ(ˆR,ˆR′|P′)−Lℓ(ˆR,ˆR′|O′)∣∣

The deviations in the last line can be bounded as follows following the same logic flow in the proof of Theorem 5.2 in [24].

 P⎛⎝supˆR,ˆR′∈H∣∣Lℓ(ˆR,ˆR′|P)−Lℓ(ˆR,ˆR′|O)∣∣≥ϵ⎞⎠≤δ ⇔P⎛⎜⎝⋃ˆR,ˆR′∈H∣∣Lℓ(ˆR,ˆR′|P)−Lℓ(ˆR,ˆR′|O)∣∣≥ϵ⎞⎟⎠≤δ ⇐∑ˆR,ˆR′∈HP(∣∣Lℓ(ˆR,ˆR′|P)−Lℓ(ˆR,ˆR′|O)∣∣≥ϵ)≤δ ⇐2|H|2exp(−2M2ϵ2|D|Δ2) (12)

Therefore, the following inequalities hold with a probability of at least , respectively.

 supˆR,ˆR′∈H∣∣Lℓ(ˆR,ˆR′|P)−Lℓ(ˆR,ˆR′|O)∣∣ ≤ΔM√|D|2log4|H|2δ (13) supˆR,ˆR′∈H∣∣Lℓ(ˆR,ˆR′|P′)−Lℓ(ˆR,ˆR′|O′)∣∣ ≤ΔM√|D|2log4|H|2δ (14)

Combining Eq. (12), Eq. (13), and Eq. (14) with the union bound completes the proof. ∎

### a.2 Proof of Lemma 2

###### Proof.

Replacing in Eq. (16), Theorem 5.2 in [24] for completes the proof. ∎

### a.3 Proof of Lemma 3

###### Proof.
 ˆLℓ(ˆR|P) ≤ˆLℓ(ˆR,ˆR∗|P)+ˆLℓ(ˆR∗|P) ≤∣∣∣ˆLℓ(ˆR,ˆR∗|P)−ˆLℓ(ˆR,ˆR∗|P′)∣∣∣+ˆLℓ(ˆR,ˆR∗|P′)+ˆLℓ(ˆR∗|P) ≤ˆLℓ(ˆR|P′)+supˆR,ˆR′∈H∣∣∣ˆLℓ(ˆR,ˆR′|P)−ˆLℓ(ˆR,ˆR′|P′)∣∣∣+ˆLℓ(ˆR∗|P′)+ˆLℓ(ˆR∗|P) =ˆLℓ(ˆR|P′)+dHΔH(P,P′)+λ

### a.4 Proof of Theorem 1

###### Proof.

First, we obtain the following inequality by replacing and for and , respectively in Eq. (7).

 ˆLℓideal(ˆR)≤ˆLℓ(ˆR|PMNAR)+dHΔH(PMCAR,PMNAR)+λ (15)

where

 ˆLℓideal(ˆR)=ˆLℓ(ˆR|PMCAR)

by definition. Then, from Lemma 2 and Lemma 3, the following inequalities hold with a probability of at least .

 Lℓ(ˆR|PMNAR)≤ˆLℓnaive(ˆR|OMNAR)+ΔM√|D|2log4|H|δ (16) ∣∣dHΔH(P%MCAR,PMNAR)−dHΔH(OMCAR,OMNAR)∣∣≤ΔM√2|D|log8|H|2δ (17)

Combining Eq. (15), Eq. (16), and Eq. (17) with the union bound completes the proof. ∎