In the offline learning and evaluation of recommender systems, the dependency of feedback data on the underlying exposure mechanism is often overlooked. When the users express their preferences on the products explicitly (such as providing ratings) or implicitly (such as clicking), the feedback are conditioned on the products to which they are exposed. In most cases, the previous exposures are decided by some underlying mechanism such as the history recommender system. The dependency causes two dilemmas for machine learning in recommender systems, and solutions have yet been found satisfactorily. Firstly, the majority of supervised learning models only handle the dependency between label (user feedback) and features, yet in the actual feedback data, the exposure mechanism can alter the dependency pathways (Figure1). In Section 2, we show from a theoretical perspective that directly applying supervised learning on feedback data can result in inconsistent detection of the user preferences. Secondly, an unbiased model evaluation should have the product exposure determined by the candidate recommendation model, which is almost never satisfied using the feedback data only. The second dilemma also reveals a major gap between evaluating models by online experiments and using history data, since the offline evaluations are more likely to bias toward the history exposure mechanism as it decided to what products the users might express their preferences. The disagreement between the online and offline evaluations may partly explain the controversial observations made in several recent papers, where deep recommendation models are overwhelmed by classical collaborative filtering approaches in offline evaluations dacrema2019we; rendle2020neural, despite their many successful deployments in the real-world applications cheng2016wide; covington2016deep; ying2018graph; zhou2018deep; zheng2018drn; zhang2019deep.
To settle the above dilemmas for recommender systems, we refer to the idea of counterfactual modelling from the observational studies and causal inference literature morgan2015counterfactuals; pearl2009causal; rosenbaum2010design to redesign the learning and evaluation methods. Briefly put, the counterfactual modelling answers questions related to "what if", e.g. what is the feedback data if the candidate model were deployed. Our key purpose of introducing the counterfactual methods is to take account of the dependency between the feedback data and exposure. Relevant proposals have been made in several recent papers schnabel2016recommendations; liang2016causal; joachims2017unbiased; agarwal2018counterfactual; liang2016modeling; yang2018unbiased; hernandez2014probabilistic; however, most of them rely on excessive data or model assumptions (such as the missing-data model we describe in Section 2) that may not be satisfied in practice. Many of the assumptions are essentially unavoidable due to a fundamental discrepancy between the recommender system and observational studies. In observational studies, the exposure (treatment) status are fully observed, and the exposure mechanism is completely decided by the covariates (features) rosenbaum1983central; austin2011introduction. For recommender systems, the exposure is only partially captured by the feedback data. The complete exposure status can only be retrieved from the system’s backend log, whose access is highly restricted, and rarely exists for the public datasets. Also, the exposure mechanism can depend on intractable randomness, e.g. burst events, special offers, interference with other modules such as the advertisement, as well as the relevant features that are not attainable from feedback data. In Figure 1, we show the causal diagrams for the three different views of recommender system.
A direct consequence of the above differences is that the exposure mechanism is not identifiable from feedback data, i.e. we can modify the conditional distribution characterized by the exposure mechanism without disturbing the observation distribution. Therefore, the existing methods have to make problem-specific or unjustifiable assumptions in order to bypass or simply ignore the identifiability issue.
Our solution is to acknowledge the uncertainty brought by the identifiability issue and treat it as an adversarial component. We propose a minimax setting where the candidate model is optimized over the worst-case exposure mechanism. By applying duality arguments and relaxations, we show that the minimax problem can be converted to an adversarial game between two recommendation models. Our approach is entirely novel and principled. We conclude the contributions as follow.
We provide the first theoretical analysis to show an inconsistent issue of supervised learning on recommender systems, caused by the unknown exposure mechanism.
We propose a minimax setting for counterfactual recommendation and convert it to a tractable two-model adversarial game. We prove the generalization bounds for the proposed adversarial learning, and provide analysis for the minimax optimization.
We carry out extensive simulation and real data experiments to demonstrate our performance, and deploy online experiments to fully illustrate the benefits of the proposed approach.
be the collected user-item pairs where non-positive interactions may come from negative sampling. The feature vectors can be one-hot encoding or embedding, so our approach is fully compatible with deep learning models that leverage representation learning and are trained under negative sampling. Recommendation models are denoted by such asand . They take , (and the exposure if available) as input. We use the shorthand to denote the output score, and the loss with respect to the is given by . Our notations also apply to the sequential recommendation by encoding the previously-interacted items to the user feature vector .
We use to denote the exposure mechanism that depends on the underlying model . Also, gives the user response, which is independent from the exposure mechanism whenever is observed. We point out that the stochasticity in the exposure can also be induced by the exogenous factors (unobserved confounders) who bring extra random perturbations. We do not explicitly differentiate the explcit and implicit feedback setting unless specified.
Supervised learning for feedback data.
be the implicit feedback. Set aside the exposure for a moment, the goal of supervised learning is to determine the optimal recommendation function that minimizes the surrogate loss:, where induces the widely-adopted margin-based loss. Now we take account of the (unobserved) exposure status by first letting:
to denote the joint distribution for positive and negative feedback under either exposure status. The surrogate loss, which now depends onand since we include the exposure, is denoted by . In the following claim, we show that if we fix the exposure mechanism and optimize , the optimal loss and the corresponding depend only on and .
When the exposure mechanism is given and fixed, the optimial loss is:
where and are the corresponding distributions for and , and is the f-divergence induced by the convex, lower-semicontinuous function . Also, the optimal that achieves the infimum is given by for some function that depends on .
We conclude that: 1. when the exposure mechanism is given, the optimal loss is a function of both the user preference and the exposure mechanism; 2. the optimal model depends only on the user preference, since is a function of which does not depend on the exposure mechanism (mentioned at the beginning of this section). Both conclusions are practically reasonable, as the optimal recommendation model should only detect user preference regardless of the exposure mechanisms. The optimal loss, on the other hand, depends on the joint distribution where the underlying exposure mechanism plays a part.
However, when is unknown, the conclusions from Claim 1 no longer hold and the optimal
will depend on the exposure mechanism. As a consequence, if the same feedback data were collected under different exposure mechanisms, the recommendation model may find the user preference differently. The inconsistency is caused by not accounting for the unknown exposure mechanism from the supervised learing. We mention that another line of research studies the user preference and exposure in an interactive online fashion using such as the contextual bandit and reinforcement learningli2010contextual; zheng2018drn. The discussions of which are beyond the scope of this paper.
The propensity-weighting approach.
In causal inference, the probability of exposure given the observed features (covariates) is referred to as the propensity scorerosenbaum1983central. The propensity-weighting approach uses weights based on the propensity score to create a synthetic sample in which the distribution of observed features is independent of exposure hirano2001estimation; austin2011introduction. It especially appeals to us because we want the feedback data to be made independent of the exposure mechanism. The propensity-weighted loss is constructed via: , and by taking the expectation with respect to exposure (whose distribution is denoted by ), we recover the ordinary loss:
where the second expectation is taken with respect to the empirical distribution . Let be the distribution for the underlying exposure mechanism. The propensity-weighted empirical distribution is then given by (after scaling), which can be think of as the synthetic sample distribution after eliminating the influence from the underlying exposure mechanism. It is straightforward to verify that after scaling, the expected propensity-weighted loss is exactly given by: .
The hidden assumption of the missing-data (click) model
A number of prior work deals with the unidentifiable exposure mechanism by assuming a missing-data model saito2020unbiased; ai2018unbiased; liang2016modeling; wang2018modeling, which is also referred to as the click model:
While the click model greatly simplifies the problem since the exposure mechanism can now be characterized explicitly, it relies on a hidden assumption that is rarely satisfied in practice. We use to denote the relevance and to denote the click. The fact that and implies:
which suggests that being relevant is independent of getting exposed given the features. This is rarely true (or at least cannot be examined) in many real-world problems, unless contains every single factor that may affect the exposure and user preference. We aim at providing a robust solution whenever the hidden assumption of the missing-data (click) model is dubious or violated.
Let be the ideal exposure-eliminated sample distribution corresponding to , according to the underlying exposure mechanism and data distribution . For notation simplicity, without overloading the original meaning by too much, from this point we treat , , and as distributions on the sample space which consists of all the observed data with . Since we make no data or model assumptions that may allow us to accurately recover , we introduce a minimax formulation to characterize the uncertainty. We optimize against the worst possible choice of (a hypothetical) , whose discrepancy with the ideal can only be determined by the data to a neighborhood: . Among the divergence and distribution distance measures, we choose the Wasserstein distance for our problem, which is defined as:
where is the convex, lower semicontinuous transportation cost function with , and is the set of all distributions whose marginals are given by and . Intuitively, the Wasserstein distance can be interpreted as the minimum cost associated with transporting mass between probability measures. We choose the Wasserstein distance instead of others exactly because we wish to understand how to transport from the empirical data distribution to an ideal synthetic data distribution where the observations were independent of the exposure mechanism. Hence, we consider the local minimax empirical risk minimization (ERM) problem:
where we directly account for the uncertainty induced by the lack of identifiability in the exposure mechanism, and optimize under the worst possible setting. However, the formulation in (5) is first of all a constraint optimization problem. Secondly, the constraint is expressed in terms of the hypothetical . After applying a duality argument, we express the dual problem via the exposure mechanism in the following Claim 2. We use
to denote some estimation of.
Suppose that the transportation cost is continuous and the propensity score are all bounded away from zero, i.e. . Let , then
where is a positive constant and is the density function associated with .
We defer the proof to Appendix A.2. If we consider the relaxation for each fixed (see the appendix), the minimax objective has a desirable formulation where becomes a tuning parameter:
To make sense of (6), we see that while is acting adversarially against as the inverse weights in the first term, it cannot arbitrarily increase the objective function, since the second terms acts as a regularizer that keeps close to the true exposure mechanism . Compared with the primal problem in (5), the relaxed dual formulation in (6) gives the desired unconstrained optimization problem. Also, we point out that the exposure mechanism is often given by the recommender system that was operating during the data collection, which we shall leverage as a domain knowledge to further convert (6) to a more tractable formulation. Let be the recommendation model that underlies . Assume for now that is given by for some transformation function . We leave the inclusion and manipulation of the unobserved factors to Section 3.2. The objective in (6) can then be converted to a two-model adversarial game:
Before we move on to discuss the implications of (7), its practical implementations and the minimax optimization, we first show and discuss the theoretical guarantees for the generalization error, in comparison to the standard ERM setting, after introducing the adversarial component.
3.1 Theoretical property
Before we state the main results, we need to characterize the loss function corresponding to the adversarial objective as well as the complexity of our hypothesis space. For the first purpose, we introduce the cost-regulated loss which is defined as:For the second purpose, we consider the entropy integral where is the hypothesis class and gives the covering number for the cover of in terms of the norm. Suppose that holds uniformly. Now we state our main theoretical result on the worst-case generalization bound under the minimax setting, and the proof is delegated to Appendix A.3.
Suppose the mapping from to is one-to-one and surjective with . Let . Then under the conditions specified in Claim 2, for all and , the following inequality holds with probability at least :
where is a positive constants and is a simple linear function with positive weights.
The above generalization bound holds for all and , and we show that when they are decided by some data-dependent quantities, the result can be converted to some simplified forms that reveal the more direct connections with the propensity-weighted loss and standard ERM results (with the proof provided in Appendix A.4).
Corollary 1 shows that the proposed approach has the same rate as the standard ERM. Also, the first result reveals an extra bias term induced by the adversarial setting, the second result characterizes how the additional uncertainty is reflected on the propensity-weighted empirical loss.
3.2 Practical implementations
Directly optimizing the minimax objective in (7) is infeasible since is unknown and the Wasserstein distance is hard to compute when
is a complicated model such as neural networkpanaretos2019statistical. Nevertheless, understanding the comparative roles of and can help us construct practical solutions.
Recall that our goal is to optimize . The auxiliary is introduced to characterize the adversarial exposure mechanism, so we are less interested in recovering the true . With that being said, the term only serves to establish certain regularizations on such that it is constrained by the underlying exposure mechanism. Relaxing or tightening the regularization term should not significantly impact the solution since we can always adjust the regularization parameter . Hence, we are motivated to design tractable regularizers to approximate or even replace , as long as the constraint on is established under the same principle. Similar ideas have also been applied to train the generative adversarial network (GAN)
: the optimal classifier depends on the unknown data distribution, so in practice, people use alternative tractable classifiers that fit into the problemgoodfellow2016nips. We list several alternative regularizers for as below.
In the explicit feedback data setting, the exposure status is partially observed, so the loss of on the partially-observed exposure data can be used as the regularizer, i.e. , where .
For the content-based recommendations, the exposure often have high correlation with popularity where the popular items are more likely to be recommended. So the regularizer may leverage the empirical popularity via: .
In the implicit feedback setting, if all the other choices are impractical, we may simply use the loss on the feedback data as a regularizer: . The loss-based regularizer is meaningful because is often determined by some other recommendation models. If it happens that , we can expect similar performances from and on the same feedback data since the exposure mechanism is determined by itself.
We focus on the third example because it applies to almost all cases without requiring excessive assumptions. Therefore, the practical adversarial objective is now given by:
In the next step, we study how to handle the unobserved factors who also play a part in the exposure mechanism. As we mentioned in Section 1, having unobserved factors is inevitable practically. In particular, we leverage the Tukey’s factorization proposed in the missing data literature franks2016non. In the presence of unobserved factors, Tukey’s factorization suggests that we additionally characterize the relationship between exposure mechanism and outcome franks2019flexible (see the appendix for detailed discussions). Relating the outcome to exposure mechanism has also been found in the recommendation literature schnabel2016recommendations
. For clarity, we employ a simple logistic-regression to modelas:
is the sigmoid function. We now reach the final form of the adversarial game:
We place to the minimization problem for the following reason. By our design, merely characterizes the potential impact of unobserved factors which we do not consider to act adversarially. Otherwise, the adversarial model can be too strong for to learn anything useful.
3.3 Minimax optimization and robust evaluation
To handle the adversarial training, we adopt the sequential optimization setup where the players take turn to update their model. Without loss of generality, we treat the objective in (8) as a function of of the two models: . When is nonconvex-nonconcave, the classical Minimax Theorem no longer hold and terkelsen1973some. Consequently, which player goes first has important implications. Here, we choose to train first because can then choose the worst candidate from the uncertainty set in order to undermines . We adopt the two-timescale gradient descent ascent (GDA) heusel2017gans schema that is widely applied to train adversarial objectives (Algorithm 1). However, the existing analysis on GDA’s converging to local Nash equilibrium assumes simultaneous training heusel2017gans; ratliff2013characterization; prasad2015two, so their guarantees do not apply here. Instead, we keep training until the objective stops changing by updating either or .
Consequently, the stationary points in Algorithm 1 may not attain local Nash equilibrium. Nevertheless, when the timescale of the two models differ significantly (by adjusting the initial learning rates and discounts), it has been shown that the stationary points belong to the local minimax solution up to some degenerate cases jin2019local. The local minimaxity captures the optimal strategies in the sequential game if both models are only allowed to change their strategies locally. Hence, Algorithm 1 leads to solutions that are locally optimal. Finally, the role of is less important in the sequential game, and we do not observe significant differences from updating it before or after and .
Recommenders are often evaluated by the mean square error (MSE) on explicit feedback, and by the information retrieval metric such as DCG and NDCG on implicit feedback. After the training, we obtain the candidate model as well as the who gives the worst-case propensity score function specialized for . Therefore, instead of pursuing unbiased evaluation, we instead consider the robust evaluation by using . It frees the offline evaluation from the potential impact of exposure mechanism, and thus provide a robust view on the true performance. For instance, the robust NDCG can be computed via: .
4 Relation to other work
The propensity-weighting method is proposed and intensively studied in the observation studies and causal inference literature austin2011introduction; austin2015moving. A recent work that introduces adversarial training to solve the identifiability issue studies on the covariate-balancing methods kallus2018deepmatch; yoon2018ganite. Adversarial training is widely applied by such as generative models goodfellow2016nips, model defense tramer2017ensemble, adversarial robustness xie2019feature and distributional robust optimization (DRO) rahimian2019distributionally. Compare with GAN, we study the sampling distribution instead of the generating distribution, and GAN does not involve counterfactual modelling. DRO often focus on the feature distribution while we study the propensity score distribution. Using the Wasserstein distance as regularization is also common in the literature shafieezadeh2019regularization; gao2017wasserstein. Here, we introduce the adversarial setting for the identifiability issue, whereas the model defense and adversarial robustness study the training and modelling properties under deliberate adversarial behaviors.
Counterfactual modelling for recommenders often relies on certain data or model assumptions (such as the click model assumption) to make up for the identifiability issue, and is thus venerable when the assumptions are violated in practice schnabel2016recommendations; liang2016causal; joachims2017unbiased; agarwal2018counterfactual; liang2016modeling; yang2018unbiased; hernandez2014probabilistic; saito2020unbiased; ai2018unbiased; wang2018modeling. Adversarial training for recommenders often borrows the GAN setting by assuming a generative distribution for certain components wang2017irgan; he2018adversarial. Here, we do not assume the generative nature of recommender systems.
5 Experiment and Result
We conduct simulation study, real-data analysis, as well as online experiments to demonstrate the various benefits of the proposed adversarial counterfactual learning and evaluation approach.
In the simulation study, we generate the synthetic data using real-world explicit feedback dataset so that we have access to the oracle exposure mechanism. We then show that models trained by our approach achieve superior unbiased offline evaluation performances.
In the real-world data analysis, we demonstrate that the models trained by our approach also achieve more improvements even using the standard offline evaluation.
By conducting online experiments, we verify that our robust evaluation is more accurate than the standard offline evaluation when compared with the actual online evaluations.
As for the baseline models, since we are proposing a high-level learning and evaluation approach that are compatible with almost all the existing recommendation models, we consider the well-known baseline models to demonstrate the effectiveness of our approach. Specifically, we employ the popularity-based recommendation (Pop), matrix factorization collaborative filtering (CF)
, multi-layer perceptron-based CF model(MLP), neural CF (NCF) and the generalized matrix factorization (GMF), as the representatives for the content-based recommendation. We also consider the prevailing attention-based model (Attn) as a representative for the sequential recommendation. We also choose and among the above baselines models for our adversarial counterfactual learning. To fully demonstrate the effectiveness of the proposed adversarial training, we also experiment with the non-adversarially trained propensity-score method PS, where we first optimize only on the regularization term until convergence, keep it fixed, and then train in the regular propensity-weighted ERM setting. For the sake of notation, we refer to our learning approach as the ACL-.
We choose to examine the various methods with the widely-adopted next-item recommendation task. In particular, all but the last two user-item interactions are used for training, the second-to-last interaction is used for validation, and the last interaction is used for testing. All the detailed data processing, experiment setup, model configuration, parameter tuning, training procedure, validation, testing and sensitivity analysis are provided in Appendix A.6.
|Hit@10||39.60 (.12||39.24 (.3)||39.68 (.3)||39.00 (.2)||39.47 (.1)||39.10 (.2)||40.32 (.1)||39.08 (.2)|
|NDCG@10||20.26 (.1)||20.10 (.2)||20.33 (.2)||19.33 (.3)||19.58 (.2)||19.30 (.1)||20.81 (.2)||19.61 (.1)|
|Hit@10||31.90 (.3)||30.61 (.2)||33.82 (.1)||30.01 (.3)||31.36 (.2)||33.50 (.1)||33.45 (.2)||32.51 (.2)|
|NDCG@10||16.65 (.2)||15.72 (.2)||17.81 (.1)||15.24 (.2)||16.40 (.2)||17.28 (.2)||17.50 (.1)||16.85 (.1)|
|Hit@10||33.76 (.1)||38.27 (.2)||39.43 (.2)||39.00 (.2)||26.62 (.3)||30.90 (.2)||31.78 (.2)||29.59 (.3)|
|NDCG@10||17.75 (.1)||18.59 (.2)||20.09 (.2)||19.28 (.3)||14.29 (.2)||16.43 (.1)||16.58 (.2)||14.94 (.2)|
the original baseline models without using propensity-score approach or ACL. We use bold-font and underscore to mark the best and second-best outcomes. The mean and standard deviation are computed over ten repetitions, and the complete numerical results are deferred to AppendixA.6.
|config||Attn / Pop||GMF / GMF||Attn / Attn|
|Hit@10||42.18 (.2)||60.97 (.1)||61.01 (.2)||63.37 (.3)||63.97(.1)||82.66 (.2)||81.97 (.1)||64.32 (.2)||83.64 (.1)|
|NDCG@10||21.99 (.1)||32.59 (.1)||32.09 (.3)||33.49 (.1)||33.82(.2)||55.27 (.1)||54.51 (.1)||33.70 (.1)||55.71 (.2)|
|config||GMF / Pop||GMF / GMF||Attn /Attn|
|Hit@10||25.26 (.2)||52.97 (.3)||81.86 (.3)||81.87 (.3)||83.12 (.3)||71.89 (.3)||82.64 (.2)||83.64 (.2)||72.02 (.2)|
|NDCG@10||15.35 (.1)||31.54 (.2)||58.38 (.2)||57.33 (.4)||58.96 (.2)||59.75 (.2)||58.84 (.2)||59.11 (.1)||59.45 (.1)|
|config||Attn / Pop||GMF / GMF||Attn / Attn|
|Hit@10||43.36 (.1)||60.32 (.2)||62.17 (.2)||63.11 (.3)||63.78 (.1)||72.63 (.2)||73.39 (.1)||64.17 (.2)||73.82 (.3)|
|NDCG@10||22.73 (.2)||37.73 (.1)||37.65 (.1)||38.78 (.3)||38.69 (.1)||48.98 (.1 )||49.92 (.3)||39.53 (.1)||49.99 (.1)|
Synthetic data analysis. We use the explicit feedback data from MovieLens-1M111All the data sources, processing steps and other detailed descriptions are provided in the appendix. and Goodreads datasets. We train a baseline CF model and use the optimized hidden factors to generate a synthetic exposure mechanism (with the details presented in Appendix A.6.3), and treat it as the oracle exposure. The implicit feedback data are then generated according to the oracle exposure as well as the optimized hidden factors. Unbiased offline evaluation is now possible because we have access to the exposure mechanism. Also, to set a reasonable benchmark under our simulation setting, we provide the additional experiments where is given by the oracle exposure model. The results are provided in Table 1. We see that when trained with the proposed approach, the baselines models yield their best performances (other than the oracle-enhanced counterparts) under the unbiased offline evaluation, and outperforms the rest of the baselines, which reveals the first appeal of our approach.
Real data analysis. Other than using the MovieLens-1M and Goodreads data in the implicit feedback setting, we further include the LastFM music recommendation (implicit feedback) dataset. From the results in Table 2, we observe that the models trained by our approach achieve the best outcome, even using the standard evaluation where the exposure mechanism is not considered. The better performance in standard evaluation suggests the second appeal of the adversarial counterfactual learning, that even though it optimizes towards the minimax setting, the robustness is not at the cost of the performance under the standard evaluation.
|MSE on metric||Standard||Popularity debiased||Propensity model debiased||Robust|
Online experiment analysis. To examine the practical benefits of the proposed robust learning and evaluation approach in real-world experiments, we carry out several online A/B testings on the Walmart.com, a major e-commerce platform in the U.S., in a content-based item recommendation setting. We are provided with the actual online testing and evaluation results. All the candidate models were trained offline using the proposed approach. We compare the standard offline evaluation, popularity-debiased offline evaluation (where the item popularity is used as the propensity score), the propensity-score model approach and our robust evaluation, with respect to the actual online evaluations. In Table 3, we see that our proposed evaluation approach is indeed a more robust approximation to the online evaluation. It reveals the third appeal of the proposed approaches that they are capable of narrowing the gap between online and offline evaluations.
We thoroughly analyze the drawback of supervised learning for recommender systems and propose the theoretically-grounded adversarial counterfactual learning and evaluation framework. We provide elaborated theoretical and empirical results to illustrate the benefits of the proposed approach.
Scope and limitation. The improvement brought by our approach ultimately depends on the properties of the feedback data, e.g. to what extent is the identifiability issue causing uncertainties in the data. Also, we observe empirically that the propensity model can experience undesired behaviors during the adversarial training as a consequence of using suboptimal tuning parameters. Therefore, it remains to be studied how the optimization dynamics can impact the two-model interactions for the proposed adversarial counterfactual learning.
To the best of our knowledge, the approaches discussed in this paper raise no major ethical concerns and societal consequences. Researchers and practitioners from the recommender system domain may benefit from our research since robust offline learning and evaluation has been a significant challenge in real-world applications. The worst possible outcome when the proposed approach fails is that it reduces to the standard offline learning as the propensity model stops making the desired impact. Finally, the proposed approach aims at solving the identifiability issues of the data, the extent of which depends on the properties of the data.
The work is supported by the Walmart U.S. eCommerce. The authors declare that there is no conflict of interest.
Appendix A.1 Proof for Claim 1
When taking the exposure mechanism into account, minimizing over the loss is implicitly doing , where
For any fixed exposure mechanism , we have
For each , let and .
Notice that is a convex function of since the supremum (negative of the infimum) over a set of affine functions is convex. Since is convex and continuous, we get:
which is exactly the f-divergence induced by .
Also, up on achieving the infimum in (A.1), the optimal is given by solving . ∎
Appendix A.2 Proof for Claim 2 and the relaxation
We first proved the dual formulation for the minimax ERM stated in Claim 2, and then discuss the relaxation for the dual problem.
For the estimation of the ideal exposure-eliminated sample, is equivalent to .
The key observation is that when is given by the empirical distribution that assigns uniform weights to all samples, the Wasserstein’s distance is convex in (since is convex) and gives .
Since we assume that the propensity scores are all bounded away from zero, so and exist and and have normal behavior. So we able to establish the duality results, since the Slater’s condition holds. Let and be a copy of . We have:
where in the last line we use the shorthand notation and . Then notice that
and we then show that the opposite direction also holds so it is always equality. Let be the space of measurable conditional distributions (Markov kernels) from to , then
In the next step, we consider the space of all measurable mappings from to , denoted by . Since all the mappings are measurable, the underlying spaces are regular, and and are at least semi-continuous, using standard measure theory arguments for exchanging the integration and supremum, we get
where the on the LHS represents the mapping, and the on the RHS still denotes elements from the sample space . Now we let the support of the conditional distribution given by . So according to (A.5), we have:
Finally, notice that
so according to (A.2), we reach the final result:
To reach the relaxation given in (5), we use the alternate expression for the Wasserstein distance obtained from the Kantorovich-Rubinstein duality villani2008optimal. We denote the Lipschitz continuity for a function by . When the cost function is -Lipschitz continuous, is also referred to as the Wasserstein- distance. Without loss of generality, we consider such as the norm, and with that the Wasserstein distance is equivalent to:
where . In practice, when is the empirical distribution that assigns uniform weights to all the samples, we have
where the-above are all constants induced by using the change-of-measure with important-weighting estimators, and the induced cost function on the last line satisfies . Therefore, we see that the Wasserstein distance between and can be bounded by . Hence, for each in (A.8),
is a relaxation of the result in Claim 2. In practice, the specific forms of the cost functions or do not matter, because the Wasserstein distance is intractable and we use the data-dependent surrogates that we discuss in Section 3.2.
Appendix A.3 Proof for Theorem 1
Following the same arguments from the proof in Claim 2, we obtain the similar result stated in (A.8) that
Let , then notice that
Since holds uniformly, according to the McDiarmid’s inequality on bounded random variables, we first have
Then let be the i.i.d Rademacher random variables independent of , and be the i.i.d copy of for .
Applying the symmetrization argument, we see that
It is clear that each is zero-mean, and now we show that it is sub-Gaussian as well.
For any two , we show the bounded difference:
Hence we see that is sub-Gaussian with respect to . Therefore, can be bounded by using the standard technique for Rademacher complexity and Dudley’s entropy integral talagrand2014upper:
Appendix A.4 Proof for Corollary 1
To obtain the first result, let the data-dependent be given by
Then according to the definition of , we have
It is easy to verify that
as well as