Optimizing cluster-based randomized experiments under a monotonicity assumption

03/07/2018 ∙ by Jean Pouget-Abadie, et al. ∙ 0

Cluster-based randomized experiments are popular designs for mitigating the bias of standard estimators when interference is present and classical causal inference and experimental design assumptions (such as SUTVA or ITR) do not hold. Without an exact knowledge of the interference structure, it can be challenging to understand which partitioning of the experimental units is optimal to minimize the estimation bias. In the paper, we introduce a monotonicity condition under which a novel two-stage experimental design allows us to determine which of two cluster-based designs yields the least biased estimator. We then consider the setting of online advertising auctions and show that reserve price experiments verify the monotonicity condition and the proposed framework and methodology applies. We validate our findings on an advertising auction dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Randomized experiments — or A/B tests — are at the core of many product decisions at large technology companies. Under the commonly assumed Stable Unit Treatment Value Assumption (SUTVA), these A/B tests serve to estimate unbiasedly the effect of assigning all units to a particular intervention over an alternative condition (Imbens and Rubin, 2015). The SUTVA assumption is one of no interference between units: a unit’s outcome in the experiment does not depend on the treatment assignment of any other unit.

In many A/B tests however, this assumption is not tenable. Consider an intervention on a user of a messaging platform: the (potential) resulting change in her behavior (e.g. increase in time spent on the platform, in number of messages sent, a decrease in response time) would affect the friends on the platform she chooses to communicate with. The same cascading phenomenon can also occur in more subtle ways in a social feed setting. Changes to a feed ranking algorithm, and the resulting behavioral changes (e.g. a higher click-through rate, feedback, or interaction time with the content on the feed) will invariably affect the content on that unit’s friends’ social feeds (Eckles et al., 2016; Gui et al., 2015).

In particular, the same is true in an advertiser auction setting, where modifications to the ecosystem can impact auctions and bidders not originally assigned to the intervention (Basse et al., 2016). Suppose that one bidder changes her strategy as a result of being assigned to a higher reserve price, or her usual bid no longer meets the reserve. The bidders she competes with now face a different bid distribution — the auction is now more competitive if she increases her bid to meet the new reserve, or less competitive if she fails to meet the reserve. These bidders might react to this new bid distribution by updating their own bidding strategy, even though they were not originally assigned to the intervention. This effect could potentially affect the other auctions they participate in.

When SUTVA does not hold, we say there is interference between units, and many fundamental results of the causal inference literature no longer hold. For example, the difference-in-means estimator under a completely randomized assignment is no longer unbiased (Imbens and Rubin, 2015). When the estimand is the difference of outcomes under two extreme assignments — one assigning all units to the intervention, and the other assigning none — a common approach to mitigating the bias of standard estimators in the face of interference is to run cluster-based randomized designs (Ugander et al., 2013; Walker and Muchnik, 2014; Eckles et al., 2017). These randomized designs group assign units to treatment or control in groups to limit the amount of interaction between different treatment buckets.

If it can be shown that there is no interaction across treatment buckets, we recover many of the results stated under SUTVA. In practice, however, such a grouping of units may not exist and A/B test practitioners often settle to find the best possible partitioning. The problem is often formulated as the balanced partitioning of a weighted graph on the experimental units, where an edge is drawn between two units that are liable to interfere with one another. This is a challenging task, both algorithmically and empirically: clustering a graph into balanced partitions is known to be NP-hard, even if we tolerate some unevenness between partitions (Andreev and Racke, 2006); furthermore, the correct graph representation of the interference mechanism is not always clear.

While the literature on finding balanced partitioning of weighted graphs and analysing cluster-based randomized designs is extensive (Middleton and Aronow, 2011; Donner and Klar, 2004; Eckles et al., 2017), there are relatively few prior works that tackle the following question: can we determine which of two balanced partitionings produces less biased estimates of the total treatment effect, without assuming the exact structure of interference is known? The objective of this paper is to show we can in fact identify the better of two clusterings through experimentation under an assumption on the interference mechanism, which we call monotonicity.

Even when the exact structure of interference is not known, monotonicity can established under a theoretical model. For example, some interference mechanisms are self-exciting — if assigning any unit to the intervention will boost the outcomes of any neighboring units. Examples range from vaccination campaigns to social feed ranking algorithms. In both cases, the units in the vicinity of a unit assigned to the intervention tend to benefit over those surrounded by units in the control bucket. Interference mechanisms that exhibit this self-exciting property are a particular example of monotone mechanisms (cf. Section 2.2). When monotonicity holds, we show that it is feasible to compare two balanced partitionings of the experimental units by running a straightforward modification of an experiment-of-experiments design (Saveski et al., 2017; Pouget-Abadie et al., 2017).

We make the following contributions: we present an experiment-of-experiments design for comparing cluster-based randomized designs. We define a monotonicity assumption under which we can determine which clustering induces the least biased estimates of the total treatment effect using this comparative design. While our technique applies to the general problem of experimental design under interference with a monotonicity assumption, we prove that pricing experiments111While pricing experiments are done in the context of ad exchanges (AdE, 2018), we note that our paper is a theoretical study of the subject and does not include any real treatments of ad campaigns. in the context of ad exchanges are monotone, and thus our framework applies to this illustrative example. Finally, we report an empirical simulation study of our algorithms for a publicly-available dataset for online ads.

In Section 2, we establish the theoretical framework by defining the monotonocity assumption, describing the suggested experiment-of-experiments design, and proposing a test for interpreting its results. In Section 3, we explain how this framework can be applied to a real-world setting, by showing that reserve-price experiments on advertising auctions are monotone. Finally, we validate these findings on a Yahoo! ad auction dataset in Section 4.

2 Theory

In this section, we set the notation for the estimand, estimates, and cluster-based randomized designs that we study. We then define the monotonicity assumption, introduce our experiment-of-experiments design, and suggest an approach to analysing its results.

2.1 Cluster-based randomized designs

Let

be the number of experimental units, let vector

denote the outcome metric of interest, and let vector denote the assignment of units to treatment or control . Recall that under the potential outcomes framework, denotes the potential outcomes of the units under assignment . Under the Stable Unit Treatment Value Assumption (SUTVA), this simplifies to . The estimand of interest here is the Total Treatment Effect (TTE), defined as the difference of outcomes between one assignment assigning all units to treatment, and another assigning none:

(1)

A completely randomized (CR) design assigns units chosen completely at random to treatment and the remaining units to control. A clustering is a partition of the experimental units into clusters. A cluster-based randomized (CBR) design is a randomized assignment of units to treatment and control at the cluster level: if cluster is assigned to treatment (resp. control), then all units in cluster are assigned to treatment (resp. control). We will use the notation to denote the expected value of estimator under a -cluster-based randomized design. Recall that represents the assignment of units to treatment and control, resulting from assigning the clusters of uniformly at random to treatment or control.

Let (resp. ) be the number of clusters assigned to treatment (resp. control). Let be the assignment vector over clusters, where . In practice, we will use the Horvitz-Thompson (HT) estimator, defined below:

(2)

Under SUTVA, the HT estimator is an unbiased estimator of the total treatment effect under any

-CBR assignment (Middleton and Aronow, 2011):

When SUTVA does not hold, this property is no longer guaranteed, and may be biased. Our objective is to minimize the bias, defined below, with respect to the clustering, without assuming any explicit knowledge of the interference mechanism or the value of the estimand :

(3)

2.2 A monotonicity assumption

Choosing the partitioning of our experimental units in a way that minimizes the bias of our estimators (cf. Eq. 3) when running a cluster-based experiment is a difficult task: without the ground truth, we cannot observe the bias directly. However, under a specific monotonicity property— common to many randomized experiments —the task of choosing the better of two clusterings becomes straightforward.

Definition 1.

Let be the set of all possible clusterings of our units. For a subset of possible clusterings, we say that the interference model is -increasing if and only if

and it is -decreasing if and only if

A -monotone model is one that is either -increasing or -decreasing.

A monotone model is one for which the expectation of the HT estimator is either always a lower bound or always an upper-bound of the estimand under any -CBR design for . It is sufficient for to contain the partitions we wish to compare: we do not have to prove monotonicity beyond those partitions. Before delving into examples of monotone interference mechanisms, we introduce the following proposition, which highlights why monotonicity is useful for reasoning about bias.

Proposition 1.

If the interference model is -increasing, then for all , it holds that

If the interference model is -decreasing, then for all , it holds that

Proposition 1 is a simple consequence of Definition 1: if we know that two cluster-based estimates are both lower bounds of the estimand, then the greater of the two must be less biased. The same reasoning applies if they both upper-bound the estimand. It is sufficient to compare the expectation of our estimators to determine which is less biased.

The crux of our framework therefore relies on reasoning about monotonicity. Many commonly studied parametric models of interference are in fact monotone. Consider the following

linear model of interference (e.g. studied in (Eckles et al., 2017)):

(4)

where for all , , is independent of , and is the proportion of ’s neighborhood that is treated. This expresses each unit’s outcome as a linear function of a fixed effect, a heterogeneous treatment effect, and a network effect proportional to the fraction of my neighborhood that is treated. As shown in the following proposition, this is monotone.

Proposition 2.

For all , let be the average proportion of a unit ’s neighborhood included in its assigned cluster . Then,

It follows that if , the interference model is -increasing, otherwise it is -decreasing.

We can also extend the above for heterogeneous network effect parameters . A proof can be found in Section A.

Proposition 3.

Let . For all ,

It follows that if , then the interference model is -increasing, otherwise it is -decreasing.

It follows that if , then the interference mechanism is -increasing, and if , then it is -decreasing. If the sign of is not consistent, then the monotonicity depends on the clustering: if all units with a given sign are perfectly clustered , e.g. all units with , then the mechanism is once again monotone.

More sophisticated interference mechanisms, without an immediate parametric form, are also monotone. For example, we show that the interference mechanism present in reserve price experiments in an advertiser auction setting is monotone (under certain conditions). See Section 3 for more details. For these complex interference mechanisms, it can also be easier to establish the following sufficient (but not necessary) condition:

Proposition 4.

We say an interference mechanism verifies the self-excitation property for a set of partitions , if for all units and partitions ,

A -self-exciting process is -increasing. A self-deexciting mechanism, with flipped inequalities, is -decreasing.

The proof is included in Section A. The two inequalities capture the following phenomenon: conditioned on my treatment status, if my outcome is greatest when my neighborhood is entirely in treatment, and lowest when my neighborhood is entirely in control, then an experiment always under-estimates the true treatment effect. This only needs to be true in expectation over the assignments , even if, in practice, we can show that the inequalities hold for all (cf. Section 3).

We say the interference mechanism is self-exciting because these inequalities are verified when units benefit from being surrounded by units in treatment. A successful messaging feature launch is a straightforward example of a self-exciting process, as is any pricing mechanism that penalizes any treated bidders and boosts the utility of their competitors.

2.3 An experiment-of-experiments design

Figure 1: A hierarchical experimental design, which assigns the experimental units to one of two cluster-based randomized designs, and , completely at random (CR). and represent the treatment effect estimates under each design respectively.

Under monotonicity, Proposition 1 states that we can determine the least-biased of two -increasing or -decreasing cluster-based designs, without knowledge of the estimand, by comparing the expectation of their estimates. However, only one cluster-based design can ever be applied to the set of experimental units in its entirety, and the comparison of with cannot be done directly.

This resembles the fundamental problem of causal inference, which states that units cannot be placed both in treatment and control buckets, and is solved through randomization. Inspired by (Saveski et al., 2017; Pouget-Abadie et al., 2017), we suggest to randomly assign different units to either clustering algorithm, resulting in a 2-step hierarchical randomized design. The procedure, described in pseudo-code in Algorithm 1, is as follows:

  • Assign units completely at random to two design buckets, one for each clustering algorithm. Let be the vector representing that assignment.

  • Within each design bucket, cluster the remaining units together according to the appropriate partition: if and , then and belong to the same cluster in design bucket . The resulting partitions are and .

  • Within each design bucket, assign the resulting clusters to treatment and control. Let be the resulting assignment vector. This is possible because no unit belongs to both and .

Input: Partitions of the units into clusters.
Output: encoding the assignment of each unit to a treatment or control bucket.
Choose uniformly at random, encoding the assignment of units to design arms and ;
for  do
     Let be the clustering on such that ;
      Assign units in treatment arm to treatment and control with a -cluster-based design;
      end for
     return the resulting assignment vector ;
Algorithm 1 Experiment of experiments design

Algorithm 1 provides us with two estimates, and , of the causal effect, one from each design arm. The resulting clusterings and may be unbalanced. This is of minor importance as the HT estimator (cf. Eq. 2

) is unbiased (under SUTVA) for unbalanced clusterings, and balancedness is required only to control its variance. In practice,

and

are not required to have the same number of clusters, but we expect the clusters sizes to be large enough for each cluster to have at least one unit in each design arm after the first stage with high probability.

From the comparison of and , we seek to order and . Under arbitrary interference structures, these proxy estimates are not guaranteed to have the same ordering, the key condition for Proposition 1. Intuitively, and represent the treatment effect estimates for two “weakened” versions of each partitioning and . This is where a completely randomized assignment helps. Because the assignment of units to design arms is done completely at random, it affects each partitioning in the same way, and we expect the ordering to stay the same. For the linear model of interference in Prop. 3, we have:

Property 1.

An interference mechanism is said to be -transitive if ,

As a sanity check, we can also confirm that the property holds for SUTVA. The property can also be shown for the linear interference mechanisms introduced in Prop. 3:

Proposition 5.

Under SUTVA, it holds that

Hence, the no-interference case is trivially -transitive. Furthermore, the linear model of interference in Prop. 3 is -transitive if the same number of units is assigned to each design arm: .

A full proof can be found in Section A. For more complex mechanisms of interference, as is the case for reserve price experiments, we use simulations to confirm the intuition that transitivity holds. See Section 4 for more details.

As is common with A/B tests, we do not have access to the expectation of our estimators, and rely on approximations to the variance, such as Neymann’s variance estimator. In order to meaningfully compare the estimates we obtain, we must apply our method of choice to determine when their ordering is significant. For example, we can make a normal approximation to the distribution of the estimates— using Neymann’s estimator to upper-bound the variance —to estimate the probability that one estimate is greater than the other with a certain significance level:

Proposition 6.

For , recall the definition of the Neymannian variance estimator for cluster-based randomized designs:

(5)

where (resp. ) is the number of clusters (resp. units) in , and , and . Assume that the interference mechanism is transitive and -increasing, such that . If is the level of significance chosen, we state that is a significantly better clustering than if and only if

where

is the cdf of the normal distribution.

A similar reasoning applies to

-decreasing mechanisms. If the Gaussian approximation is not appropriate, the distribution of the estimators can equally be approximated by a bootstrap analysis, or a more sophisticated model-based imputation method 

(Imbens and Rubin, 2015). More details can be found in Section A.

3 Application to reserve price experiments

Online advertising exchanges provide an interface for bidders to participate in a set of auctions for advertising online. These ads can appear within the company’s own content, in a social feed, below a search query, or on the webpage of an affiliated publisher. These auctions provide the vast majority of revenue to these platforms, and are thus the subject of experimentation and optimization. Platforms run experiments and monitor different metrics including of revenue and estimates of bidders’ welfare. One such welfare metric is the sum of the bids of advertisers, and another metric is the sum of estimated utility of bidders via another utility estimator.

One possible parameter subject to optimization is the method of determining reserve prices. Online marketplaces can choose to implement a reserve price, which sets the minimum bid required for a bid to be valid and compete with others. It may vary from bidder to bidder, and from auction to auction. A higher reserve may improve revenue, but if it is too high, then too many bids are discarded and ad opportunities can go unsold.

Modifications to a reserve price rule are prime examples of experiments where SUTVA does not hold. A change in reserve price to one bidder affects the bidding problem facing another bidder, even when her reserve is unchanged (e.g., reducing competition when the reserve to the first bidder is higher). Although we ignore them here, budget constraints are another factor— if a budget-constrained bidder faces higher reserve prices, then she may adjust her bids to re-optimize return on investment. Working without budget constraints, we establish conditions under which the resulting interference mechanism within reserve price experiments is monotone, both in the case of a single-item second price auction setting and in the Vickrey-Clarke-Groves auction setting for positional ads. See (Varian and Harris, 2014) for a reference.

3.1 Single-item second price auctions

We consider a single-item second-price auction with bidders : the highest bidder wins the auction and is charged the maximum of her reserve price and the second-highest bid. The second price auction is truthful (bidding true values is a dominant-strategy equilibrium), and we will assume that the bidders are rational.

Consider two reserve price mechanisms (control) and (treatment). Suppose that the reserve price mechanism corresponding to treatment always sets a higher reserve price than the reserve price mechanism corresponding to control: . By symmetry, the following argumentation would also work if the treatment and control labels were switched.

We suppose the bidders have unobserved values for winning the auction. We randomly assign bidders to either the treatment or control reserve price mechanism, with the resulting assignment. The chosen metric of interest is a bidder’s utility, denoted by . For a second-price auction, if bidder does not win the auction, and when she wins the auction and pays price . The bidder welfare of an auction is the sum of each bidder’s utility, , and the estimand is given by:

Tthe reserve price experiment for second price auctions verifies the self-excitation property (cf. Prop. 4). The idea is that assigning a unit to the intervention can only make them less competitive by discarding their bid from the auction. Thus, the higher the number of treated units, the lower the competition for the remaining bidders, and the higher their utility.

Theorem 1.

Consider a set of rational agents with no budget-constraints. Let the outcome of interest be each agent’s welfare. The interference mechanism of a reserve price experiment, assigning treated units to a higher personalized reserve price, for a single-item second-price auction is self-exciting, and thus monotone.

Proof.

Consider bidder i’s outcome under and under any assignment such that . There are three possible cases:

  • Bidder wins the auction in neither assignment. Her utility is therefore constant.

  • Bidder wins the auction in only one assignment. It must be that bidder wins under but not . Her utility is under and greater than under .

  • Bidder wins the auction under both assignments. If the second highest bid is the same under both assignments, bidder ’s utility is constant. Otherwise, the second highest bid under can only be lower than the second highest bid under . Thus bidder ’s payment is lower and her utility is higher under assignment than under assignment .

By symmetry, we reach a similar conclusion when comparing assignments and any assignment such that . ∎

It follows that the reserve price experiment is -increasing, and any cluster-based randomized design underestimates the bidder welfare estimand.

3.2 Positional ad auctions

Figure 2: The average click-through rate (CTR) observed in the Yahoo! Search Auction dataset, described in Section 4

, can be observed to be an approximately decreasing and convex function of the slot rank. The confidence intervals were too small to be meaningfully reported in the figure.

In practice, ad auctions are also multi-item, used for selling more than one ad position on a user’s view. We now extend the previous results to a multi-item setting, with items (or “slots”). We assume the common positional ad setting, where each slot has an inherent click-through rate , which we can suppose is ordered:  (Varian, 2007). Each bidder is only ever allocated at most one item, with value for getting a click. As a result, bidder ’s utility for winning slot is , where is the required payment. We assume for simplicity that all bidders have the same ad quality, and thus the same click-through rate for a given ad slot.

The Vickrey-Clarke-Groves (VCG) auction takes place in two parts. First, a value-maximising allocation is chosen (based on bids). Here, the highest bids win the highest slots. Bidders are then charged the externality they impose on all other bidders. In other words, assuming that bidder obtains the slot, bidder pays:

where is the reserve imposed on bidder with value . We can prove that the self-excitation property holds under a convexity assumption.

Theorem 2.

Consider a set of rational agents with no budget-constraints. Let the outcome of interest be each agent’s welfare. The interference mechanism of a reserve price experiment, assigning treated units to a higher personalized reserve price, for a VCG auction in the positional ad setting with no quality effects is self-exciting, and thus monotone if the click-through rate function is convex:

This convexity assumption is verified empirically in the literature and in the Yahoo! auction dataset222Our own dataset could potentially suffer from endogeneity, where weaker bidders are consistently assigned to lower slots. The assumption is, however, supported elsewhere in the literature (Brooks, 2004; Richardson et al., 2007). introduced in Section 4 (cf. Figure 2). The intuition behind the proof is similar: the greater the number of my competitors are treated, the fewer are able to compete, and thus the higher my utility. We prove this through a case analysis. Let be the reserve that bidder faces under assignment vector : if and otherwise.

Proof.

Consider the outcomes of bidder and under and such that for all , , , and . By transitivity, if we can show , then it follows that . There are three possible cases:

  • The allocation of bidders to slots does not change and thus prices do not change. Bidder i’s utility is constant.

  • Bidder is allocated to slot for both and assignments, but bidder ’s () bid is discarded when is treated (): . The difference of bidder ’s outcome under the two treatment assignments is: . This quantity is always negative, hence .

  • Bidder ’s () bid is discarded when is treated and thus bidder is allocated to slot . In that case, bidder ’s utility under is: . The same bidder utility under is: .

    It follows that the difference of bidder ’s outcomes is equal to:

    where the terms are implicit. Note that each individual term of the sum is positive by convexity, such that .

4 Experimental Data and Validation

In this section, we validate our design strategy for comparing two given graph partitions for the purpose of experimentation under interference to an advertising auction dataset. For this purpose, we make use of a Yahoo! auction dataset.

4.1 The Yahoo! Search Auction dataset

Per keyphrase
nbr of bids min 1
median 2
max 7041
bid value min ¢
median ¢
max $
impressions min 1
median 3
max
clicks min 0
max 7041
Per bidder
nbr of bids min 1
median 9
max
bid value min ¢
median ¢
max $
impressions min
median
max
clicks min
max
Table 1: Summary statistics for the Yahoo! dataset, aggregated by keyphrase or by bidder, per day for the entire 4 month period. Bid values are given in USD unless specified otherwise.

is the value of the cumulative distribution function of impressions for a single impression.

The Yahoo! Search Marketing Advertiser Bid-Impression-Click data on competing Keywords dataset is a publicly-available dataset released by Yahoo!333Available for download at https://webscope.sandbox.yahoo.com/, containing bid, impression, click, and revenue data between advertiser-keyphrase pairs over a period of 4 months. The advertiser and keyphrase are anonymized, represented as a randomly-chosen string. A sample line of the dataset is reproduced444The account ID and keyword ID’s have been shortened for the sake of exposition in this sample line. The bid value is given in 1/100¢. below:

day id rank keyphrase bid impress. clicks
1 a3d2 2 f3e4,j6r3, 100.0 1.0 0.0
Table 2: Sample line in the Yahoo! dataset

The dataset contains bidding activities of different bidders. There are a total of keywords represented, for a total of unique keyphrases (or list of keywords). Table 1 contains a series of summary statistics computed over keyphrase-day pairs and bidder-day pairs, namely the total number of bids, the total bid value, the total number of impressions, and the total number of clicks per keyword (or per bidder) and per day.

We can represent the Yahoo! dataset by a set of bipartite graphs between bidders, identified by their account_id, and the keyphrases. The bid bipartite graph on day draws a weighted edge of weight between every bidder-keyphrase pair such that bidder bids on keyphrase on day . We can aggregate these graphs over the entire time period ( months) by summing their edge weights together. We can also consider the impression, rank, and clicks graphs, where the weight of the edge is given by the number of impressions, the rank, or the number of clicks respectively received by bidder on keyphrase .

The dataset only provides data aggregated at the granularity of a single day, reporting the average bid and total number of impressions and clicks for each bidder, keyphrase day triplet. Hence, we define a keyphrase-day pair as a single auction, where each bidder’s bid is set to the reported average bid for that keyphrase-day pair. For the sake of simplicity, we will only consider a setting with the first ad positions, which account for the majority of clicks.

4.2 Simulating a reserve price experiment

Figure 3: Weighted ratio of edges across partitions for successive runs of the R-LDG algorithm on the weighted bid graph into partitions and partitions respectively.

While the Yahoo! Search Auction dataset provides us with a set of bidders, keyphrases, and the bids, impressions, and clicks that link them, it does not provide us with an actual intervention on the auction ecosystem. We must therefore simulate the impact of a change in the reserve price given to each bidder.

While many possible units of randomization exist for an auction experiment (keyphrases, bidders, browsers, users, various pairings of these units, etc.), the reserve price experiment we consider randomizes on bidders. On large auction platforms, the reserve price might be set through the application of machine learning methods. In our context, we choose a random non-zero reserve price for each bidder, calibrating the spread of the distribution such that some bidders will not always be able to match the reserve price for all auctions. All bidders assigned to the intervention will face their non-zero reserve price, fixed for every auction for simplicity. All bidders assigned to the control bucket will not face a reserve price.

Within the same auction for a given keyphrase, two participating bidders may face distinct reserves and be assigned to different treatment buckets. A bidder-cluster-based randomized experiment is thus used to mitigate the possible interference between bidders, our units of randomization, within a single auction.

To validate our experiment-of-experiments design, we must find candidate balanced graph partitions to compare, a problem known to be NP-hard — even when we slightly relax the balancedness assumption (Andreev and Racke, 2006). In the last several years, there has been good progress in developing scalable distributed balanced partitioning algorithms for graphs with billions of edges (Tsourakakis et al., 2014; Aydin et al., 2016). These algorithms have enabled practitioners to apply these large-scale graph mining algorithms for large-scale randomized experimental studies (Ugander and Backstrom, 2013; Saveski et al., 2017; Rolnick et al., 2016)

. Of the numerous heuristic algorithms for finding such partitions, the

Restreaming Linear Deterministic Greedy (R-LDG) algorithm (Nishimura and Ugander, 2013) is a popular choice. It consists of repeatedly applying a greedy algorithm, originally proposed in (Stanton and Kliot, 2012), which assigns each node to one of partitions according to the following objective:

where is the set of nodes assigned to partition at step of the algorithm, is the maximum capacity of partition , and is the set of neigbhors of node in the graph.

We can apply this clustering algorithm to any of the bipartite graphs introduced in Section 4.2, aggregated over the entire time period, resulting in a set of mixed bidder-keyphrase clusters. The bidder-only clusters are obtained from the previous clustering by simpling removing the keyphrase nodes from consideration. The algorithm’s objective must be slightly modifed to accomodate weighted graphs, by replacing with . Furthermore, we must also modify the balance requirement, since only the bidder side of the bipartite graph clustering is required to be balanced! We therefore replace with where is the set of bidder nodes in partition and is the maximum number of allowed bidder nodes in partition . The final objective is given by:

Figure 3 plots the proportion of edges cut, weighted by the bid amount, over consecutive runs of the R-LDG algorithm for and clusters. We adopt three main vectors of comparison between candidates partitions to determine the efficacy of our proposed experiment-of-experiment design:

  • Quality: comparing partitions of the graph that differ in their estimated quality, for example by looking at the number of edges cut, for a fixed number of clusters. As an extreme example, we will compare a random graph partitioning to a partitioning obtained by running the R-LDG algorithm to convergence.

  • Number of partitions: comparing two partitions of the graph obtained by running the same clustering algorithm for a different number of partitions. As an example, we will consider a R-LDG clustering with clusters and a R-LDG clustering with clusters.

  • Metric: comparing partitions of the graph that are obtained by applying the same algorithm on different bipartite graphs. As an example, we will compare a R-LDG clustering of the bid graph with an R-LDG clustering of the impressions graph.

The dataset does not provide the budgets of the bidders or their perceived ad quality, hence we will adopt the same simplifying assumptions as Section 3 of no quality effects between bidders and no budget constraints. Furthermore, we assume bids are unchanged as a result of the experiment (which would be valid for rational, non budget-limited bidders).

4.3 Validating the empirical optimization

Figure 4: Distribution of the expectation of the HT estimator under and , and the induced clusterings and . The red segment represents the total treatment effect estimand. (Top) is a R-LDG clustering, is a random clustering (). (Middle) is a R-LDG clustering into partitions, is a R-LDG clustering into partitions. (Bottom) is a R-LDG clustering of the bid graph, whereas is a R-LDG clustering of the impressions graph. ()

We first compare a partitioning of the graph obtained by running the modified R-LDG algorithm (cf. Section 4.2) against a completely random balanced partitioning of the graph. We fix a subset of auctions with few bidders per auction, in order to showcase the framework and establish the monotonicity and transitivity properties by allowing a setting for which there is a clear difference between the two clusterings. The reduction in cut size — measured by the ratio of the weighted sum of edges inter-clusters over the sum of all edge weights — over the iterations of the algorithm is shown in Figure 3. While the weighted cut of the graph for a random partition is around , the partition obtained with the R-LDG algorithm approaches within a few iterations.

We validate the monotonicity assumption, as well as the transitivity assumption, for reserve price experiments. In Figure 4 (a), we plot four distributions as well as the Total Treatment Effect estimand (cf. Eq. 1), obtained by taking the difference between assigning all units to a higher reserve price and assigning none. Namely, we plot the distribution of the HT estimator’s expectation (cf. Eq 2) under each cluster-based design: where for the R-LDG clustering and for the random clustering. We also plot the distribution of the expectation of the experiment-of-experiments (EoE) estimators: .

We find that they all under-estimate the true treatment effect, as expected from the -increasing property. As expected, the HT estimator is more biased under a random clustering than under the R-LDG clustering. Furthermore, we find that the property of transitivity holds (cf. Eq. 1), namely the EoE estimate of the “random estimator” also under-estimates the total treatment effect more severely than the EoE estimate of the “R-LDG estimator”.

We repeat the experiment to compare a R-LDG clustering with partitions with another R-LDG clustering with partitions (cf. Figure 4 (b)). We find that the clustering with partitions is less biased but exhibits higher variance, and that the transitivity property holds. Finally, in Figure 4 (c), we compare a clustering of the impressions bipartite graph with a clustering of the bid bipartite graph. The transitivity property is again verified, and moreover we see that clustering the bid bipartite graph may be a better heuristic in this setting, but the difference in the two clusterings is very slight. The code is available for download at https://jean.pouget-abadie.com/kdd2018code.

5 Discussion

We showed that, under a certain monotonicity assumption, we can determine which of two clusterings yields the least biased estimator by running an experiment-of-experiments design. We noted that commonly-studied parametric models of interference verify this monotonicity assumption. Moreover, we proved that the interference mechanism resulting from the impact of a reserve price experiment on social utility is monotone, and hence our framework applies. Finally, we validated our framework on a simulated reserve price experiment, grounded in a publicly-available Yahoo! search ad dataset. There are several questions worth investigating that we did not tackle in this paper. Notably, while we explored the case of rational bidders participating in positional ad auctions without budget constraints or quality effects to establish monotonicity, can these assumptions be relaxed or generalized? What other kinds of experiments are monotone (or self-exciting)? Is it possible to generalize Theorem 2 to other Vickrey-Clacke-Groves auctions, Generalized Second Price auctions, or budgeted bidders? Finally, can the monotonicity assumption be validated empirically, either through an experimental design or an observational data study? It seems randomized saturation designs (Baird et al., ) would be a good place to start for testing monotonicity experimentally. Finally, our framework relied on the transitivity of the experiment-of-experiment estimators: namely, that they conserved the ordering of the expectation of the estimators under each clustering. Whilst we validated this assumption either theoretically (cf. Prop. 5) or through simulation (cf. Section. 4.3), can we characterize the clustering-experiment pairs that are transitive and can the assumption be tested empirically?

Appendix A Proofs

a.1 Proof of Proposition 2 and 3

Assume that , where . Recall the definition of the estimand: . Plugging in the expression for , we obtain: . The estimator is given by: , where (resp. ) is the number of clusters in treatment (resp. control). Plugging in the expression for , we obtain:

We obtain the desired result by taking the difference between these quantities. Prop. 2 follows by substituting .

a.2 Proof of Proposition 4

The proposition can be established by rewritting the definition of -increasing interference mechanisms,

such that a sufficient condition of the model to be -increasing is for and . If increasing the number of treated units in that unit’s neighborhood increases that unit’s outcome — holding that unit’s treatment assignment constant — then the two previous inequalities hold.

a.3 Proof of Proposition 5

Recall that for , our estimator can be written as:

where (resp. ) is the number of treated (resp. control) clusters in design arm and is the number of units in design arm . We begin by first considering the no-interference case. We have that . By the law of iterated expectations, we have .

We now consider the linear model suggested in Eq. 4, where we assume heterogeneous network effects (). From the proof of Proposition 3, we have that

Note that we have . It follows that, if , and ,

We conclude that the linear model of interference is transitive.

a.4 Discussion for Proposition 6

Under unspecified models of interference, theoretical bounds on the power of even the simplest randomized experiment are hard to come by. While the joint assumption of monotonicity and transitivity allow us to design a sensible test for detecting the better of two partitions, they are not sufficient to bound its power without stronger assumptions. We thus rely on simulations, like the ones run in Section 4, or theoretical approximations, like the ones suggested in Prop. 6. It approximates , for by two independently-distributed Gaussian variables of mean and variance , given in Eq. 5. Their difference therefore has the distribution . Recall that Neymann’s variance estimator is an upper-bound of the true variance, under SUTVA, in expectation over the assignment (cf. (Imbens and Rubin, 2015)). We prove in the lemma below that this still holds true for a hierarchical assignment.

Lemma 1.

Under SUTVA, Neymann’s variance estimator is an upper-bound in expectation of the true variance of the HT estimator:

Proof.

By Eve’s law, . From (Imbens and Rubin, 2015), the first term can is equal to:

where , the cluster-level outcomes. The second term can be shown to be equal to .

Since , we must prove: . This follows from an application of the Cauchy-Schwarz inequality for balanced clusters: , where are the cluster sizes, equal to in the balanced case. ∎

In order to determine the greater of two clusterings, we can perform two one-sided t-tests. The Bayesian approach is to compute the posterior distribution of the difference of the two estimates, using a conjugate Gaussian prior. In order to assess the impact of assuming the two estimates are independent Gaussians, we suggest running a sensitivity analysis, by considering the result of the test for different values of the correlation coefficient.

References

  • AdE [2018] Ad exchange auction model. https://support.google.com/adxseller/answer/152039?hl=en&ref˙topic=2904831, February 2018.
  • Andreev and Racke [2006] Konstantin Andreev and Harald Racke. Balanced graph partitioning. Theory of Computing Systems, 39(6):929–939, 2006.
  • Aydin et al. [2016] Kevin Aydin, MohammadHossein Bateni, and Vahab S. Mirrokni. Distributed balanced partitioning via linear embedding. In WSDM, 2016.
  • [4] Sarah Baird, J Aislinn Bohren, Craig McIntosh, and Berk Ozler. Designing experiments to measure spillover effects. SSRN:2505070.
  • Basse et al. [2016] Guillaume W Basse, Hossein Azari Soufiani, and Diane Lambert. Randomization and the pernicious effects of limited budgets on auction experiments. In Artificial Intelligence and Statistics, pages 1412–1420, 2016.
  • Brooks [2004] Nico Brooks. The atlas rank report: How search engine rank impacts traffic. Insights, Atlas Institute Digital Marketing, 2004.
  • Donner and Klar [2004] Allan Donner and Neil Klar. Pitfalls of and controversies in cluster randomization trials. American Journal of Public Health, 94(3):416–422, 2004.
  • Eckles et al. [2016] Dean Eckles, René F Kizilcec, and Eytan Bakshy. Estimating peer effects in networks with peer encouragement designs. PNAS, 113(27):7316–7322, 2016.
  • Eckles et al. [2017] Dean Eckles, Brian Karrer, and Johan Ugander. Design and analysis of experiments in networks: Reducing bias from interference. Journal of Causal Inference, 5(1), 2017.
  • Gui et al. [2015] Huan Gui, Ya Xu, Anmol Bhasin, and Jiawei Han. Network a/b testing: From sampling to estimation. In WWW, 2015.
  • Imbens and Rubin [2015] Guido W Imbens and Donald B Rubin. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.
  • Middleton and Aronow [2011] Joel A Middleton and Peter M Aronow. Unbiased estimation of the average treatment effect in cluster-randomized experiments. SSRN:1803849, 2011.
  • Nishimura and Ugander [2013] Joel Nishimura and Johan Ugander. Restreaming graph partitioning: simple versatile algorithms for advanced balancing. In KDD, 2013.
  • Pouget-Abadie et al. [2017] Jean Pouget-Abadie, Martin Saveski, Guillaume Saint-Jacques, Weitao Duan, Ya Xu, Souvik Ghosh, and Edoardo M Airoldi. Testing for arbitrary interference on experimentation platforms. arXiv:1704.01190, 2017.
  • Richardson et al. [2007] Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: estimating the click-through rate for new ads. In WWW, 2007.
  • Rolnick et al. [2016] David Rolnick, Kevin Aydin, Shahab Kamali, Vahab S. Mirrokni, and Amir Najmi. Geocuts: Geographic clustering using travel statistics. CoRR, abs/1611.03780, 2016. URL http://arxiv.org/abs/1611.03780.
  • Saveski et al. [2017] Martin Saveski, Jean Pouget-Abadie, Guillaume Saint-Jacques, Weitao Duan, Souvik Ghosh, Ya Xu, and Edoardo Airoldi. Detecting network effects: Randomizing over randomized experiments. In KDD, 2017.
  • Stanton and Kliot [2012] Isabelle Stanton and Gabriel Kliot. Streaming graph partitioning for large distributed graphs. In KDD, 2012.
  • Tsourakakis et al. [2014] Charalampos E. Tsourakakis, Christos Gkantsidis, Bozidar Radunovic, and Milan Vojnovic. FENNEL: streaming graph partitioning for massive scale graphs. In WSDM, 2014.
  • Ugander and Backstrom [2013] Johan Ugander and Lars Backstrom. Balanced label propagation for partitioning massive graphs. In WSDM, pages 507–516, 2013.
  • Ugander et al. [2013] Johan Ugander, Brian Karrer, Lars Backstrom, and Jon Kleinberg. Graph cluster randomization: Network exposure to multiple universes. In KDD, 2013.
  • Varian [2007] Hal R Varian. Position auctions. International Journal of industrial Organization, 25(6):1163–1178, 2007.
  • Varian and Harris [2014] Hal R Varian and Christopher Harris. The vcg auction in theory and practice. American Economic Review, 104(5):442–45, 2014.
  • Walker and Muchnik [2014] David Walker and Lev Muchnik. Design of randomized experiments in networks. Proceedings of the IEEE, 102(12):1940–1951, 2014.