Experimental Design in Two-Sided Platforms: An Analysis of Bias

by   Ramesh Johari, et al.
Stanford University

We develop an analytical framework to study experimental design in two-sided platforms. In the settings we consider, customers rent listings; rented listings are occupied for some amount of time, then become available. Platforms typically use two common designs to study interventions in such settings: customer-side randomization (CR), and listing-side randomization (LR), along with associated estimators. We develop a stochastic model and associated mean field limit to capture dynamics in such systems, and use our model to investigate how performance of these estimators is affected by interference effects between listings and between customers. Good experimental design depends on market balance: we show that in highly demand-constrained markets, CR is unbiased, while LR is biased; conversely, in highly supply-constrained markets, LR is unbiased, while CR is biased. We also study a design based on two-sided randomization (TSR) where both customers and listings are randomized to treatment and control, and show that appropriate choices of such designs can be unbiased in both extremes of market balance, and also yield low bias in intermediate regimes of market balance.



There are no comments yet.


page 1

page 2

page 3

page 4


Interference, Bias, and Variance in Two-Sided Marketplace Experimentation: Guidance for Platforms

Two-sided marketplace platforms often run experiments to test the effect...

Treatment Effects in Market Equilibrium

In evaluating social programs, it is important to measure treatment effe...

Unique Ergodicity in the Interconnections of Ensembles with Applications to Two-Sided Markets

There has been much recent interest in two-sided markets and dynamics th...

Trustworthy Online Marketplace Experimentation with Budget-split Design

Online experimentation, also known as A/B testing, is the gold standard ...

Randomization Bias in Field Trials to Evaluate Targeting Methods

This paper studies the evaluation of methods for targeting the allocatio...

Assortment planning for two-sided sequential matching markets

Two-sided matching platforms provide users with menus of match recommend...

Transparency and Control in Platforms for Networked Markets

In this work, we analyze the worst case efficiency loss of online platfo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We develop a framework to study experiments (also known as A/B tests) that two-sided platform operators routinely employ to improve the platform. Experiments are used to test all types of interventions that affect the interactions between participants in the market; examples include features that change the process by which buyers search for sellers, or interventions that alter the information the platform shares with buyers about sellers. We are particularly motivated by marketplaces where customers do not purchase goods, but rather rent (or book) them for some amount of time. This covers a broad array of platforms, e.g., lodging (e.g., Airbnb and Booking.com), freelancing (e.g., Upwork), and many services (tutoring, dogwalking, child care, etc.). While we explicitly model such a rental platform, the model we describe also captures features of a platform where goods are bought, and supply must be replenished for future demand.

Our model consists of a fixed number of listings; customers arrive sequentially over (continuous) time. For example, on a lodging site, listings include hotel rooms, private rooms, houses for rent, etc.; and customers are travelers looking to book. In online labor platforms, a freelancer offering work is a listing, and a client looking to hire a freelancer is a customer. Naturally, an arriving customer can only rent available

listings (i.e., those that are not currently rented). The customer forms their consideration set from available listings and then, according to a choice model, chooses which listing to rent from this set (including an outside option). We allow the choice set formation process, the utility of a customer for a listing, and the utility of a customer for the outside option to be heterogeneous across listings and customers. In our paper, we employ the multinomial logit choice model; however, since we allow for arbitrary heterogeneity, this admits a quite a general class of demand models. Once a listing is rented, it is occupied and becomes unavailable until the end of the occupancy time.

In this paper, we consider interventions

by the platform that change the parameters governing the choice probability of the customer, such as those described above; we refer to the new choice parameters as the

treatment model, and the baseline as the control model.111The same framework that we employ in this paper can be used to consider interventions that change other parameters, such as customer arrival rates or the time that listings remain occupied when rented; such application is outside the scope of our current work. We assume the platform wants to use an experiment to assess the difference between the rate at which rentals would occur if all choices were made according to the treatment parameters (the global treatment condition), and the corresponding rate if all choices were made according to the treatment parameters (the global control condition). This is the global treatment effect or . In particular, we imagine the quantity of interest is the steady-state (or long-run) , i.e., the long-run average difference in rental rates.222Our framework can also be used to evaluate other metrics of interest based on experimental outcomes; for simplicity we focus on rate of rental in this work.

Most platforms employ one of two simple designs for testing such an intervention: either customer-side randomization (what we call the design) or listing-side randomization (what we call the design). In the design, customers are randomized to treatment or control. All customers in treatment make choices according to the treatment choice model, and all customers in control make choices according to the control choice model. In the design, listings are randomized to treatment or control, and the utility of a listing is then determined by its treatment condition. As a result, in the design, in general each arriving customer will consider some listings in the treatment condition and some listings in the control condition. As an example, suppose the platform decides to test an intervention that shows badges for certain listings. In the design, all treatment customers see the badges, and no control customers see the badges. In the design, all customers see the badges on treated listings, and do not see them on control listings.

Each of these designs are associated with natural estimators. In the design, the platform measures the rate of rental by treatment customers, and compares to the rate of rental by control customers; this is what we call the naive estimator. In the design, the platform measures the rate at which treatment listings are rented, and compares to the rate at which control listings are rented; this is what we call the naive estimator.

To develop some intuition for the potential biases, first consider an idealized static setting where listings are instantly replenished upon being rented; in other words, every arriving customer sees the full set of listings as available. As a result, in the design there is no interference between treatment and control customers, and consequently the estimator is unbiased for the true . On the other hand, in the design, every arriving customer considers both treatment and control listings when choosing whether to rent, creating a linkage across listings through customer choice. In other words, in the design there is interference between treatment and control, and in general the estimator will be biased for the true .

Now return to our dynamic model, where the limited inventory of listings is enforced, i.e., listings remain unavailable for some time after rental. In this case, observe that on top of the preceding discussion, there is a dynamic linkage between customers: the set of listings available for consideration by a customer is dependent on the listings considered and rented by previously arriving customers. This dynamic effect introduces a new form of bias into estimation, and is distinctly unique to our work. In particular, because of this dynamic bias, in general the naive estimator will be biased as well.

Our paper develops a dynamic model of two-sided markets with inventory dynamics, and uses this framework to compare and contrast both the designs and estimators above, as well as a novel class of more general designs based on two-sided randomization (of which the two examples above are special cases). In more detail, our contributions and the organization of the paper are as follows.

Benchmark model and formal mean field limit. Our first main contribution is to develop a general, flexible theoretical model to capture the dynamics described above. In Section 3

, we present a model that yields a continuous-time Markov chain in which the state at any given time is the number of currently available listings of each type. In Section

4, we then suggest a formal mean field analog of this continuous-time Markov chain, by considering a limit where the number of listings in the system approaches infinity. Scaling by the number of listings yields a continuum mass of listings in the limit. In the mean field model, the state at a given time is the mass of available listings, and this mass evolves via a system of ODEs. Using a Lyapunov argument, we show this system is globally asymptotically stable, and give a succinct characterization of the resulting asymptotic steady state of the system as the solution to an optimization problem.

Designs and estimators: Two-sided, customer-side, and listing-side randomization. In Section 5, we develop a more general form of experimental design, called two-sided randomization (); an analogous idea was independently proposed recently by [2] (see also Section 2). In a design, both customers and listings are randomized to treatment and control. However, the intervention is only applied when a treatment customer considers a treatment listing; otherwise, if the customer is in control or the listing is in control, the intervention is not seen by the customer. (In the example above, a customer would see the badge on a listing only if the customer were treated and the listing were treated.) Notably, the and designs are the special cases of where all listings are treated (), or all customers are treated (). We also define natural naive estimators for each design.

Analysis of bias: The role of market balance. Finally, in Sections 6 and 7, we study the bias of the different designs and estimators proposed. Our main theoretical results characterize how the bias depends on the relative volume of supply and demand in the market. In particular, in the highly demand-constrained regime (where customers arrive slowly and/or listings replenish quickly), the model approaches the static benchmark described above: the naive estimator becomes unbiased, while the naive estimator is biased. On the other hand, in the highly supply-constrained regime (where customers arrive rapidly and/or listings replenish slowly), the dynamic bias above suggests a more complicated story. However, we remarkably find that in fact the naive estimator becomes unbiased, while the naive estimator is biased. We show how to interpret these findings via examples in Section 6.

Given these findings, it is natural to ask whether good performance can be achieved in moderately balanced markets by “interpolating” between the naive

and estimators. We show that a naive estimator that achieves this interpolation, and also propose a more sophisticated estimator that exhibits substantially improved performance in numerical examples. This latter estimator explicitly aims to correct for interference in regimes of moderate market balance.

Motivated by the common practice of running short-run experiments, we also study the transient behavior of these different designs and estimators. We note that in highly demand-constrained markets, the naive

estimator is unbiased even in the transient phase; informally, this is because all arriving customers see the same (full) set of available listings. In general, however, numerical investigation of transient performance reveals that the best design can vary depending on market balance and the time horizon of interest. More generally, when studying these designs and estimators there will be an important tradeoff between reducing bias and increasing variance. We leave these directions for future work.

Taken together, our work sheds light on what experimental designs and associated estimators should be used by two-sided platforms depending on market conditions, to alleviate the biases from interference that arise in such contexts. We view our work as a starting point towards a comprehensive framework for experimental design in two-sided platforms; we discuss some directions for future work in Section 8.

2 Related work

SUTVA. The types of interference described in these experiments are violations of the Stable Unit Treatment Value Assumption (SUTVA) in causal inference [11]. SUTVA requires that the (potential outcome) observation on one unit should be unaffected by the particular assignment of treatments to the other units. A large number of recent works have investigated experiment design in the presence of interference, particularly in the context of markets and social networks.

Interference in marketplaces. Biases from interference can be large: [5] empirically show in an auction experiment that the presence of interference among bidders caused the estimate of the treatment effect to be wrong by a factor of two. This evidence is corroborated by [9], who similarly finds through simulations that a marketplace experiment changing search and recommendation algorithms can be off by a factor of two. Inspired by the goal of reducing such bias, other work has developed approaches to bias characterization and reduction both theoretically (e.g., as in [4] in the context of auctions with budgets), as well as via simulation (e.g., as in [10] who explores the performance of designs).

Our work complements this line, by developing a mathematical framework for the study of estimation bias in dynamic platforms. Key to our analysis is the use of a mean field model to model both transient and steady-state behavior of experiments. A related approach is taken in [16], where a mean field analysis is used to study equilibrium effects of an experimental intervention where treatment is incrementally applied in a marketplace (e.g., a small pricing change).

Interference in social networks. A bulk of the literature in experimental design with interference considers an interference that arises through some underlying social network: e.g., [12] studies the identification of treatment responses under interference; [15] introduces a graph cluster based randomization scheme and analyzes the bias and variance of the design; and many other papers, including [1, 3, 13] focus on estimating the spillover effects created by interference. In general, our work is distinct because the interference pattern is endogenous to the experiment, and dynamically evolving over time.

Other experimental designs. In practice, platforms currently mitigate the effects of interference through either clustering techniques that change the unit of observation to reduce spillovers among them [7], similar to some of the works mentioned above (e.g., [10, 15]); or by switchback testing [14] , in which the treatment is turned on and off over time. Both cause a substantial increase in estimation variance due to a reduction in effective sample size, and thus the naive and designs remain popular workhorses in the platform experimentation toolkit.

Two-sided randomization. Finally, a closely related paper is [2]. Independently of our own work, there the authors propose a more general multiple randomized design of which is a special case. They focus on a static model and provide an elegant and complete statistical analysis under a local interference assumption. By contrast, we focus on a dynamic platform with market-wide interference patterns, and focus on a mean field analysis of bias.

3 A Markov chain model of platform dynamics

In this section, we first introduce the basic dynamic platform model that we study in this paper with a finite number of listings. In the next section, we describe a formal mean field limit of our model inspired by the regime where , that we use as the substrate for the remainder of our analysis in the paper. This mean field limit model then serves as the framework within which we study the bias of different experimental designs and associated estimators.

We consider a two-sided platform where we refer to the supply side as listings and the demand side as customers. There are a fixed number of listings in the marketplace. Customers arrive over time and at the time of arrival, the customer can choose from the set of available listings in the market and decide whether to rent the corresponding listing. If the customer chooses a listing, then they rent the listing for a random time period during which it is unavailable for other customers. At the end of this rental, the listing again becomes available for use for other customers.

The formal details of our model are as follows. Note: we use boldface

to denote vectors throughout the paper.

Time. The system evolves in continuous time .

Listings. The system consists of a fixed number of listings. We refer to ”the ’th system” as the instantiation of our model with listings present. We use a superscript ”” to denote quantities in the ’th system where appropriate. Let denote the total number of listings of type in the ’th system. For each , we assume that . Note that .

We allow for heterogeneity in the listings. Each listing has a type , where is a finite set (the listing type space). Note that in general, the type may encode both observable and unobservable covariates; in particular, our analysis does not presume that the platform is completely informed about the type of each customer. For example, in a lodging site may encode observed characteristics of a house such as the number of bedrooms, but also characteristics that are unobserved by the platform because they may be difficult or impossible to measure. 333Our analysis does not consider improved estimation via use of observed covariates; this remains an interesting direction for future investigation.

State description. At each time , each listing can be either available (i.e., available for rent) or occupied (i.e., occupied by a customer who previously rented it). The system state at time in the ’th system is described by , where denotes the number of listings of type available in the system at time . Let be the total number of listings available for rent at time . In our subsequent development, we develop a model that makes a continuous-time Markov process.

Customers. Customers arrive to the platform sequentially and decide whether or not to rent, and if so, which type of listing to rent. Each customer has a type , where is a finite set (the customer type space) that represents customer heterogeneity. As with listings, the type may encode both observable and unobservable covariates, and again, our analysis does not presume that the platform is completely informed about the type of each listing. Customers of type arrive according to a Poisson process of rate ; these processes are independent across types. Let be the total arrival rate of customers. Let denote the arrival time of the ’th customer.

We assume that , and that for each , we have . Note that .

Consideration sets. In practice, when customers arrive to a platform, they typically form a consideration set of possible listings to rent; the initial formation of the consideration set may depend on various aspects of the search and recommendation algorithms employed by the platform. To simplify the model, we capture this process by assuming that on arrival, each listing of type available at time is included in the arriving customer’s consideration set independently with probability for a customer of type . For example, can capture the possibility that the platform’s search ranking is more likely to highlight available listings of type that are more attractive for a customer of type , making these listings more likely to be part of the customer’s consideration set; this effect is made clear via our choice model presented below. After the consideration set is formed, a choice model is then applied to the consideration set to determine whether a booking (if any) is made.

Formally, the customer choice process unfolds as follows. Suppose that customer arrives at time . For each listing , let if the listing is unavailable at . Otherwise, if listing is available, then let with probability , and let with probability , independently of all other randomness. Then the consideration set of customer is .

Customer choice. Customers choose at most one listing to rent; they can also choose not to rent at all. We assume that customers have a utility for each listing that depends on its type: a type customer has utility for a type listing. (Note that all utilities are positive.). Let denote the probability that arriving customer of type rents listing of type .

In this paper we assume that customers make choices according to the well-known multinomial logit choice model. In particular, given the realization of , we have:


Here is the value of the outside option for type customers in the ’th system. In particular, the probability that customer does not book any listing at all grows with . We let the outside option scale with ; this is motivated by the observation that in practical settings, the probability a customer does not make a rental should remain bounded away from zero even for very large systems. In particular, we assume that .

For later reference, we define:


where the expectation is over the randomness in . With this definition, is the probability that customer rents an available listing of type , where the probability is computed prior to realization of the consideration set.

Dynamics: A continuous-time Markov chain. The system evolves as follows. Initially all listings are available.444As the system we study is irreducible and we analyze its steady state behavior, it would not matter if we chose a different initial condition.. Every time a customer arrives, the choice process described above unfolds. Any occupied listing remains occupied (and therefore unavailable for further rental) for an exponential time with parameter , independent of all other randomness. Once this time expires, the listing returns to being available.

For simplicity in our analysis, we treat as a constant. We note here that it is straightforward to generalize all our analysis to the case where also depends on listing type, i.e., each listing type has a parameter that governs how long the listing is occupied once rented. We omit the details of this generalization in favor of simplicity of presentation. An even more general model might allow to depend on both listing type and the type of the customer who made the rental; such a generalization remains an interesting open direction.

Our preceding specification turns into a continuous-time Markov process on a finite state space . We now describe the transition rates of this Markov process. For a state , represents the number of available listings of type .

There are only two types of transitions possible: either (i) a listing that is currently occupied becomes available, or (ii) a customer arrives, and rents a listing that is currently available. (If a customer arrives but does not rent anything, the state of the system is unchanged.) Let denote the unit basis vector in the direction , i.e., , and for . The rate of the first type of transition is:


since there are booked listings of type , and each remains occupied for an exponential time with mean , independently of all other randomness.

The second type of transition requires some more steps to formulate. In principle, our choice model suggests that the identity of both the arriving guest and individual listings affect system dynamics; however, our state description only tracks the aggregate number of listings of each type available at each time . The key here is that our entire specification depends on guests only through their type, and depends on listings only through their type.

Formally, suppose a customer of type arrives to find the system in state . For each let be a Binomialrandom variable, independently across . Recall that for each available listing , each is a Bernoulli random variable. Recalling as defined in (2), it is straightforward to check that:


In other words, the probability an arriving customer of type rents a listing of type when the state is is given by ; and this probability depends on the past history only through the state (ensuring the Markov property holds).

With this definition at hand, for states with , the rate of the second type of transition is:


Note that the resulting Markov chain is irreducible, since customers have positive probability of sampling into, and renting from, their consideration set, and every listing in the consideration set has positive probability of being rented.

Steady state. Since the Markov process defined above is irreducible on a finite state space, there is a unique steady state distribution on for the process. In terms of this steady state distribution note that:

is the steady state expected number of available listings. Thus we refer to as the steady state availability in the ’th system, and we refer to as the steady state occupancy in the ’th system.

4 A mean field model of platform dynamics

The continuous-time Markov process described in the preceding section is challenging to analyze directly because the customers’ choices involving consideration sets induce complex dynamics. Instead, to make progress we consider a formal mean field limit of that process, motivated by the regime where

, in which the evolution of the system becomes deterministic. We do not prove the mean field limit in this paper, though we conjecture that using relatively standard techniques such a limit can be established. Instead, in this section we formally describe a fluid model via a system of ordinary differential equations (ODEs) that is the analogue of the finite system Markov process.

We consider a continuum model with a unit mass of listings. The total mass of listings of type in the system is (recall that ). We represent the state at time by ; represents the mass of listings of type available at time . The state space for this model is:

We first present the intuition behind our mean field model. Consider a state with for all . We view this state as analogous to a state in the ’th system. We consider the system dynamics defined by (3)-(5). Note that the rate at which occupied listings of type become available is , from (3). If we divide by , then this rate becomes as . On the other hand, note that for large , if is Binomial, then concentrates on . Thus the choice probability becomes approximately:


(Here we use the fact that as .) This is the mean field multinomial logit choice model for our system. Now the rate at which listings of type become occupied is , from (5). If we divide by , this rate becomes as .

Inspired by the preceding observations, we define the following system of differential equations for the evolution of :


This is our formal mean field model. In the remainder of this section, we show that this system has a unique solution for any initial condition; and further, by constructing an appropriate Lyapunov function, we show that there exists a unique limit point to which all trajectories converge (regardless of initial condition). This limiting point is the unique steady state of the mean field limit, and can be used as a large system approximation of the steady state of the ’th finite system.555Establishing such a result rigorously requires showing that steady state can be interchanged with the limit as ; again, we conjecture such a result can be proven using relatively standard techniques, but we omit any further technical development of such a result in this version of the paper.

4.1 Existence and uniqueness of mean field trajectory

First, we show the straightforward result that the system of ODEs defined in (7) possesses a unique solution. This follows by an elementary application of the Picard-Lindelöf theorem from the theory of differential equations. The proof is in Appendix A.

Proposition 1.

Fix an initial state . The system (7) has a unique solution satisfying and for all and , and .

4.2 Existence and uniqueness of mean field steady state

Next, we show that the system of ODEs in (7) has a unique limit point, to which all trajectories converge regardless of the initial condition. We refer to this as the steady state of the mean field system. We prove the result via the use of a convex optimization problem; the objective function of this problem is a Lyapunov function for the mean field dynamics that guarantees global asymptotic stability of the steady state.

Formally, we have the following result. The proof is in Appendix A.

Theorem 1.

There exists a unique steady state for (7), i.e., a unique vector solving the following system of equations:


This limit point has the property that for all , i.e., it is in the interior of . Further, this limit point is globally asymptotically stable, i.e., all trajectories of converge to as , for any initial condition .

The limit point is the unique solution to the following optimization problem:

minimize (9)
subject to (10)

The function appearing in the proposition statement is not convex; our proof proceeds by first noting that it suffices to restrict attention to such that for all , then making the transformation . The objective function redefined in terms of these transformed variables is strictly convex, and this allows us to establish the desired result.

5 Experiments: Designs and estimators

In this section, we leverage the framework developed in the previous section to undertake a study of experimental designs a platform might employ to test interventions in the marketplace. For simplicity, we focus on interventions that change the choice probability of one or more types of customers for one or more types of listings, and we assume the platform is interested in estimating the resulting rate at which rentals take place. However, we believe the same approach we employ here can be applied to study other types of interventions and platform objectives as well.

Formally, the platform’s goal is to design experiments with associated estimators to assess the performance of the intervention (the treatment), relative to the status quo (the control). In particular, the platform is interested in determining the steady-state rate of rental when the entire market is in the treatment condition (i.e., global treatment), compared to the steady-state rate of rental when the entire market is in the control condition (i.e., global control). We refer to the difference of these two rates as the global treatment effect. It is important to emphasize that this is a steady-state quantity as typically a platform is interested in the long-run effect of an intervention.

Two types of canonical experimental designs are employed in practice: listing-side randomization (denoted ) and customer-side randomization (denoted ). In the former design, listings are randomized to treatment or control; in the latter design, customers are randomized to treatment or control. Each design also has an associated natural ”naive” estimator of rental probability. As we discuss, these estimators will typically be biased, due to interference effects.

The and designs are special cases of a more general two-sided randomization (), where both listings and customers are randomized to treatment and control simultaneously. ( designs were also independently introduced and studied in recent work by [2]; see Section 2 for discussion.) In the next subsection we develop the relevant formalism for these designs; we then subsequently define natural ”naive” estimators that are commonly used for the and designs, as well as an interpolation between these two as an estimator for a design. In the remainder of the paper we study the bias of these different designs and estimators under different market conditions.

5.1 Experimental design

Treatment condition. We consider a binary treatment: every customer and listing in the market will either be in treatment or control. (Generalization of our model to more than two treatment conditions is relatively straightforward.) We model the treatment condition by expanding the set of customer and listing types. For every customer type , we create two new customer types ; and for every listing type , we create a two new listing types . The types are control types; the types are treatment types.

Two-sided randomization. We assume that a fraction of customers are randomized to treatment, and a fraction are randomized to control, independently; and we assume that a fraction of listings are randomized to treatment, and a fraction are randomized to control, independently. This is the two-sided randomization () design: randomization takes place on both sides of the market simultaneously.

Treatment as a choice probability shift. Examples of interventions that platforms may wish to test include the introduction of higher quality photos for a hotel listing on a lodging site, or showing previous job completion rates of a freelancer on an online labor market. These interventions change the choice probability of listings by customers. In particular, we continue to assume the multinomial logit choice model, and we assume that for a type customer and a type listing that have been given the intervention, the utility becomes ; the utility of the outside option becomes ; and the probability of inclusion in the consideration set becomes .

In the designs that we consider, a key feature is that the intervention is applied only when a treated customer interacts with a treated listing. For example, when an online labor marketplace decides to show previous job completion rates of a freelancer as an intervention, only treated customers can see these rates, and they only see them when they consider treated freelancers. We model this by redefining quantities in the experiment as follows:


This definition is a natural way to incorporate randomization on each side of the market. However, we remark here that it is not necessarily canonical; for example, an alternate design would be one where the intervention is applied when either the customer has been treated or the listing has been treated. Even more generally, the design might randomize whether the intervention is applied, based on the treatment condition of the customer and the listing. In all likelihood, the relative advantages of these designs would depend not only on the bias they yield in any resulting estimators, but also in the variance characteristics of those estimators. We leave further study and comparison of these designs to future work.

Customer-side and listing-side randomization. Two special cases of the design are as follows. When , all listings are in the treatment condition; in this case, randomization only takes place on the customer side of the market. This is the customer-side randomization () design. When , all customers are in the treatment condition; in this case, randomization only takes place on the listing side of the market. This is the listing-side randomization () design.

System dynamics. With the specification above, it is straightforward to adapt our mean field system of ODEs, cf. (7), and the associated choice model (6), to this setting. The key changes are as follows:

  1. The mass of control (resp., treatment) listings of type (resp., ) becomes (resp., ). In other words, abusing notation, we define , and .

  2. The arrival rate of control (resp., treatment) customers of type (resp., ) becomes (resp., ). Thus we define , and .

  3. The choice probabilities are defined as in (6), with the relevant quantities defined according to (11)-(13).

Using Proposition 1 and Theorems 1, we know that there exists a unique solution to the resulting system of ODEs; and that there exists a unique limit point to which all trajectories converge, regardless of initial condition. This limit point is the steady state for a given experimental design. For a experiment with treatment customer fraction , and treatment listing fraction , we use the notation to denote the ODE trajectory, and we use to denote the steady state.

Rate of rental. In our subsequent development, it will be useful to have a shorthand notation for the rate at which rentals of listings of treatment condition are made by customers of treatment condition , in the interval . In particular, we define:


Further, since is globally asymptotically stable, bounded, and converges to as , we have:


Global treatment effect. Recall we assume the steady-state rate of rental is the quantity of interest to the platform. In particular, the platform is interested in the change in this rate from the global control condition () to the global treatment condition ().

In the global control condition, the steady state rate at which guests rent is: , and in the global treatment condition, the steady state rate at which guests rent is . Thus the global treatment effect is .

We remark that the rate of rental decisions made by arriving customers will change over time, even if the market parameters are constant over time (including the arrival rates of different customer types, as well as the utilities that customers have for each listing type). This transient change in rental rates is driven by changes in the state ; in general, such fluctuations will lead the transient rate of rental to differ from the steady-state rate, for all values of and (including global treatment and global control). It is for this reason that we specifically aim to measure the global treatment effect as a comparison of the steady state behavior in the global treatment and global control counterfactual worlds, to capture, informally, the long run change in behavior due to an intervention.

5.2 Estimators: Transient and steady state

Thus the goal of the platform is to use the experiment to estimate . In this section we consider estimators the platform might use to estimate this quantity. We first consider the and

designs, and we define “naive” estimators that the platform might use to estimate the global treatment effect. These designs and estimators are those most commonly used in practice. We define these estimators during the transient phase of the experiment, as that is the most practically relevant regime (since A/B tests are run for a fixed duration in practice). We then also define associated steady-state versions of these estimators. Finally, we combine these estimation approaches in a natural heuristic that can be employed for any general


Estimators for the design. We start by considering the design, i.e., where and . A simple naive estimate of the rate of rental is to measure the rate at which rentals are made in a given interval of time by control customers, and compare this to the analogous rate for treatment customers. Formally, suppose the platform runs the experiment for the interval , with a fraction of customers in treatment. The rate at which customers of treatment condition rent in this period is . The naive estimator is the difference between treatment and control rates, where we correct for differences in the size of the control and treatment groups, by scaling with the respective masses:


We let denote the steady-state naive estimator.

Estimators for the design. Analogously, we can define a naive estimator for the design, i.e., where and . Formally, suppose the platform runs the experiment for the interval , with a fraction of listings in treatment. The rate at which listings with treatment condition are rented in this period is . The naive estimator is the difference between treatment and control rates, again scaled by the mass of listings in each group:


We let denote the corresponding steady-state naive estimator.

Estimators for the design. As with the and designs, it is possible to design a natural naive estimator for the design as well. In particular, we have the following definition of the naive estimator:


To interpret this estimator, observe that the first term is the normalized rate at which treatment customers booked treatment listings in the experiment; we normalize this by , since a mass of customers are in treatment, and a mass of listings are in treatment. This first term estimates the global treatment rate of rental. The sum is the total rate at which control rentals took place: either because the customer was in the control group, or because the listing was in the control group, or both. (Recall that in the design, the intervention is only seen when treatment customers interact with treatment listings.) This is normalized by the complementary mass, . This second term estimates the global control rate of rental. As before, we can define a steady-state version of this estimator as , with the steady-state versions of the respective quantities on the right hand side of (18).

It is straightforward to check that as , we have , the naive estimator. Similarly, as , we have , the naive estimator. In this sense, the naive estimator naturally ”interpolates” between the naive estimator and the naive estimator. We exploit this interpolation to choose and as a function of market conditions in the next section (in particular, dependent on the imbalance between demand and supply). More generally, inspired by the idea of interpolating between the naive estimator and the naive estimator, we also explore an alternative, more sophisticated estimator.

6 Analysis of bias: Examples

In the remainder of the paper, we study the behavior of the , , and designs and associated naive estimators proposed in the previous section. We are particularly interested in characterizing the bias: i.e., the extent to which the estimators we have defined under- or overestimate the true .666

Because we work in the mean field limit, an unbiased estimator is one that would actually be consistent in a statistical sense. Again, a rigorous proof of such a fact is outside the scope of this paper.

In this section, we start with a simple discussion via example that illustrates the main effects that cause bias. Throughout the discussion, we assume that there are listings in total in the market, and that listings are homogenous (i.e., of identical type). Further, we also assume that arriving customers are homogeneous (i.e., of identical type). We let denote the utility of a customer for a listing, and suppose the platform considers an intervention that changes this to . Finally, we assume that every arriving customer includes any listing that is available in her consideration set.

An important operational finding of our work is that the market balance has a significant influence in determining which estimator and design is bias-optimal. When becomes large, the market is relatively supply-constrained: customers are arriving much faster than occupied listings become available. When becomes small, the market is demand-constrained, with few customers arriving and many available listings. We divide our discussion of this example into these two extreme cases. Our findings are illustrative of the insights we obtain theoretically in the next section.

6.1 Highly demand-constrained markets

Consider a hypothetical limit where each listing becomes instantly available again after being rented (i.e., but remains fixed). This is the demand-constrained extreme, where capacity constraints on listings become irrelevant. Note that on arrival of a customer, both listings are always available, and therefore, in her consideration set. In this limit, observe that the steady-state rate at which customers rent listing becomes:


(The factor 2 appears in the denominator as there are two listings.) Since the intervention changes to , the is:

Now suppose we consider a design that randomizes a fraction of arriving customers to treatment. Observe that in this demand-constrained extreme, every arriving control (resp., treatment) customer sees the full global control (resp., treatment) market condition; there is no dynamic influence of one customer’s rental behavior on any other customer. This suggests the naive estimator should correctly recover the global treatment effect. Indeed, the steady-state naive estimator becomes:

In other words, the naive estimator is perfectly unbiased.

On the other hand, consider a design where listing 1 is (randomly) assigned to treatment, and listing 2 is (randomly) assigned to control. In this design, the steady-state naive estimator (with ) becomes:

It is clear that in general this will not be equal to the , because there is interference between the two listings: every arriving customer sees a market environment that is neither quite global treatment nor global control, and the estimates reflect this imperfection. Even with immediate replenishment, treatment listings compete for customers and ”cannibalize” rentals from control listings, causing the naive estimator to be biased. (Note that such a violation would arise for virtually any reasonable choice model that could be considered.)

6.2 Highly supply-constrained markets

Now we consider the opposite extreme, where the market is heavily supply constrained; in particular, we consider the hypothetical limit where but remains fixed. Now in this case, note that a listing that becomes available will nearly instantaneously be booked; therefore, virtually every arriving customer will find at most one of the two listings available, and their decision of whether to book will be entirely determined by comparison of that available listing against the outside option. In particular, as a result when the steady-state rate at which listing is rented approaches .

We thus require a more refined estimate of this rental rate as . Suppose is large, and suppose listing becomes available. Based on the intuition above, we make the approximation that the listing will be considered in isolation by a succession of customers until it is rented. Customers arrive at rate , and rent an available listing with probability ; in other words, in this regime listings compete only with the outside option, and not with each other. Therefore the mean time until such a rental occurs is ; and once booked, the listing remains rented for mean time , at which time it becomes available again. Therefore for large , the steady-state rate at which a listing is rented is approximately:


As expected, this rate approaches as . The is thus:

With this observation in hand, suppose we again consider the same design where listing 1 is (randomly) assigned to treatment, and listing 2 is (randomly) assigned to control. Since in (20) there is no influence of one listing on the other, observe that the naive estimator (with ) becomes:

In other words, the naive estimator is perfectly unbiased. This is intuitive: in the limit where is large, since listings do not compete with each other for rentals, there is no interference when we implement the design.

On the other hand, consider the naive design where a fraction of arriving customers are randomized to treatment. In this case we wish to establish the rate at which rentals are made by treatment and control customers respectively. Suppose listing was occupied by a treatment customer, and becomes available. Define:

This is the probability an arriving customer rents the available listing. Customers arrive at rate , so a mean time elapses until a rental is made; the listing then remains occupied for mean time . Conditional on a rental, the rental was made by a treatment guest with probability:

Thus the mean time between treatment rentals of listing

is a geometrically distributed multiple of

, with parameter ; in other words, the mean rate at which listing is rented by treatment customers is:

We can use the same logic for the rental rate of control customers, and so we find the naive estimator is:

In general, this estimator will be biased, i.e., not equal to . The issue is that in this case, customers have a dynamic influence on each other across the treatment groups: when a listing becomes available, whether or not it is available for booking by a subsequent control customer depends on whether or not a treatment customer had previously booked the listing. In this case, customers compete among each other for listings. This interference across customer groups leads to the biased expression for the naive estimator.

We note that the and naive estimators converge to zero in the limit where ; this is because the rental rate of each listing becomes in this limit. The naive estimator does not converge to zero in general, however, as .

6.3 Discussion: Violation of SUTVA

Our simple example illustrates that the naive estimator is biased when , and unbiased when ; and the naive estimator is unbiased when , while it is biased when . These findings can be interpreted through the lens of the classical potential outcomes model; in that model, an important result is that when the stable unit treatment value assumption (SUTVA) holds, then naive estimators of the sort we consider will be unbiased for the true treatment effect. SUTVA requires that the treatment condition of units other than a given customer or listing should not influence the potential outcomes of that given customer or listing. The discussion above illustrates that in the limit where , there is no interference across customers in the design; this is why the naive estimator is unbiased. Similarly, in the limit where , there is no interference across listings in the design; this is why the naive estimator is unbiased. On the other hand, the cases where each estimator is biased involve interference across experimental units.

7 Analysis of bias: Results

We establish two key theoretical results in this section: in the limit of a highly supply-constrained market (where ), the naive estimator becomes an unbiased estimator of the , while the naive estimator is biased. On the other hand, in the limit of a highly demand-constrained market (where ), the naive estimator becomes an unbiased estimator of the GTE, while the naive estimator is biased. In other words, each of the two naive designs is respectively optimal in the limits of extreme market imbalance. These results are match the findings in our simple example in the precedig section. At the same time, we find empirically that neither estimator performs well in the region of moderate market balance.

Inspired by this finding, we consider and associated estimators that naturally interpolate between the two naive designs depending on market balance. We first consider the naive estimator. Given the findings above, we show that a simple approach to adjusting and as a function of market balance yields performance that balances between the naive estimator and the naive estimator. Nevertheless, we show there is significant room for improvement, by adjusting for the types of experimental interference that arise using observations from the experiment. In particular, we propose a heuristic for a novel interpolating estimator for the design that aims to correct these biases, and yields surprisingly good empirical performance. We conclude with a brief discussion of transient performance of the estimators considered, and some insights derived through numerical investigation.

7.1 Theory: Steady-state bias in unbalanced markets

In this section, we theoretically study the bias of the steady-state naive and estimators in the limits where the market is extremely unbalanced (either demand-constrained or supply-constrained). The key tool we employ is a characterization of the asymptotic behavior of as defined in (15) in the limits where and . We use this characterization in turn to quantify the asymptotic bias of the naive estimators relative to the .

7.1.1 Highly demand-constrained markets

We start by considering the behavior of naive estimators in the limit where . We start with the following proposition that characterizes behavior of as . The proof is in Appendix A.

Proposition 2.

Fix all system parameters except and , and consider a sequence of systems in which . Then along this sequence,


The expression on the right hand side depends on both and through and respectively. In particular, we recall that , , and , . In our subsequent discussion in this regime, to emphasize the dependence of on below, we will write . With this definition, we have , .

The proposition shows that in this limit, the (scaled) rate of rental behaves as if the available listings of type was exactly for every and treatment condition . It is as if every arriving customer sees the entire mass of listings as being available, as in our simplified example in the previous section; in that example, , and so rentals are immediately replenished.

We use the preceding result to study the bias of the steady-state naive and estimators in the limit where . Consider a sequence of systems where . Using the preceding result, we observe that:


We now use Proposition 2 to show that the steady-state naive estimator is unbiased in the limit as , while the steady-state naive estimator remains biased. First we consider a experiment paired with the naive estimator. Using Proposition 2, it follows that:

Now note that and when ; similarly, , and when . Thus, from the definition of the design in (11)-(13) and the definition of the choice probability in (37), the choice probability of a control customer for a treatment listing at is the same as the choice probability of a control customer for a control listing at :

These choice probabilities are the same because (1) all listings are in treatment in the design, with the mass of each type equal to ; and (2) control customers have the same choice model parameters for these listings regardless of whether they are in treatment or control. Thus it follows that as , i.e., the steady-state naive estimator is asymptotically unbiased.

On the other hand, consider the steady-state naive estimator. Observe that:

In general, this limit will not be equivalent to the ; i.e., the naive estimator is asymptotically biased. The reason is that is different from both (all listings in treatment) and (all listings in control): in the design, there is a positive mass of listings in both treatment and control, and this means the choice probabilities do not match those in either global treatment (in the first term) or global control (in the second term). This is exactly the same interference between listings of different treatment conditions that we saw in the simple example in the preceding section, in which listings compete for customers.

Based on the preceding discussion, we observe that the difference between and the does not converge to zero in general as ; i.e., the naive estimator is biased. However, the naive estimator is unbiased in this limit. We summarize in the following theorem.

Theorem 2.

Consider a sequence of systems where . Then for all such that , . However, for , generically over parameter values777Here ”generically” means for all parameter values, except possibly for a set of parameter values of Lebesgue measure zero. we have .

7.1.2 Heavily supply-constrained markets

We now characterize the behavior of naive estimators in the limit where . We start with the next proposition, where we study the behavior of as . The proof is in Appendix A. To state the proposition, we define:

Proposition 3.

Fix all system parameters except and , and consider a sequence of systems in which . Along this sequence,