Causally Driven Incremental Multi Touch Attribution Using a Recurrent Neural Network

02/01/2019
by   Ruihuan Du, et al.
0

This paper describes a practical system for Multi Touch Attribution (MTA) for use by a publisher of digital ads. We developed this system for JD.com, an eCommerce company, which is also a publisher of digital ads in China. The approach has two steps. The first step ("response modeling") fits a user-level model for purchase of a product as a function of the user' s exposure to ads. The second ("credit allocation") uses the fitted model to allocate the incremental part of the observed purchase due to advertising, to the ads the user is exposed to over the previous T days. To implement step one, we train a Recurrent Neural Network (RNN) on user-level conversion and exposure data. The RNN has the advantage of flexibly handling the sequential dependence in the data in a semi-parametric way. The specific RNN formulation we implement captures the impact of advertising intensity, timing, competition, and user-heterogeneity, which are known to be relevant to ad-response. To implement step two, we compute Shapley Values, which have the advantage of having axiomatic foundations and satisfying fairness considerations. The specific formulation of the Shapley Value we implement respects incrementality by allocating the overall incremental improvement in conversion to the exposed ads, while handling the sequence-dependence of exposures on the observed outcomes. The system is under production at JD.com, and scales to handle the high dimensionality of the problem on the platform (attribution of the orders of about 300M users, for roughly 160K brands, across 200+ ad-types, served about 80B ad-impressions over a typical 15-day period).

READ FULL TEXT VIEW PDF
08/11/2018

Learning Multi-touch Conversion Attribution with Dual-attention Mechanisms for Online Advertising

In online advertising, the Internet users may be exposed to a sequence o...
07/20/2017

Attribution Modeling Increases Efficiency of Bidding in Display Advertising

Predicting click and conversion probabilities when bidding on ad exchang...
09/17/2020

A Time To Event Framework For Multi-touch Attribution

Multi-touch attribution (MTA) estimates the relative contributions of th...
02/16/2021

Revenue Attribution on iOS 14 using Conversion Values in F2P Games

Mobile app developers use paid advertising campaigns to acquire new user...
02/24/2015

Multi-Touch Attribution Based Budget Allocation in Online Advertising

Budget allocation in online advertising deals with distributing the camp...
09/06/2018

Deep Neural Net with Attention for Multi-channel Multi-touch Attribution

Customers are usually exposed to online digital advertisement channels, ...
04/20/2018

Modelling customer online behaviours with neural networks: applications to conversion prediction and advertising retargeting

In this paper, we apply neural networks into digital marketing world for...

1 Introduction

As digital ads proliferate, and the measurability of advertising increases, the issue of Multi Touch Attribution (MTA) has become one of paramount importance to advertisers and digital publishers. MTA pertains to the question of how much the marketing touchpoints a user was exposed to, contributes to an observed action by the consumer. Understanding the contribution of various marketing touchpoints is an input to good campaign design, to optimal budget allocation and for understanding the reasons for why one campaign worked and one did not. Wrong attribution results in misallocation of resources, inefficient prioritization of touchpoints, and consequently lower return on marketing investments. Consequently having a good model of attribution is now recognized as critical for marketing planning, design and growth. According to eMarketer estimates, among US companies with at least 100 employees using more than one digital marketing channel, about 85% utilize some form of digital attribution models in 2018, emphasizing the importance of having good solutions to the problem from the perspective of industry (

Benes, 2018).

Because the various touchpoints can interact in complex ways to affect the final outcome, the problem of parsing the individual contributions and allocating credit is a complex one. Given the complexity, many firms and platforms use rule-based methods such as last-touch, first-touch, equally-weighted, or time-decayed attribution (IAB, 2018). Because these rules may not always reflect actuality, modern approaches propose data-driven attribution schemes that use rules derived from actual marketplace data to allocate credit. This paper proposes a data-driven MTA system for use by a publisher of digital ads.111While the model is presented from the perspective of a publisher (in our case, an eCommerce platform), one can also see this from the perspective of an advertiser who wishes the assign credit for the orders he obtains across the various ads he buys. We developed this system for JD.com, an eCommerce company, which is also a publisher of digital ads in China. The advertising marketplace of JD feature thousands of advertisers buying billions of impressions of ads of more than 200 types for over 300m users, and is a data-rich environment. Hence, as a practical matter, we need a system that scales to handle high dimensionality and leverages the large quantities of user-level data available to the platform.

Our approach has two steps. The first step (“response modeling”) fits a user-level model for purchase of a brand’s product as a function of the user’s exposure to ads. The second (“credit allocation”) uses the fitted model to allocate the incremental part of the observed purchase due to advertising, to the ads the user is exposed to over the previous days.

To implement step one, our goal is to develop a response model that captures the following aspects of ad-response,

  1. Responsiveness of current purchases to a sequence of past advertising exposures: that is, we would like to develop a response model that allows for both current and past advertising to matter in driving current purchases. This is consistent with a large literature that has documented that advertising has long-lived effects, and emphasized how both the stock as well as the flow of advertising affects consumer response (e.g., Bagwell, 2007

    ). In addition, various findings in this literature motivates the need to handle the effect of the history of past exposures flexibly. For instance, the effect of history can operate in complex ways, by changing not just the baseline level of purchase probability, but also the marginal effect of current ad-exposures (e.g.,

    Bass et al., 2007; Naik et al., 1998).

  2. Responsiveness of current purchases to the intensity of ad-exposure: that is, the model should allow the effect of a brand’s advertising on a user to depend on the number of exposures, not just whether there was exposure, with the effect being possibly non-linear (e.g., Dube et al. 2005). Therefore, we need to capture an intensive margin by which advertising can affect behavior, in addition to accommodating an extensive margin for the effects of ads.

  3. Responsiveness of current purchases to the timing of ad-exposures: that is, the model should allow the effect of past exposures to differ based on the timing of those exposures. This is motivated by accumulating evidence that the effect of advertising exposure is long-lasting but decays in human memory (e.g., Sahni, 2015). Therefore, we expect advertising to have a lasting effect, with the effect highest at the time of exposure, ceteris paribus, and decaying over time. A realistic model should accommodate a role for time in incorporating the effect of an exposure on purchase, and allow this decay to occur in a flexible way that can be learned from the data.

  4. Responsiveness of current purchases to competitive ad-exposures: that is, the model should accommodate a role for both own and competing brand’s current and past ad-exposures to affect current purchases. Allowing for cross-brand ad-effects to matter is important in a competitive marketplace with competing brands, and is also critical to capturing the incremental contribution to a brand’s own-advertising efforts correctly (e.g., Anderson and Simester, 2013).

  5. Capturing heterogeneity across users: that is, the model should allow for the ad-response to differ by consumer types. One motivation is based on business considerations: advertisers use the output from the model to design their targeting strategies, and often desire estimates of attribution split by consumer segments. Another motivation is based on inference: a typical concern about measuring ad-response is user selection into exposure. Including a flexible accommodation for heterogeneity into the model mitigates the selection concern somewhat by “controlling for” observables that drive selection into exposure (e.g., Varian, 2016).

Given the scale of the data, and the large number of ad-types, a fully non-parametric model that reflect these considerations is not feasible. Instead, our approach is to develop a flexible specification that fits the data and incorporates these aspects of ad-response. We train a Recurrent Neural Network (RNN) for this purpose. The RNN is trained on user-level conversion and exposure data, and is architected to capture the impact of advertising intensity, timing, competition, and user-heterogeneity outlined above. The model is set up as a binary classification problem, outputting a probability that a user buys a product associated with a brand in a given time-period. It takes as inputs in its lowermost layer, the impressions served to a user of a focal brand’s and its competitors ads split by ad-type; and separately for each of

time-periods prior to date of attribution. This allows a flexible way of handling the intensity of ad-exposure and competition over the past periods on attribution of an order on the period. A separate fully connected layer in the model takes as input a set of user characteristics, which shifts the the “intercept” of the logistic output layer, giving it a “fixed-effects” interpretation.

The specifics of the application to advertising-response motivates a bi-directional formulation of the RNN. We allow for a hidden layer with backward recurrence, augmented with a hidden layer with forward recurrence. This improves the fit of the model, because the fact that a user saw a particular set of ads in is useful to predict his response in period , . For example, if a user bought a brand in , he may not search for that brand in period , and not be exposed to search ads in . So the knowledge that he did not see search ads in is useful to predict whether he will buy in period . More generally, this suggests that the sequence of future ad-impressions can help predict . Adding a layer with forward recurrence serves as a semi-parametric summary of future activity that is helpful to predict current actions.

While RNNs are not new to this paper, it is worth emphasizing why this class of models is of value for the MTA problem. Compared to the other frameworks, RNNs represent a more flexible way to handle the sequential dependence in the data. Sequential dependence is key to ad-response, because what we need to capture from the data is how exposure to past touchpoints cumulatively build up to affect the final outcome. RNNs do this well by allowing for continuous, high-dimensional hidden states (compared to lower-dimensional, discrete ones in other models with hidden states); combined with a distributed representation of those states that allows them to store information about the past efficiently. This enables the RNN to handle long-term, higher-order and non-Markovian dependencies in a semi-parametric manner (see

Graves 2012; Lipton 2015 for overview).222

Comparing Hidden Markov Models to RNNs,

(Lipton, 2015) says, “Hidden Markov models (HMMs), which model an observed sequence as probabilistically dependent upon a sequence of unobserved states, were described in the 1950s and have been widely studied since the 1960s (Stratonovich, 1960). However, traditional Markov model approaches are limited because their states must be drawn from a modestly sized discrete state space . The dynamic programming algorithm that is used to perform efficient inference with hidden Markov models scales in time (Viterbi, 1967). Further, the transition table capturing the probability of moving between any two time-adjacent states is of size . Thus, standard operations become infeasible with an HMM when the set of possible hidden states grows large. Further, each hidden state can depend only on the immediately previous state. While it is possible to extend a Markov model to account for a larger context window by creating a new state space equal to the cross product of the possible states at each time in the window, this procedure grows the state space exponentially with the size of the window, rendering Markov models computationally impractical for modeling long-range dependencies (Graves et al., 2014).” Increasingly, some researchers view HMMs as special cases of RNNs ((Wessels and Omlin, 2000; Buys et al., 2018)).
Well-known results in theoretical computer science also show that recurrent neural nets have attractive universal approximation properties: any function that can be computed by a digital computer is also in-principle computable by a recurrent neural net architecture.333By a “net” we mean, an architecture in which neurons are allowed to synchronously update their states according to some combination of past activation values. While earlier literature had suggested that nets can achieve universality if one allowed for an infinite number of neurons (e.g., Franklin and Garzon (1990)), or allowed for higher-order connections (where current states update their values as multiplications or products of past activations, e.g., Sun et al. (1991)), results by Siegelmann and Sontag (1991, 1995)

are even more favorable: recurrent neural nets can achieve universality using only a finite number of neurons, and using only first-order, non-multiplicative connections. In particular, any function computable by a Turing Machine can be computed by such a net. “Turing Machines” are mathematically simple computational devices to help formalize the notion of computability. Under the

Church-Turing thesis in computer science, for every computable problem, there exists a Turing Machine that computes it; and conversely, every problem not computable by a Turing Machine is also not computable by finite means. See for instance https://plato.stanford.edu/entries/turing-machine/, for historical perspectives.

Thus, relatively “simple” recurrent architectures can capture very complex functional dependencies in the data, especially if we allow for a large number of neurons. This makes them attractive to our situation. The main disadvantage of RNNs are they require more data, and take longer to train. This problem is mitigated in implementations in modern tech platforms, which are data-rich and have access to large computational resources.

To implement step two, we focus on incrementality-based allocation for advertising. To understand the motivation for this, note that each ad generates an incremental increase in the overall probability of purchase, and the set of ad-exposures as a whole generate an incremental improvement in the propensity to purchase. Our approach is to allocate the incremental improvement due to the ads to each ad-type. This takes into account that even if the user did not see the ads, the user may have some baseline propensity to buy anyway due to tastes, prior experiences, spillovers from competitive advertising. Logically, the part of observed orders that would have occurred anyway should not be allocated to the focal brand’s advertising efforts. To allocate credit, we compute Shapley Values, which have the advantage of having axiomatic foundations and satisfying fairness considerations (Shapley, 1953; Roth, 1988). The specific formulation of the Shapley Value we implement respects incrementality by allocating the overall incremental improvement in conversion to the exposed ads, while handling the sequence-dependence of exposures on the observed outcomes.

Computing the Shapley Values is computationally intensive. We present a scalable algorithm (implemented in a distributed MapReduce framework) that is fast enough to allow computation in reasonable amounts of time so as to make productization feasible. The algorithm takes predictions from the response model trained on the data as an input, and allocates credit over tuples of ad-exposures and time periods. Allocation at the tuple-level has the advantage of handling the role of the sequence in an internally consistent way. Once allocation of credit at this level is complete, we sum across all time periods associated with an ad-type to develop an ex-post credit allocation to each ad-type. This explicit aggregation has the advantage that aggregation biases are reduced when using the model to allocate credit at a more aggregate level, such as over advertising channels (e.g., search and display).444Aggregating responses to the channel level reduces the complexity of the algorithm, and enables pooling of data, but masks the differential contribution of various touchpoints to final conversion, because implicitly, such a response model assumes that the effect of all touchpoints within the channel are similar. By training the response model and implementing the allocation at the ad-type and time-period level, and then exactly aggregating these allocations to the channel level, we mitigate such concerns to a large extent.

In combination, the RNN response-model and Shapley Value credit system represents a coherent, theory- and data-driven attribution framework for the platform. We present details and an illustration of the framework using data from one product category (cell-phones) at JD.com. This is a single product category version of the full framework. The full framework is under production at the firm, and accommodates all product categories on the site, and scales to handle the high dimensionality of the problem on the platform (attribution of the orders of about 300M users, for roughly 160K brands, across 200+ ad-types, served about 80B ad-impressions over a typical 15-day period).

The rest of the paper discusses the relevant literature, the details of the model, details of the cell-phone data, and results. The last section concludes.

2 Relationship to the Literature

The problem of attribution of credit to underlying marketing activities is not new. The previous literature on the topic is divided into two streams: (a) empirical papers that develop statistical response models to measure the effect of marketing touchpoints on consumer purchase behavior and engagement; (b) papers that combine an empirically specified response model with an allocation scheme for allocating credit to the touchpoints. This paper is part of the second stream.

Early research on response models used market-level data and time-series based aggregate “marketing mix” models to asses the effect of marketing touchpoints in print, TV and internet channels on sales and engagement (e.g., Naik et al., 2005; de Haan et al., 2016; Kireyev et al., 2016). More recent work has leveraged access to user-level browsing and conversion data to develop individual-level models of responsiveness. Notable examples in this stream include Shao and Li (2011)

(who use a bagged logistic regression model and a semi-parametric model that allows upto second-order dependence in consumer behavior);

Li and Kannan (2014) and Xu et al. (2014) (who use customized Markovian models of consumer channel choice and conversion); Zhang et al. (2014) (who use a hazard-based survival model that allows for time decay in ad-exposures); Abhishek et al. (2015) (who use an HMM of user exposure and conversion); and Anderl et al. (2016) who model customer purchase and browsing behavior as Markov graph with upto fourth order dependence.

Broadly speaking, the recent response modeling literature has focused on developing generative models of consumer behavior that capture the dependence in the effects of touchpoints to the extent possible, while making simplifying assumptions to feasibly handle the high dimensionality of the measurement problem. Relative to this stream, this paper uses an RNN trained on user-level data as a response model. Compared to past frameworks, the model handles complex patterns of dependence in a more flexible way. The specific formulation of the RNN also allows the sequences of touchpoints to have differential effects on final conversion, which is novel. The setup also accommodates in one framework the role of intensity, timing, competition, and user-heterogeneity, which have typically not been considered together in the previous literature.

Amongst papers in the second stream, Dalessandro et al. (2012) was the first to propose using the Shapley value as a credit allocation mechanism for the MTA problem. They call this “causally-motivated” attribution because of the causal interpretation associated with the “marginality” property of the Shapley Value rule. Dalessandro et al. (2012) compute Shapley values for channels by fitting to the data logistic regression models similar to Shao and Li (2011). Yadagiri et al. (2015) present an important extension this work, allowing the statistical model to be semi-parametric, but restricting outcomes to depend on the composition, but not the order, of previous touchpoints. Anderl et al. (2016) leverages the Markov graph-based approach proposed by Archak et al. (2010) for credit allocation. This approach is computationally attractive, but lacks the fairness properties of the Shapley Value.555The credit allocated to an ad-slot is the “removal effect”, computed as the change in probability in reaching the conversion state from the start state when the slot is removed from the Markov graph. The removal effect can be thought of as the marginal contribution of an ad-slot. The Shapley Value in contrast, allocates credit based on a transformation of the marginal contributions. Like Shao and Li (2011); Dalessandro et al. (2012); Yadagiri et al. (2015), we use the Shapley value for credit allocation. The specifics of our implementation differs from these papers on three aspects. First, we present a way to obtain the incremental contribution from a focal firms’ advertising to observed orders, and to allocate the incremental contribution to the underlying ad-slots. Previous approaches have allocated total orders. Our view is that the incremental part is the more intuitive allocation as it is the component that is due to advertising. Second, allowing the conversion to depend on the order of exposures in the response model requires us to develop a way to implement credit allocation to an ad-slot which depends on its order in the temporal sequence of exposures. We present an algorithm that computes Shapley Values over tuples of ad-slots and location in the sequence to do so. This aspect, which arises because “order matters”, is not an explicit consideration in previous approaches. Third, we implement attribution at a more disaggregated “ad-slot” level, compared to the more aggregate channel-level attribution of past approaches. This makes the problem considered here more high dimensional than considered previously.

This paper is also related to a game-theoretical literature that devise payment rules for multi-channel ads. Notable papers include Agarwal et al. (2009); Wilbur and Zhu (2009); Jordan et al. (2011); Hu et al. (2016); Berman (2018) who propose efficient contracts when there are interactions across publishers, and advertisers and publishers are strategic and information is possibly asymmetric. The response model and attribution methods presented here can form an input to the creation of the payment contracts suggested in this theory.

A limitation of our approach and indeed of all the response models cited previously, is the lack of exogenous variation in user exposure to advertising. This can contaminate the learning of marginal effects from the data due to issues associated with nonrandom targeting and selection into ad-exposure. A typical solution to the problem randomization of users into ad-exposures across all the ad-types, followed by training the model on data generated by the randomization is infeasible at the scale required for practical implementation, due to the cost and complexity of such randomization. Extant papers that have trained ad-response models on data with full or quasi-randomization (Sahni, 2015; Barajas et al., 2016; Nair et al., 2017; Zantedeschi et al., 2017) have done so at smaller scale, over limited number of users and ad-types. The approach to this problem adopted here is to include a large set of user features into the model so that by including these, we convert a “selection on unobservables” problem into a “selection on observables” problem. Given the feature set is large and accommodated flexibly, controlling for these observables may mitigate the selection issue to a great extent (e.g., Varian, 2016), albeit not perfectly.

3 Model Framework

3.1 Problem Statement: Defining Multi Touch Attribution in terms of Incrementality

Let denote users; denote time (days); and denote brands. Let index an “ad-position,” i.e., a particular location on the publishers inventory or at an external site at which the user can see advertisements linked to a given brand. For instance, a particular ad-slot showing a display ad on the top frame of the JD app home-page would be one ad-position, and a particular ad-slot showing a search-ad in in response to keyword search on the JD app would be another ad-position. Consider an order made by user for brand on day . Let denote the set of ad-positions at which user was exposed to ads for brand over the days preceding the order (from to ).

We formulate the multi touch attribution problem as developing a set of credit-allocations for all , so that the allocation for represents the contribution of brand ’s ads at position to the expected incremental benefit generated by brand ’s advertising on the observed order. Define as the change in the probability of order occurring due to the user’s exposure to ’s ads in the positions in . We look for a set of fractions such that,

(1)

3.2 Problem Solution: Response Model Trained on User Data + Shapley Values

We solve the problem in two steps. To allocate the orders on date ,

  • In step 1, we train a response model for purchase behavior using individual user-level data observed during to .

  • In step 2, we take the model as given, and for each order observed on date we compute Shapley Values for the ad-positions . We set to these Shapley values and aggregate across orders to obtain the overall allocation for brand on date .

Figure (1) show the architecture of the system.

  • Notes: The Figure shows the architecture of the attribution system. In the first stage, a response-model is trained on historical data. In the response model, ad-impressions and price indices are used as inputs into an LSTM layer with recurrence. User characteristics are processed through a fully connected layer. These together feed into an output layer that forms a prediction. Results from the trained model are used to compute Shapley values for all orders observed at a disaggregated level. These are then aggregated as desired to obtain attributions.

Figure 1: Attribution System Architecture

3.2.1 Shapley Values

Motivation

The Shapley value has many advantages as a fair allocation scheme in situations of joint generation of outcomes. These advantages have been articulated in a long history of economic literature on co-operative games starting with Shapley (1953). See Roth (1988) for a summary and historical perspectives. The key idea is that any fair allocation should avoid waste and allocate all of the total benefit to the constituent units that generated the benefit (“allocative efficiency”). This is the “adding-up” requirement is encapsulated in equation (1). Further, fairness suggests that two units that contribute the same to every possible configuration in which the constituent units can influence the final outcome, should be allocated the same credit (“symmetry of credit”); and logically, that a unit which contributes nothing to any configuration should be allocated zero credit (“dummy unit”).

In addition to these requirements, a fair allocation should also satisfy the “marginality principle,” encapsulating the idea that credit should be proportional to contribution. Specifically, marginality requires that the share of total joint benefit allocated to any constituent unit should depend only on that unit’s own contribution to the joint benefit. As Young (1988) points out, a sharing rule that does not satisfy the marginality principle is subject to serious distortions. If one unit’s credit depends on another’s contributions, the first unit can be viewed favorable (or unfavorably) on the basis of the performance of the second. This affects how these units are rewarded or punished and distorts performance. The difficulty is that finding a sharing rule that simultaneously satisfies marginality and efficiency is non-trivial. When the marginality principle is imposed, the sum of individual unit’s marginal contributions will not equal total overall benefit in general. If there are increasing returns from joint production, the sum of marginal contributions will be too high; and if there are decreasing returns, it will be to low.666Paraphrasing Young (1988): “One seemingly innocuous remedy is to compute the marginal product of all factor inputs and then adjust them by some common proportion so that total output is fully distributed. This proportional to marginal product principle is the basis of several classical allocation schemes..[]..the proportional to marginal product principle does not resolve the “adding up” problem in a satisfactory way. The reason is that the rule does not base the share of a factor solely on that factor’s own contribution to output, but on all factors contribution to output. For example, if one factor’s marginal contribution to output increases while another ’s decreases, the share attributed to the first factor may actually decrease; that is, it may bear some of the decrease in productivity associated with the second factor.”

The Shapley value aligns these requirements in an elegant way. Theorem 1 in Young (1988) shows the remarkable property that the Shapley value is the unique sharing rule that is efficient, symmetric and satisfies the marginality principle. This is the reason for the Shapley value’s appeal as a credit-allocation scheme.

The next three sub-sections discusses how we compute the Shapley values in the ad-position attribution problem. First, we discuss how we define the expected incremental benefit generated by a set of ad-positions to a brand’s order. Then, we discuss how we allocate that benefit to ad-position-day tuples, and aggregate these to generate credit allocations across ad-positions. Finally, we provide an illustrative example.

Defining the Expected Incremental Benefit Generated by a Set of Ad-Positions to an Observed Order

Let

denote a binary random variable for whether user

purchases brand on day . An order is a realization with associated own-brand ad-exposures at positions . The expected incremental benefit generated by brand ’s advertising on order is,

(2)

The first term in equation (2) represents the probability of the order occurring given ’s exposures to brand ’s ads at the ad-positions in over the preceding days. The second term represents the counterfactual probability of the order occurring if did not have any exposures to brand ’s ads at the ad-positions in over the preceding days (denoted as ). Holding everything else fixed, this difference represents the expected incremental contribution of the ad-positions in to the order. We can think of as a causal effect of brand ’s advertising over the past days on user s propensity to place the observed order on day

Allocating Incremental Benefit to a Position-Day Tuple

To allocate to the ad-positions in , we first allocate to each ad-position-day tuple in which saw ads of brand over the last days. We then sum the allocations across days for the tuples that each ad-position is associated with, to obtain the overall allocation of to that .

To do this, let be the set of ad-position-day combinations at which user saw ads for brand during the days preceding order . Denote the cardinality of as .777For example, if user saw ads for brand at ad-position on days ; at position on day ; and position on day , would be and . For a given tuple , let denote a generic element from the power-set of , i.e., the sub-set of all ad-position-day combinations at which the user saw ads for brand during the days, excluding tuple . Let the cardinality of be denoted .

Define the function as,

(3)

i.e., represents the expected incremental benefit from user seeing ads for brand at ad-position-day combinations in , holding everything else fixed. By construction, . So, by allocating across the ad-position-day tuples in , we allocate the same total incremental benefit generated by brand ’s advertising, as we would by allocating across the ad-positions in . Also, by construction, .

For each tuple , we need the fractions , to satisfy two conditions. First, that , so that the fractions sum to the full incremental benefit of the ads on the order (i.e., satisfy allocative efficiency). Second, that the fractions for a given tuple are functions of only its marginal effects with respect to (i.e., satisfy the marginality principle). These are the Shapley values for the tuples defined as,

(4)

Computing the Shapley values requires a way to estimate the marginal effects in equation (4) from the data, as well as an algorithm that scales to handle the high-dimensionality of . This is discussed in the subsequent section.

Once the Shapley values are computed, we sum them across all to obtain the allocation of that order to ad-position as,

(5)

where is the set of days in that are associated with ad-position

The final step is to do this across all orders observed for brand on day . To do this, we sum across all and all users who bought brand on day . This gives the overall incremental contribution of the ad-positions to the brand’s orders. To allocate this to we simply compute how much ad-position contributed to this sum. To see this mathematically, denote in equation (5) as for short. We sum across all for all that made an order (i.e., ) to compute the term in the denominator in equation (6); and we sum for only ad-position across all that made an order (i.e., ) to compute the term in the numerator in equation (6). Dividing the numerator by the denominator, we allocate to ad-position , a proportion computed as,

(6)

Each element in represents the contribution of ad-position to the total incremental orders obtained on day by the brand due to its advertising on the positions. thus represent a set of attributions that can be reported back to the advertiser.

Linking to a Response-Model

Let be the number of impressions of brand ’s ad seen by user at ad-position on day Collect all the impressions of the user for the brand’s ad across positions on day in

; collect the impression vectors across all the brands for that user on day

in ; and stack the entire vector of impressions across all days and brands in a vector . Let be a price-index for brand on day , representing an average price for products of brand faced by users on day .888We compute this as a share weighted average of the list prices of the SKUs associated with the brand on that day. Collect the price indices for all brand on day in vector and stack these in a vector . Finally, let represent a vector of user characteristics collected at baseline. The probability of purchase on day is modeled as a function of user characteristics, and the ad-impressions and price-indices of brand and all other brands in the product category over the last days as,

(7)

The probability model is parametrized by vector which will be learned from the data.999The “hat” notation on emphasizes that the response parameters are learned in a first-stage from the data.

We use equation (7) which provides an expression for , along with the definition of the marginal effects in equation (3), to compute the Shapley values defined in equation (4). To obtain the marginal effects from the response model, we define an operator on that takes a set as defined in 3.2.1 as an input.101010Recall from 3.2.1 that we use to refer to a sub-set of ad-position-day combinations at which the user saw ads for brand during the days, excluding tuple . Given , sets all the impressions of brand apart from those in the ad-position-day tuples in to also leaves impressions of all brands unchanged.

Mathematically, taking and as input, outputs a transformed vector computed as,

(8)

With as defined above, we can compute the Shapley value using the response model as,

(9)

In effect, what we obtain in the square brackets in equation (9) is the change in the predicted probability of purchase of an order of brand on day by user when the tuple is added to the set of ad-position-day combinations in , holding everything else (including competitor advertising) fixed at the values observed in the data for that order.

Illustrative Example

Suppose there are only two brands , three ad-positions , and three days . Suppose user who made an order for brand on day saw 4 ads for brand at ad-position on days ; 7 ads at position on day ; and 10 ads for brand 2 at ad-position 3 on day . Then, , , , , , , so that (0,0,0,0,0,0,4,7,0,0,0,10,4,0,0,0,0,0. Suppose, we would like to evaluate the Shapley value of tuple of brand 1. For this order, is the set , i.e., the tuples at which the user saw ads for brand . The cardinality of the set . is the set with corresponding power set . In equation (9) for the Shapley values, is an element from this power set. To evaluate the terms in the square brackets, we need to evaluate the probability at transformed values of corresponding to and , for each . Consider one particular value of . The cardinality of , .

  • To transform given , we apply . Applying transforms to as follows: as per equation (8), for and tuples and , set and . For all others elements . Therefore, = (0,0,0,0,0,0,4,0,0,0,0,10,0,0,0,0,0,0. What the transformation has done is to set to 0 the impressions for brand 1 at ad-position-day combinations and , which are the tuples that are not in at which saw ads for brand 1 during the days.

  • To transform given , we apply . Applying transforms to as follows: for and tuple , set . For all others elements . Therefore, = (0,0,0,0,0,0,4,0,0,0,0,10,4,0,0,0,0,0. What the transformation has done is to set to 0 the impressions for brand 1 at ad-position-day combination , which is the only tuple that is not in at which saw ads for brand 1 during the days.

Evaluating at these values now generates the term in the square brackets in equation (9). Repeating for all possible , and summing per (9) gives the Shapley value for brand 1 for tuple .

An Efficient Algorithm for Fast, Large-Scale Computation

Exact computation of Shapley values as described above is computationally intensive. Shapley values have to be calculated separately for each order. The number of orders can number in the millions on a given day on an eCommerce platform like JD.com. Additionally, Shapley values have to be computed for each ad-position-day tuple for each order. When is large, this latter step also becomes computationally intensive, requiring Monte Carlo simulation methods to approximate the calculation.

We seek an implementation that scales to accommodate a large number of brands and orders, and generates reports in a matter of hours, which is important for business purposes. Our implementation switches between exact and approximate solutions for the Shapley values depending on the cardinality of , and is implemented in a MapReduce framework so it runs in a parallel, distributed environment on a cluster. Algorithm 1 presents details.

Input:

Output: the contribution of each ad-position to each brand

Function Map()

1:determine according to and
2:
3:if  is small enough for exact method then
4:     
5:     for  do
6:         for  do
7:              if  then
8:                  
9:              else
10:                                               
11:else
12:      the set of permutations of
13:      number of draws for monte carlo approximation
14:      downsample to keep elements
15:     for  do
16:         for  do
17:              
18:              
19:                             
20:
21:
22:for  do
23:     
24: for
25:for  do
26:     
27:for  do
28:     

Function Reduce()

1:
2:for  do
3:     
4:
Algorithm 1 Distributed Contribution Computation

3.2.2 Response Model

The purpose of the response model is to provide a data-driven way to estimate the marginal effects in equation (4). The architecture of the RNN is presented in Figure (2). Though the model training is done simultaneously across all brands, the picture is drawn only for one brand. The input vector of ad-impressions, , and the input vector of price-indexes, , are fed through an LSTM layer with recurrence. The user characteristics, , are processed through a separate, fully-connected layer. The outputs from the LSTM cells and the fully-connected layer jointly impact the predicted outcome . Combining this with the observed outcome,, we obtain the log-likelihood

, which forms the loss function for the model. The RNN finds a set of parameters or weights that maximizes the log-likelihood.

  • Notes: The Figure shows the computational graph for the RNN model for ad-response. Though the model training is done simultaneously across all brands, the picture is drawn only for one brand. The input vector of ad-impressions, , and the input vector of price-indexes, , are fed through an LSTM layer with recurrence. The user characteristics, , are processed through a separate, fully-connected layer. The outputs from the LSTM cells and the fully-connected layer jointly impact the predicted outcome . Combining this with the observed outcome,, we obtain the log-likelihood , which forms the loss function for the model.

Figure 2: Computational Graph for RNN

As noted before, we utilize a bi-directional formulation in which we allow for a hidden layer with backward recurrence, augmented with a hidden layer with forward recurrence. The layer with forward recurrence serves as a semi-parametric summary of future activity that is helpful to predict current actions. This is shown in Figure (2) where the superscript “fw” indicates forward recurrence and “bw” indicates backward recurrence. The use of “future” ad-impressions for predicting current behavior in the response model requires some elaboration when the model is used to compute the “causal” or marginal effects in the Shapley Values. Note that the causal effects the response model has to deliver for computing the Shapley values for a user, are always differences in the user’s predicted probabilities of purchase in a period , under different retrospective (pre) counterfactual ad-impression sequences. This means the use of future ad-impressions to predict behavior in the bi-directional model poses no conceptual difficulty in developing the causal effects required for computing Shapley Values.

The model is implemented in TensorFlow. We use the non-peephole based implementation of the LSTM (Hochreiter and Schmidhuber, 1997); regularized via dropout with probability

; and optimized via stochastic gradient descent using the

Adam Optimizer (Kingma and Ba, 2014). We initialize the forward and backward LSTM cells to 0; the LSTM weights orthogonally with multiplicative factor = 1.0; and the other parameters using truncated normal draws. The model training is stopped with the error in a validation dataset stabilizes.

4 Experiments and Application to Cell-phone Product Category

We present an application of the model using individual-level data on ad-exposures and purchases from the cell-phone product category on JD.com during a 15-day window in 2017. We first present some model-free evidence documenting the quantitative relevance in our data of some of the considerations outlined in the introduction for a good response model. Then we show present model performance metrics and results.

Data and Summary Statistics

To create the training data, we sample users who saw during the 15-day window, at least one ad-impression related to a product in the cell-phone product category sold on JD.com. Within this overall sample, we define the positive sample as the set of users who purchased a product of any brand in the cell-phone category during the time window. We define the negative sample as the set of users in the overall sample who did not purchase any product in the cell-phone category during the 15-day time window. Table 1 provides summary statistics of the training dataset. There are roughly 75M users, 3.4M orders, and 7B ad-impressions. There are 301 ad-positions. We aggregate brands to 31. Table (2) shows market shares on the basis of units sold and revenue generated. Huawei is the largest brand, followed by Apple (2nd largest in terms of revenue), Xiaomi, Meizhu, Vivo and others.

Number of users in Overall Sample 75,768,508
Number of users in Positive Sample 2,100,687
Number of users in Negative Sample 73,667,821
Number of ad-impressions in product category over 15 days 7,153,997,856
Number of orders made in product category over 15 days 3,477,621
Number of orders made on day 175,937
Number of brands () 31
Number of ad-positions () 301
  • Notes:Descriptive statistics of training dataset, which comprises individual-level data on ad-exposures and purchases from the cell-phone category on JD.com during a 15-day window in 2017. The positive sample is the set of users who purchased a product of any brand in the cell-phone category during the time window. The negative sample is the set of users in the overall sample who did not purchase any product in the cell-phone category during the 15-day time window.

Table 1: Summary Statistics of Training Data
Huawei Xiaomi Apple Meizhu Vivo Others
By Units Sold 29.5% 25.30% 8.40% 6.20% 3.20% 27.40%
By RMB Sold 27.2% 18.4% 24.7% 4.2% 5.3% 20.2%
  • Notes: Brand-level market shares in the training data based on units sold and by money spent (RMB) in the cell-phone category on JD.com during a 15-day window in 2017.

Table 2: Market Shares in Category by Brand
Brand Positive Sample Negative Sample
Mean SD Mean SD
Huawei 201.6 328.6 34.7 144.9
Xiaomi 220.3 371.6 26.3 111.7
Apple 147.2 253.1 16.5 63.3
Meizhu 173.4 309.9 14.3 65.8
Vivo 82.4 147.9 15.2 46.6
Others 97.6 197.7 8.1 43.2
Table 3: Summary Statistics of Ad-Exposures by Brand
Motivating Patterns in Data

Table (3) shows summary statistics of ad-exposures split by brand separately for the positive sample and the negative sample. Ad-exposures are seen to much higher in the positive sample. For example, mean exposures in the positive sample (over a 15-day windows) are 737% () higher for Xiaomi, and 792% () for Apple. While some of this can be driven by targeting by the brands of users more likely to buy their products, the large differences across the positive and negative samples by brand suggest that the sequence of their ad-exposures over the 15-day window matter in explaining purchases on day-15.

Figure (4) shows plots of the probability of purchase of a brand on day as a function of the number of impressions of ads for Apple (left panel) or Xiaomi (right panel) seen by the user in the past 15 days. There is evidence of a robust positive association, suggesting that the intensity of own-advertising exposure matters for conversion. Figure (4) shows the probability of purchase of a brand on day-15 as a function of the days since the user saw impressions of ads for that brand. Plots are presented separately for Apple (left panel) and Xiaomi (right panel). To represent exposure timing, we represent on the -axis, the average days elapsed since a user saw ad-impressions of that brand, obtained by weighting the day of the ad-impression by the number of impressions over the 15 days. Specifically, for each user , the -value for user for brand , is computed as , where is the number of impressions of ads of brand seen by user on day , . The response curve is U-shaped, with more recent exposures associated with higher purchase probability of the brand on day 15. This suggests decay in the ad-response. The plot also shows the effect decays close to zero in 15 days, providing some data-based justification for using this cutoff.

Figure (4) shows a plot of the probability of purchase of a brand on day-15 as a function of the number of impressions of ads for Apple (left panel) seen by the user over the 15 day window. The probability of purchase of Apple is plotted separately by a median split of the number of impressions seen of Xiaomi ads. The blue dots depict the probabilities for those who saw more than the median impressions of Xiaomi ads over the 15-day window, and the red dots depict the probabilities for those who saw less than the median impressions of Xiaomi ads over the 15-day window. There is evidence for separation: the association of purchase probabilities with Apple ad-impressions is steeper for those who saw less than the median impressions of Xiaomi ads over the 15-day window. The right panel depicts the analogous plot for Xiaomi. The pattern is similar, suggesting the importance of allowing for competitor ads to matter in affecting purchases.

Finally, in Figure (3), we assess the importance of including user characteristics in the model as a control for user heterogeneity and selection into ad-exposure. We do this informally by comparing the marginal effects for a search ad-position for linear models with and without “fixed-effects.” Comparing to a model without, the model with fixed-effects allows flexible user heterogeneity on the intercept. We choose a search ad-position because it receives some of the highest ad-impressions in the data, and also because the issue of selection is likely to be severe in the case of search ads (i.e., those who like a brand are more likely to search for it and see the search ads, while being more likely to buy the product without the exposure). If we find that the predicted marginal effects are “more reasonable” under the fixed-effects model compared to the base model, that provides some evidence for the value of including user heterogeneity. Past “A/B” testing at JD.com has shown that the search ad-position produces positive marginal lift across many historical campaigns. So, we use the extent to which the predicted results are positive as a metric of reasonableness.

To do this, we let be an indicator of whether user bought a product of a given brand on day 15 (to economize on notation, the index for brand is suppressed). We first train two models for : (1) a linear model , where is the number of impressions seen by user of that brand at ad-position over the 15-day window; and, (2) a linear fixed-effects model , which is the same as the previous model, except the intercept is -specific. Once the two models are trained, we store them in memory. The predictions from the two models are denoted and respectively. Our goal is to compare predictions between the two models for a focal search ad-position, denoted by . The marginal value of seeing ads at position depends on the sequence of ads preceding it, so in order to do the comparison between models, we need to pick a set of sequences on which to base the comparison. We would like to pick the sequences that occur most in the data and for which we have enough data so that we can assess the comparison with a reasonable amount of precision. With these considerations, we filter to the sub-set of users in the data who saw ad-impressions at least 10 ad-positions over the 15-day window (10 is the median across users in the data). For each in this sub-set, we let denote the sequence of ad-positions that saw over the 15-day window (so, by construction, ). Define by , a 9-element permutation from the set of ad-positions excluding , i.e. a 9-element permutation from the set . We then estimate the marginal value to seeing ads at position when it occurs as the in the sequence, given that the first 9 ad-positions seen is as,

The first term is obtained by averaging the predicted using the observed ad-impressions for all users that have the sequence , and the second term is obtained by averaging the predicted using the observed impressions for all users that have the sequence . For each 9-position sequence thus provides an estimate using the linear model, of the incremental benefit of seeing ads at position next, rather than at position next. We do the same thing using the fixed effects model, to compute analogously for each Then, we plot a histogram of the distribution of and across for ad-position . This is shown in Figure (3).

Looking at Figure (3), we see that both models put significant probability mass on the positive support, but fewer of the marginal effects are negative under the fixed-effects model. We take this as supportive evidence that including controls for selection is important to generate reasonable measures of ad-effectiveness, apart from allowing effects to be estimated separately by user segment. [ph] Intensity of Advertising Exposure Matters for Conversion [Apple] [Xiaomi]

  • Notes: The Figure shows a plot of the probability of purchase of a brand on day as a function of the number of impressions of ads for Apple (left panel) or Xiaomi (right panel) seen by the user in the past 15 days. To construct the plot, define the negative sample as the set of users in the overall sample who did not purchase any product in the cell-phone category during the 15-day time window. At each value of the -axis (number of own-brand ad-impressions), the probability of purchase of the brand on the -axis is computed as the number of users who bought the brand’s products on day-15, divided by the total numbers of users in the negative sample. The -axis is capped at 2,000 impressions to account for bots, crawlers, non-individual buyers etc.

[ph] Timing of Advertising Exposure Matters for Conversion [Apple] [Xiaomi]

  • Notes: The Figure shows a plot of the probability of purchase of a brand on day-15 as a function of the days since the user saw impressions of ads for that brand. Plots are presented separately for Apple (left panel) and Xiaomi (right panel). To construct the plot, define the negative sample as the set of users in the overall sample who did not purchase any product in the cell-phone category during the 15-day time window. At each value of the -axis, the probability of purchase of the brand on the -axis is computed as the number of users who bought the brand’s products on day-15, divided by the total numbers of users in the negative sample. The -axis represents the average days elapsed since the user saw ad-impressions of that brand, obtained by weighting the day of the ad-impression by the number of impressions over the 15 days. Specifically, for each user , the -value for user for brand , is computed as , where is the number of impressions of ads of brand seen by user on day , .

[ph] Exposure to Competitor Advertising Exposure Matters for Conversion [Apple] [Xiaomi]

  • Notes: The Figure shows a plot of the probability of purchase of a brand on day-15 as a function of the number of impressions of ads for Apple (left panel) seen by the user over the 15 day window. To construct the plot, define the negative sample as the set of users in the overall sample who did not purchase any product in the cell-phone category during the 15-day time window. At each value of the -axis (number of own-brand ad-impressions), the probability of purchase of the brand on the -axis is computed as the number of users who bought the brand’s products on day-15, divided by the total numbers of users in the negative sample. The -axis is capped at 2,000 impressions to account for bots, crawlers, non-individual buyers etc. The probability of purchase of Apple is plotted separately by a median split of the number of impressions seen of of Xiaomi ads. The blue dots depict the probabilities for those who saw more than the median impressions of Xiaomi ads over the 15-day window, and the red dots depict the probabilities for those who saw less than the median impressions of Xiaomi ads over the 15-day window. There is evidence for separation: the association of purchase probabilities with Apple ad-impressions is steeper for those who saw less than the median impressions of Xiaomi ads over the 15-day window. The right panel depicts analogous plot for Xiaomi.

  • Notes: Histogram of estimated marginal effects for a search ad-position using linear models with and without “fixed-effects.” Both models put significant probability mass on the positive support, but fewer of the marginal effects are negative under the fixed-effects model.

Figure 3: Comparison of Marginal Effects from a Linear Response Model With and Without User Fixed Effects
Model Performance

Table 4 shows the accuracy, recall and precision of the model. At the brand level, the model has precision ranging from 69%76% and recall ranging from 12%34% for the top brands; showing it fits the data well.

Brand Metric
Accuracy Recall Precision
Huawei 0.997 0.340 0.714
Xiaomi 0.997 0.308 0.705
Apple 0.998 0.218 0.696
Meizhu 0.999 0.186 0.727
Vivo 0.999 0.119 0.762
Others 1.000 0.095 0.726
Table 4: RNN Model Performance Metrics by Brand

Figures 4 compares the accuracy, precision, recall and AUC (area under the curve) statistics of the model against two benchmark specifications.111111The statistics are computed on a validation dataset that is held-out separately from the training dataset. The first is a unidirectional LSTM RNN, which is exactly the same as the preferred model but without the forward recurrence. The second is a flexible logistic model, which specifies the probability of a user purchasing brand on day as a semi-parametric logistic function of the ad-impressions and price-indexes on the same day (i.e.,

). Looking at the results, we see the bi-directional RNN has the highest AUC amongst the models; and has accuracy, precision and recall statistics that are comparable or higher. The poor performance of the logistic model in particular emphasizes the importance of accounting for dependence over time to fit the data. The plots also show the speed of convergence of the models as a function of the number of training steps; the bi-directional RNN converges faster in fewer training steps. This is helpful in production, which typically requires frequent model updating.

121212To get a sense for this, the training times for 30,000 steps for the 3 models on our cluster are 11.21 hrs (bi-directional RNN); 9.68 hrs (unidirectional RNN); 12.48 hrs (logistic) hrs respectively.

(a) Accuracy
(b) Precision
(c) Recall
(d) AUC
Figure 4: Performance of Bi-directional RNN Relative to Benchmark Specifications

Tables 5 also benchmarks the algorithm presented in 1 for distributed computation of Shapley values. Recall this algorithm shifts from exact computation of Shapley values to a Monte Carlo simulation approximation when the number of ad-positions over which to allocate credit is large. This “mixed” method improves computational speed, which is important for high-frequency reporting of results in deployment. To assess the performance of this algorithm, we pick 6,000 orders from and run the algorithm on these data for various configurations. The experiment is repeated for each configuration 5 times, and the average across the 5 reps is reported.131313The computational environment uses a Spark cluster with Spark 2.3 by pyspark, running TensorFlow v1.6, with an 8 core CPU, 100 workers and 8 GB memory per worker without the GPU. The first row in the table reports on the number of orders we are able to attribute per minute using the three methods: the mixed method is about 2,300% faster than exact computation, and about 14% faster than a simulation-only method. The second row documents this efficiency gain does not come at the cost of high error: the average error in the mixed method is low relative to exact computation, and a order of magnitude smaller than using full simulation.141414We compute error as a mean squared difference over the orders in the total attributed value (line in algorithm 1) for the evaluated algorithm relative to that from the exact algorithm , i.e. .

Algorithm Exact Approximate Mixed
Orders processed per minute 4.2 88.85 101.24
Error 0.3190 0.0064
Table 5: Processing Efficiency and Prediction Error when using Exact, Approximate and Mixed Shapley Value Computation
Model Results

To explore the results from the model, we first discuss Figure 7 which emphasizes the importance of accounting for the incrementality of advertising in the allocation of orders. The figure assesses how much of the probability of an observed order occurring is driven by advertising. For each observed order in the cell-phone category on day , we compute the predicted probability of that order occurring with and without advertising exposure, and take the difference as the incremental probability associated with advertising. We then compute the ratio of incremental to the total predicted probability with advertising. The figure presents a histogram of this ratio across all orders. There is a substantial spread from , with some orders that seems driven primarily by advertising, and some which would have occurred anyway irrespective of the brand’s advertising. It is this counterfactual outcome of what would the order would have been in the absence of advertising that the incrementality based allocation seeks to reflect.

  • Notes: The Figure shows the eCDF of the Shapley values computed on the basis of the RNN model for all the orders on day-15. The median Shapley value is about 0.25, suggesting the median ad-position to which exposure occurs contributes about 25% to the incremental benefit from advertising. About 70% of the Shapley values are above 0.5, suggesting these ad-positions contribute more than half of the the incremental benefit generated from advertising. About 15% of the Shapley values are above 0.8, suggesting these ad-positions contribute more than 80% of the the incremental benefit generated from advertising.

Figure 5: Empirical CDF of Computed Shapley Values
  • Notes: The Figure shows the Shapley Value from the RNN model at each ad-position indexed on the -axis, average across all orders on day-15 for which that position was the last clicked. This allows benchmarking the Shapley Value based attribution against “last-click” attribution, which allocates 100% of the credit for the order to the last-clicked ad-position. The Shapley values are all seen to be <1, showing that under the model, the last-clicked ad-positions do not obtain full credit. To the extent that the Shapley values are all less than 0.6, the RNN model suggests that last-clicked ads contribute upto a maximum of 60% to the incremental conversion generated by advertising. Finally, cart and payment page positions, which may get a lot of credit under “last-click” or “last-visit” attribution schemes on eCommerce sites, are seen to not be allocated a lot of credit by the model.

Figure 6: Shapley Values from RNN Model at each Ad-position, Averaged Across all Orders in which that Ad-position was the “Last-clicked”
  • Notes: The Figure shows how much of the probability of purchase is driven by advertising. For each order on day , we compute the predicted probability of that order occurring with and without advertising exposure, and take the difference as the incremental probability associated with advertising. We then compute the ratio of incremental to the total predicted probability with advertising. The figure presents a histogram of this ratio across all orders.

Figure 7: Distribution Across Orders of How Much of the Probability of Purchase is Incrementally Driven by Advertising

Figure 5 shows the empirical CDF of the Shapley values computed on the basis of the RNN model for all the orders on day . The median Shapley value is about 0.25, suggesting the median ad-position to which exposure occurs contributes about 25% to the incremental benefit from advertising. About 70% of the Shapley values are above 0.5, suggesting these ad-positions contribute more than half of the the incremental benefit generated from advertising. About 15% of the Shapley values are above 0.8, suggesting these ad-positions contribute more than 80% of the the incremental benefit generated from advertising. This shows the Shapley values have discriminatory power it helps identifying top ad-positions that contribute most to observed outcomes.

Finally, Figure 6 compares the credit allocation based on Shapley Values to rule-based “Last-clicked” attribution. To do this, Figure 6 shows the Shapley Value from the RNN model at each ad-position indexed on the -axis, averaged across all orders on day for which that position was the last clicked. This allows benchmarking the Shapley Value based attribution against “last-click” attribution, which allocates 100% of the credit for the order to the last-clicked ad-position. The Shapley values are all seen to be <1, showing that under the model, the last-clicked ad-positions do not obtain full credit. To the extent that the Shapley values are all less than 0.6, the RNN model suggests that last-clicked ads contribute upto a maximum of 60% to the incremental conversion generated by advertising. Further, cart and payment page positions (which correspond to ads shown on these positions), which may get a lot of credit under “last-click” or “last-visit” attribution schemes on eCommerce sites, are seen to not be allocated a lot of credit by the model.

As a final assessment, we compare Figure 7 which shows the proportion of ad-impressions in the training data across the ad-positions to Figure 7 which shows the average (across orders) of the Shapley Values for the same ad-positions. The ad-positions are indexed from in order of their share of total impressions, so impressions of ad-position is higher than impressions of ad-position , and so on in both figures. Comparing the two figures, we can see that the distribution of Shapley Values across positions does not follow the same pattern as that of impressions, suggesting that the effect picked up by the model is not purely driven by the intensity of advertising expenditures by advertisers (which drive impressions). Further, we observe that some positions that receive fewer ad-impressions have higher Shapley Values than those that receive higher impressions. This suggests that advertising expenditure allocations overall may not be optimal from the advertiser’s perspective, and could be improved by better incorporation of attribution using a model such as the one presented here. A more formal assessment of this issue however requires a method for advertiser budget allocation across ad-positions, which is outside of the scope of this paper. [Impression Share by Ad-Position]

[Shapley Values by Ad-Position]

  • Notes: The left panel of the figure shows the proportion of ad-impressions in the training data across the ad-positions. The ad-positions are indexed from in order of their share of total impressions. The right panel of the figure shown the average (across orders) of the Shapley Values for the same ad-positions. Comparing the two figures, we can see that the Shapley Values do not simply map out the intensity of impressions.

5 Implementation and Extension to Larger Scale

For actual implementation, the model has to be scaled to all brands and all categories on JD.com. From a model training and updating perspective, it is intractable to maintain a separate model for each product category (more than 175). For production, we extend the model presented above to accommodate all product categories in one unified framework. This model has larger scale, so we impose some parameter restrictions to reduce the dimensionality of the problem. First, we allow categories to have separate parameters, but restrict the weights for all brands within a product category to be similar. Second, on the basis of pre-training data, we create a set of features that characterize each brand (e.g., brand rank within JD.com), and include them as covariates that shift the intercept of the output layer. This allows for heterogeneity across brands within each product category. Third, for each user , brand , and day , we include in the input vector the ad-impressions of brand at all the ad-positions as before; but summarize the competitive ad-impressions by including only (a) the ad-impressions of all brands other than in the same product category across the positions; and (b) the ad-impressions of all brands in other categories across the positions. Thus, in this model is a dimensional vector with the first entries corresponding to impressions of the focal brand; the next corresponding to all other brands in the same product category as the focal brand; and the last corresponding to all brands in other categories. Fourth, in order to increase the informativeness of user impression of ads, in some specifications, we include information on whether the user clicked on ads into the response model. Finally, to address the issue of selection more directly, we use the pre-training dataset to develop a predicted baseline propensity of each user to buy a particular brand , using as features his characteristics . We include the predicted baseline propensities into the unified model as controls. Due to business confidentiality reasons, we do not reveal the exact details of this implementation.

6 Conclusions

A practical system for data-driven MTA for use by ad publishing platform is presented. The system combines a flexible response model trained on user-level data with Shapley values for ad-types for attribution. A bi-directional RNN customized to modeling user purchase behavior and ad-response, is developed as the response model, which has the advantage of being semi-parametric, reflective of several salient aspects of ad-response, and able to handle high dimensionality and long-term dependence. The use of the Shapley value provides a way to allocate credit at a disaggregate level in a way that respects the sequential nature of advertising response. The use of the Shapley value is based on fairness considerations taking the advertising policies by the advertisers as given. It is possible that advertisers re-optimize their advertising policies in response to the allocations. The optimal allocation contract that endogenizes the equilibrium response of advertisers remains an open question (see, for instance, Abhishek et al. (2017); Berman (2018)).

References

  • Abhishek et al. (2017) Abhishek, V., S. Despotakis, and R. Ravi (2017): “Multi-Channel Attribution: The Blind Spot of Online Advertising,” working paper, Tepper School of Business.
  • Abhishek et al. (2015) Abhishek, V., P. Fader, and K. Hosanagar (2015): “Media Exposure through the Funnel: A Model of Multi-Stage Attribution,” working paper, Wharton School of Business.
  • Agarwal et al. (2009) Agarwal, N., S. Athey, and D. Yang

    (2009): “Skewed Bidding in Pay-per-Action Auctions for Online Advertising,”

    The American Economic Review, 99, 441–447.
  • Anderl et al. (2016) Anderl, E., I. Becker, F. von Wangenheim, and J. H. Schumann (2016): “Mapping The Customer Journey: Lessons Learned From Graph-Based Online Attribution Modeling,” International Journal of Research in Marketing, 33, 457 – 474.
  • Anderson and Simester (2013) Anderson, E. T. and D. Simester (2013): “Advertising in a Competitive Market: The Role of Product Standards, Customer Learning, and Switching Costs,” Journal of Marketing Research, 50, 489–504.
  • Archak et al. (2010) Archak, N., V. S. Mirrokni, and S. Muthukrishnan (2010): “Mining Advertiser-specific User Behavior Using Adfactors,” in Proceedings of the 19th International Conference on World Wide Web, New York, NY, USA: ACM, WWW ’10, 31–40.
  • Bagwell (2007) Bagwell, K. (2007): “The Economic Analysis of Advertising,” Elsevier, vol. 3 of Handbook of Industrial Organization, 1701 – 1844.
  • Barajas et al. (2016) Barajas, J., R. Akella, M. Holtan, and A. Flores (2016): “Experimental Designs and Estimation for Online Display Advertising Attribution in Marketplaces,” Marketing Science, 35, 465–483.
  • Bass et al. (2007) Bass, F. M., N. Bruce, S. Majumdar, and B. P. S. Murthi (2007): “Wearout Effects of Different Advertising Themes: A Dynamic Bayesian Model of the Advertising-Sales Relationship,” Marketing Science, 26, 179–195.
  • Benes (2018) Benes, R. (2018): “Who Is Using Multitouch Attribution?” https://www.emarketer.com/content/who-is-using-multitouch-attribution.
  • Berman (2018) Berman, R. (2018): “Beyond the Last Touch: Attribution in Online Advertising,” Marketing Science, forthcoming.
  • Buys et al. (2018) Buys, J., Y. Bisk, and Y. Choi (2018): “Bridging HMMs and RNNs through Architectural Transformations,” IRASL Workshop, NIPS, https://irasl.gitlab.io/.
  • Dalessandro et al. (2012) Dalessandro, B., C. Perlich, O. Stitelman, and F. Provost (2012): “Causally Motivated Attribution for Online Advertising,” in Proceedings of the Sixth International Workshop on Data Mining for Online Advertising and Internet Economy, New York, NY, USA: ACM, ADKDD ’12, 7:1–7:9.
  • de Haan et al. (2016) de Haan, E., T. Wiesel, and K. Pauwels (2016): “The Effectiveness Of Different Forms Of Online Advertising For Purchase Conversion In A Multiple-Channel Attribution Framework,” International Journal of Research in Marketing, 33, 491 – 507.
  • Dube et al. (2005) Dube, J.-P., G. Hitsch, and P. Manchanda (2005): “An Empirical Model of Advertising Dynamics,” Quantitative Marketing and Economics, 3, 107–144.
  • Franklin and Garzon (1990) Franklin, S. and M. Garzon (1990): Neural Computability, Ablex, Norwood, NJ, vol. 1, 128–144.
  • Graves (2012) Graves, A. (2012): Supervised Sequence Labelling with Recurrent Neural Networks, vol. 385 of Studies in Computational Intelligence, Springer-Verlag Berlin Heidelberg, 1 ed.
  • Graves et al. (2014) Graves, A., G. Wayne, and I. Danihelka (2014): “Neural Turing Machines,” CoRR, abs/1410.5401.
  • Hochreiter and Schmidhuber (1997) Hochreiter, S. and J. Schmidhuber

    (1997): “Long Short-Term Memory,”

    Neural Computation, 9, 1735–1780.
  • Hu et al. (2016) Hu, Y. J., J. Shin, and Z. Tang (2016): “Incentive Problems in Performance-Based Online Advertising Pricing: Cost per Click vs. Cost per Action,” Management Science, 62, 2022–2038.
  • IAB (2018) IAB (2018): “IAB Attribution Hub,” https://www.iab.com/guidelines/iab-attribution-hub/.
  • Jordan et al. (2011) Jordan, P., M. Mahdian, S. Vassilvitskii, and E. Vee (2011): “The Multiple Attribution Problem in Pay-Per-Conversion Advertising,” in

    Algorithmic Game Theory

    , ed. by G. Persiano, Berlin, Heidelberg: Springer Berlin Heidelberg, 31–43.
  • Kingma and Ba (2014) Kingma, D. P. and J. Ba (2014): “Adam: A Method for Stochastic Optimization,” CoRR, abs/1412.6980.
  • Kireyev et al. (2016) Kireyev, P., K. Pauwels, and S. Gupta (2016): “Do Display Ads Influence Search? Attribution And Dynamics In Online Advertising,” International Journal of Research in Marketing, 33, 475 – 490.
  • Li and Kannan (2014) Li, H. A. and P. Kannan (2014): “Attributing Conversions in a Multichannel Online Marketing Environment: An Empirical Model and a Field Experiment,” Journal of Marketing Research, 51, 40–56.
  • Lipton (2015) Lipton, Z. C. (2015): “A Critical Review of Recurrent Neural Networks for Sequence Learning,” CoRR, abs/1506.00019.
  • Naik et al. (1998) Naik, P. A., M. K. Mantrala, and A. G. Sawyer (1998): “Planning Media Schedules in the Presence of Dynamic Advertising Quality,” Marketing Science, 17, 214–235.
  • Naik et al. (2005) Naik, P. A., K. Raman, and R. S. Winer (2005): “Planning Marketing-Mix Strategies in the Presence of Interaction Effects,” Marketing Science, 24, 25–34.
  • Nair et al. (2017) Nair, H. S., S. Misra, W. J. Hornbuckle, R. Mishra, and A. Acharya (2017): “Big Data and Marketing Analytics in Gaming: Combining Empirical Models and Field Experimentation,” Marketing Science, 36, 699–725.
  • Roth (1988) Roth, A. (1988): The Shapley Value: Essays in Honor of Lloyd S. Shapley, Cambridge University Press.
  • Sahni (2015) Sahni, N. (2015): “Effect of Temporal Spacing between Advertising Exposures: Evidence from Online Field Experiments,” Quantitative Marketing and Economics, 13, 203–247.
  • Shao and Li (2011) Shao, X. and L. Li (2011): “Data-driven Multi-touch Attribution Models,” in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA: ACM, KDD ’11, 258–264.
  • Shapley (1953) Shapley, L. S. (1953): A Value for N-Person Games, Princeton University Press, chap. Annals of Mathematical Studies, 307–317.
  • Siegelmann and Sontag (1995) Siegelmann, H. and E. Sontag (1995): “On the Computational Power of Neural Nets,” Journal of Computer and System Sciences, 50, 132 – 150.
  • Siegelmann and Sontag (1991) Siegelmann, H. T. and E. D. Sontag (1991): “Turing Computability With Neural Nets,” Applied Mathematics Letters, 4, 77 – 80.
  • Stratonovich (1960) Stratonovich, R. L. (1960): “Conditional Markov Processes,” Theory of Probability and its Applications, 5, 156–178.
  • Sun et al. (1991) Sun, G., H. Chen, and Y. Lee (1991): “Turing Equivalence Of Neural Networks With Second Order Connection Weights,” JCNN-91-Seattle International Joint Conference on Neural Networks.
  • Varian (2016) Varian, H. R. (2016): “Causal Inference in Economics and Marketing,” Proceedings of the National Academy of Sciences, 113, 7310–7315.
  • Viterbi (1967) Viterbi, A. (1967): “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Transactions on Information Theory, 13, 260–269.
  • Wessels and Omlin (2000) Wessels, T. and C. W. Omlin (2000): “Refining Hidden Markov Models With Recurrent Neural Networks,” in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000, vol. 2, 271–276.
  • Wilbur and Zhu (2009) Wilbur, K. C. and Y. Zhu (2009): “Click Fraud,” Marketing Science, 28, 293–308.
  • Xu et al. (2014) Xu, L., J. A. Duan, and A. Whinston (2014): “Path to Purchase: A Mutually Exciting Point Process Model for Online Advertising and Conversion,” Management Science, 60, 1392–1412.
  • Yadagiri et al. (2015) Yadagiri, M. M., S. K. Saini, and R. Sinha (2015): “A Non-parametric Approach to the Multi-channel Attribution Problem,” in Web Information Systems Engineering - WISE 2015, Cham: Springer International Publishing, 338–352.
  • Young (1988) Young, H. P. (1988): Individual Contribution And Just Compensation, Cambridge University Press, 267–278, The Shapley Value: Essays in Honor of Lloyd S. Shapley.
  • Zantedeschi et al. (2017) Zantedeschi, D., E. M. Feit, and E. T. Bradlow (2017): “Measuring Multichannel Advertising Response,” Management Science, 63, 2706–2728.
  • Zhang et al. (2014) Zhang, Y., Y. Wei, and J. Ren (2014): “Multi-touch Attribution in Online Advertising with Survival Theory,” in 2014 IEEE International Conference on Data Mining, 687–696.