1 Introduction
As digital ads proliferate, and the measurability of advertising increases, the issue of Multi Touch Attribution (MTA) has become one of paramount importance to advertisers and digital publishers. MTA pertains to the question of how much the marketing touchpoints a user was exposed to, contributes to an observed action by the consumer. Understanding the contribution of various marketing touchpoints is an input to good campaign design, to optimal budget allocation and for understanding the reasons for why one campaign worked and one did not. Wrong attribution results in misallocation of resources, inefficient prioritization of touchpoints, and consequently lower return on marketing investments. Consequently having a good model of attribution is now recognized as critical for marketing planning, design and growth. According to eMarketer estimates, among US companies with at least 100 employees using more than one digital marketing channel, about 85% utilize some form of digital attribution models in 2018, emphasizing the importance of having good solutions to the problem from the perspective of industry (
Benes, 2018).Because the various touchpoints can interact in complex ways to affect the final outcome, the problem of parsing the individual contributions and allocating credit is a complex one. Given the complexity, many firms and platforms use rulebased methods such as lasttouch, firsttouch, equallyweighted, or timedecayed attribution (IAB, 2018). Because these rules may not always reflect actuality, modern approaches propose datadriven attribution schemes that use rules derived from actual marketplace data to allocate credit. This paper proposes a datadriven MTA system for use by a publisher of digital ads.^{1}^{1}1While the model is presented from the perspective of a publisher (in our case, an eCommerce platform), one can also see this from the perspective of an advertiser who wishes the assign credit for the orders he obtains across the various ads he buys. We developed this system for JD.com, an eCommerce company, which is also a publisher of digital ads in China. The advertising marketplace of JD feature thousands of advertisers buying billions of impressions of ads of more than 200 types for over 300m users, and is a datarich environment. Hence, as a practical matter, we need a system that scales to handle high dimensionality and leverages the large quantities of userlevel data available to the platform.
Our approach has two steps. The first step (“response modeling”) fits a userlevel model for purchase of a brand’s product as a function of the user’s exposure to ads. The second (“credit allocation”) uses the fitted model to allocate the incremental part of the observed purchase due to advertising, to the ads the user is exposed to over the previous days.
To implement step one, our goal is to develop a response model that captures the following aspects of adresponse,

Responsiveness of current purchases to a sequence of past advertising exposures: that is, we would like to develop a response model that allows for both current and past advertising to matter in driving current purchases. This is consistent with a large literature that has documented that advertising has longlived effects, and emphasized how both the stock as well as the flow of advertising affects consumer response (e.g., Bagwell, 2007
). In addition, various findings in this literature motivates the need to handle the effect of the history of past exposures flexibly. For instance, the effect of history can operate in complex ways, by changing not just the baseline level of purchase probability, but also the marginal effect of current adexposures (e.g.,
Bass et al., 2007; Naik et al., 1998). 
Responsiveness of current purchases to the intensity of adexposure: that is, the model should allow the effect of a brand’s advertising on a user to depend on the number of exposures, not just whether there was exposure, with the effect being possibly nonlinear (e.g., Dube et al. 2005). Therefore, we need to capture an intensive margin by which advertising can affect behavior, in addition to accommodating an extensive margin for the effects of ads.

Responsiveness of current purchases to the timing of adexposures: that is, the model should allow the effect of past exposures to differ based on the timing of those exposures. This is motivated by accumulating evidence that the effect of advertising exposure is longlasting but decays in human memory (e.g., Sahni, 2015). Therefore, we expect advertising to have a lasting effect, with the effect highest at the time of exposure, ceteris paribus, and decaying over time. A realistic model should accommodate a role for time in incorporating the effect of an exposure on purchase, and allow this decay to occur in a flexible way that can be learned from the data.

Responsiveness of current purchases to competitive adexposures: that is, the model should accommodate a role for both own and competing brand’s current and past adexposures to affect current purchases. Allowing for crossbrand adeffects to matter is important in a competitive marketplace with competing brands, and is also critical to capturing the incremental contribution to a brand’s ownadvertising efforts correctly (e.g., Anderson and Simester, 2013).

Capturing heterogeneity across users: that is, the model should allow for the adresponse to differ by consumer types. One motivation is based on business considerations: advertisers use the output from the model to design their targeting strategies, and often desire estimates of attribution split by consumer segments. Another motivation is based on inference: a typical concern about measuring adresponse is user selection into exposure. Including a flexible accommodation for heterogeneity into the model mitigates the selection concern somewhat by “controlling for” observables that drive selection into exposure (e.g., Varian, 2016).
Given the scale of the data, and the large number of adtypes, a fully nonparametric model that reflect these considerations is not feasible. Instead, our approach is to develop a flexible specification that fits the data and incorporates these aspects of adresponse. We train a Recurrent Neural Network (RNN) for this purpose. The RNN is trained on userlevel conversion and exposure data, and is architected to capture the impact of advertising intensity, timing, competition, and userheterogeneity outlined above. The model is set up as a binary classification problem, outputting a probability that a user buys a product associated with a brand in a given timeperiod. It takes as inputs in its lowermost layer, the impressions served to a user of a focal brand’s and its competitors ads split by adtype; and separately for each of
timeperiods prior to date of attribution. This allows a flexible way of handling the intensity of adexposure and competition over the past periods on attribution of an order on the period. A separate fully connected layer in the model takes as input a set of user characteristics, which shifts the the “intercept” of the logistic output layer, giving it a “fixedeffects” interpretation.The specifics of the application to advertisingresponse motivates a bidirectional formulation of the RNN. We allow for a hidden layer with backward recurrence, augmented with a hidden layer with forward recurrence. This improves the fit of the model, because the fact that a user saw a particular set of ads in is useful to predict his response in period , . For example, if a user bought a brand in , he may not search for that brand in period , and not be exposed to search ads in . So the knowledge that he did not see search ads in is useful to predict whether he will buy in period . More generally, this suggests that the sequence of future adimpressions can help predict . Adding a layer with forward recurrence serves as a semiparametric summary of future activity that is helpful to predict current actions.
While RNNs are not new to this paper, it is worth emphasizing why this class of models is of value for the MTA problem. Compared to the other frameworks, RNNs represent a more flexible way to handle the sequential dependence in the data. Sequential dependence is key to adresponse, because what we need to capture from the data is how exposure to past touchpoints cumulatively build up to affect the final outcome. RNNs do this well by allowing for continuous, highdimensional hidden states (compared to lowerdimensional, discrete ones in other models with hidden states); combined with a distributed representation of those states that allows them to store information about the past efficiently. This enables the RNN to handle longterm, higherorder and nonMarkovian dependencies in a semiparametric manner (see
Graves 2012; Lipton 2015 for overview).^{2}^{2}2Comparing Hidden Markov Models to RNNs,
(Lipton, 2015) says, “Hidden Markov models (HMMs), which model an observed sequence as probabilistically dependent upon a sequence of unobserved states, were described in the 1950s and have been widely studied since the 1960s (Stratonovich, 1960). However, traditional Markov model approaches are limited because their states must be drawn from a modestly sized discrete state space . The dynamic programming algorithm that is used to perform efficient inference with hidden Markov models scales in time (Viterbi, 1967). Further, the transition table capturing the probability of moving between any two timeadjacent states is of size . Thus, standard operations become infeasible with an HMM when the set of possible hidden states grows large. Further, each hidden state can depend only on the immediately previous state. While it is possible to extend a Markov model to account for a larger context window by creating a new state space equal to the cross product of the possible states at each time in the window, this procedure grows the state space exponentially with the size of the window, rendering Markov models computationally impractical for modeling longrange dependencies (Graves et al., 2014).” Increasingly, some researchers view HMMs as special cases of RNNs ((Wessels and Omlin, 2000; Buys et al., 2018)). Wellknown results in theoretical computer science also show that recurrent neural nets have attractive universal approximation properties: any function that can be computed by a digital computer is also inprinciple computable by a recurrent neural net architecture.^{3}^{3}3By a “net” we mean, an architecture in which neurons are allowed to synchronously update their states according to some combination of past activation values. While earlier literature had suggested that nets can achieve universality if one allowed for an infinite number of neurons (e.g., Franklin and Garzon (1990)), or allowed for higherorder connections (where current states update their values as multiplications or products of past activations, e.g., Sun et al. (1991)), results by Siegelmann and Sontag (1991, 1995)are even more favorable: recurrent neural nets can achieve universality using only a finite number of neurons, and using only firstorder, nonmultiplicative connections. In particular, any function computable by a Turing Machine can be computed by such a net. “Turing Machines” are mathematically simple computational devices to help formalize the notion of computability. Under the
ChurchTuring thesis in computer science, for every computable problem, there exists a Turing Machine that computes it; and conversely, every problem not computable by a Turing Machine is also not computable by finite means. See for instance https://plato.stanford.edu/entries/turingmachine/, for historical perspectives.Thus, relatively “simple” recurrent architectures can capture very complex functional dependencies in the data, especially if we allow for a large number of neurons. This makes them attractive to our situation. The main disadvantage of RNNs are they require more data, and take longer to train. This problem is mitigated in implementations in modern tech platforms, which are datarich and have access to large computational resources.
To implement step two, we focus on incrementalitybased allocation for advertising. To understand the motivation for this, note that each ad generates an incremental increase in the overall probability of purchase, and the set of adexposures as a whole generate an incremental improvement in the propensity to purchase. Our approach is to allocate the incremental improvement due to the ads to each adtype. This takes into account that even if the user did not see the ads, the user may have some baseline propensity to buy anyway due to tastes, prior experiences, spillovers from competitive advertising. Logically, the part of observed orders that would have occurred anyway should not be allocated to the focal brand’s advertising efforts. To allocate credit, we compute Shapley Values, which have the advantage of having axiomatic foundations and satisfying fairness considerations (Shapley, 1953; Roth, 1988). The specific formulation of the Shapley Value we implement respects incrementality by allocating the overall incremental improvement in conversion to the exposed ads, while handling the sequencedependence of exposures on the observed outcomes.
Computing the Shapley Values is computationally intensive. We present a scalable algorithm (implemented in a distributed MapReduce framework) that is fast enough to allow computation in reasonable amounts of time so as to make productization feasible. The algorithm takes predictions from the response model trained on the data as an input, and allocates credit over tuples of adexposures and time periods. Allocation at the tuplelevel has the advantage of handling the role of the sequence in an internally consistent way. Once allocation of credit at this level is complete, we sum across all time periods associated with an adtype to develop an expost credit allocation to each adtype. This explicit aggregation has the advantage that aggregation biases are reduced when using the model to allocate credit at a more aggregate level, such as over advertising channels (e.g., search and display).^{4}^{4}4Aggregating responses to the channel level reduces the complexity of the algorithm, and enables pooling of data, but masks the differential contribution of various touchpoints to final conversion, because implicitly, such a response model assumes that the effect of all touchpoints within the channel are similar. By training the response model and implementing the allocation at the adtype and timeperiod level, and then exactly aggregating these allocations to the channel level, we mitigate such concerns to a large extent.
In combination, the RNN responsemodel and Shapley Value credit system represents a coherent, theory and datadriven attribution framework for the platform. We present details and an illustration of the framework using data from one product category (cellphones) at JD.com. This is a single product category version of the full framework. The full framework is under production at the firm, and accommodates all product categories on the site, and scales to handle the high dimensionality of the problem on the platform (attribution of the orders of about 300M users, for roughly 160K brands, across 200+ adtypes, served about 80B adimpressions over a typical 15day period).
The rest of the paper discusses the relevant literature, the details of the model, details of the cellphone data, and results. The last section concludes.
2 Relationship to the Literature
The problem of attribution of credit to underlying marketing activities is not new. The previous literature on the topic is divided into two streams: (a) empirical papers that develop statistical response models to measure the effect of marketing touchpoints on consumer purchase behavior and engagement; (b) papers that combine an empirically specified response model with an allocation scheme for allocating credit to the touchpoints. This paper is part of the second stream.
Early research on response models used marketlevel data and timeseries based aggregate “marketing mix” models to asses the effect of marketing touchpoints in print, TV and internet channels on sales and engagement (e.g., Naik et al., 2005; de Haan et al., 2016; Kireyev et al., 2016). More recent work has leveraged access to userlevel browsing and conversion data to develop individuallevel models of responsiveness. Notable examples in this stream include Shao and Li (2011)
(who use a bagged logistic regression model and a semiparametric model that allows upto secondorder dependence in consumer behavior);
Li and Kannan (2014) and Xu et al. (2014) (who use customized Markovian models of consumer channel choice and conversion); Zhang et al. (2014) (who use a hazardbased survival model that allows for time decay in adexposures); Abhishek et al. (2015) (who use an HMM of user exposure and conversion); and Anderl et al. (2016) who model customer purchase and browsing behavior as Markov graph with upto fourth order dependence.Broadly speaking, the recent response modeling literature has focused on developing generative models of consumer behavior that capture the dependence in the effects of touchpoints to the extent possible, while making simplifying assumptions to feasibly handle the high dimensionality of the measurement problem. Relative to this stream, this paper uses an RNN trained on userlevel data as a response model. Compared to past frameworks, the model handles complex patterns of dependence in a more flexible way. The specific formulation of the RNN also allows the sequences of touchpoints to have differential effects on final conversion, which is novel. The setup also accommodates in one framework the role of intensity, timing, competition, and userheterogeneity, which have typically not been considered together in the previous literature.
Amongst papers in the second stream, Dalessandro et al. (2012) was the first to propose using the Shapley value as a credit allocation mechanism for the MTA problem. They call this “causallymotivated” attribution because of the causal interpretation associated with the “marginality” property of the Shapley Value rule. Dalessandro et al. (2012) compute Shapley values for channels by fitting to the data logistic regression models similar to Shao and Li (2011). Yadagiri et al. (2015) present an important extension this work, allowing the statistical model to be semiparametric, but restricting outcomes to depend on the composition, but not the order, of previous touchpoints. Anderl et al. (2016) leverages the Markov graphbased approach proposed by Archak et al. (2010) for credit allocation. This approach is computationally attractive, but lacks the fairness properties of the Shapley Value.^{5}^{5}5The credit allocated to an adslot is the “removal effect”, computed as the change in probability in reaching the conversion state from the start state when the slot is removed from the Markov graph. The removal effect can be thought of as the marginal contribution of an adslot. The Shapley Value in contrast, allocates credit based on a transformation of the marginal contributions. Like Shao and Li (2011); Dalessandro et al. (2012); Yadagiri et al. (2015), we use the Shapley value for credit allocation. The specifics of our implementation differs from these papers on three aspects. First, we present a way to obtain the incremental contribution from a focal firms’ advertising to observed orders, and to allocate the incremental contribution to the underlying adslots. Previous approaches have allocated total orders. Our view is that the incremental part is the more intuitive allocation as it is the component that is due to advertising. Second, allowing the conversion to depend on the order of exposures in the response model requires us to develop a way to implement credit allocation to an adslot which depends on its order in the temporal sequence of exposures. We present an algorithm that computes Shapley Values over tuples of adslots and location in the sequence to do so. This aspect, which arises because “order matters”, is not an explicit consideration in previous approaches. Third, we implement attribution at a more disaggregated “adslot” level, compared to the more aggregate channellevel attribution of past approaches. This makes the problem considered here more high dimensional than considered previously.
This paper is also related to a gametheoretical literature that devise payment rules for multichannel ads. Notable papers include Agarwal et al. (2009); Wilbur and Zhu (2009); Jordan et al. (2011); Hu et al. (2016); Berman (2018) who propose efficient contracts when there are interactions across publishers, and advertisers and publishers are strategic and information is possibly asymmetric. The response model and attribution methods presented here can form an input to the creation of the payment contracts suggested in this theory.
A limitation of our approach and indeed of all the response models cited previously, is the lack of exogenous variation in user exposure to advertising. This can contaminate the learning of marginal effects from the data due to issues associated with nonrandom targeting and selection into adexposure. A typical solution to the problem randomization of users into adexposures across all the adtypes, followed by training the model on data generated by the randomization is infeasible at the scale required for practical implementation, due to the cost and complexity of such randomization. Extant papers that have trained adresponse models on data with full or quasirandomization (Sahni, 2015; Barajas et al., 2016; Nair et al., 2017; Zantedeschi et al., 2017) have done so at smaller scale, over limited number of users and adtypes. The approach to this problem adopted here is to include a large set of user features into the model so that by including these, we convert a “selection on unobservables” problem into a “selection on observables” problem. Given the feature set is large and accommodated flexibly, controlling for these observables may mitigate the selection issue to a great extent (e.g., Varian, 2016), albeit not perfectly.
3 Model Framework
3.1 Problem Statement: Defining Multi Touch Attribution in terms of Incrementality
Let denote users; denote time (days); and denote brands. Let index an “adposition,” i.e., a particular location on the publishers inventory or at an external site at which the user can see advertisements linked to a given brand. For instance, a particular adslot showing a display ad on the top frame of the JD app homepage would be one adposition, and a particular adslot showing a searchad in in response to keyword search on the JD app would be another adposition. Consider an order made by user for brand on day . Let denote the set of adpositions at which user was exposed to ads for brand over the days preceding the order (from to ).
We formulate the multi touch attribution problem as developing a set of creditallocations for all , so that the allocation for represents the contribution of brand ’s ads at position to the expected incremental benefit generated by brand ’s advertising on the observed order. Define as the change in the probability of order occurring due to the user’s exposure to ’s ads in the positions in . We look for a set of fractions such that,
(1) 
3.2 Problem Solution: Response Model Trained on User Data + Shapley Values
We solve the problem in two steps. To allocate the orders on date ,

In step 1, we train a response model for purchase behavior using individual userlevel data observed during to .

In step 2, we take the model as given, and for each order observed on date we compute Shapley Values for the adpositions . We set to these Shapley values and aggregate across orders to obtain the overall allocation for brand on date .
Figure (1) show the architecture of the system.
3.2.1 Shapley Values
Motivation
The Shapley value has many advantages as a fair allocation scheme in situations of joint generation of outcomes. These advantages have been articulated in a long history of economic literature on cooperative games starting with Shapley (1953). See Roth (1988) for a summary and historical perspectives. The key idea is that any fair allocation should avoid waste and allocate all of the total benefit to the constituent units that generated the benefit (“allocative efficiency”). This is the “addingup” requirement is encapsulated in equation (1). Further, fairness suggests that two units that contribute the same to every possible configuration in which the constituent units can influence the final outcome, should be allocated the same credit (“symmetry of credit”); and logically, that a unit which contributes nothing to any configuration should be allocated zero credit (“dummy unit”).
In addition to these requirements, a fair allocation should also satisfy the “marginality principle,” encapsulating the idea that credit should be proportional to contribution. Specifically, marginality requires that the share of total joint benefit allocated to any constituent unit should depend only on that unit’s own contribution to the joint benefit. As Young (1988) points out, a sharing rule that does not satisfy the marginality principle is subject to serious distortions. If one unit’s credit depends on another’s contributions, the first unit can be viewed favorable (or unfavorably) on the basis of the performance of the second. This affects how these units are rewarded or punished and distorts performance. The difficulty is that finding a sharing rule that simultaneously satisfies marginality and efficiency is nontrivial. When the marginality principle is imposed, the sum of individual unit’s marginal contributions will not equal total overall benefit in general. If there are increasing returns from joint production, the sum of marginal contributions will be too high; and if there are decreasing returns, it will be to low.^{6}^{6}6Paraphrasing Young (1988): “One seemingly innocuous remedy is to compute the marginal product of all factor inputs and then adjust them by some common proportion so that total output is fully distributed. This proportional to marginal product principle is the basis of several classical allocation schemes..[]..the proportional to marginal product principle does not resolve the “adding up” problem in a satisfactory way. The reason is that the rule does not base the share of a factor solely on that factor’s own contribution to output, but on all factors contribution to output. For example, if one factor’s marginal contribution to output increases while another ’s decreases, the share attributed to the first factor may actually decrease; that is, it may bear some of the decrease in productivity associated with the second factor.”
The Shapley value aligns these requirements in an elegant way. Theorem 1 in Young (1988) shows the remarkable property that the Shapley value is the unique sharing rule that is efficient, symmetric and satisfies the marginality principle. This is the reason for the Shapley value’s appeal as a creditallocation scheme.
The next three subsections discusses how we compute the Shapley values in the adposition attribution problem. First, we discuss how we define the expected incremental benefit generated by a set of adpositions to a brand’s order. Then, we discuss how we allocate that benefit to adpositionday tuples, and aggregate these to generate credit allocations across adpositions. Finally, we provide an illustrative example.
Defining the Expected Incremental Benefit Generated by a Set of AdPositions to an Observed Order
Let
denote a binary random variable for whether user
purchases brand on day . An order is a realization with associated ownbrand adexposures at positions . The expected incremental benefit generated by brand ’s advertising on order is,(2) 
The first term in equation (2) represents the probability of the order occurring given ’s exposures to brand ’s ads at the adpositions in over the preceding days. The second term represents the counterfactual probability of the order occurring if did not have any exposures to brand ’s ads at the adpositions in over the preceding days (denoted as ). Holding everything else fixed, this difference represents the expected incremental contribution of the adpositions in to the order. We can think of as a causal effect of brand ’s advertising over the past days on user s propensity to place the observed order on day
Allocating Incremental Benefit to a PositionDay Tuple
To allocate to the adpositions in , we first allocate to each adpositionday tuple in which saw ads of brand over the last days. We then sum the allocations across days for the tuples that each adposition is associated with, to obtain the overall allocation of to that .
To do this, let be the set of adpositionday combinations at which user saw ads for brand during the days preceding order . Denote the cardinality of as .^{7}^{7}7For example, if user saw ads for brand at adposition on days ; at position on day ; and position on day , would be and . For a given tuple , let denote a generic element from the powerset of , i.e., the subset of all adpositionday combinations at which the user saw ads for brand during the days, excluding tuple . Let the cardinality of be denoted .
Define the function as,
(3) 
i.e., represents the expected incremental benefit from user seeing ads for brand at adpositionday combinations in , holding everything else fixed. By construction, . So, by allocating across the adpositionday tuples in , we allocate the same total incremental benefit generated by brand ’s advertising, as we would by allocating across the adpositions in . Also, by construction, .
For each tuple , we need the fractions , to satisfy two conditions. First, that , so that the fractions sum to the full incremental benefit of the ads on the order (i.e., satisfy allocative efficiency). Second, that the fractions for a given tuple are functions of only its marginal effects with respect to (i.e., satisfy the marginality principle). These are the Shapley values for the tuples defined as,
(4) 
Computing the Shapley values requires a way to estimate the marginal effects in equation (4) from the data, as well as an algorithm that scales to handle the highdimensionality of . This is discussed in the subsequent section.
Once the Shapley values are computed, we sum them across all to obtain the allocation of that order to adposition as,
(5) 
where is the set of days in that are associated with adposition
The final step is to do this across all orders observed for brand on day . To do this, we sum across all and all users who bought brand on day . This gives the overall incremental contribution of the adpositions to the brand’s orders. To allocate this to we simply compute how much adposition contributed to this sum. To see this mathematically, denote in equation (5) as for short. We sum across all for all that made an order (i.e., ) to compute the term in the denominator in equation (6); and we sum for only adposition across all that made an order (i.e., ) to compute the term in the numerator in equation (6). Dividing the numerator by the denominator, we allocate to adposition , a proportion computed as,
(6) 
Each element in represents the contribution of adposition to the total incremental orders obtained on day by the brand due to its advertising on the positions. thus represent a set of attributions that can be reported back to the advertiser.
Linking to a ResponseModel
Let be the number of impressions of brand ’s ad seen by user at adposition on day Collect all the impressions of the user for the brand’s ad across positions on day in
; collect the impression vectors across all the brands for that user on day
in ; and stack the entire vector of impressions across all days and brands in a vector . Let be a priceindex for brand on day , representing an average price for products of brand faced by users on day .^{8}^{8}8We compute this as a share weighted average of the list prices of the SKUs associated with the brand on that day. Collect the price indices for all brand on day in vector and stack these in a vector . Finally, let represent a vector of user characteristics collected at baseline. The probability of purchase on day is modeled as a function of user characteristics, and the adimpressions and priceindices of brand and all other brands in the product category over the last days as,(7) 
The probability model is parametrized by vector which will be learned from the data.^{9}^{9}9The “hat” notation on emphasizes that the response parameters are learned in a firststage from the data.
We use equation (7) which provides an expression for , along with the definition of the marginal effects in equation (3), to compute the Shapley values defined in equation (4). To obtain the marginal effects from the response model, we define an operator on that takes a set as defined in 3.2.1 as an input.^{10}^{10}10Recall from 3.2.1 that we use to refer to a subset of adpositionday combinations at which the user saw ads for brand during the days, excluding tuple . Given , sets all the impressions of brand apart from those in the adpositionday tuples in to also leaves impressions of all brands unchanged.
Mathematically, taking and as input, outputs a transformed vector computed as,
(8) 
With as defined above, we can compute the Shapley value using the response model as,
(9) 
In effect, what we obtain in the square brackets in equation (9) is the change in the predicted probability of purchase of an order of brand on day by user when the tuple is added to the set of adpositionday combinations in , holding everything else (including competitor advertising) fixed at the values observed in the data for that order.
Illustrative Example
Suppose there are only two brands , three adpositions , and three days . Suppose user who made an order for brand on day saw 4 ads for brand at adposition on days ; 7 ads at position on day ; and 10 ads for brand 2 at adposition 3 on day . Then, , , , , , , so that (0,0,0,0,0,0,4,7,0,0,0,10,4,0,0,0,0,0. Suppose, we would like to evaluate the Shapley value of tuple of brand 1. For this order, is the set , i.e., the tuples at which the user saw ads for brand . The cardinality of the set . is the set with corresponding power set . In equation (9) for the Shapley values, is an element from this power set. To evaluate the terms in the square brackets, we need to evaluate the probability at transformed values of corresponding to and , for each . Consider one particular value of . The cardinality of , .

To transform given , we apply . Applying transforms to as follows: as per equation (8), for and tuples and , set and . For all others elements . Therefore, = (0,0,0,0,0,0,4,0,0,0,0,10,0,0,0,0,0,0. What the transformation has done is to set to 0 the impressions for brand 1 at adpositionday combinations and , which are the tuples that are not in at which saw ads for brand 1 during the days.

To transform given , we apply . Applying transforms to as follows: for and tuple , set . For all others elements . Therefore, = (0,0,0,0,0,0,4,0,0,0,0,10,4,0,0,0,0,0. What the transformation has done is to set to 0 the impressions for brand 1 at adpositionday combination , which is the only tuple that is not in at which saw ads for brand 1 during the days.
Evaluating at these values now generates the term in the square brackets in equation (9). Repeating for all possible , and summing per (9) gives the Shapley value for brand 1 for tuple .
An Efficient Algorithm for Fast, LargeScale Computation
Exact computation of Shapley values as described above is computationally intensive. Shapley values have to be calculated separately for each order. The number of orders can number in the millions on a given day on an eCommerce platform like JD.com. Additionally, Shapley values have to be computed for each adpositionday tuple for each order. When is large, this latter step also becomes computationally intensive, requiring Monte Carlo simulation methods to approximate the calculation.
We seek an implementation that scales to accommodate a large number of brands and orders, and generates reports in a matter of hours, which is important for business purposes. Our implementation switches between exact and approximate solutions for the Shapley values depending on the cardinality of , and is implemented in a MapReduce framework so it runs in a parallel, distributed environment on a cluster. Algorithm 1 presents details.
3.2.2 Response Model
The purpose of the response model is to provide a datadriven way to estimate the marginal effects in equation (4). The architecture of the RNN is presented in Figure (2). Though the model training is done simultaneously across all brands, the picture is drawn only for one brand. The input vector of adimpressions, , and the input vector of priceindexes, , are fed through an LSTM layer with recurrence. The user characteristics, , are processed through a separate, fullyconnected layer. The outputs from the LSTM cells and the fullyconnected layer jointly impact the predicted outcome . Combining this with the observed outcome,, we obtain the loglikelihood
, which forms the loss function for the model. The RNN finds a set of parameters or weights that maximizes the loglikelihood.
As noted before, we utilize a bidirectional formulation in which we allow for a hidden layer with backward recurrence, augmented with a hidden layer with forward recurrence. The layer with forward recurrence serves as a semiparametric summary of future activity that is helpful to predict current actions. This is shown in Figure (2) where the superscript “fw” indicates forward recurrence and “bw” indicates backward recurrence. The use of “future” adimpressions for predicting current behavior in the response model requires some elaboration when the model is used to compute the “causal” or marginal effects in the Shapley Values. Note that the causal effects the response model has to deliver for computing the Shapley values for a user, are always differences in the user’s predicted probabilities of purchase in a period , under different retrospective (pre) counterfactual adimpression sequences. This means the use of future adimpressions to predict behavior in the bidirectional model poses no conceptual difficulty in developing the causal effects required for computing Shapley Values.
The model is implemented in TensorFlow. We use the nonpeephole based implementation of the LSTM (Hochreiter and Schmidhuber, 1997); regularized via dropout with probability
; and optimized via stochastic gradient descent using the
Adam Optimizer (Kingma and Ba, 2014). We initialize the forward and backward LSTM cells to 0; the LSTM weights orthogonally with multiplicative factor = 1.0; and the other parameters using truncated normal draws. The model training is stopped with the error in a validation dataset stabilizes.4 Experiments and Application to Cellphone Product Category
We present an application of the model using individuallevel data on adexposures and purchases from the cellphone product category on JD.com during a 15day window in 2017. We first present some modelfree evidence documenting the quantitative relevance in our data of some of the considerations outlined in the introduction for a good response model. Then we show present model performance metrics and results.
Data and Summary Statistics
To create the training data, we sample users who saw during the 15day window, at least one adimpression related to a product in the cellphone product category sold on JD.com. Within this overall sample, we define the positive sample as the set of users who purchased a product of any brand in the cellphone category during the time window. We define the negative sample as the set of users in the overall sample who did not purchase any product in the cellphone category during the 15day time window. Table 1 provides summary statistics of the training dataset. There are roughly 75M users, 3.4M orders, and 7B adimpressions. There are 301 adpositions. We aggregate brands to 31. Table (2) shows market shares on the basis of units sold and revenue generated. Huawei is the largest brand, followed by Apple (2nd largest in terms of revenue), Xiaomi, Meizhu, Vivo and others.
Number of users in Overall Sample  75,768,508 
Number of users in Positive Sample  2,100,687 
Number of users in Negative Sample  73,667,821 
Number of adimpressions in product category over 15 days  7,153,997,856 
Number of orders made in product category over 15 days  3,477,621 
Number of orders made on day  175,937 
Number of brands ()  31 
Number of adpositions ()  301 

Notes:Descriptive statistics of training dataset, which comprises individuallevel data on adexposures and purchases from the cellphone category on JD.com during a 15day window in 2017. The positive sample is the set of users who purchased a product of any brand in the cellphone category during the time window. The negative sample is the set of users in the overall sample who did not purchase any product in the cellphone category during the 15day time window.
Huawei  Xiaomi  Apple  Meizhu  Vivo  Others  

By Units Sold  29.5%  25.30%  8.40%  6.20%  3.20%  27.40% 
By RMB Sold  27.2%  18.4%  24.7%  4.2%  5.3%  20.2% 

Notes: Brandlevel market shares in the training data based on units sold and by money spent (RMB) in the cellphone category on JD.com during a 15day window in 2017.
Brand  Positive Sample  Negative Sample  

Mean  SD  Mean  SD  
Huawei  201.6  328.6  34.7  144.9 
Xiaomi  220.3  371.6  26.3  111.7 
Apple  147.2  253.1  16.5  63.3 
Meizhu  173.4  309.9  14.3  65.8 
Vivo  82.4  147.9  15.2  46.6 
Others  97.6  197.7  8.1  43.2 
Motivating Patterns in Data
Table (3) shows summary statistics of adexposures split by brand separately for the positive sample and the negative sample. Adexposures are seen to much higher in the positive sample. For example, mean exposures in the positive sample (over a 15day windows) are 737% () higher for Xiaomi, and 792% () for Apple. While some of this can be driven by targeting by the brands of users more likely to buy their products, the large differences across the positive and negative samples by brand suggest that the sequence of their adexposures over the 15day window matter in explaining purchases on day15.
Figure (4) shows plots of the probability of purchase of a brand on day as a function of the number of impressions of ads for Apple (left panel) or Xiaomi (right panel) seen by the user in the past 15 days. There is evidence of a robust positive association, suggesting that the intensity of ownadvertising exposure matters for conversion. Figure (4) shows the probability of purchase of a brand on day15 as a function of the days since the user saw impressions of ads for that brand. Plots are presented separately for Apple (left panel) and Xiaomi (right panel). To represent exposure timing, we represent on the axis, the average days elapsed since a user saw adimpressions of that brand, obtained by weighting the day of the adimpression by the number of impressions over the 15 days. Specifically, for each user , the value for user for brand , is computed as , where is the number of impressions of ads of brand seen by user on day , . The response curve is Ushaped, with more recent exposures associated with higher purchase probability of the brand on day 15. This suggests decay in the adresponse. The plot also shows the effect decays close to zero in 15 days, providing some databased justification for using this cutoff.
Figure (4) shows a plot of the probability of purchase of a brand on day15 as a function of the number of impressions of ads for Apple (left panel) seen by the user over the 15 day window. The probability of purchase of Apple is plotted separately by a median split of the number of impressions seen of Xiaomi ads. The blue dots depict the probabilities for those who saw more than the median impressions of Xiaomi ads over the 15day window, and the red dots depict the probabilities for those who saw less than the median impressions of Xiaomi ads over the 15day window. There is evidence for separation: the association of purchase probabilities with Apple adimpressions is steeper for those who saw less than the median impressions of Xiaomi ads over the 15day window. The right panel depicts the analogous plot for Xiaomi. The pattern is similar, suggesting the importance of allowing for competitor ads to matter in affecting purchases.
Finally, in Figure (3), we assess the importance of including user characteristics in the model as a control for user heterogeneity and selection into adexposure. We do this informally by comparing the marginal effects for a search adposition for linear models with and without “fixedeffects.” Comparing to a model without, the model with fixedeffects allows flexible user heterogeneity on the intercept. We choose a search adposition because it receives some of the highest adimpressions in the data, and also because the issue of selection is likely to be severe in the case of search ads (i.e., those who like a brand are more likely to search for it and see the search ads, while being more likely to buy the product without the exposure). If we find that the predicted marginal effects are “more reasonable” under the fixedeffects model compared to the base model, that provides some evidence for the value of including user heterogeneity. Past “A/B” testing at JD.com has shown that the search adposition produces positive marginal lift across many historical campaigns. So, we use the extent to which the predicted results are positive as a metric of reasonableness.
To do this, we let be an indicator of whether user bought a product of a given brand on day 15 (to economize on notation, the index for brand is suppressed). We first train two models for : (1) a linear model , where is the number of impressions seen by user of that brand at adposition over the 15day window; and, (2) a linear fixedeffects model , which is the same as the previous model, except the intercept is specific. Once the two models are trained, we store them in memory. The predictions from the two models are denoted and respectively. Our goal is to compare predictions between the two models for a focal search adposition, denoted by . The marginal value of seeing ads at position depends on the sequence of ads preceding it, so in order to do the comparison between models, we need to pick a set of sequences on which to base the comparison. We would like to pick the sequences that occur most in the data and for which we have enough data so that we can assess the comparison with a reasonable amount of precision. With these considerations, we filter to the subset of users in the data who saw adimpressions at least 10 adpositions over the 15day window (10 is the median across users in the data). For each in this subset, we let denote the sequence of adpositions that saw over the 15day window (so, by construction, ). Define by , a 9element permutation from the set of adpositions excluding , i.e. a 9element permutation from the set . We then estimate the marginal value to seeing ads at position when it occurs as the in the sequence, given that the first 9 adpositions seen is as,
The first term is obtained by averaging the predicted using the observed adimpressions for all users that have the sequence , and the second term is obtained by averaging the predicted using the observed impressions for all users that have the sequence . For each 9position sequence thus provides an estimate using the linear model, of the incremental benefit of seeing ads at position next, rather than at position next. We do the same thing using the fixed effects model, to compute analogously for each Then, we plot a histogram of the distribution of and across for adposition . This is shown in Figure (3).
Looking at Figure (3), we see that both models put significant probability mass on the positive support, but fewer of the marginal effects are negative under the fixedeffects model. We take this as supportive evidence that including controls for selection is important to generate reasonable measures of adeffectiveness, apart from allowing effects to be estimated separately by user segment. [ph] [Apple] [Xiaomi]

Notes: The Figure shows a plot of the probability of purchase of a brand on day as a function of the number of impressions of ads for Apple (left panel) or Xiaomi (right panel) seen by the user in the past 15 days. To construct the plot, define the negative sample as the set of users in the overall sample who did not purchase any product in the cellphone category during the 15day time window. At each value of the axis (number of ownbrand adimpressions), the probability of purchase of the brand on the axis is computed as the number of users who bought the brand’s products on day15, divided by the total numbers of users in the negative sample. The axis is capped at 2,000 impressions to account for bots, crawlers, nonindividual buyers etc.
[ph]
[Apple] [Xiaomi]
Notes: The Figure shows a plot of the probability of purchase of a brand on day15 as a function of the days since the user saw impressions of ads for that brand. Plots are presented separately for Apple (left panel) and Xiaomi (right panel). To construct the plot, define the negative sample as the set of users in the overall sample who did not purchase any product in the cellphone category during the 15day time window. At each value of the axis, the probability of purchase of the brand on the axis is computed as the number of users who bought the brand’s products on day15, divided by the total numbers of users in the negative sample. The axis represents the average days elapsed since the user saw adimpressions of that brand, obtained by weighting the day of the adimpression by the number of impressions over the 15 days. Specifically, for each user , the value for user for brand , is computed as , where is the number of impressions of ads of brand seen by user on day , .
[ph]
[Apple] [Xiaomi]
Notes: The Figure shows a plot of the probability of purchase of a brand on day15 as a function of the number of impressions of ads for Apple (left panel) seen by the user over the 15 day window. To construct the plot, define the negative sample as the set of users in the overall sample who did not purchase any product in the cellphone category during the 15day time window. At each value of the axis (number of ownbrand adimpressions), the probability of purchase of the brand on the axis is computed as the number of users who bought the brand’s products on day15, divided by the total numbers of users in the negative sample. The axis is capped at 2,000 impressions to account for bots, crawlers, nonindividual buyers etc. The probability of purchase of Apple is plotted separately by a median split of the number of impressions seen of of Xiaomi ads. The blue dots depict the probabilities for those who saw more than the median impressions of Xiaomi ads over the 15day window, and the red dots depict the probabilities for those who saw less than the median impressions of Xiaomi ads over the 15day window. There is evidence for separation: the association of purchase probabilities with Apple adimpressions is steeper for those who saw less than the median impressions of Xiaomi ads over the 15day window. The right panel depicts analogous plot for Xiaomi.
Model Performance
Table 4 shows the accuracy, recall and precision of the model. At the brand level, the model has precision ranging from 69%76% and recall ranging from 12%34% for the top brands; showing it fits the data well.
Brand  Metric  

Accuracy  Recall  Precision  
Huawei  0.997  0.340  0.714 
Xiaomi  0.997  0.308  0.705 
Apple  0.998  0.218  0.696 
Meizhu  0.999  0.186  0.727 
Vivo  0.999  0.119  0.762 
Others  1.000  0.095  0.726 
Figures 4 compares the accuracy, precision, recall and AUC (area under the curve) statistics of the model against two benchmark specifications.^{11}^{11}11The statistics are computed on a validation dataset that is heldout separately from the training dataset. The first is a unidirectional LSTM RNN, which is exactly the same as the preferred model but without the forward recurrence. The second is a flexible logistic model, which specifies the probability of a user purchasing brand on day as a semiparametric logistic function of the adimpressions and priceindexes on the same day (i.e.,
). Looking at the results, we see the bidirectional RNN has the highest AUC amongst the models; and has accuracy, precision and recall statistics that are comparable or higher. The poor performance of the logistic model in particular emphasizes the importance of accounting for dependence over time to fit the data. The plots also show the speed of convergence of the models as a function of the number of training steps; the bidirectional RNN converges faster in fewer training steps. This is helpful in production, which typically requires frequent model updating.
^{12}^{12}12To get a sense for this, the training times for 30,000 steps for the 3 models on our cluster are 11.21 hrs (bidirectional RNN); 9.68 hrs (unidirectional RNN); 12.48 hrs (logistic) hrs respectively.Tables 5 also benchmarks the algorithm presented in 1 for distributed computation of Shapley values. Recall this algorithm shifts from exact computation of Shapley values to a Monte Carlo simulation approximation when the number of adpositions over which to allocate credit is large. This “mixed” method improves computational speed, which is important for highfrequency reporting of results in deployment. To assess the performance of this algorithm, we pick 6,000 orders from and run the algorithm on these data for various configurations. The experiment is repeated for each configuration 5 times, and the average across the 5 reps is reported.^{13}^{13}13The computational environment uses a Spark cluster with Spark 2.3 by pyspark, running TensorFlow v1.6, with an 8 core CPU, 100 workers and 8 GB memory per worker without the GPU. The first row in the table reports on the number of orders we are able to attribute per minute using the three methods: the mixed method is about 2,300% faster than exact computation, and about 14% faster than a simulationonly method. The second row documents this efficiency gain does not come at the cost of high error: the average error in the mixed method is low relative to exact computation, and a order of magnitude smaller than using full simulation.^{14}^{14}14We compute error as a mean squared difference over the orders in the total attributed value (line in algorithm 1) for the evaluated algorithm relative to that from the exact algorithm , i.e. .
Algorithm  Exact  Approximate  Mixed 

Orders processed per minute  4.2  88.85  101.24 
Error  0.3190  0.0064 
Model Results
To explore the results from the model, we first discuss Figure 7 which emphasizes the importance of accounting for the incrementality of advertising in the allocation of orders. The figure assesses how much of the probability of an observed order occurring is driven by advertising. For each observed order in the cellphone category on day , we compute the predicted probability of that order occurring with and without advertising exposure, and take the difference as the incremental probability associated with advertising. We then compute the ratio of incremental to the total predicted probability with advertising. The figure presents a histogram of this ratio across all orders. There is a substantial spread from , with some orders that seems driven primarily by advertising, and some which would have occurred anyway irrespective of the brand’s advertising. It is this counterfactual outcome of what would the order would have been in the absence of advertising that the incrementality based allocation seeks to reflect.
Figure 5 shows the empirical CDF of the Shapley values computed on the basis of the RNN model for all the orders on day . The median Shapley value is about 0.25, suggesting the median adposition to which exposure occurs contributes about 25% to the incremental benefit from advertising. About 70% of the Shapley values are above 0.5, suggesting these adpositions contribute more than half of the the incremental benefit generated from advertising. About 15% of the Shapley values are above 0.8, suggesting these adpositions contribute more than 80% of the the incremental benefit generated from advertising. This shows the Shapley values have discriminatory power it helps identifying top adpositions that contribute most to observed outcomes.
Finally, Figure 6 compares the credit allocation based on Shapley Values to rulebased “Lastclicked” attribution. To do this, Figure 6 shows the Shapley Value from the RNN model at each adposition indexed on the axis, averaged across all orders on day for which that position was the last clicked. This allows benchmarking the Shapley Value based attribution against “lastclick” attribution, which allocates 100% of the credit for the order to the lastclicked adposition. The Shapley values are all seen to be <1, showing that under the model, the lastclicked adpositions do not obtain full credit. To the extent that the Shapley values are all less than 0.6, the RNN model suggests that lastclicked ads contribute upto a maximum of 60% to the incremental conversion generated by advertising. Further, cart and payment page positions (which correspond to ads shown on these positions), which may get a lot of credit under “lastclick” or “lastvisit” attribution schemes on eCommerce sites, are seen to not be allocated a lot of credit by the model.
As a final assessment, we compare Figure 7 which shows the proportion of adimpressions in the training data across the adpositions to Figure 7 which shows the average (across orders) of the Shapley Values for the same adpositions. The adpositions are indexed from in order of their share of total impressions, so impressions of adposition is higher than impressions of adposition , and so on in both figures. Comparing the two figures, we can see that the distribution of Shapley Values across positions does not follow the same pattern as that of impressions, suggesting that the effect picked up by the model is not purely driven by the intensity of advertising expenditures by advertisers (which drive impressions). Further, we observe that some positions that receive fewer adimpressions have higher Shapley Values than those that receive higher impressions. This suggests that advertising expenditure allocations overall may not be optimal from the advertiser’s perspective, and could be improved by better incorporation of attribution using a model such as the one presented here. A more formal assessment of this issue however requires a method for advertiser budget allocation across adpositions, which is outside of the scope of this paper. [Impression Share by AdPosition]
[Shapley Values by AdPosition]

Notes: The left panel of the figure shows the proportion of adimpressions in the training data across the adpositions. The adpositions are indexed from in order of their share of total impressions. The right panel of the figure shown the average (across orders) of the Shapley Values for the same adpositions. Comparing the two figures, we can see that the Shapley Values do not simply map out the intensity of impressions.
5 Implementation and Extension to Larger Scale
For actual implementation, the model has to be scaled to all brands and all categories on JD.com. From a model training and updating perspective, it is intractable to maintain a separate model for each product category (more than 175). For production, we extend the model presented above to accommodate all product categories in one unified framework. This model has larger scale, so we impose some parameter restrictions to reduce the dimensionality of the problem. First, we allow categories to have separate parameters, but restrict the weights for all brands within a product category to be similar. Second, on the basis of pretraining data, we create a set of features that characterize each brand (e.g., brand rank within JD.com), and include them as covariates that shift the intercept of the output layer. This allows for heterogeneity across brands within each product category. Third, for each user , brand , and day , we include in the input vector the adimpressions of brand at all the adpositions as before; but summarize the competitive adimpressions by including only (a) the adimpressions of all brands other than in the same product category across the positions; and (b) the adimpressions of all brands in other categories across the positions. Thus, in this model is a dimensional vector with the first entries corresponding to impressions of the focal brand; the next corresponding to all other brands in the same product category as the focal brand; and the last corresponding to all brands in other categories. Fourth, in order to increase the informativeness of user impression of ads, in some specifications, we include information on whether the user clicked on ads into the response model. Finally, to address the issue of selection more directly, we use the pretraining dataset to develop a predicted baseline propensity of each user to buy a particular brand , using as features his characteristics . We include the predicted baseline propensities into the unified model as controls. Due to business confidentiality reasons, we do not reveal the exact details of this implementation.
6 Conclusions
A practical system for datadriven MTA for use by ad publishing platform is presented. The system combines a flexible response model trained on userlevel data with Shapley values for adtypes for attribution. A bidirectional RNN customized to modeling user purchase behavior and adresponse, is developed as the response model, which has the advantage of being semiparametric, reflective of several salient aspects of adresponse, and able to handle high dimensionality and longterm dependence. The use of the Shapley value provides a way to allocate credit at a disaggregate level in a way that respects the sequential nature of advertising response. The use of the Shapley value is based on fairness considerations taking the advertising policies by the advertisers as given. It is possible that advertisers reoptimize their advertising policies in response to the allocations. The optimal allocation contract that endogenizes the equilibrium response of advertisers remains an open question (see, for instance, Abhishek et al. (2017); Berman (2018)).
References
 Abhishek et al. (2017) Abhishek, V., S. Despotakis, and R. Ravi (2017): “MultiChannel Attribution: The Blind Spot of Online Advertising,” working paper, Tepper School of Business.
 Abhishek et al. (2015) Abhishek, V., P. Fader, and K. Hosanagar (2015): “Media Exposure through the Funnel: A Model of MultiStage Attribution,” working paper, Wharton School of Business.

Agarwal
et al. (2009)
Agarwal, N., S. Athey, and D. Yang
(2009): “Skewed Bidding in PayperAction Auctions for Online Advertising,”
The American Economic Review, 99, 441–447.  Anderl et al. (2016) Anderl, E., I. Becker, F. von Wangenheim, and J. H. Schumann (2016): “Mapping The Customer Journey: Lessons Learned From GraphBased Online Attribution Modeling,” International Journal of Research in Marketing, 33, 457 – 474.
 Anderson and Simester (2013) Anderson, E. T. and D. Simester (2013): “Advertising in a Competitive Market: The Role of Product Standards, Customer Learning, and Switching Costs,” Journal of Marketing Research, 50, 489–504.
 Archak et al. (2010) Archak, N., V. S. Mirrokni, and S. Muthukrishnan (2010): “Mining Advertiserspecific User Behavior Using Adfactors,” in Proceedings of the 19th International Conference on World Wide Web, New York, NY, USA: ACM, WWW ’10, 31–40.
 Bagwell (2007) Bagwell, K. (2007): “The Economic Analysis of Advertising,” Elsevier, vol. 3 of Handbook of Industrial Organization, 1701 – 1844.
 Barajas et al. (2016) Barajas, J., R. Akella, M. Holtan, and A. Flores (2016): “Experimental Designs and Estimation for Online Display Advertising Attribution in Marketplaces,” Marketing Science, 35, 465–483.
 Bass et al. (2007) Bass, F. M., N. Bruce, S. Majumdar, and B. P. S. Murthi (2007): “Wearout Effects of Different Advertising Themes: A Dynamic Bayesian Model of the AdvertisingSales Relationship,” Marketing Science, 26, 179–195.
 Benes (2018) Benes, R. (2018): “Who Is Using Multitouch Attribution?” https://www.emarketer.com/content/whoisusingmultitouchattribution.
 Berman (2018) Berman, R. (2018): “Beyond the Last Touch: Attribution in Online Advertising,” Marketing Science, forthcoming.
 Buys et al. (2018) Buys, J., Y. Bisk, and Y. Choi (2018): “Bridging HMMs and RNNs through Architectural Transformations,” IRASL Workshop, NIPS, https://irasl.gitlab.io/.
 Dalessandro et al. (2012) Dalessandro, B., C. Perlich, O. Stitelman, and F. Provost (2012): “Causally Motivated Attribution for Online Advertising,” in Proceedings of the Sixth International Workshop on Data Mining for Online Advertising and Internet Economy, New York, NY, USA: ACM, ADKDD ’12, 7:1–7:9.
 de Haan et al. (2016) de Haan, E., T. Wiesel, and K. Pauwels (2016): “The Effectiveness Of Different Forms Of Online Advertising For Purchase Conversion In A MultipleChannel Attribution Framework,” International Journal of Research in Marketing, 33, 491 – 507.
 Dube et al. (2005) Dube, J.P., G. Hitsch, and P. Manchanda (2005): “An Empirical Model of Advertising Dynamics,” Quantitative Marketing and Economics, 3, 107–144.
 Franklin and Garzon (1990) Franklin, S. and M. Garzon (1990): Neural Computability, Ablex, Norwood, NJ, vol. 1, 128–144.
 Graves (2012) Graves, A. (2012): Supervised Sequence Labelling with Recurrent Neural Networks, vol. 385 of Studies in Computational Intelligence, SpringerVerlag Berlin Heidelberg, 1 ed.
 Graves et al. (2014) Graves, A., G. Wayne, and I. Danihelka (2014): “Neural Turing Machines,” CoRR, abs/1410.5401.

Hochreiter and
Schmidhuber (1997)
Hochreiter, S. and J. Schmidhuber
(1997): “Long ShortTerm Memory,”
Neural Computation, 9, 1735–1780.  Hu et al. (2016) Hu, Y. J., J. Shin, and Z. Tang (2016): “Incentive Problems in PerformanceBased Online Advertising Pricing: Cost per Click vs. Cost per Action,” Management Science, 62, 2022–2038.
 IAB (2018) IAB (2018): “IAB Attribution Hub,” https://www.iab.com/guidelines/iabattributionhub/.

Jordan et al. (2011)
Jordan, P., M. Mahdian, S. Vassilvitskii, and E. Vee (2011):
“The Multiple Attribution Problem in PayPerConversion Advertising,”
in
Algorithmic Game Theory
, ed. by G. Persiano, Berlin, Heidelberg: Springer Berlin Heidelberg, 31–43.  Kingma and Ba (2014) Kingma, D. P. and J. Ba (2014): “Adam: A Method for Stochastic Optimization,” CoRR, abs/1412.6980.
 Kireyev et al. (2016) Kireyev, P., K. Pauwels, and S. Gupta (2016): “Do Display Ads Influence Search? Attribution And Dynamics In Online Advertising,” International Journal of Research in Marketing, 33, 475 – 490.
 Li and Kannan (2014) Li, H. A. and P. Kannan (2014): “Attributing Conversions in a Multichannel Online Marketing Environment: An Empirical Model and a Field Experiment,” Journal of Marketing Research, 51, 40–56.
 Lipton (2015) Lipton, Z. C. (2015): “A Critical Review of Recurrent Neural Networks for Sequence Learning,” CoRR, abs/1506.00019.
 Naik et al. (1998) Naik, P. A., M. K. Mantrala, and A. G. Sawyer (1998): “Planning Media Schedules in the Presence of Dynamic Advertising Quality,” Marketing Science, 17, 214–235.
 Naik et al. (2005) Naik, P. A., K. Raman, and R. S. Winer (2005): “Planning MarketingMix Strategies in the Presence of Interaction Effects,” Marketing Science, 24, 25–34.
 Nair et al. (2017) Nair, H. S., S. Misra, W. J. Hornbuckle, R. Mishra, and A. Acharya (2017): “Big Data and Marketing Analytics in Gaming: Combining Empirical Models and Field Experimentation,” Marketing Science, 36, 699–725.
 Roth (1988) Roth, A. (1988): The Shapley Value: Essays in Honor of Lloyd S. Shapley, Cambridge University Press.
 Sahni (2015) Sahni, N. (2015): “Effect of Temporal Spacing between Advertising Exposures: Evidence from Online Field Experiments,” Quantitative Marketing and Economics, 13, 203–247.
 Shao and Li (2011) Shao, X. and L. Li (2011): “Datadriven Multitouch Attribution Models,” in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA: ACM, KDD ’11, 258–264.
 Shapley (1953) Shapley, L. S. (1953): A Value for NPerson Games, Princeton University Press, chap. Annals of Mathematical Studies, 307–317.
 Siegelmann and Sontag (1995) Siegelmann, H. and E. Sontag (1995): “On the Computational Power of Neural Nets,” Journal of Computer and System Sciences, 50, 132 – 150.
 Siegelmann and Sontag (1991) Siegelmann, H. T. and E. D. Sontag (1991): “Turing Computability With Neural Nets,” Applied Mathematics Letters, 4, 77 – 80.
 Stratonovich (1960) Stratonovich, R. L. (1960): “Conditional Markov Processes,” Theory of Probability and its Applications, 5, 156–178.
 Sun et al. (1991) Sun, G., H. Chen, and Y. Lee (1991): “Turing Equivalence Of Neural Networks With Second Order Connection Weights,” JCNN91Seattle International Joint Conference on Neural Networks.
 Varian (2016) Varian, H. R. (2016): “Causal Inference in Economics and Marketing,” Proceedings of the National Academy of Sciences, 113, 7310–7315.
 Viterbi (1967) Viterbi, A. (1967): “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Transactions on Information Theory, 13, 260–269.
 Wessels and Omlin (2000) Wessels, T. and C. W. Omlin (2000): “Refining Hidden Markov Models With Recurrent Neural Networks,” in Proceedings of the IEEEINNSENNS International Joint Conference on Neural Networks. IJCNN 2000, vol. 2, 271–276.
 Wilbur and Zhu (2009) Wilbur, K. C. and Y. Zhu (2009): “Click Fraud,” Marketing Science, 28, 293–308.
 Xu et al. (2014) Xu, L., J. A. Duan, and A. Whinston (2014): “Path to Purchase: A Mutually Exciting Point Process Model for Online Advertising and Conversion,” Management Science, 60, 1392–1412.
 Yadagiri et al. (2015) Yadagiri, M. M., S. K. Saini, and R. Sinha (2015): “A Nonparametric Approach to the Multichannel Attribution Problem,” in Web Information Systems Engineering  WISE 2015, Cham: Springer International Publishing, 338–352.
 Young (1988) Young, H. P. (1988): Individual Contribution And Just Compensation, Cambridge University Press, 267–278, The Shapley Value: Essays in Honor of Lloyd S. Shapley.
 Zantedeschi et al. (2017) Zantedeschi, D., E. M. Feit, and E. T. Bradlow (2017): “Measuring Multichannel Advertising Response,” Management Science, 63, 2706–2728.
 Zhang et al. (2014) Zhang, Y., Y. Wei, and J. Ren (2014): “Multitouch Attribution in Online Advertising with Survival Theory,” in 2014 IEEE International Conference on Data Mining, 687–696.