1 Introduction
In largescale systems, a common problem is to explain the reasons for a change in the output, especially for unexpected and big changes. Explaining the reasons or attributing the change to input factors can help isolate the cause and debug it if the change is undesirable, or suggest ways to amplify the change if desirable. For example, in a distributed system, system failure [29] or performance anomaly [19; 1] are important undesirable outcomes. In online platforms such as ecommerce websites or search websites, a desirable outcome is increase in revenue and it is important to understand why the revenue increased or decreased [25; 5].
Technically, this problem can be framed as an attribution problem [7; 28; 5]
. Given a set of candidate factors, which of them can best explain the observed change in output? Methods include statistical analysis based on conditional probabilities
[2; 13; 24] or computation of gametheoretic attribution scores like Shapley value [17; 25; 5]. However, most past work assumes that the output can be written as a function of the inputs, ignoring any structure in the computation of the output.In this paper, we consider largescale systems such as search or ad systems where output metrics are aggregated over different kinds of inputs or composed over multiple pipeline stages, leading to a natural computational structure (instead of a single function of the inputs). For example, in an ad system, the number of ads that are matched per query is a composite measure that is composed of an analogous metric over each query category (see Figure 1). While the overall matching density may fluctuate, the matching density per category is expected to be more stably associated with the input queries and ads. As another example, the output metric may be a result of a series of modules in a pipeline, e.g., recommendations that are finally shown to a user may be a result of multiple pipeline stages where each stage filters some items. Our key insight is that utilizing the computational structure of a realworld system can break up the system into smaller subparts that stay stable over time and thus can be modelled accurately. In other words, the system’s computation can be modelled as a set of independent, causal mechanisms [22] over a structural causal model (SCM) [20].
Modeling the system’s computation as a SCM also provides a principled way to define an attribution score. Specifically, we show that attribution can be defined in terms of counterfactuals on the SCM. Following recent work on causal shapley values [10; 14], we posit four axioms that any desirable attribution method for an output metric should satisfy. We then propose a counterfactual variant of the Shapley value that satisfies all these properties. Thus, given the computational structure, our proposed CFShapley method has the following steps: 1)
utilize machine learning algorithms to fit the SCM and compute counterfactual values of the metric under any input, and
2) use the estimated counterfactuals to construct an attribution score to rank the contribution of different inputs. On simulated data, our results show that the proposed method is significantly more accurate for explaining inputs’ contribution to an observed change in a system metric, compared to Shapley value [17] or its recent causal variants [10; 14].We apply the proposed method, CFShapley attribution, to a largescale ad matching system that outputs relevant ads for each search query issued by a user. The key outcome is matching density, the number of ads matched per query. This density is roughly proportional to revenue generated, since only the queries for which ads are selected contribute to revenue. There are two main causes for a change in matching density: change in query volume or change in demand from advertisers. Given that queries are typically organized by categories, the attribution problem is to understand which of these two are driving an observed change in matching density, and from which categories.
To do so, we construct a causal graph representing the system’s computation pipeline (Figure 1). Given six months of system’s log data, we repurpose timeseries prediction models to learn the structural equation for categorywise density as a function of query volume and ad demand, its parents in the graph. For this system, we find that categorywise attribution is possible with minimal assumptions, while attribution between query volume and ad demand requires knowledge of the structural equations that generate categorywise density. In both cases, we show how the CFShapley method can be used to estimate the system’s counterfactual outputs and the resultant attribution scores. As a sanity check, CFShapley attribution scores satisfy the efficiency property for attributing the matching density metric: their sum matches the observed change in density. We then use CFShapley
scores to explain density changes on five outlier days from November to December 2021, uncovering insights on how changes in query volume or ad demand for different categories affects the density metric. We validate the results through an analysis of external events during the time period.
To summarize, our contributions include,

A method for attributing metrics in a largescale system utilizing its computational structure as a causal graph, that outperforms recent Shapley valuebased methods on simulated data.

A case study on estimating counterfactuals in a realworld ad matching system, providing a principled way for attributing change in its output metric.
2 Related Work
Our work considers a causal interpretation of the attribution problem. Unlike attribution methods on predictions from a (deterministic) machine learning model [17; 11], here we are interested in attributing realworld outcomes where the datagenerating process includes noise. Since the attribution problem concerns explaining a single outcome or event, we focus on causality on individual outcomes [9] rather than general causality that deals with the average effect of a cause on the outcome over a (sub)population [20]. In other words, we are interested in estimating the counterfactual, given that we already have an observed event. Counterfactuals are the hardest problem in Pearl’s ladder of causation, compared to observation and intervention [21].
While counterfactuals have been applied in feature attribution for machine learning models [15; 27], less work has been done for attributing realworld outcomes in systems using formal counterfactuals. Recent work uses the dointervention to propose doshapley values [10; 14] that attribute the interventional quantity across different inputs . While doshapley values are useful for calculating the average effect of different inputs on the output , they are not applicable for attributing an individual change in the output. For attributing individual changes, [12] analyze root cause identification for outliers in a structural causal model, and find that attribution conditional on the parents of a node is more effective than global attribution. They quantify the attribution using information theoretic scores, but do not provide any axiomatic characterization of the resulting attribution score. In this work, we propose four axioms that characterize desirable properties for an attribution score for explaining individual change in output and present the CFShapley value that satisfies those axioms.
Attribution in ad systems. Multitouch attribution is the most common attribution problem studied in online ad systems. Given an ad click, the goal is to assign credit to the different preceding exposures of the same item to the user, e.g., previous ad exposures, emails, or other media. Multiple methods have been proposed to estimate the attribution such as attributing all to the last exposure [2], an average over all exposures, or using probabilistic models to model the click data as a function of the input exposures [24; 13]. Recent methods utilize the gametheoretic attribution score using Shapley values that summarizes the attribution over multiple simulations of input variables, with [5] or without a causal interpretation [25]. Multitouch attribution can be considered as a onelevel SCM problem, where there is an output node being affected by all input nodes. It does not cover more complex systems where there is a computational structure.
Performance Anomaly Attribution. Computational structure (e.g., specific system capabilities or logs) has been considered in the systems literature to rootcause performance anomalies [1] or system failures [29]. Some methods use causal reasoning to motivate their attribution algorithm, but they do so informally. Our work provides a formal analysis of the system attribution problem.
3 Defining the attribution problem
For a system’s outcome metric , let be a value that needs to be explained (e.g., an extreme value). Our goal is to explain the value by attributing it to a set of input variables, . Can we rank the variables by their contribution in causing the outcome?
For example, consider a system that crashes whenever its load crosses 0.9 units. The system’s crash metric can be described by the following structural equations, . The corresponding graph for the system has the following edges: . The value of each input
is affected by the independent error terms through the Bernoulli distribution. Suppose the initial reference value was
and the next observed value is . Given that the system crashed (), how do we attribute it to ? Intuitively, is a sufficient cause of the crash since changing would lead to the crash irrespective of values of other variables. However, and can be equally a reason for this particular crash since their coefficients sum to . However, if either of or are observed to be zero, then the other one cannot explain the crash. This example indicates that the attribution for any input variable depends on the equations of the datagenerating process and also on the values of other variables.3.1 Attribution score for system metric change
We now define the attribution score for explaining an observed value wrt a reference value. While system inputs can be continuous, we utilize the fact that system metrics are measured and compared over time. That is, we are often interested in attribution for a metric value compared to an reference timestamp. Reference values are typically chosen from previous values that are expected to be comparable (E.g., metric value at last hour or last week). By comparing to a reference timestamp, we simplify the problem by considering only two values of a continuous variable: its observed value, and its value on the reference timestamp.
Formally, we express the problem of attribution of an outcome metric, as explaining change in the metric wrt. a reference, : Why did the outcome value change from to ?
Definition 1.
Attribution Score. Let and be the observed and reference values respectively of a system metric. Let be the set of input variables. Then, an attribution score for provides the contribution of in causing the change from to .
3.2 The need for SCM and counterfactuals
To estimate the causal contribution, we need to model the datagenerating process from input variables to the outcome. This is usually done by a structural causal model (SCM) , that consists of a causal graph and structural equations describing the generating functions for each variable.
SCM. Formally, a structural causal model [20] is defined by a tuple where is the set of observed variables, refer to the unobserved variables, is a set of functions, and is a strictly positive probability measure for . For each , determines its datagenerating process, where denotes parents of and . We consider a nonlinear, additive noise SCM such that , can be written as a additive combination of some and the unobserved variables (error terms). We assume a Markovian SCM such that unobserved variables (corresponding to error terms) are mutually independent, thus the SCM corresponds to a directed acyclic graph (DAG) over with edges to each node from its parents. Note that a specific realization of the unobserved variables, determines the values of all other variables.
Counterfactual. Given an SCM, values of unobserved variables , a target variable and a subset of inputs , a counterfactual corresponds to the query, “What would have been the value of (under ), had been ”. It is written as .
Using counterfactuals, we can formally express the attribution question in the the above example. Suppose the observed values are and for some input , under . At an earlier reference timestamp with a different value of the unobserved variables, , the values are and . Starting from the observed value (), the attribution for is characterized by the change in after changing to its reference value, . That is, given that is with and all other variables at their observed value, how much would change if is set to ? Similarly, we can ask, (), denoting the change in ’s value upon setting when is set to its reference values. Thus, there can be multiple expressions to determine the counterfactual impact of depends on the values of other variables.
4 Attribution using CFShapley value
To develop an attribution score, we propose a way to average over the different possible counterfactual impacts. First, we posit desirable axioms that an attribution score should satisfy, as in [17; 14].
4.1 Desirable axioms for an attribution score
Axioms.
Given two values of the metric, observed, and reference, , corresponding to unobserved variables, and respectively, following properties are desirable for an attribution score that measures the causal contribution of inputs .

CFEfficiency. The sum of attribution scores for all equals the counterfactual change in output from reference to observed value, .

CFIrrelevance. If a variable has no effect on the counterfactual value of output under all witnesses, , then .

CFSymmetry. If two variables have the same effect on counterfactual value of output , then their attribution scores are same, .

CFApproximation. For any subset of variables set to their reference values , the sum of attribution scores approximates the counterfactual change from observed value. I.e., there exists a weight
s.t. the vector
is the solution to the weighted least squares, .
Similar to shapley value axioms, these axioms convey intuitive properties that a counterfactual attribution score should satisfy. CFEfficiency states the sum of attribution scores for inputs should equal the difference between the observed metric and the counterfactual metric when all inputs are set to their reference values. CFIrrelevance states that if changing the value of an input has no effect on the output counterfactual under all values of other variables, then the Shapley value of should be zero. CFSymmetry states that if changing the value of two inputs has the same effect on the counterfactual output under all values of the other variables, then both variables should have an identical attribution score. And finally, CFApproximation states the difference between the observed output and the counterfactual output due to a change in any subset of variables is roughly equal to the sum of attribution scores for those variables.
Note that CFEfficiency does not necessarily imply that the sum of attribution scores is equal to the actual difference between the observed value and reference value. This is because the actual difference is a combination of the input variables’ contribution and statistical noise (error terms). That is, , where we used the CFEfficiency property for a desirable attribution score . The second term corresponds to the difference in metric with the same input variables but different noise corresponding to the observed and reference timestamps. This is the unavoidable noise component since we are explaining the change due to a single observation. Therefore, for any counterfactual attribution score to meaningfully explain the observed difference, it is useful to select a reference timestamp to minimize the difference over exogenous factors (e.g., using a previous value of the metric on the same day of week or same hour). Given the true structural equations and an attribution score that satisfies the axioms, if the scores do sum to the observed difference in a metric, then it implies that reference timestamp was wellselected.
4.2 The CFShapley value
We now define the CFShapley value that satisfies all four axioms.
Definition 2.
Given an observed output metric and a reference value , the CFShapley value for contribution by input is given by,
(1) 
where is the number of input variables , is the subset of variables set to their reference values , and is the value of unobserved variables such that .
Proposition 1.
CFShapley value satisfies all four axioms, Efficiency, Irrelevance, Symmetry and Approximation.
Proof.
Efficiency. Following [14; 26], the CFShapley value for an input can be written as,
(2) 
where is the set of all permutations over the variables and is the subset of variables that precede in the permutation . The sum is,
(3) 
We can show it analogously under .
CFIrrelevance.
If , then the numerator in Eqn. 1 for , will be zero and the result follows.
CFSymmetry.
Assuming same effect on counterfactual value, we write the CFShapley value for and show it is the same for .
where the third equality uses .
CFApproximation. Here we use a property [17] on value functions of standard Shapley values. There exists specific weights such that the Shapley value is the solution to where is the value function of any subset . The result follows by selecting .
∎
Comparison to doshapley. Unlike CFShapley, the doshapley value [14] takes the expectation over all values of the unobserved , . Thus, it measures the average causal effect over values of , whereas for attributing a single observed value, we want to know the contributions of inputs under the same .
4.3 Estimating CFShapley values
Eqn. 1 requires estimation of counterfactual output at different (hypothetical) values of input, and in turn requires both the causal graph and the structural equations of the SCM. Using knowledge on the system’s metric computation, the first step is to construct its computational graph. Then for each node in the graph, we fit its generating function using a predictive model over its parents, which we consider as the datagenerating process (fitted SCM).
To fit the SCM equations, for each node
, a common way is to use supervised learning to build a model
estimating its value using the values of its parent nodes at the same timestamp. However, such a model will have high variance due to natural temporal variation in the node’s value over time. Since including variables predictive of the outcome reduces the variance of an estimate in general
[3], we utilize autocorrelation in timeseries data to include the previous values of the node as predictive features. Thus, the final model is expressed as, ,(4) 
where is the number of autocorrelated features that we include. The model can be trained using a supervised timeseries prediction algorithm with auxiliary features, such as DeepAR [23].
We then use the fitted SCM equations to estimate the counterfactual with the 3step algorithm from Pearl [20], assuming additive error. To compute for any subset , the three steps are,

Abduction. Infer error of structural equations on all observed variables. For each , where is the observed value at timestamp .

Action. Set the value of , ignoring any parents of .

Prediction. Use the inferred error term and new value of to estimate the new outcome, by proceeding stepwise for each level of the graph [20; 6] (i.e., following a topological sort of the graph), starting with ’s children and proceeding downstream until node’s value is obtained. For each ordered by the topological sort of the graph (after ), . And finally, we will obtain, .
Thus, the CFShapley score for any input is obtained by repeatedly applying the above algorithm and aggregating the required counterfactuals in Eqn. 1; we use a common Monte Carlo approximation to sample a fixed number () of values of [4; 8].
5 Evaluation
Our goal is to attribute observed changes in the output metric of an ad matching system. We first describe the system and conduct a simulation study to evaluate CFShapley scores.
5.1 Description of the ad matching system
We consider an ad matching system where the goal is to retrieve all the relevant ads for a particular web search query by a user (these ads are ranked later to show only top ones to the user). The outcome variable is the average number of ads matched for each query, called the “matching density” (or simply density). This outcome can be affected by multiple factors, including the availability of ads by advertisers, the distribution and amount of user queries issued on the system, any algorithm changes, or any other system bug or unknown factors. For simplicity, we consider a matching algorithm based on matching exact full text between a query and provided keyword phrases for an ad. This algorithm remains stable over time due to its simplicity. Thus, we can safely assume that there are no algorithm changes or code bugs for the matching algorithm under study. Given an extreme or unexpected value of density, our goal then is to attribute between change in ads and change in queries.
Since there are millions of queries and ads, we categorize the data by nearly 250 semantic query categories. Examples of query categories are "Fashion Apparel", "Health Devices", "Internet", and so on. A naive solution may be to simply compare the magnitude of observed change in ad demand or query volume across categories. That is, given a change in density on day , choose a reference day (e.g., same day last week) and compare the values of ad demand and query volume. We may conclude that the factor with the highest percentage change is causing the overall change in density. However, the limitation is that the factor with the highest percentage change may neither be necessary nor sufficient to cause the change because its effect depends on the values of other factors. E.g., an increase in query volume for a category can either have positive, negative, or no effect on the daily density depending on its ad demand compared to other categories. This is because the density is computed as a query volumeweighted average of category density; increase in query volume for a lowdemand (and hence lowdensity) category decreases the aggregate density (see Eqn. 5).
5.2 Constructing an SCM for ad density metric
To apply the CFShapley method for attributing a matching density value, we define a causal graph based on how the metric is computed, as shown in Figure 1. The number of queries for a category is measured by the number of search result page views (SRPV). The number of ads is measured by the number of listings posted by advertisers. For simplicity, we call them query volume and ad demand. We assume that given a category, the ad demand and query volume are independent of each other since they are driven by the advertiser and user goals respectively. The combination of ad demand and query volume for a category determine its categorywise density which then is aggregated to yield the daily density. As we are interested in attribution over days as a time unit, we refer to the aggregate density as daily density, . Thus, the variables are the inputs to the system where is the category, refers to ad demand, refers to query volume, and is the number of categories.
The structural equation from categorywise densities to daily density is known. It is simply a weighted average of the categorywise densities, weighted by the query volume.
(5) 
where is the density of category on day and is the query volume for the category on day . However, the equation from categorywise ad demand and query volume to category density is infeasible to obtain. This would involve “replay” of a computationally expensive matching algorithm to realtime queries and ad listings but the ad listings are not available (only a daily snapshot of ads inventory is stored in the logs). We will show how to to estimate the structural equation for category density in Section 6.1.
5.3 Evaluating CFShapley on simulated data
Before applying CFShapley on the ad matching system, we first evaluate the method on simulated data motivated by the causal graph of the system. This is because it is impossible to know the groundtruth attribution using data from a realworld system, since we do not know how the change in input variables led to the observed metric value and which inputs were the most important.
We construct simulated data based on the causal structure of Figure 1
. For each category, we assume ad demand and query volume as independent Guassian random variables (we simulate realworld variation in query volume using a Beta prior). The categorywise density is constructed as a monotonic function of ad demand and has a biweekly periodicity. The SCM equations are,
(6)  
(7) 
where and are the query volume and ad demand respectively for category at time . They combine to produce the ad matching density based on a function and additive normal noise. The variance of the noise, determines the stochastic variation in the system. For the simulation, we construct based on two observations about the category density: 1) it is roughly a ratio of the relevant ads and the number of queries; and 2) it exhibits autocorrelation with its previous value and periodicity over a longer time duration. We use to denote the fraction of relevant ads and add a second term with parameter to simulate a biweekly pattern, . is the relative importance of the previous value in determining the current category density. Finally, all the categorywise densities are weighted by their query volume and averaged to produce the daily density metric, .
Each dataset generated using these equations has 1000 days and 10 categories; we set for simplicity. We intervene on the ad demand or query volume of the 1000th point to construct an outlier metric that needs to be attributed. Given the biweekly pattern, reference date is chosen 14 days before the 1000th point.
Setting groundtruth attribution. Even with simulated data, setting groundtruth attribution can be tricky. For example, if there is an increase in ad demand for one category and increase in query volume for another, it is not clear which one would cause the biggest impact on the daily density. That depends on their query volume and ad demand respectively and any changes in other categories. To evaluate attribution methods, therefore, we consider simple interventions where objective groundtruth can be obtained. Specifically, for ease of interpretation, we intervene on only two categories at a time such that the first has a substantially higher chance of affecting the outcome metric than the second.
We consider two configurations: change in 1) ad demand and 2) query volume. For changing ad demand (Config 1), we choose two categories such that the first has the highest query volume and the second has the lowest query volume. We double the ad demand for both categories with a slight difference (x2 for the first category, x2.1 for the second). Since the categorywise densities are weighted by query volume to obtain the daily density metric, for the same (or similar) change in demand, it is natural that first category has higher impact on daily density (even though they may have similar impact on their categorywise density). For Config 1, thus, the groundtruth attribution is the first category. For changing query volume (Config 2), we choose two categories such that the first has the most extreme density and the second has density equal to the reference daily density. Then, we change query volume as above: x2 for the first category and x2.1 for the second. Following Eqn. 7, query volume change in a category having the same density as the daily density is expected to have low impact on daily density (keeping other categories constant, if category density is not impacted by query volume, an increase in query volume for a category with density equal to daily density causes zero change in daily density). Thus, the groundtruth attribution (category with the highest impact on output metric) is again the first category. Note that query volume has higher variation across categories, so a higher multiplicative factor does not necessarily mean a higher absolute difference.
Baseline attribution methods. We compare CFShapley to the standard Shapley value (as implemented in SHAP [17], Shapley) and the doshapley value (DoShapley) [14]. The Shapley method ignores the structure and fits a model directly predicting daily density using (categorywise) ad demand and query volume features. It uses the predictions of this model for computing the Shapley score. For the DoShapley method, we notice that our causal graph corresponds to the DirectCausal graph structure in their paper and use the estimator from Eq. (5) in [14], that depends on the same daily density predictor as the standard Shapley value. We also evaluate on three intuitive baselines based on absolute change in inputs: 1) The category with the biggest change in ad demand (AdDemandDelta); 2) query volume (QVolumeDelta); or 3) density multiplied by query volume (ProductDelta) since this product is used in the daily density equation.
For the CFShapley algorithm, we fit the structural equation for category density, using the following features: ad demand, query volume, . For both the CFShapley category density prediction and the Shapley daily density prediction model, we use a 3layer feed forward network. We use all data uptil 999th day for training and validation for all models.
Results. For each attribution method, we measure accuracy compared to the groundtruth as we increase the noise () in the true datagenerating process (SCM) (). As noise in the generating process increases, we expect higher error for fitting structural equations and thus the attribution task becomes harder. Attribution accuracy is defined as the fraction of times a method outputs the highest attribution score for the correct category (first category), over 20 simulations.
Figure 2 shows the results. CFShapley obtains the highest attribution accuracy for both Config 1 and 2. In general, attribution for ad demand is easier than query volume because both the category density and daily density are monotonic functions of the ad demand. That is why we observe near 100% accuracy for CFShapley under Config 1, even with high noise. The attribution accuracy for Config 2 is 7080%, decreasing as more noise is introduced.
In comparison, none of the baselines achieve more than 50% (randomguess) accuracy. Note that the Shapley and DoShapley methods obtain similar attribution accuracies. While their attribution scores are different, the highest ranked category often turns out to be the same since they rely on the same daily density model (but use different formulae). Inspecting the predictive accuracy of the daily density model offers an explanation: error on the daily density prediction is higher than that for categorywise density prediction (and it increases as the noise is increased). This indicates the value of computing an individualized counterfactual using the full graph structure, rather than focusing on the average (causal) effect. Finally, the other intuitive baselines fail on both tasks since they only look at the change in the input variables.
6 Case study on ad matching system
We now apply the CFShapley attribution method on data logs of a realworld ad matching system from July 6 to Dec 28, 2021. For each query, we have log data on the number of ads matched by the system. In addition, each query is marked with its category. The category query volume is measured as the number of queries issued by users for each category. This allows us to calculate the groundtruth matching density on each day, categorywise and aggregate. Separately, to compute the categorywise input ad demand for a day, we fetch each ad listing available on the day and assign it to a category if any query from that category contains a word that is present in its keywords. This is the total sum of ad listings that are potentially relevant to the query for the exact matching algorithm (that matches the full query exactly to the full ad keyword phrase).
6.1 Implementing CFShapley: Fitting the SCM
We follow the method outlined in Section 4.3
. The main task is to estimate the structural equations for categorywise ad density. There are over 250 categories; fitting a separate model for each is not efficient. Besides, it may be beneficial to exploit the common patterns in the different timeseries. We therefore consider a deep learningbased model, DeepAR
[23] that fits a single recurrent network for multiple timeseries (we also tried a transformerbased model, temporal fusion transformer (TFT) [16] but found it hard to tune to obtain comparable accuracy). As specified in Equation 4, for each category, the DeepAR model is given ad demand, query volume and the autoregressive values of density for the past 14 days. Note that rather than predicting over a range of days (which can be innacurate), we fit the timeseries model separately for each day using data up to itsth day, to utilize the additional information available from the previous day. To implement DeepAR, we used the opensource GluonTS library.
We compare the DeepAR model to three baselines. As simple baselines that capture the weekly pattern, we consider, 1) category density on the same day a week before; and 2) the average density over the last four weeks. We also consider a 3layer feedforward network that uses the same features as DeepAR. Table 1 shows the prediction error. DeepAR model obtains the lowest error on the validation set according to all three metrics: mean absolute percentage error (MAPE), median APE, and the symmetric MAPE [18]. For our results, we choose DeepAR as the estimated SCM equation and apply CFShapley on data from Nov 15 to Dec 28. We chose Nov 15 to allow sufficient days of training data.
Choosing reference timestamp. The CFShapley method requires specifying a reference day that provides the the “expected/usual” density value. Common ways to choose it are the last day’s value or the value last week on the same day. We choose the latter due to known weekly patterns in the density metric.
Model  Mean APE (%)  Median APE (%)  sMAPE 

LastWeek  21.2  11.5  0.20 
Avg4Weeks  25.1  10.6  0.17 
FeedForward  20.0  10.8  0.20 
DeepAR  15.6  7.8  0.13 
6.2 Validating the CFEfficiency axiom
We first check whether the obtained CFShapley scores sum up to the observed percentage change in daily density metric (Figure 3). The difference between the sum of CFShapley scores and the actual change is less than 0.10% for all days, indicating that our choice of reference timestamp is appropriate (Sec. 4.1) and that the shapley value computation by approximation is capturing relevant signal.
6.3 Choosing dates for evaluation
While we computed attribution scores for all days, typically one is interested in attribution for unexpected values for daily density.
To discover unusual days for attribution, we fit a standard timeseries model to the aggregate daily density data. We use four candidate models: 1) daily density on the same day last week; 2) mean density of the last 4 weeks; 3) a feed forward network; and 4)
DeepAR model. As for the categorywise prediction, all neural network models are provided the last 14 days of daily density. Table
2shows the mean APE, median APE, and SMAPE. The feed forward model obtains the lowest error. While DeepAR is a more expressive model than FeedForward, a potential reason for its lower accuracy is the number of training samples (only as many data points as the number of days for dailydensity prediction unlike categorywise prediction). For its simplicity, we use the FeedForward network for detecting outlier days. Its prediction for different days and the outliers detected can be seen in Figure
4. Like DeepAR, the feedforward model is implemented as a Bayesian probabilistic model, so it outputs prediction samples rather than a point prediction.Model  Mean APE (%)  Median APE (%)  sMAPE 

LastWeek  4.6  3.1  0.047 
Last4Weeks  4.5  3.3  0.047 
FeedForward  3.0  2.2  0.031 
DeepAR  3.4  2.4  0.035 
Category  AdDemandAttrib  QueryVolumeAttrib 

Sort by AdDemandAttrib  
Internet & Telecom  0.0450  0.00850 
Apparel  0.00843  0.00151 
Arts & Entertainment  0.00663  0.00928 
Hobbies & Leisure  0.00646  0.0168 
Travel & Tourism  0.00584  0.000287 
Sort by QueryVolumeAttrib  
Hobbies & Leisure  0.00646  0.0168 
Arts & Entertainment  0.00633  0.00928 
Internet & Telecom  0.0450  0.00850 
Law & Government  0.000203  0.00743 
Health  0.000161  0.00645 
Days where the daily density goes beyond the 95% prediction interval are chosen for attribution. A visual inspection shows two clusters, Thanksgiving/Black Friday and Christmas, which are expected due to their significance in the US. We also find an extreme value on Dec. 4. In all three cases, the daily density increases. Intuitively, one may have expected the opposite for holidays: density would decrease since people are expected to spend less time online.
6.4 Qualitative analysis
We now use CFShapley to explain these observed changes.
December 4. Figure 5 shows the attribution by different categories, aggregated up to obtain 22 highlevel categories. The Internet & Telecom (IT) category has the biggest positive attribution score while the Hobbies & Leisure (HL) category has the biggest negative attribution score. That is, daily density decreased on the day due to the HL category.
To understand why, we look at the attribution scores separately for ad demand and query volume for each category in Table 3. The attribution score reflects to the percentage change in daily density compared to last week, due to ad demand or query volume of a category. The only categories to have an attribution score greater than 1% are IT and HL, agreeing with the categorywise analysis. Specifically, the change in ad demand due to IT leads to a 4.5% increase in daily density. The query volume change in HL, on the other hand, leads to a 1.7% decrease in daily density. Considering all categories together, ad demand change leads to an 6.5% increase in daily density and query volume change leads to a 5.6% decrease. The net result is a 1% improvement over the last week. While an increase of 1% of daily density may look small, note that the value last week was already inflated due to it being a Black Friday week. This is why we detect outliers using the expected timeseries pattern rather than simply difference from last week. On such days, one may also consider an alternative baseline, e.g., two weeks before.
Are the attributions meaningful? In the absence of groundtruth, we dive deeper into the query logs to check for evidence. We do find a significant increase in queries for the HL category. In fact, more than 70% of the increase in query volume for HL is due to cheetahrelated queries. On manual inspection, we find that December 4 is International Cheetah Day. Cheetahrelated queries also contribute to 86% of the ad demand increase for HL category. Given that the category density of HL is much lower than the daily density, this increase in query volume causes a decrease in daily density, leading to the negative attribution score. Due to ad demand volume increase (perhaps in anticipation of the Cheetah Day), the HL also leads to an increase of 0.6% in daily density (see Table 3. On the other hand, IT category’s main contribution is from an increase in ad demand. Logs show a substantial (14%) increase in ads compared to last week for the category on Dec 4, which explains its high attribution score for ad demand. This increase is sustained across queries, possibly indicating a shift for the first Saturday after the holiday weekend.
Nov 25 and 26 (Thanksgiving). On Thanksgiving holiday (Nov 25), we may have expected density to drop since many people in the US are expected to spend more time with their family and less time online. At the same time, online shopping on Black Friday (Nov 26) may increase density. Instead, we find that the density increases significantly on both days (see Figure 4. Specifically, compared to last week, daily density on Nov 26 increased by 18.3%, out of which 13.5% is contributed by query volume change and 4.8% by ad demand. How to explain this result? Using the CFShapley method, for query volume change, we find that the categories Health, Law and Government and Business & Industrial are the topranked categories. Each contribute more than 2% of the density increase, leading to a cumulative 7% increase. From the logs, we see that query volume for these categories decreased as people spent less time on work or health related queries. Since these categories tend to have low density, the daily density increased as a result. On the ad demand side, Online Media & Ecommerce contributed nearly 3% increase in daily density, perhaps due to increased demand for Black Friday shopping. Nov 25 exhibits similar patterns for query volume.
Dec 24 and Dec 25 (Christmas). On Christmas days too, there is an significant increase in density. Like the Thanksgiving days, health and workrelated queries are issued fewer times, leading to an overall increase in daily density (all three categories have attribution scores >1%). However, we find that the top categories by query volume change are Hobbies & Leisure and Arts & Entertainment. Both these categories experience a surge in their query volume and being highdensity categories, cause a 2.1% and 1.8% increase in daily density respectively. To explain this, we look at the query logs and find that the rise in Hobbies & Leisure queries is fueled by the toys & games subcategory, which is aligned with the expectation of the holiday days. On Dec 25, Hobbies & Leisure is also the category which has the highest attribution score by ad demand (2.7%). Overall, the category contributes 4.8% increase, nearly onethird of the total density increase on Christmas day, signifying the importance of toys & games subcategory for Christmas.
7 Discussion and Conclusion
We presented a counterfactualbased attribution method to explain changes in a largescale ad system’s output metric. Using the computational structure of the system, the method provides attribution scores that are more accurate than prior methods.
References
 [1] (2012) Xray: automating rootcause diagnosis of performance anomalies in production software. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), Cited by: §1, §2.
 [2] (2018) Beyond the last touch: attribution in online advertising. Marketing Science 37 (5), pp. 771–792. Cited by: §1, §2.
 [3] (2013) Counterfactual reasoning and learning systems: the example of computational advertising.. Journal of Machine Learning Research 14 (11). Cited by: §4.3.
 [4] (2009) Polynomial calculation of the shapley value based on sampling. Computers & Operations Research 36 (5), pp. 1726–1730. Cited by: §4.3.
 [5] (2012) Causally motivated attribution for online advertising. In Proceedings of the sixth international workshop on data mining for online advertising and internet economy, Cited by: §1, §1, §2.

[6]
(2022)
Evaluating and mitigating bias in image classifiers: a causal perspective using counterfactuals
. In Proceedings of the IEEE/CVF WACV Conference, pp. 915–924. Cited by: item 3.  [7] (2020) Prediction, estimation, and attribution. International Statistical Review 88, pp. S28–S59. Cited by: §1.
 [8] (2008) A linear approximation method for the shapley value. Artificial Intelligence 172 (14). Cited by: §4.3.
 [9] (2016) Actual causality. MiT Press. Cited by: §2.
 [10] (2020) Causal shapley values: exploiting causal knowledge to explain individual predictions of complex models. Advances in neural information processing systems 33, pp. 4778–4789. Cited by: §1, §2.
 [11] (2021) Anomaly attribution with likelihood compensation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 4131–4138. Cited by: §2.
 [12] (2019) Causal structure based root cause analysis of outliers. arXiv preprint arXiv:1912.02724. Cited by: §2.
 [13] (2016) A probabilistic multitouch attribution model for online advertising. In Proceedings of the 25th acm international on conference on information and knowledge management, pp. 1373–1382. Cited by: §1, §2.
 [14] (202217–23 Jul) On measuring causal contributions via dointerventions. In Proceedings of the 39th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 162, pp. 10476–10501. Cited by: §1, §2, §4.2, §4.2, §4, §5.3.
 [15] (2021) Towards unifying feature attribution and counterfactual explanations: different means to the same end. In Proceedings of the 2021 AAAI/ACM AIES Conference, Cited by: §2.
 [16] (2021) Temporal fusion transformers for interpretable multihorizon time series forecasting. International Journal of Forecasting 37 (4), pp. 1748–1764. Cited by: §6.1.
 [17] (2017) A unified approach to interpreting model predictions. Advances in neural information processing systems 30. Cited by: §1, §1, §2, §4.2, §4, §5.3.
 [18] (2020) The m4 competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting 36 (1), pp. 54–74. Cited by: §6.1.
 [19] (2012) Structured comparative analysis of systems logs to diagnose performance problems. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), Cited by: §1.
 [20] (2009) Causality. Cambridge university press. Cited by: §1, §2, §3.2, item 3, §4.3.
 [21] (2019) The seven tools of causal inference, with reflections on machine learning. Communications of the ACM 62 (3), pp. 54–60. Cited by: §2.
 [22] (2017) Elements of causal inference: foundations and learning algorithms. The MIT Press. Cited by: §1.
 [23] (2020) DeepAR: probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting 36 (3), pp. 1181–1191. Cited by: §4.3, §6.1.
 [24] (2011) Datadriven multitouch attribution models. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 258–264. Cited by: §1, §2.
 [25] (2022) Shapley meets uniform: an axiomatic framework for attribution in online advertising. Management Science. Cited by: §1, §1, §2.
 [26] (2014) Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems 41 (3), pp. 647–665. Cited by: §4.2.
 [27] (2020) Counterfactual explanations for machine learning: a review. arXiv preprint arXiv:2010.10596. Cited by: §2.
 [28] (2012) Understanding the past: statistical analysis of causal attribution. American Journal of Political Science 56 (1), pp. 237–256. Cited by: §1.
 [29] (2019) The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 131–146. Cited by: §1, §2.