Data-driven decision making in many applications turns into understanding or evaluation of different variants of a system, that are caused by some variable , through observation of some variable . For example, such situations occur in A/B testing, the state-of-the-art technique of online evaluation of web services [23, 22, 15, 6, 12]
, where controlled experiments are conducted to detect the causal effect of the system updates on its performance relying on an evaluation metric.
The authors of  introduced the notion of a binary effect variable in order to compare the observed variable w.r.t. two possible states of the variant variable (e.g., treatment assignment variable or instrumental variable). Namely, let us assume that does not directly depend on the variant , but there is only indirect dependency via the latent variable , i.e., the following equality of conditional distributions holds: . In this way, there is no causal effect [39, 33, 29, 7] of the variant variable that is not conditioned by . In our study, we generalize the notion of the effect variable to the multi-state case.
We conduct thorough theoretical analysis of effect variables with finite number of states and turn the problem of finding of such variable into a distribution decomposition problem. Since we show that, in general case, the number of possible effect latent variables is infinite, we introduce the notion of the extreme effect variable (as in the binary case), which simultaneously: (a) is such effect variable that its states are minimally affected by the variant variable ; (b) is such effect variable that its states are extremely different w.r.t. the observable variable . In addition, we demonstrate application of this technique to several real examples of quality evaluation of a web search engine.
2 Related work
Our work could be compared with other studies in three aspects. The first one relates to causal discovery [39, 33, 29, 10, 30, 7, 40], where the main problem consists in finding a causal relations between two or more variables (e.g., a causal graph or chain). In our study, we state that the causal effect exists and assume that it could be decomposed into . Then, we try to find such latent variable in a non-parametric way. For instance, Balke and Pearl  studied a model where reception of a treatment did depend not only on treatment assignment, but also on some latent variables (factors). In contrast to that study, we consider a model where the treatment assignment variable affects an observed response via a latent variable . Tishby et al.  studied the problem of finding a short code for one signal (variable) that preserved the maximum information about another signal (variable). In contrast to our work, they optimized so-called distortion measure to find solution to their problem. Other studies optimized Shanon’s entropy (or its conditional variant ) over all possible mediating variables . In our study, we find an optimal solutions in our problem s.t. differ between the variants as little as possible.
The second group of related studies concerns decompositions of a distribution into a mixture of several other distributions (e.g., approximations by a mixture of parametric distributions like in , or exact methods to decompose into a basis like in ). To the best of our knowledge, in our study, we consider a novel problem of a simultaneous decomposition of a tuple of distributions into a mixture of other distributions.
The studies on online A/B testing (e.g., [41, 2, 22, 8, 14, 6]) form the third aspect of related work, since we apply our approach to online evaluation of web services. There is a line of works devoted to evaluation of different components of web services: ranking algorithms [38, 13, 32, 15], the user interface [24, 13, 32, 15], etc. Other works on A/B experiments address various web user experience: absence , abandonment , engagement [13, 14, 11, 17], speed , periodicity [13, 11, 15, 16], etc. Studies focused on improvement of A/B test metric sensitivity considered: utilization of more data from the period before the experiment [9, 34] and from the experiment period either by learning a linear combination of metrics  or by predicting a future metric value . Budylin et al.  proposed a tool (referred to as linearization) that allowed to efficiently and directly apply all existing sensitivity improvement techniques to ratio metrics (such as CTR). Our study extend the technique of Nikolaev et al.  to multidimensional case. More details on A/B testing can be find in surveys like  or in some books on randomized experiments in general like [19, 31] or in tutorials like [4, 5, 12].
In this section, we introduce the core definitions and notations of our study.
3.1 Effect latent variable
Let be a set of random events (e.g., experimental units in A/B testing) and let
denote the probability over them. Letbe an observable variable (e.g., the number of visits of a user ). Let be a variant variable (e.g., it could represent a variant of a service shown to a user ).
A variable is called an effect (latent) variable (of relative to ) if
The region of values of the variable is referred to as .
In this study, we consider the case when the regions and are finite and are of equal size . These limitations corresponds to practical usage of this theory considered in Section 6. However, the other cases could be studied in future work.
3.2 Notations and basic definitions
In our work, we widely use the following well-known terms.
A function is a distribution over an enumerable set if :
its values are non-negative: ;
the sum of all its values over is equal to : .
From here on in the paper we use shorter notations for distributions:
3.3 Distribution decomposition
Let for a discrete variable , then the function satisfies Definition 4. From here on in this section we consider this case of a discrete variable (i.e., is an enumerable set)222However, all the described theory could be translated to the case of non-discrete variable , where the density function of the distribution should be used as . This is a good direction for future work. Then, having an effect variable , we decompose as follows:
4 Existence and properties of an effect variable
In this section, we translate the problem of finding of an effect variable into a non-linear problem of distribution decomposition described below.
4.1 Distribution decomposition problem
Since , we assume that without loss of generality. One translates the problem of finding an effect variable to the problem of finding a distribution decomposition, which is formalized as follows.
If a pair is a solution to this problem, then is referred to as the mixture matrix, the scalar values are referred to as the mixture coefficients, and the tuple of distributions is referred to as the decomposition basis.
Note that Problem 1 has always trivial solutions. The mixture matrix of such solution is a permutation one and the decomposition basis is , where is the permutation of w.r.t. the matrix . These trivial solutions correspond to the case when the variant variable is the effect variable .
Let the source distributions be linearly independent, then the solution mixture matrix is invertible.
Let us assume the contrary, namely the image of the linear operator has the dimension lower than K, i.e., . Then, the linear capsule of has no greater dimension, i.e., . Thus, the distributions are linearly dependent. We come to a contradiction. ∎
Let be a solution of Problem 1 and
be an invertible matrix, then the following identity holds for the decomposition basis(see the first condition of Problem 1):
Next, we consider the following problem, which concerns finding of the mixture matrix (coefficients) only.
Having source distributions over , find a matrix , so that the following conditions holds:
is an invertible stochastic matrix;
The identity in Eq. (6) holds since . The condition on in Problem 1 holds since is a stochastic matrix as a solution to Problem 2. Finally, for , the identity holds, since is a stochastic matrix. Then,
4.2 Solutions to the decomposition problem
Now, we are ready to find the set of all mixture matrices that are solutions to Problem 2. This set is denoted by and the injection of its coefficients in is denoted by . First of all, we state the properties of these sets in the following lemma.
Let the source distributions be linearly independent, then:
is not empty and contains at least trivial solutions that are all permutation matrices (corresponding decomposition bases are permuted source distributions );
does not contain any degenerate matrix (i.e., );
is closed w.r.t. multiplication by a permutation matrix from the right;
is a closed manifold of the dimension in general case;
The property 1 follows from Remark 2. The property 2 follows from the definition of a solution of Problem 2. Next, let is any permutation matrix and . Then, is a stochastic matrix and , since it is a permutation (inverse to ) of , that are non-negative.
Finally, the restrictions on the components of a stochastic matrix (Def. 3) imply that belongs to the intersection of hyperplanes and of closed half-spaces For (), the inequalities in Eq. (8) define a closed set in for each . Thus, the region is the intersection of all of them, and, hence, is also closed, i.e., the property 4 holds. ∎
In this section, we translate the problem of finding of an effect variable into non-linear Problem 1 of the simultaneous decomposition of the distributions . Then, the equivalence of this problem to non-linear Problem 2 of finding proper stochastic matrix with constraints (8) is established for the linearly independent distributions . Finally, the key properties of solutions to these problems are proved in Theorem 1.
5 Extreme effect variable
Since Problem 1 has a continuum of possible solutions in a general non-degenerated case (see Theorem 1), we introduce the notion of the optimal distribution decomposition and the notion of the extreme effect variable.
5.1 Optimal distribution decomposition
A distribution decomposition for Problem 1 is called the optimal distribution decomposition if its mixture matrix has the lowest absolute value of the determinant among all mixture matrices , i.e.,
The corresponding effect variable , its mixture matrix , its mixture coefficients , and its decomposition basis are referred to as the extreme effect variable, the extreme mixture matrix, the extreme mixture coefficients, and the extreme decomposition basis, respectively.
Note that the minimum in Eq. (11) always exists, since is a closed bounded region in (see Theorem 1) and the function is continuous on . Moreover, this minimum is non-zero in the case of linearly independent source distributions since each mixture matrix is non-degenerate one, i.e., (see Theorem 1).
The resulting variable is extreme simultaneously in two senses. The minimization of the determinant in Eq. (11) implies that, on the one hand, we try to find extreme variable , such that its probabilities differ between the variants as little as possible. On the other hand, for any set of elements, we can consider the matrices and . They are connected by the identity (see Eq. (7)), and, thus, the identity holds. Since the matrix is a constant one for the given source distributions, this identity infers that the optimal decomposition maximizes the disagreement of the decomposition basis (i.e., its power). In terms of the effect variable, we try to find such effect variable that has the most different distributions in the basis . This agrees well with the intuition that represent extremely different (absolute) states of an event.
Let be an optimal distribution decomposition for Problem 1, then the distribution decomposition , obtained by a permutation of the decomposition basis , is an optimal distribution decomposition as well ( is the corresponding permutation matrix).
In order to clearer understand the meaning of the extreme effect variable, one considers two special cases of the observed and variant variables.
5.2 The case of
In this case, the observable variable has the same number of states as the variant and the effect variables. Since , we numerate the elements of this set: and consider the functions and in the following matrix forms:
Hence, in this case, the restriction on the functions () to be distributions (as in Def. 4) is equivalent to the restriction on the matrix () to be a stochastic one (as in Def. 3). Thus, given the stochastic matrix , the distribution decomposition in Eq. (6) could be written in the following form:
(a) If is a stochastic matrix, then . (b) If is a stochastic matrix and , then is a permutation matrix.
In the case of , the matrix is the extreme mixture matrix and the functions defined by are the extreme decomposition basis.
First, note that the decomposition basis defined by Eq. (9) is and its components satisfy the conditions in Eq. (8). Hence, the matrix is a mixture matrix for Problem 2 (i.e., ). Second, we will show that it is the extreme mixture matrix. From Eq. (12) which holds for any mixture matrix , we get the following identity:
Thus, the minimization of the determinant is equivalent to the maximization of the determinant , since is given. On the one hand, Lemma 2 implies , since the matrix , that correspond to the decomposition basis , is a stochastic one. On the other hand, for , we have , that reaches the maximum (i.e, ). Hence, the theorem is proved. ∎
This theorem implies that the observable variable is the extreme effect variable . Moreover, any extreme effect variable is equal to the observable variable up to a proper matching of their states (see Statement 2), i.e. .
5.3 The binary variable case ()
We return to the general case of and consider now the binary variant variable (and, thus, any effect variable is also binary). This lead us to the situation considered in . Let and , then any mixture matrix and the distribution decomposition in Eq. (6) could be written in the following form:
The conditions in Eq. (8) are linear for (), namely, they are
Hence, the set of all solutions of Problem 1 in terms of the set on the plane of the mixture coefficients is the union two quadrangles that are centrally symmetric to each other w.r.t. the point and are limited in the unit square by the union of two pairs of intersected half-planes (see Fig. 1):
In the considered case, the minimization of the absolute value of the determinant of the mixture matrix () is equivalent to the minimization of the absolute value of the difference between the mixture coefficients (), since . In other word, the extreme effect variable has such decomposition basis that supplies the minimal mixture difference between the probabilities of the variable to be in the state (or ) for the variants and . Therefore, for , Problem 2 is linear both in terms of the objective , and in terms of the constraints (15)444This is opposite to the cases with , where the problem is non-linear.. Note that there are only two extreme solutions (see Fig. 1), and they could be obtained from each other by the swap: , , and . Thus, from Eq. (16), Eq. (17), and Fig. 1, one infers that the extreme mixture coefficients have the form (for the case of ):
and the difference . Note that the relative difference for the extreme case has a very simple form: .
In this section, we introduce the notion of the extreme effect variable and prove its existence. We show that its extreme property is two-fold: first, the distributions of the variable w.r.t. the variants are close each other as much as possible (in terms of the volume of its mixture matrix ); and, second, the states of the variable are extremely different w.r.t. the observable variable (in terms of the conditional distributions ). These intuitions are observed in two considered special cases. The first one relates to the case of the equal number of states of the observable variable and the variant one (i.e., ), where we show that the extreme effect variable is the observable one . The second case considers the binary effect variable (i.e., ), where we explain its geometric interpretation and demonstrate the exact formulas for the distributions and for the extreme effect variable . This binary case of the variants is popular in applications: for instance, in A/B testing, the state-of-the-art technique to evaluate a web service [41, 2, 22, 8, 14], two variants of the service are compared w.r.t. a key metric (see the next section).
|Exp. #1 (the number of clicks)||Exp. #2 (the presence time)|
6 Application to online evaluation
Online controlled experiments, such as A/B tests, are the state-of-the-art techniques for improving web services based on data-driven decisions and are widely used by many Internet companies [41, 2, 22, 8, 14]. The aim of the controlled experiments is to detect the causal effect of the system updates on its performance relying on a user behavior metric that is assumed to correlate with the quality of the system. Users are randomly exposed to one of the two variants of a service (the control (A) and the treatment (B), e.g., the current production version of the service and its update) in order to compare these variants .
The experiment’s objective is quantitatively measured by an evaluation metric (also known as the online service quality metric, the Overall Evaluation Criterion (OEC), etc. ), which is usually the average value of a scalar measure (a key metric) over the events (entities) , e.g., users. Thus, for each user group , we have the distribution of the measure over the experimental units . Then, the average values , are used as OEC and their difference is calculated. Since, the set of variants , the observable variable , and its empirical conditional distributions are given, we are able to apply the technique described in Section 5.3 in order to find the extreme effect variable and the optimal decomposition of distributions .
In our study, we consider two popular engagement metrics of user activity : the number of clicks made by a user and the presence time of a user, that are defined in the same way as in [20, 38, 28, 13, 14]. For each metric, we consider a large-scale A/B experiment conducted on real users of Yandex555https://www.yandex.com, one of the popular web search engines. Each of these A/B tests has been designed to evaluate a noticeable deterioration of the user interface of the service, has lasted two weeks and has affected at least hundreds of thousands of users.
The treatment effect of the first A/B experiment is detected by the decrease of the number of clicks per user by (i.e., ), while the treatment effect of the second one is detected by the decrease of the presence time per user by (i.e., )666The differences in both experiments are significant with p-value of two-sample t-test The values of the metrics are logarithmically transformed, multiplied by a fixed random constant and, then, are binarized in order to obtain discrete distributions for the observable variable
two-sample t-testlower than (the state-of-the-art threshold [23, 9, 38, 22, 13, 14]).. In Fig. 2(a), we present the distributions and for the metrics of these experiments777
The values of the metrics are logarithmically transformed, multiplied by a fixed random constant and, then, are binarized in order to obtain discrete distributions for the observable variable. The random constant is hidden for confidentiality reasons.. Then, we apply our extreme effect variable approach (see Eq. (15) and Eq. (19)), which results in the extreme mixture matrix and the extreme decomposition basis , that are presented in Table 1 and Fig. 2(b), respectively, for each A/B experiment.
We see that in both experiments the distribution corresponds to the extreme state that has lower values of than the extreme state : in Exp. #1 (in Exp. #2), corresponds to the state, where users have lower number of clicks (presence time) than in . In both experiments, the fraction of the extreme state is increased in the treatment variant B of the service w.r.t. the control one (see in Table 1). Hence, we can conclude that the treatment effect in both experiments is negative, since the fraction of the negative extreme state is increased by percents (by for Exp. #1 and by for Exp. #2). This coincides with the conclusion based on the difference of the mean values of the key metrics ().
Summarizing, our approach provides more additional information that allows us to understand more clearly how exactly user behavior, observed through the variable , differs between the variants in an A/B experiment. First, the approach explains the difference between the observed distributions and through description of the extreme latent states, such that their mixture is minimally affected by the treatment effect of the A/B experiment (see Def. 5). Second, our approach quantifies this whole (total) difference between the distributions in one scalar metric , whose relative difference between the variants is noticeably higher than the one for the mean value (i.e., ), since the extreme effect variable is aware of all the treatment effect (see Def. 1 and 5) observed in the whole conditional distribution of the variable , not only in its mean value.
In our study, we generalized the notion of the effect variable to the (multidimensional) finite-state case. We translated the problem of finding an effect variable to the simultaneous decomposition of the conditional distributions of the observable variable under the states of the variant variable. We conducted theoretical analysis of these problems and their solutions. We applied our approach to online evaluation of a web search engine through A/B testing and showed its utility by providing clear additional intuition about the evaluation criterion.
I would like to thank Gleb Gusev who discussed this study with me and inspired me to public it.
-  Owe Axelsson. Iterative solution methods. Cambridge University Press, 1996.
-  Eytan Bakshy and Dean Eckles. Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods. In KDD’2013, pages 1303–1311, 2013.
-  Alexander Balke and Judea Pearl. Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association, 92(439):1171–1176, 1997.
-  Roman Budylin, Alexey Drutsa, Gleb Gusev, Eugene Kharitonov, Pavel Serdyukov, and Igor Yashkov. Online evaluation for effective web service development: Extended abstract of the tutorial at thewebconf’2018. 04 2018.
-  Roman Budylin, Alexey Drutsa, Gleb Gusev, Pavel Serdyukov, and Igor Yashkov. Online evaluation for effective web service development. In arXiv preprint arXiv:1809.00661. Tutorial at KDD’2018, 08 2018.
-  Roman Budylin, Alexey Drutsa, Ilya Katsev, and Valeriya Tsoy. Consistent transformation of ratio metrics for efficient online controlled experiments. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 55–63, 2018.
-  Thomas Claassen. Causal discovery and logic. UB Nijmegen, 2013.
-  Alex Deng, Tianxi Li, and Yu Guo. Statistical inference in two-stage online controlled experiments with treatment selection and validation. In WWW’2014, pages 609–618, 2014.
-  Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In WSDM’2013, 2013.
-  Mathias Drton, Bernd Sturmfels, and Seth Sullivant. Lectures on algebraic statistics, volume 39. Springer Science & Business Media, 2008.
-  Alexey Drutsa. Sign-aware periodicity metrics of user engagement for online search quality evaluation. In SIGIR’2015, pages 779–782, 2015.
-  Alexey Drutsa, Gleb Gusev, Eugene Kharitonov, Denis Kulemyakin, Pavel Serdyukov, and Igor Yashkov. Effective online evaluation for web search. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1399–1400. ACM, 2019.
-  Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. Engagement periodicity in search engine usage: Analysis and its application to search quality evaluation. In WSDM’2015, pages 27–36, 2015.
-  Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. Future user engagement prediction and its application to improve the sensitivity of online experiments. In WWW’2015, pages 256–266, 2015.
-  Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. Periodicity in user engagement with a search engine and its application to online controlled experiments. ACM Transactions on the Web (TWEB), 11, 2017.
-  Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. Using the delay in a treatment effect to improve sensitivity and preserve directionality of engagement metrics in a/b experiments. In WWW’2017, 2017.
-  Alexey Drutsa, Anna Ufliand, and Gleb Gusev. Practical aspects of sensitivity in online experimentation with user engagement metrics. In CIKM’2015, pages 763–772, 2015.
-  Georges Dupret and Mounia Lalmas. Absence time and user engagement: evaluating ranking functions. In WSDM’2013, pages 173–182, 2013.
-  David A Freedman, David Collier, Jasjeet S Sekhon, and Philip B Stark. Statistical models and causal inference: a dialogue with the social sciences. Cambridge University Press, 2010.
-  Bernard J Jansen, Amanda Spink, and Vinish Kathuria. How to define searching sessions on web search engines. In Advances in Web Mining and Web Usage Analysis, pages 92–109. Springer, 2007.
-  Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov. Learning sensitive combinations of a/b test metrics. In WSDM’2017, 2017.
-  R. Kohavi, A. Deng, R. Longbotham, and Y. Xu. Seven rules of thumb for web site experimenters. In KDD’2014, 2014.
-  Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. Controlled experiments on the web: survey and practical guide. Data Min. Knowl. Discov., 18(1):140–181, 2009.
-  Ronny Kohavi, Thomas Crook, Roger Longbotham, Brian Frasca, Randy Henne, Juan Lavista Ferres, and Tamir Melamed. Online experimentation at microsoft. Data Mining Case Studies, page 11, 2009.
-  Peter Lancaster and Miron Tismenetsky. The theory of matrices: with applications. Academic press, 1985.
-  Steffen L. Lauritzen. Sufficiency, prediction and extreme models. Scandinavian Journal of Statistics, 1:128–134, 1974.
-  Steffen L. Lauritzen. Extreme point models in statistics. Scandinavian Journal of Statistics, 11:65–91, 1984. With discussion and response.
-  Janette Lehmann, Mounia Lalmas, Georges Dupret, and Ricardo Baeza-Yates. Online multitasking and user engagement. In CIKM’2013, pages 519–528, 2013.
-  Jan Lemeire and Erik Dirkx. Causal models as minimal descriptions of multivariate systems, 2006.
-  Charles F Manski. Identification for prediction and decision. Harvard University Press, 2009.
-  Stephen L Morgan and Christopher Winship. Counterfactuals and causal inference. Cambridge University Press, 2014.
-  Kirill Nikolaev, Alexey Drutsa, Ekaterina Gladkikh, Alexander Ulianov, Gleb Gusev, and Pavel Serdyukov. Extreme states distribution decomposition method for search engine online evaluation. In KDD’2015, pages 845–854, 2015.
-  Judea Pearl. Causality: models, reasoning and inference, volume 29. Cambridge Univ Press, 2000.
-  Alexey Poyarkov, Alexey Drutsa, Andrey Khalyavin, Gleb Gusev, and Pavel Serdyukov. In KDD’2016, pages 235–244, 2016.
-  Kerry Rodden, Hilary Hutchinson, and Xin Fu. Measuring the user experience on a large scale: user-centered metrics for web applications. In CHI’2010, pages 2395–2398, 2010.
-  Wesley C. Salmon. Statistical Explanation and Statistical Relevance. University of Pittsburgh Press, Pittsburgh, 1971. With contributions by Richard C. Jeffrey and James G. Greeno.
-  Cosma Rohilla Shalizi and James P. Crutchfield. Information bottlenecks, causal states, and statistical relevance bases: How to represent relevant information in memoryless transduction. Advances in Complex Systems, 5:91–95, 2002.
-  Yang Song, Xiaolin Shi, and Xin Fu. Evaluating and predicting user engagement change with degraded search relevance. In WWW’2013, pages 1213–1224, 2013.
-  Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search, volume 81. MIT press, 2000.
-  Oliver Stegle, Dominik Janzing, Kun Zhang, Joris M Mooij, and Bernhard Schölkopf. Probabilistic latent variable models for distinguishing between cause and effect. In NIPS’2010, pages 1687–1695, 2010.
-  Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. Overlapping experiment infrastructure: More, better, faster experimentation. In KDD’2010, pages 17–26, 2010.
-  Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In B. Hajek and R. S. Sreenivas, editors, Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, pages 368–377, Urbana, Illinois, 1999. University of Illinois Press.
-  Matthias von Davier and Claus H Carstensen. Multivariate and mixture distribution Rasch models: Extensions and applications. Springer Science & Business Media, 2007.
-  Minsheng Wang, AI Chan, and Charles K Chui. Wigner-ville distribution decomposition via wavelet packet transform. In Time-Frequency and Time-Scale Analysis, 1996., Proceedings of the IEEE-SP International Symposium on, pages 413–416. IEEE, 1996.