Nowadays, the development of most leading web services and software products, in general, is guided by data-driven decisions that are based on online evaluation which qualifies and quantifies the steady stream of web service updates. Online evaluation is widely used in modern Internet companies (like search engines (Deng et al., 2014; Hohnhold et al., 2015; Drutsa et al., 2015b), social networks (Bakshy and Eckles, 2013; Xu and Chen, 2016), media providers (Bakshy and Eckles, 2013), and online retailers) in permanent manner and on a large scale. Yandex run more than 100 online evaluation experiments per day; Bing reported on more than 200 run A/B tests per day (Kohavi et al., 2013); and Google conducted more than 1000 experiments (Hohnhold et al., 2015). The number of smaller companies that use A/B testing in the development cycle of their products grows as well. The development of such services strongly depends on the quality of the experimentation platforms. In this tutorial, we overview the state-of-the-art methods underlying the everyday evaluation pipelines.
At the beginning of this tutorial (which is a shorter version of (Budylin et al., 2018a)), we make an introduction to online evaluation and give basic knowledge from mathematical statistics (40 min, Section 1). Then, we share approaches for development of online metrics (50 min, Section 2
). This is followed by rich industrial experiences on constructing of an experimentation pipeline and evaluation metrics: emphasizing best practices and common pitfalls (55 min, Section3
). A large part of our tutorial is devoted to modern and state-of-the-art techniques (including the ones based on machine learning) that allow to conduct online experimentation efficiently (65 min, Section4). Finally, we point out open research questions and current challenges that should be interesting for research scientists.
1. Statistical foundation
We introduce the main probabilistic terms, which form a theoretical foundation of A/B testing. We introduce the observed values as random variables sampled from an unknown distribution. Evaluation metrics are statistics based on observations (mean, median, quantiles, etc.). Overview of statistical hypothesis testing is provided with definitions of p-value, type I error, and type II error. We discuss several statistical tests (Student’s t-test, Mann Whitney U, and Bootstrap(Efron and Tibshirani, 1994)), compare their properties and applicability (Drutsa et al., 2015c).
2. Development of online metrics
We deeply discuss how to build evaluation metrics, what is the main ingredient in online experimentation pipeline. First, we introduce the notion of an A/B test (also known as an online controlled experiment): it compares two variants of a service at a time, usually its current version (control) and a new one (treatment), by exposing them to two groups of users (Peterson, 2004; Kohavi et al., 2007; Kohavi et al., 2009b). Main components of a metric are presented: key metric, evaluation statitistic, statistical significance test, Overall Evaluation Criterion (OEC) (Kohavi et al., 2009b), and Overall Acceptance Criterion (OAC) (Drutsa et al., 2015c). The aim of controlled experiments is to detect the causal effect of service updates on its performance relying on an Overall Evaluation Criterion (OEC) (Kohavi et al., 2009b), a user behavior metric (e.g., clicks-per-user, sessions-per-user, etc.) that is assumed to correlate with the quality of the service. We show that development of a new good metric is a challenging goal, since an appropriate OAC should possess two crucial qualities: directionality (the sign of the detected treatment effect should align with positive/negative impact of the treatment on user experience) and sensitivity (the ability to detect the statistically significant difference when the treatment effect exists) (Kohavi et al., 2012; Nikolaev et al., 2015; Poyarkov et al., 2016; Deng and Shi, 2016). The former property allows to make correct conclusions on the system quality changes (Kohavi et al., 2012; Nikolaev et al., 2015; Deng and Shi, 2016), while improvement of the latter one allows to detect metric changes in more experiments and to utilize less users (Kharitonov et al., 2015c, b; Poyarkov et al., 2016).
Second, we provide evaluation criteria beyond averages: (a) how to evaluate periodicity (Drutsa et al., 2017a; Drutsa, 2015; Drutsa et al., 2017b) and trends (Drutsa et al., 2015a; Drutsa, 2015; Drutsa et al., 2017a) of user behavior over days, e.g., for detection of delayed treatment effects; and (b) how to evaluate frequent/rare behavior and diversity in behavior between users that cannot be detected by mean values (Drutsa et al., 2015c). Third, product-based aspects in metric building are presented. Namely, we discuss vulnerability of metrics such as a click on a button to switch a search engine (Arkhipova et al., 2015) (i.e., how can a metric be gamed or manipulated); ways to measure different aspects of a service (i.e., speed (Kohavi et al., 2010; Kohavi et al., 2014), absence (Chakraborty et al., 2014), abandonment (Kohavi et al., 2014)); difference between metrics of user loyalty and ones of user activity (Rodden et al., 2010; Lalmas et al., 2014; Drutsa et al., 2015a, b; Drutsa, 2015; Drutsa et al., 2015c); dwell time to improve click-based metrics (Kim et al., 2014); how to evaluate the number of user tasks (which can have a complex hierarchy (Boldi et al., 2009)) by means of sessions (Song et al., 2013); and issues in session division (Jones and Klinkner, 2008) as well.
Fourth, math-based approaches to improve metrics are considered. In particular, we discuss the powerful method of linearization (Budylin et al., 2018b) that reduces any ratio metric to the average of a user-level metric preserving directionality and allowing usage of a wide range of sensitivity improvement techniques developed for user-level metrics. We also describe different methods of noise reduction (such as capping (Kohavi et al., 2014), slicing (Song et al., 2013; Deng and Hu, 2015), taking into account user activity, etc.) and of utilization of a user generated content approach.
Finally, some system requirements for metric building are discussed. We explain how to get a set of experiments with verdicts (known positiveness or negativeness), how to construct a pipeline to easily implement and test metrics, and how to measure metrics (Dmitriev and Wu, 2016).
3. Experimentation pipeline and workflow in the light of industrial practice
We share rich industrial experiences on constructing of an experimentation pipeline in large Internet companies. First, we discuss how can experiments be used for evaluation of changes in various components of web services: the user interface (Kohavi et al., 2009a; Drutsa et al., 2015a; Nikolaev et al., 2015; Drutsa et al., 2017a), ranking algorithms (Song et al., 2013; Drutsa et al., 2015a; Nikolaev et al., 2015; Drutsa et al., 2017a), sponsored search (Chawla et al., 2016), and mobile apps (Xu and Chen, 2016). Second, we consider several real cases of experiments, where pitfalls (Crook et al., 2009; Kohavi et al., 2012; Kohavi et al., 2014; Deng and Shi, 2016) are demonstrated and lessons are learned. In particular, we discuss: conflicting experiments, network effects (Gui et al., 2015), duration and seasonality (Shokouhi, 2011; Drutsa et al., 2015a), logging, and slices.
Third, we provide our management methodology to conduct experiments efficiently and to avoid the pitfalls. This methodology is based on pre-launch checklists and a team of Experts on Experiments (EE). We also present our system of tournaments, where problems similar to the ones in two-stage A/B testing (Deng et al., 2014) are solved. Finally, we discuss how large-scale experimental infrastructure (Tang et al., 2010; Kohavi et al., 2013; Xu et al., 2015) can be used to collect experiments for metric evaluation (Dmitriev and Wu, 2016).
4. Machine learning driven A/B testing
A large part of our tutorial is devoted to modern and state-of-the-art techniques (including the ones based on machine learning) that improve the efficiency of online experiments. We start this section with the comparison of randomized experiments and observational studies. We explain that the difference between averages of the key metric may be misleading when measured in an observational study. We introduce the Neyman–Rubin model and rigorously formulate implicit assumptions we make each time when evaluating the results of randomized experiments.
Then we overview several studies devoted to the variance reduction of evaluation metrics. Regression adjustment techniques such as stratification, linear models(Deng et al., 2013; Xie and Aurisset, 2016)2016)
reduce the variance related to the observed features (covariates) of the users. We also consider experiments with user experience, where the effect of a service change is heterogeneous (is different for users of different types). We overview the main approaches to estimation of the heterogeneous treatment effect depending on the user features(Powers et al., 2017; Athey and Imbens, 2015).
We explain the Optimal Distribution Decomposition (ODD) approach that is based on the analysis of the control and treatment distributions of the key metric as a whole, and, for this reason, is sensitive to more ways the two distributions may actually differ(Nikolaev et al., 2015). Method of virtually increasing of the experiment duration through the prediction of the future (Drutsa et al., 2015b) is discussed. We also provide another way to improve sensitivity that is based on learning of metric combinations (Kharitonov et al., 2017). This approach showed outstanding sensitivity improvements in the large scale empirical evaluation (Kharitonov et al., 2017).
Finally, we discuss ways to improve the performance of experimentation pipeline as a whole. Optimal scheduling of online evaluation experiments is presented (Kharitonov et al., 2015b) and approaches for early stopping of them are highlighted (where inflation of Type I error (Johari et al., 2017) and ways to correctly make sequential testing (Kharitonov et al., 2015c; Deng et al., 2016) are discussed).
We also highlight important topics not covered by the tutorial: Bayesian approaches (Deng, 2015; Deng et al., 2016) and non-parametric mSRPT (Abhishek and Mannor, 2017) in sequential testing; network A/B testing (Gui et al., 2015; Saveski et al., 2017); two-stage A/B testing (Deng et al., 2014); Imperfect Treatment Assignment (Coey and Bailey, 2016); and interleaving (Joachims, 2002; Joachims et al., 2003; Radlinski et al., 2008; Hofmann et al., 2011; Chapelle et al., 2012; Radlinski and Craswell, 2013; Schuth et al., 2014; Kharitonov et al., 2015a; Aurisset et al., 2017; Radlinski and Yue, 2011; Grotov and de Rijke, 2016; Radlinski and Hofmann, 2013; Radlinski, 2013).
The tutorial materials (slides) are available at https://research.yandex.com/tutorials/online-evaluation/kdd-2018.
- Abhishek and Mannor (2017) Vineet Abhishek and Shie Mannor. 2017. A nonparametric sequential test for online randomized experiments. In Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 610–616.
- Arkhipova et al. (2015) Olga Arkhipova, Lidia Grauer, Igor Kuralenok, and Pavel Serdyukov. 2015. Search Engine Evaluation based on Search Engine Switching Prediction. In SIGIR’2015. ACM, 723–726.
- Athey and Imbens (2015) Susan Athey and Guido Imbens. 2015. Machine Learning Methods for Estimating Heterogeneous Causal Effects. arXiv preprint arXiv:1504.01132 (2015).
- Aurisset et al. (2017) Juliette Aurisset, Michael Ramm, and Joshua Parks. 2017. Innovating Faster on Personalization Algorithms at Netflix Using Interleaving. https://medium.com/netflix-techblog/interleaving-in-online-experiments-at-netflix-a04ee392ec55. (2017).
- Bakshy and Eckles (2013) Eytan Bakshy and Dean Eckles. 2013. Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods. In KDD’2013. 1303–1311.
- Boldi et al. (2009) Paolo Boldi, Francesco Bonchi, Carlos Castillo, and Sebastiano Vigna. 2009. From Dango to Japanese cakes: Query reformulation models and patterns. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Volume 01. IEEE Computer Society, 183–190.
- Budylin et al. (2018a) Roman Budylin, Alexey Drutsa, Gleb Gusev, Eugene Kharitonov, Pavel Serdyukov, and Igor Yashkov. 2018a. Online Evaluation for Effective Web Service Development: Extended Abstract of the Tutorial at TheWebConf’2018.
- Budylin et al. (2018b) Roman Budylin, Alexey Drutsa, Ilya Katsev, and Valeriya Tsoy. 2018b. Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 55–63.
- Chakraborty et al. (2014) Sunandan Chakraborty, Filip Radlinski, Milad Shokouhi, and Paul Baecke. 2014. On correlation of absence time and search effectiveness. In SIGIR’2014. 1163–1166.
- Chapelle et al. (2012) Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. 2012. Large-scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Systems (TOIS) 30, 1 (2012), 6.
- Chawla et al. (2016) Shuchi Chawla, Jason Hartline, and Denis Nekipelov. 2016. A/B testing of auctions. In EC’2016.
- Coey and Bailey (2016) Dominic Coey and Michael Bailey. 2016. People and cookies: Imperfect treatment assignment in online experiments. In Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1103–1111.
- Crook et al. (2009) Thomas Crook, Brian Frasca, Ron Kohavi, and Roger Longbotham. 2009. Seven pitfalls to avoid when running controlled experiments on the web. In KDD’2009. 1105–1114.
- Deng (2015) Alex Deng. 2015. Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments. In WWW’2015 Companion. 923–928.
- Deng and Hu (2015) Alex Deng and Victor Hu. 2015. Diluted Treatment Effect Estimation for Trigger Analysis in Online Controlled Experiments. In WSDM’2015. 349–358.
- Deng et al. (2014) Alex Deng, Tianxi Li, and Yu Guo. 2014. Statistical inference in two-stage online controlled experiments with treatment selection and validation. In WWW’2014. 609–618.
- Deng et al. (2016) Alex Deng, Jiannan Lu, and Shouyuan Chen. 2016. Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing. In DSAA’2016.
- Deng and Shi (2016) Alex Deng and Xiaolin Shi. 2016. Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned. In KDD’2016.
- Deng et al. (2013) Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In WSDM’2013. 123–132.
- Dmitriev and Wu (2016) Pavel Dmitriev and Xian Wu. 2016. Measuring Metrics. In CIKM’2016. 429–437.
- Drutsa (2015) Alexey Drutsa. 2015. Sign-Aware Periodicity Metrics of User Engagement for Online Search Quality Evaluation. In SIGIR’2015. 779–782.
- Drutsa et al. (2015a) Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2015a. Engagement Periodicity in Search Engine Usage: Analysis and Its Application to Search Quality Evaluation. In WSDM’2015. 27–36.
- Drutsa et al. (2015b) Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2015b. Future User Engagement Prediction and its Application to Improve the Sensitivity of Online Experiments. In WWW’2015. 256–266.
- Drutsa et al. (2017a) Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2017a. Periodicity in User Engagement with a Search Engine and its Application to Online Controlled Experiments. ACM Transactions on the Web (TWEB) 11 (2017).
- Drutsa et al. (2017b) Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2017b. Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B Experiments. In WWW’2017.
- Drutsa et al. (2015c) Alexey Drutsa, Anna Ufliand, and Gleb Gusev. 2015c. Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics. In CIKM’2015. 763–772.
- Efron and Tibshirani (1994) Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap. CRC press.
- Grotov and de Rijke (2016) Artem Grotov and Maarten de Rijke. 2016. Online learning to rank for information retrieval: Tutorial. In SIGIR.
- Gui et al. (2015) Huan Gui, Ya Xu, Anmol Bhasin, and Jiawei Han. 2015. Network a/b testing: From sampling to estimation. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 399–409.
- Hofmann et al. (2011) Katja Hofmann, Shimon Whiteson, and Maarten De Rijke. 2011. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM international conference on Information and knowledge management. ACM, 249–258.
- Hohnhold et al. (2015) Henning Hohnhold, Deirdre O’Brien, and Diane Tang. 2015. Focusing on the Long-term: It’s Good for Users and Business. In KDD’2015. 1849–1858.
- Joachims (2002) Thorsten Joachims. 2002. Unbiased evaluation of retrieval quality using clickthrough data. (2002).
- Joachims et al. (2003) Thorsten Joachims et al. 2003. Evaluating Retrieval Performance Using Clickthrough Data. (2003).
- Johari et al. (2017) Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. 2017. Peeking at A/B Tests: Why it matters, and what to do about it. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1517–1525.
- Jones and Klinkner (2008) Rosie Jones and Kristina Lisa Klinkner. 2008. Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 699–708.
- Kharitonov et al. (2017) Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov. 2017. Learning Sensitive Combinations of A/B Test Metrics. In WSDM’2017.
- Kharitonov et al. (2015a) Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. 2015a. Generalized Team Draft Interleaving. In CIKM’2015.
- Kharitonov et al. (2015b) Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. 2015b. Optimised Scheduling of Online Experiments. In SIGIR’2015. 453–462.
- Kharitonov et al. (2015c) Eugene Kharitonov, Aleksandr Vorobev, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. 2015c. Sequential Testing for Early Stopping of Online Experiments. In SIGIR’2015. 473–482.
- Kim et al. (2014) Youngho Kim, Ahmed Hassan, Ryen W White, and Imed Zitouni. 2014. Modeling dwell time to predict click-level satisfaction. In Proceedings of the 7th ACM international conference on Web search and data mining. ACM, 193–202.
- Kohavi et al. (2009a) Ronny Kohavi, Thomas Crook, Roger Longbotham, Brian Frasca, Randy Henne, Juan Lavista Ferres, and Tamir Melamed. 2009a. Online experimentation at Microsoft. Data Mining Case Studies (2009), 11.
- Kohavi et al. (2012) Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu. 2012. Trustworthy online controlled experiments: Five puzzling outcomes explained. In KDD’2012. 786–794.
- Kohavi et al. (2013) Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In KDD’2013. 1168–1176.
- Kohavi et al. (2014) R. Kohavi, A. Deng, R. Longbotham, and Y. Xu. 2014. Seven Rules of Thumb for Web Site Experimenters. In KDD’2014.
- Kohavi et al. (2007) Ron Kohavi, Randal M Henne, and Dan Sommerfield. 2007. Practical guide to controlled experiments on the web: listen to your customers not to the hippo. In KDD’2007. 959–967.
- Kohavi et al. (2009b) Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009b. Controlled experiments on the web: survey and practical guide. Data Min. Knowl. Discov. 18, 1 (2009), 140–181.
- Kohavi et al. (2010) Ron Kohavi, David Messner, Seth Eliot, Juan Lavista Ferres, Randy Henne, Vignesh Kannappan, and Justin Wang. 2010. Tracking Users’ Clicks and Submits: Tradeoffs between User Experience and Data Loss. (2010).
- Lalmas et al. (2014) Mounia Lalmas, Heather O’Brien, and Elad Yom-Tov. 2014. Measuring user engagement. Synthesis Lectures on Information Concepts, Retrieval, and Services 6, 4 (2014), 1–132.
- Nikolaev et al. (2015) Kirill Nikolaev, Alexey Drutsa, Ekaterina Gladkikh, Alexander Ulianov, Gleb Gusev, and Pavel Serdyukov. 2015. Extreme States Distribution Decomposition Method for Search Engine Online Evaluation. In KDD’2015. 845–854.
- Peterson (2004) Eric T Peterson. 2004. Web analytics demystified: a marketer’s guide to understanding how your web site affects your business. Ingram.
- Powers et al. (2017) Scott Powers, Junyang Qian, Kenneth Jung, Alejandro Schuler, Nigam H Shah, Trevor Hastie, and Robert Tibshirani. 2017. Some methods for heterogeneous treatment effect estimation in high-dimensions. arXiv preprint arXiv:1707.00102 (2017).
- Poyarkov et al. (2016) Alexey Poyarkov, Alexey Drutsa, Andrey Khalyavin, Gleb Gusev, and Pavel Serdyukov. 2016. Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments. In KDD’2016. 235–244.
- Radlinski (2013) Filip Radlinski. 2013. Sensitive Online Search Evaluation. http://irsg.bcs.org/SearchSolutions/2013/presentations/radlinski.pdf. (2013).
- Radlinski and Craswell (2013) Filip Radlinski and Nick Craswell. 2013. Optimized interleaving for online retrieval evaluation. In WSDM.
- Radlinski and Hofmann (2013) Filip Radlinski and Katja Hofmann. 2013. Practical online retrieval evaluation. In ECIR.
- Radlinski et al. (2008) Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does clickthrough data reflect retrieval quality?. In CIKM’2008. 43–52.
- Radlinski and Yue (2011) Filip Radlinski and Yisong Yue. 2011. Practical Online Retrieval Evaluation. In SIGIR.
- Rodden et al. (2010) Kerry Rodden, Hilary Hutchinson, and Xin Fu. 2010. Measuring the user experience on a large scale: user-centered metrics for web applications. In CHI’2010. 2395–2398.
- Saveski et al. (2017) Martin Saveski, Jean Pouget-Abadie, Guillaume Saint-Jacques, Weitao Duan, Souvik Ghosh, Ya Xu, and Edoardo M Airoldi. 2017. Detecting network effects: Randomizing over randomized experiments. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 1027–1035.
- Schuth et al. (2014) Anne Schuth, Floor Sietsma, Shimon Whiteson, Damien Lefortier, and Maarten de Rijke. 2014. Multileaved comparisons for fast online evaluation. In CIKM.
- Shokouhi (2011) Milad Shokouhi. 2011. Detecting seasonal queries by time-series analysis. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 1171–1172.
- Song et al. (2013) Yang Song, Xiaolin Shi, and Xin Fu. 2013. Evaluating and predicting user engagement change with degraded search relevance. In WWW’2013. 1213–1224.
- Tang et al. (2010) Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. 2010. Overlapping experiment infrastructure: More, better, faster experimentation. In KDD’2010. 17–26.
- Xie and Aurisset (2016) Huizhi Xie and Juliette Aurisset. 2016. Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix. In KDD’2016.
- Xu and Chen (2016) Ya Xu and Nanyu Chen. 2016. Evaluating Mobile Apps with A/B and Quasi A/B Tests. In KDD’2016.
- Xu et al. (2015) Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015. From infrastructure to culture: A/B testing challenges in large scale social networks. In KDD’2015.