Data Science projects usually involve a multitude of steps to collect useful insights from data. After general steps of data validation, cleaning and exploration, a data scientist applies preprocessing steps to the data set. A large number of possible preprocessing operations are available. Features often need to be extracted, scaled, transformed and imputed. The preprocessed data can then be used to train a machine learning model which can be used to predict new data. Finding optimal models and preprocessing operations is notoriously difficult as many, often somewhat technical, decisions have to be made and nearly all available operations and algorithms contains additional hyperparameters. The idea of automatically obtaining a machine learning pipeline from data is the central element of the research fieldAutomatic Machine Learning (AutoML). AutoML systems are currently rising in popularity as they can find powerful models without human oversight and knowledge. The aim of this paper is three-fold: We describe capabilities of current Auto-ML systems and how they integrate into the data science work-flow, discuss potential shortcomings in current practices and discuss how an interface for the practitioner could ideally look like. A model used in production can affect potentially affect millions of user or highly important technical systems or processes. Hence, such models often have to be judged and optimized w.r.t. to multiple criteria. The importance of these criteria, like predictive performance, model size, prediction speed and interpretability will vary between projects and have inherent trade-offs, meaning that not all of them can be optimized equally well. While data scientists are often forced to spend the majority of the time on data cleaning and exploratory analysis, the time available to build and investigate actual models from the data is often comparably small (20% is an often-quoted number). Automatizing this part, can thus result in better models and reduce mistakes in the process. We consider extending current approaches to be able to incorporate multiple criteria a challenge necessary, to significantly advance data-science applications.
2. Status Quo
In this work, we mainly focus on the machine-learning part of the data science workflow. Substantial effort has been put into many components, such as exploratory data analysis and data-cleasning (c.f the automated statistician project (Ghahramani, 2015)), or providing additional visualizations (Kandel et al., 2011), but these data science stages still remain often largely manual processes. Various machine learning toolboxes are available to users in different programming languages. Popular examples are Weka (Hall et al., 2009) (Java), scikit-learn (Pedregosa et al., 2011) (Python) and mlr (Bischl et al., 2016) (R). These toolboxes serve as a first step towards making machine-learning accessible to a wide audience of practitioners and build the foundation of most state-of-the-art AutoML systems. While the field of AutoML has obtained a lot of attention in recent years from companies such as Google (Google Cloud AutoML), Amazon (Amazon Sagemaker), it has long been an active field of research and various implementations already exist. Examples for those include auto-weka (Thornton et al., 2013), auto-sklearn (Feurer et al., 2015) and tpot (Olson et al., 2016). In the scientific community, those systems are compared in several AutoML challenges organized at top machine-learning conferences (Feurer et al., 2018). These challenges focus solely on the predictive performance of models built by the AutoML systems as it is easy to compare and rank the systems in this way. Other criteria as discussed above are completely ignored and a simpler, sparse model is not preferred to a much more complex one despite having nearly identical predictive performance.
A typical (simplified) workflow involving such a system looks as follows (c.f. figure 1): After accessing and cleaning the data, the data scientist conducts exploratory data analysis in order to gather first insights from the data. When the data has a sufficient quality the user passes on the data to an AutoML system, which then optimizes a machine learning pipeline of preprocessing, model and hyper-parameters in order to achieve a high predictive performance. The quality of the resulting model is then assessed by performance metrics as well as by human domain experts. Recently, the fields of fair and transparent machine learning (FATML) (c.f (Barocas et al., 2018)) and interpretable machine learning (IML) (c.f. (Molnar, 2019)) erupted as important new fields of research. Different methods that increase fairness, transparency, and interpretability have been proposed (Barocas and Selbst, 2014; Lipton, 2018). In many cases, respecting these criteria is crucial for model selection as predictions need to be explained to clients, users or society.
3. Where is the human in AutoML?
The humans role in current AutoML processes is to choose data sets, validation protocols, performance measures to optimize and to define the pipeline search space, i.e., which preprocessing and modeling steps to consider. After that, the systems does not require human intervention and returns an optimal model after a prespecified amount of time. This often drastically speeds up the process of obtaining well working models as technical optimization is left to the machine and has not to be dealt with in a manual trial-and-error process. Furthermore, this process can be scaled up to run on massively parallel systems nowadays. A very important approach to making this complex process more accessible to humans was proposed in (Wang et al., 2019). Still, large amounts of time are spent on data-cleaning, preprocessing and hand-crafting features, as these steps typically depend on domain knowledge. Their effectiveness can be observed in Kaggle’s machine learning competitions, as well as in research (Domingos, 2012). We want to start discussing how humans can be enabled by AutoML systems even further and how those systems need to be extended in order to achieve this. A very basic suggestion can be observed in figure 2. We consider the current inability of many AutoML systems to incorporate criteria such as fairness and interpretability a major drawback. Additionally, systems should make intermediate results available to the practitioner, which can then be evaluated and played back to the AutoML system. This can especially help in situations, where user preferences are not easily quantifyable, or where relevant criteria are not a-priori known. The field of AutoML promises great enhancements to the current data science workflow, but to harness its full potential, it needs to be extended to be more accommodating towards multiple criteria and human intervention.
This work has been funded by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibilities for its content.
- Fairness and machine learning. fairmlbook.org. Note: http://www.fairmlbook.org Cited by: §2.
- Big Data’s Disparate Impact. SSRN eLibrary (English). Cited by: §2.
- Mlr: machine learning in R. JMLR 17 (170), pp. 1–5. Cited by: §2.
- A few useful things to know about machine learning. Commun. ACM 55 (10), pp. 78–87. External Links: Cited by: §3.
- Practical automated machine learning for the automl challenge 2018. In ICML 2018 AutoML Workshop, Cited by: §2.
- Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2962–2970. External Links: Cited by: §2.
- The Automatic Statistician and future directions in probabilistic machine learning. Note: Presentation, Machine Learning Summer School 2015 External Links: Cited by: §2.
- The WEKA Data Mining Software: An Update. ACM SIGKDD explorations newsletter 11 (1), pp. 10–18. Cited by: §2.
- Wrangler: interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, New York, NY, USA, pp. 3363–3372. External Links: Cited by: §2.
- The mythos of model interpretability. Queue 16 (3), pp. 30:31–30:57. External Links: Cited by: §2.
- Interpretable machine learning. https://christophm.github.io/interpretable-ml-book/. Note: https://christophm.github.io/interpretable-ml-book/ Cited by: §2.
Applications of evolutionary computation: 19th european conference, evoapplications 2016, porto, portugal, march 30 – april 1, 2016, proceedings, part i. G. Squillero and P. Burelli (Eds.), pp. 123–137. External Links: Cited by: §2.
- Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §2.
- Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In Proc. of KDD-2013, pp. 847–855. Cited by: §2.
- ATMSeer: Increasing Transparency and Controllability in Automated Machine Learning. To appear in: In CHI Conference on Human Factors in Computing Systems Proceedings (CHI 2019), pp. arXiv:1902.05009. External Links: Cited by: §3.