Machine learning and data science have entered many domains of human effort in modern times. The number of self-reported data scientists has doubled in recent years [harrison1995validity]. They have entered various domains including academia, industry and business among others. There has therefore been a demand for machine learning tools that are flexible, powerful and most importantly, interpretable. The effective application of machine learning tools unfortunately requires an expert understanding of the frameworks and algorithms that are present in a machine learning pipeline. It also requires knowledge of the problem domain and understanding of the assumptions used in the analysis. In order for tools to be used adequately by non-experts; new tools must be developed for understanding and interpreting the results of a data analysis pipeline in a specific domain.
Pipelines in machine learning and data science are commonly organized in the form of interdependent components. Such components that make up a data analysis pipeline include data preprocessing, feature extraction, feature transformation, model building and model evaluation among others. Fig.1 shows a generic representation of data analysis pipelines as a feed-forward network. Each computational step of the pipeline consists of several algorithms () to choose from. Each algorithm in the pipeline has its own hyperparameters that must be optimized. Therefore, there are an exponential number of combinations of algorithms and hyperparameters in a given data analysis pipeline, which makes it computationally intensive task to optimize the pipeline. Tuning this pipeline can be viewed as the optimization of an objective function that is noisy and expensive to evaluate. The input to the pipeline is a dataset , the pipeline (a network consisting of the steps , the algorithms and corresponding hyperparameters ) and the objective to optimize such as validation error, accuracy, F1-score, or cross-entropy loss etc.The goal of a data scientist is to find the best set of algorithms and hyperparameters in this pipeline that optimizes the objective function. This corresponds to finding an optimal path through the pipeline in Fig. 1. Simple methods such as grid and random search [bergstra2012random] have been used to tackle this problem. More complicated approaches such as Bayesian optimization [snoek2012practical, zhang2016flash]
have been used successfully for approaching more difficult problems. Pipeline optimization as a whole has also been approached using genetic algorithms[olson2016evaluation, olson2016tpot]. We use grid search, random search and Bayesian optimization methods for optimization of the pipeline and each individual path in it. Our present goal is not to improve ways to optimize the pipeline, but to use any one such method to help a domain scientist quantify the importance of different steps in the pipeline. For example ”How important is feature extraction?”.
Domain experts prefer to understand how predictive decisions are made by the pipeline. Recently there has been an advent of models and techniques for improving the interpretability of machine learning. [ribeiro2016model] introduces a model-agnostic method for interpreting the results of complex machine learning algorithms. [koh2017understanding] uses influence functions to understand blackbox predictions. In this work, we attempt to provide an interpretation of machine learning pipelines in terms of the importance and sensitivity of components in the pipeline (steps, algorithms and hyperparameters) as opposed to the approaches which are geared toward interpretation of algorithms based on the dataset (see [koh2017understanding]).Using our approach, one can understand the importance of different steps like feature extraction and feature transformation and individual algorithms and hyperparameters. To our knowledge, this type of approach to interpretation has not been taken before. To this end, we propose the understanding of the contribution of error in data analysis pipelines using a method that we denote as the agnostic methodology. Essentially, to quantify the contribution of a particular component, we compute the error from the pipeline when the component is selected agnostically. We use the cross-entropy loss as the performance metric of the optimization algorithms and basis of error quantification in the image classification pipelines. Understanding the importance of the components in the predictive model is important for experts to design better data analysis pipelines. Experts can use the information from error contributions to focus attention on certain parts of the pipeline depending on the source of error. We introduce a methodology to quantify the contribution of error from different components of the data analysis pipeline, namely the computational steps and algorithms in the pipeline.
Pipeline optimization methods and algorithms like grid search, random search [bergstra2012random] and Bayesian optimization [snoek2012practical] are used to optimize the pipeline for performing experiments with our agnostic error contribution methodology. We take two different approaches to optimization. The first is hyper-parameter optimization (HPO) where a computational path in Fig. 1 is optimized. The second type of optimization is denoted as combined algorithm selection and hyperparameter optimization (CASH). This term was introduced in [thornton2013auto]. This is a more difficult problem, because the pipeline is optimized globally, in that the result of the optimization is a single optimized path that produces the best performance over all the paths in the machine learning workflow.
We use four datasets to demonstrate the error contribution methodology. The problem we focus on is image classification. We show the performance of the optimization frameworks (HPO and CASH) for the experiments. We show experimentally that CASH using random search and Bayesian optimization can be efficiently used for quantification of errors from the different computational steps of the pipeline. In addition, HPO frameworks of both Bayesian optimization and random search provides estimates of error contributions from the algorithms and hyperparameters in a particular path of the pipeline. We demonstrate from the results that theagnostic error contribution methodology maybe used by both data science and domain experts to improve and interpret the results of image classification pipelines. In addition, we observe that random search is a more accurate estimator of error contribution than Bayesian optimization. Finally, we demonstrate a visualization on a real pipeline.
In this section we describe the optimization problem and methods that are used in this work.
Ii-a Algorithm selection and hyper-parameter optimization
We approach the problem of optimization of the pipeline from two frameworks. In one framework, each path in the pipeline in Fig. 1 is individually optimized. This essentially boils down to the problem of hyper-parameter optimization (HPO) because the hyperparameters of each algorithm are optimized for each individual path. In the second framework, the entire pipeline is optimized. This means that the algorithms and hyperparameters are optimized together. This is denoted as combined algorithm selection and hyper-parameter optimization (CASH).
Ii-A1 Hyper-parameter optimization (HPO)
Let the n hyperparameters in a path be denoted as , and let be their respective domains. The hyperparameter space of the path is = .
When trained with on data , the validation error is denoted as
. Using -fold cross-validation, the hyperparameter optimization problem for a dataset is to minimize:
Hyperparameters may be numerical, categorical or conditional with a finite domain. The minimization of this objective function provides the optimal configuration of hyperparameters on a particular path in the pipeline in Fig. 1. The optimization of the objective function defined by Eq. 1 is very expensive. Depending on the type of hyper-parameter variables, the derivatives and convexity properties maybe unknown, and derivative free global optimization methods like Bayesian optimization and techniques like random search maybe used to tackle this problem. This framework is represented in Fig. 1(a).
Ii-A2 Combined algorithm selection and hyper-parameter optimization (CASH)
We can define the CASH formulation using Fig. 1. Let there be computational steps in the pipeline. Each step in the pipeline consists of algorithms , where , is the number of algorithms in step , represents the -th algorithm in step , and represents the set of hyperparameters corresponding to . The entire space of algorithms and hyperparameters is therefore given by
. The objective function to be minimized for CASH is given by
where, and other notations are the same as those introduced in the previous section. Similar to the objective function defined over the hyperparameters in Eq. 1, the optimization in Eq. 2 is even more difficult due to the additional problem of algorithm selection. Again, the derivates may be impossible to compute and convexity properties may be completely unknown. This framework is represented in Fig. 1(b).
Ii-B Optimization methods
The critical step in HPO or CASH is to choose the set of trials in the search space, which is for HPO and for CASH. In this section, methods that are used in this paper for optimization of Eq. 1 and Eq. 2 are described. Grid search, random search and Bayesian optimization are used in this work.
Ii-B1 Grid search
Grid search is the simplest of all methods for coming up with trials in the search space. The set of trials in grid search is formed by assembling every possible set of values in (HPO) and (CASH) and computing the validation loss for each. The configuration or that minimizes the validation loss is chosen as the optimum configuration. Unfortunately grid search is computationally very expensive. For HPO, the number of trials corresponds to , and for CASH this is . This product makes grid search suffer from the curse of dimensionality. This is because the number of trials grows exponentially with the number of hyperparameters. However, grid search has certain advantages. Firstly, parallelization and implementation is trivial. In addition, grid search is robust in the sense that results maybe replicated easily.
Ii-B2 Random search
Random search is the optimization method where trial configurations are randomly sampled from the search space of (HPO) or (CASH). [bergstra2012random] shows empirically and theoretically that randomly selecting trials is sufficiently accurate and more efficient than performing optimization using grid search. We also show similar results in this work.
Ii-B3 Bayesian optimization
Sequential model based Bayesian optimization (SMBO) [hutter2011sequential] is the method of choice when it comes to optimization of complicated black-box functions. In a nutshell, it consists of two components. The first is a probabilistic model and the second is an acquisition function. The probabilistic model can be modelled using Gaussian processes (Spearmint) [snoek2012practical]
, random forests (SMAC)[hutter2011sequential] and using density estimation with Tree-structured Parzen estimators (TPE) [bergstra2011algorithms]. The acquisition function determines the future candidates or trials for evaluation. The acquisition function is relatively cheap to evaluate compared to the actual objective function . One of the most prominent acquisition functions is expected improvement (EI) [expected_improvement]. We use the sequential model-based algorithm configuration (SMAC) that uses random forests as the Bayesian optimization framework. This is because it can be used for optimizing conditional hyperparameter configurations. The choice is also based on empirical results in [eggensperger2013towards].
Iii Proposed methods
In this section the proposed methodology for quantification of error contribution is presented. The method is independent of the optimization methods that maybe used for both the HPO and CASH formulations.
Iii-a Error contribution with the agnostic methodology
We propose an agnostic methodology for quantifying error contributions from different parts of the pipeline. It is defined as the minimum error obtained by being agnostic to a particular component of the pipeline (computational step, algorithms or hyperparameters). We shall define what agnostic refers to for both computational steps, algorithms and hyperparameters individually.
Iii-A1 Quantification of error from computational steps
The agnostic methodology maybe used for quantification of contributions from computational steps like feature extraction, data pre-processing and learning algorithms. Being agnostic to a computational step means that the algorithms in that step are selected randomly for that step while the remaining pipeline is optimized. The average of the minimum errors obtained with each algorithm in the step used as the only algorithm in that particular step, provides an estimate of the agnostic error from a particular pipeline. More formally, the agnostic methodology is implemented for computational steps in the following manner. Using Fig. 1 as a reference, let be the number of steps in the pipeline. Each step in the pipeline is denoted as . is the number of algorithms in step . denotes the -th algorithm in the -th step. represents the minimum validation error found after optimization of the entire pipeline (using the CASH framework). is the minimum validation error found with as the only algorithm in step . The error contribution from step , is given by Eq. 3.
where, Taking the difference with respect to the global minimum in Eq. 3 provides an estimate of the error contribution from step of the pipeline. A large value of would mean that step is important for the pipeline.
Iii-A2 Quantification of error from algorithms
The agnostic methodology for algorithms is implemented as follows. Similar to the agnostic methodology for steps, we define the agnostic methodology for algorithms. In this case, we focus on a single path in the pipeline in Fig. 1. Let’s assume we are trying to quantify the error contribution of a particular algorithm that lies on path . Being to means we optimize everything else on the path except the algorithm. This means that we pick the hyperparameters of algorithm randomly while optimizing the rest of the algorithms on the path. This is formally calculated by taking the average of the optimum errors on the path for each configuration of . The minimum validation error on the path is then subtracted from this error to give us the error contribution from algorithm on path . These errors are computed using the results and the search trials on the CASH framework in section II-A2.
where, , represents the number of hyperparametric configurations of , is the minimum error obtained with the -th configuration of and is the minimum error found over the path that consists of algorithm .
Iii-A3 Quantification of error from hyperparameters
The agnostic methodology for hyperparameters is implemented as follows. In the case of hyperparameters, we focus on a single path similar to what we did for algorithms. Let’s assume we are trying to quantify the error contribution of a particular hyper-parameter that lies on path , i.e. the -th hyper-parameter of the -th algorithm in the -th step of the pipeline. Being to means we optimize everything else on the path except the hyper-parameter. This means that we pick the hyperparameter of algorithm randomly while optimizing the rest of the hyperparameters on the path. This is formally calculated by taking the average of the optimum errors on the path for each configuration of . The minimum validation error on the path is then subtracted from this error to give us the error contribution from hyper-parameter on path . This is again computed using the HPO framework described in section II-A1.
where, , = number of hyperparameters of algorithm . represents the number of configurations of , is the minimum error obtained with the -th configuration of and is the minimum error found over the path that consists of algorithm .
Iv Experiments and results
In this section, we describe the experiments performed on the data analysis pipeline to quantify the error contributions from different components of the pipeline. Image classification is the data analysis problem chosen for demonstrating the error quantification experiments.
The above figure shows the pipeline used in this work. There are 3 computational steps in this pipeline, namely feature extraction (), feature transformation () and learning algorithms (). The steps, algorithms and corresponding hyperparameters is described in Table I.
|Feature extraction||Haralick texture features (Haralick distance)|
Pre-trained CNN trained on ImageNet[deng2009imagenet] database with VGG16 [simonyan2014very] network
|Pre-trained CNN trained on ImageNet [deng2009imagenet] database with Inception [szegedy2016rethinking] network|
|Feature transformation||PCA () [wold1987principal]|
|ISOMAP (number of neighbors, number of components) [tenenbaum2000global]|
|Learning algorithms||Random forests (number of trees, maximum features) [breiman2001random]|
|SVM () [cortes1995support]|
The algorithms defined in the table are selected for making up the components of the pipeline in Fig. 3. This is meant to serve as an example for demonstrating the experiments using the error contribution framework described in section III. It can easily be generalized to any data analysis problem that involve pipelines.
Iv-a Optimization frameworks
Experiments are performed using two optimization frameworks. These frameworks have been described in detail in Section II-A. The first global optimization framework is the CASH framework described in Section II-A2. Here, the pipeline is optimized as a whole including the algorithms, which are themselves considered as hyperparameters in this framework. Fig. 1(b) is a representation of this. This is used for quantification of the contribution of error with respect to computational steps in the pipeline.
The second framework is the hyperparameter optimization (HPO) framework where each path in the pipeline is optimized individually. This is described in detail in section II-A1. This framework is used for quantifying the contribution of error with respect to algorithms and hyperparameters in the each path of the pipeline. The framework is depicted in Fig. 1(a).
Specifically, we choose the path haralick texture features - PCA - random forests to demonstrate the error quantification approach for algorithms and hyperparameters.
|Dataset (notation)||Distribution of classes|
|Breast cancer (breast) [bilgin2007cell]||benign: 151, in-situ: 93, invasive: 202|
|Brain cancer (brain) [gunduz2004cell]||glioma: 16, healthy: 210, inflammation: 107|
|Material science 1 (matsc1) [chowdhury2016image]||dendrites: 441, non-dendrites: 132|
|Material science 2 (matsc2) [chowdhury2016image]||transverse: 393, longitudinal: 48|
Four datasets from the domains of medicine and material science are used in this work. They are image datasets of breast cancer[bilgin2007cell], brain cancer [gunduz2004cell], and two datasets of microstructures in material science [chowdhury2016image]. They are described in Table II.
|Haralick texture feature||Haralick distance||The Haralick distance to consider while computing the co-occurence matrix||[1, 2, 3, 4]|
|PCA||whitening||Flag variable for whitening the data||[True, False]|
|ISOMAP||number of neighbors||Number of neighbors to consider for each point||[3, 4, 5, 6, 7]|
|ISOMAP||number of components||Number of co-ordinates for the manifold in ISOMAP algorithm||[2, 3, 4]|
|Random forests||number of estimators||Number of trees in the forest||[8, 81, 154, 227, 300]|
|Random forests||maximum features||The fraction of the total number of features to consider when looking for the best split||[0.3, 0.5, 0.7]|
|SVM||C||Penalty parameter of the error term||[0.1, 25.075, 50.05, 75.025, 100.0]|
Kernel coefficient for the radial basis function
|[0.3, 0.5, 0.7]|
Table II represents datasets from the scientific domain. These datasets have been chosen because they represent examples of real world datasets. They are noisy in the sense that they have artefacts in the images, are heavily imbalanced and are small in terms of number of samples. They are different from the very large datasets like ImageNet [deng2009imagenet]
, where deep learning techniques like convolutional neural networks have been shown to be superior. Even though deep neural networks represent an end-to-end workflow where the input image is fed into the network and the output classification is obtained at the other end, they may also be represented as pipelines, if the hyperparameters of the network are considered.[shin2016deep] has shown that machine learning problems involving datasets from medical imaging may be solved using pre-trained and fine-tuned neural networks rather than training them from scratch. We have therefore used pre-trained models such as VGGnet [simonyan2014very] and InceptionNet [szegedy2016rethinking] as pre-trained feature extraction models that fit naturally in the pipeline framework described here for the purpose of illustrating the error quantification methodology.
Iv-C Error quantification experiments
Experiments based on the quantification of error contributions framework described in section III-A are presented here. The plots are of the error contribution values calculated using Eqs. 3, 4 and 6 on the 4 datasets described in Table II.
Iv-C1 Experimental setting
Optimization using the 3 algorithms in section II-B is performed using the pipeline in Fig. 3 on the 4 datasets in Table II. The domain and possible values of the hyperparameters are described in Table III. The convergence criteria (a hyper-hyper-parameter) is set at 50 iterations of unchanging function value for each of the optimization methods. The choice of the convergence criteria and hyperparameters are independent of the error quantification methods. Results maybe obtained by using any choice of values for these components. The continuous hyperparameters maximum features, C and have been discretized specifically for comparison of the optimization methods with grid search. In general, discretization of the hyperparameters is not necessary for performing optimization.
The error contribution values are obtained from the trials in the optimization methods described in Section II
. Grid search is only run once while the other algorithms are averaged over 5 runs with the mean and standard deviation shown in the following plots. These results are computed on the validation error (cross-entropy loss) obtained at the end of the pipeline. Random search and Bayesian optimization (using the SMAC algorithm) are implemented on both the frameworks described in SectionIV-A. The grid search results maybe used as the gold standard to compare the performance of other optimization algorithms.
Iv-C2 Error contribution from computational steps
The mean and standard deviation values of calculated using Eq. 3 is represented in Fig. 4 for the 4 datasets in Table II. The error is calculated using the formulation of in section III-A1. We observe that most of the contribution comes from feature extraction algorithms. This means that it is most important to optimize the feature extraction step among the other steps as it is of most importance based on the plots. This confirms our intuitive belief that feature extraction algorithms are the most important components of a machine learning pipeline.
Let us look at Fig. 3(a) (contribution of the steps on the breast) as an example. Here we see that the contribution from the steps reduces in magnitude as we move towards the end of the pipeline. This trend is quantified by all of the three methods - grid search, random search and Bayesian optimization. However, Bayesian optimization is not able to capture the contribution from feature transformation as accurately as random search with respect to grid search. The standard deviation of grid search is 0 because it was only run once for each dataset due to the time required for computation and also because of the robustness of the grid search method (the results don’t change because we try out every single configuration). We observe from the results, that random search follows the behavior of grid search. It mirrors the behavior of grid search, but the results are not robust because of the high standard deviation. This is expected because the search trials may not include all the configurations in the pipeline as opposed to grid search where all the configurations of algorithms and hyperparameters are evaluated. The results of SMAC (Bayesian optimization) in the CASH framework do not follow the behavior of grid search as closely as random search. This is because the trials for SMAC are even more sparse with respect to the algorithms it selects for optimization of the error in Eq. 3. SMAC only samples a few configurations based on the updated probabilistic model as it narrows in on the optimum configuration. Therefore, it sometimes gives erroneous results as can be seen by the contribution of feature transformation in Fig. 4.
Iv-C3 Error contribution from algorithms
In Fig. 5, the error contributions from algorithms are quantified using the formulation in Section III-A2. We select haralick texture features, and random forests as the path to demonstrate the contribution of error from algorithms, because each of these algorithms are associated with one or more hyperparameters. We observe a trend here, in that, the contribution from haralick texture features and random forests is more than . This means that it is more important to tune haralick texture features, and random forests than it is to tune . This maybe attributed to the fact that the search space of hyperparameters (in Table III) for haralick texture features and random forests is larger than that of PCA. Again, we see the trend that random search performs better and is more robust in terms of following the behavior of grid search than Bayesian optimization.
Let us take the example of the error contribution of algorithms over the selected path in the brain dataset depicted in Fig. 4(b). We observe that random forests has the most amount of contribution with respect to the error followed by haralick texture features and PCA respectively. Again we see that Bayesian optimization is not able to capture the error contributions from random forests adequately.
Iv-C4 Error contribution from hyperparameters
Fig. 6 shows the error contributions from hyperparameters quantified using the formulation in Section III-A3. We again select haralick texture features, and random forests as the path to demonstrate the contribution of error from algorithms. The hyperparameters (from Table III) in this path are Haralick distance, whitening, Number of estimators and Maximum features. Again, we see the trend that random search performs better in terms of following the behavior of grid search than Bayesian optimization. In addition, we observe that it is in general more important to tune the hyperparameters Haralick distance, and Number of estimators than it is to tune the other hyperparameters. This again could be due to the number of configurations of the respective hyperparameters in Table III, where a hyper-parameter with a larger search space has more contribution to the error and is therefore more important to tune.
Let us look at the plot in Fig. 5(b) (contribution of hyperparameters with respect to the brain). In this plot, we again observe that the maximum contribution in terms of error is from the number of estimators hyper-parameter. Similar to the error contribution from steps and algorithms, we observe that even though, Bayesian optimization follows the trend in grid search, it is not able to accurately capture the contribution, especially for the number of estimators hyperparameter.
Iv-C5 Comparison of computation time
Fig. 8 shows the average computation times of each of the algorithms used in the error quantification experiments. We can observe that random search and Bayesian optimization are both efficient in terms of computational time as opposed to grid search. Therefore, both CASH and HPO optimization frameworks using algorithms like Random search and Bayesian optimization maybe used for computation of the error contributions instead of grid search. However, as we have seen repeatedly from the plots in Figs. 4, 5 and 6; random search captures the behavior of grid search more accurately and is more robust than grid-search. Hence, random search maybe used to quantify the contributions from steps, algorithms and hyperparameters accurately and efficiently.
We describe an ongoing application of this work in a project known as the Cognitive Immersive Systems Laboratory [su2017cognitive] (CISL) at Rensselaer Polytechnic Institute (RPI). The next Cognitive and Immersive Situations Room (CIR) to augment group intelligence is being built here.
A use case of CISL is to diagnose breast cancer with the help of experts consisting of medical professionals (domain experts) and computer scientists (data scientists). This group of experts come together to solve the problem with the help of the artificial intelligence in the room (CIR).
A demonstration of the CIR for breast cancer diagnosis is shown in Fig. 9. The frontend of the CIR for this specific use-case is shown in Fig. 7. The experts in the room look at various aspects of the problem which are depicted in the figure. This is an ideal application of the error contribution framework where the experts can look at the contributions of the various components in Fig. 7 and decide on which components to optimize and improve the image classification pipeline.
The suggested approach involves understanding the sources of error contribution in data analysis pipelines. Specifically, we propose an agnostic methodology to quantify the error contributions from different parts of an image classification pipeline, namely computational steps, algorithms and hyperparameters. This is described in Section III-A. The results in Section IV-C show that the global optimization methods like random search and Bayesian optimization is able to quantify the error contributions as well as grid search. The framework of Bayesian optimization is not as accurate and robust as random search due to reasons specified in section IV. In general we expect optimization algorithms that have more trials in a larger region of the search space of the configurations to quantify the contributions from components in the pipeline accurately. We intend to explore the results from more hyper-parameter optimization algorithms in the future. The agnostic error contribution methodology maybe used by machine learning practitioners to understand and interpret results of a specific machine learning problem on a dataset. Understanding the sources of contribution will help data scientists quickly iterate over pipelines, algorithms and hyperparameters and find the best set of configurations for solving a particular task by focussing on the important components of the pipeline found from the results. Domain experts like biologists and scientists from different disciplines can use this method to understand and interpret the error in terms of the pipeline used to solve their specific problem. In terms of future work, this formulation could be expanded to cover more data analysis problems more algorithms. The error quantification framework maybe used by any practitioner that works with pipelines for solving a machine learning problem. The pipeline may even be extended to include processes like data cleaning and data normalization. This framework may also be included to understand and interpret deep neural networks, which are more end-to-end in nature. This maybe used for comparing the performance of candidate networks for solving the problem.