Log In Sign Up

PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines

by   Jorge Piazentin Ono, et al.
NYU college

In recent years, a wide variety of automated machine learning (AutoML) methods have been proposed to search and generate end-to-end learning pipelines. While these techniques facilitate the creation of models for real-world applications, given their black-box nature, the complexity of the underlying algorithms, and the large number of pipelines they derive, it is difficult for their developers to debug these systems. It is also challenging for machine learning experts to select an AutoML system that is well suited for a given problem or class of problems. In this paper, we present the PipelineProfiler, an interactive visualization tool that allows the exploration and comparison of the solution space of machine learning (ML) pipelines produced by AutoML systems. PipelineProfiler is integrated with Jupyter Notebook and can be used together with common data science tools to enable a rich set of analyses of the ML pipelines and provide insights about the algorithms that generated them. We demonstrate the utility of our tool through several use cases where PipelineProfiler is used to better understand and improve a real-world AutoML system. Furthermore, we validate our approach by presenting a detailed analysis of a think-aloud experiment with six data scientists who develop and evaluate AutoML tools.


XAutoML: A Visual Analytics Tool for Establishing Trust in Automated Machine Learning

In the last ten years, various automated machine learning (AutoML) syste...

Amazon SageMaker Autopilot: a white box AutoML solution at scale

AutoML systems provide a black-box solution to machine learning problems...

MLModelScope: Evaluate and Measure ML Models within AI Pipelines

The current landscape of Machine Learning (ML) and Deep Learning (DL) is...

INODE: Building an End-to-End Data Exploration System in Practice [Extended Vision]

A full-fledged data exploration system must combine different access mod...

Finding Reusable Machine Learning Components to Build Programming Language Processing Pipelines

Programming Language Processing (PLP) using machine learning has made va...

Faster Convergence with Lexicase Selection in Tree-based Automated Machine Learning

In many evolutionary computation systems, parent selection methods can a...

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics

The effective utilization at scale of complex machine learning (ML) tech...

Code Repositories

1 Related Work

AutoML has emerged as an approach to simplify the use of ML for different applications, and many systems that support AutoML are currently available [hutter_automated_2019, google_cloud_2020, mljar_machine_2020, ibm_watson_2020, drori_alphad3m:_2018, olson_tpot_2019, feurer_auto-sklearn_2019]

. While early work on AutoML focused on hyperparameter optimization and on ML primitives, recent approaches aim to efficiently automate the synthesis of end-to-end pipelines – from data loading, pre-processing, feature extraction, feature selection, model fitting and selection, and hyper-parameter tuning 

[swearingen_ATM_2017, drori_alphad3m:_2018, shang_democratizing_ml_2019, feurer_auto-sklearn_2019, olson_tpot_2019]

. AlphaD3M uses deep learning to learn how to incrementally construct ML pipelines, framing the problem of pipeline synthesis for model discovery as a single-player game with a neural network sequence model and Monte Carlo Tree Search (MCTS) 

[drori_alphad3m:_2018]. Auto-sklearn produces pipelines using Bayesian optimization combined with meta-learning [feurer_auto-sklearn_2019], and TPOT uses genetic programming and tree-based optimization [olson_tpot_2019].

During the search for well-performing pipelines, AutoML systems can generate a large number of pipelines. This makes the analysis of the search results a challenging problem, in particular when pipelines have similar scores. As a consequence, the selection of the best performing end-to-end pipeline becomes expensive, time-consuming and a tedious process. In the following, we discuss research that has attempted to tackle this challenges through visualization.

Explaining AutoML. The black box nature of AutoML systems and the difficulty in understanding the inner-workings of these systems lead to reduced trust in the pipelines they produce [wang_atmseer:_2019]. There have been some attempts to make the AutoML process more transparent through insightful visualizations of the resulting pipelines. These can be roughly grouped into two categories: hyperparameter visualization [wang_visual_2019, park_visualhypertuner_2019, golovin_google_2017] and pipeline visualization [weidele_autoaiviz_2020, cashman_ablate_2019].

ATMSeer [wang_atmseer:_2019] and Google Vizier [golovin_google_2017] are approaches to visualize hyperparameters. ATMSeer, which is integrated with the ATM AutoML framework [swearingen_ATM_2017], displays the predictive model (i.e., last step of the ML pipeline) together with its hyperparameters and performance metrics to the user. Users can use crossfiltering (e.g., over ML algorithms) to facilitate the exploration of large collection of pipelines and refine the search space of the AutoML system, if needed. Google Vizier makes use of a parallel coordinate view where hyperparameters and objective functions are displayed to the users. It allows them to examine how the different hyperparameters (dimensions) co-vary with each other and also against the objective function. Although these methods help AutoML users to analyze the generated pipelines, most of them only support the analysis of its last step which is the fitted model, leaving aside important aspects of the pipeline such as data cleaning and feature engineering. In contrast, PipelineProfiler provides a visualization to explore and analyze the end-to-end pipeline – from data ingestion and engineering, to model generation.

Systems that support pipeline visualization include AutoAIViz [weidele_autoaiviz_2020] and REMAP [cashman_ablate_2019]. AutoAIViz uses Conditional Parallel Coordinates (PCP) [weidele_conditional_2019] to represent sequential pipelines and their hyperparameters. The system provides a hierarchical visualization which shows the pipeline steps (on the first level of PCP) as well as the hyperparameters of each step (on the second level of PCP). REMAP [cashman_ablate_2019] focuses on pipelines that use deep neural networks. It proposes a new glyph, called Sequential Neural Architecture Chips (SNAC), which shows both the layer type and dimensionality, and allows users to interactively add or remove layers from the networks. Despite their ability to show end-to-end pipelines, both systems can only show linear pipelines, making it difficult to explore pipelines created by different AutoML systems that have more complex structure. Furthermore, REMAP was designed specifically to visualize neural network architectures, and thus it is not suitable to explore general ML pipelines that use different learning techniques. In this work, our goal is to allow users to explore, compare and analyze pipelines generated by multiple AutoML systems which can have nonlinear pipeline structures and use a variety of primitives and learning techniques.

Visual Analytics for Model Selection. Selecting a good model among the potentially large set of models (or pipelines) derived by an AutoML system is a challenging problem that has attracted significant attention in the literature. Visual analytics systems such Visus [santos_visus_2019], TwoRavens [gil_tworavens_2019], and Snowcat [cashman_snowcat_2018] provide a front-end to AutoML systems and guide subject-matter experts without training in data science through the model selection process. They focus on providing explanations for models, and some provide simple mechanisms to compare models (e.g., based on the scores, or the actual explanations). Other approaches focus exclusively on model selection. RegressionExplorer [dingen_regressionexplorer:_2018]

enables the creation, evaluation and comparison of logistic regression models using subgroup analysis. Square 


introduces a novel encoding to investigate different models by visually comparing multiple histograms based on the statistical performance metrics of a multi-class classifier. It also shows instance-level distribution information. Similarly, ModelTracker

[amershi_modeltracker:_2015], an interactive instance-based visualization, enables multi-scale analysis. It visualizes predictions supporting both aggregate and instance-level performance information while enabling direct inspection of the data. The majority of these methods was designed to evaluate and select predictive models based on the performance results. However, they do not take into account additional metrics like running time, i.e., how long pipelines take to run, or primitive usage, i.e., whether primitives are used correctly and effectively. PipelineProfiler not only encodes this information in a compact visual representation, but it also provides a usable interface that allows users interact with a pipeline collection at different levels of abstraction – from a high-level overview to drilling down to inspect details of select pipelines.

Interactive Model Steering. Systems such as TreePOD [muhlbacher_treepod_2018], BEAMES [das_beames_2019] and EnsembleMatrix [talbot_ensemblematrix:_2009] support the analysis and refinement of models through interactive visualizations, and allow users to explore the effects of modifying some parameters. BEAMES [das_beames_2019] lets users steer the training of new regression models from a set of previous models. It presents model performance information to the users and trains new models based on the user feedback. Users can modify feature/sample importance, and combine multiple models to create ensembles. EnsembleMatrix [talbot_ensemblematrix:_2009]

enables users to steer the creation of ensemble decision tree models. With EnsembleMatrix, users can combine and choose weights for decision trees and interactively evaluate the performance of the ensemble model. TreePOD 

[muhlbacher_treepod_2018] supports the creation of decision trees with multiple objectives, including performance and interpretability. Users can look at the optimization procedure and guide it so that simpler solutions are found. These systems frequently support model steering by letting users try different settings during the model construction process. Although, they help users understand the impact of these parameters over the models, they do not consider other relevant steps (also called primitives) that are part of end-to-end pipelines like data ingestion and feature engineering, which could have a significant impact in the final model performance. The primitives contribution view in PipelineProfiler displays the correlations between primitive usage and scores, allowing users to infer which primitives can lead to well-performing pipelines. Users can then drill down and further explore individual pipelines, their primitives and hyperparameters.

2 PipelineProfiler: exploring end-to-end ML pipelines

In this section, we describe PipelineProfiler, a tool that enables the exploration of end-to-end machine learning pipelines produced by AutoML systems. We first present the desiderata we distilled from interviews with AutoML experts and subsequently used to guide our design choices. Then, we describe the components of PipelineProfiler, how they are integrated, and the algorithms we developed to enable the effective analysis of ML pipelines. Finally, we briefly describe the implementation details of our system.

2.1 Domain Requirements

We conducted interviews with six data scientists who actively work with AutoML systems in the context of the D3M project [elliott_data-driven_2020]: the developers of four distinct AutoML systems (D1 - D4) and two data scientists that are tasked with evaluating the D3M AutoML systems (E1 and E2). Since each developer works on a specific system, they have different needs and follow distinct workflows. However, they share some challenges. The AutoML evaluators are part of the D3M management team. They are responsible for selecting what types of ML tasks the developers must focus on and also evaluate system performance.

D3M pipelines are represented as JSON-serialized objects that contain metadata, input and output information, and the pipeline architecture, which is described as a directed acyclic graph (DAG) [milutinovic_2019, d3m_datadrivendiscovery_2020]. The exploration of ML pipelines collections is a task performed by all AutoML developers and evaluators. All interviewees said they explored pipelines by looking at their JSON representations, and complained that reading the text files, and inspecting the pipelines one at a time was a tedious and time-consuming task. Understanding and comparing the pipelines is difficult, in particular, since the DAG structure is hard to grasp from the JSON representaion.

D1 said she does not have time to inspect pipelines often, and instead focuses on assessing cross-validation scores and looking for correlations between primitives and performance scores. In contrast, D2, D3 and D4 they examine the pipeline DAGs and their architecture. D1 and D2 also analyze the prediction and training time. They mentioned that their AutoML systems were evaluated within a given time budget, therefore training time is an important metric for them.

D2’s system has a blacklisting feature: when a primitive is found to have poor performance, it can be flagged and excluded from the search process. Therefore, he was interested in identifying when a primitive was associated with high and l scores for pipelines.

D3 usually compares pipelines using their cross-validation scores. When he finds a problem for which his systems derives sub-optimal pipelines, he inspects the pipelines derived by other systems. His goal is to understand which features in his pipelines lead to the low scores, and conversely, why the pipelines derived by the other systems perform better. By answering these questions, he hopes to gain insights into if and how he can improve his system. He is also interested in exploring the pipelines at the hyperparameter level, but said this is currently not possible due to the large number of primitives (over 300), pipelines, and parameters involved.

D4 is also interested in comparing pipelines, albeit for a different reason. More specifically, he is interested in comparing AutoML pipelines from different sources, including human generated pipelines. His goal is to evaluate if there are differences between machine- and human-generated pipelines. He is also interested in primitive similarity. More specifically, he wants to find which primitives are exchangeable within a pipeline architecture.

The analysis workflow followed by the AutoML evaluators is significantly different from that of the developers. While developers focus on pipeline structure, evaluators are mostly concerned with how well the systems perform and the problem types (e.g., classification, regression, forecasting, object detection, etc.) they currently support and should support in future iterations. More specifically, E1 and E2 said that their workflow consisted mostly on evaluating AutoML systems based on their cross-validation scores. However, they were also interested in checking how the primitives were being used, and whether AutoML systems produced different pipelines a given problem type. More specifically, they stated that if all AutoML systems derived the same (or very similar) pipelines, the task they are solving is no longer challenging and new problem types should be proposed. For formal evaluations, the D3M systems are evaluated using sequestered problems that are not visible to the developers. Thus, to give actionable insight to AutoML developers without disclosing specifics of the sequestered problems, E2 was also interested in identifying why pipelines fail.

We compiled the following desiderata from the interviews:

  1. [start=1,label=[R0]]

  2. Pipeline collection overview and summary: all participants would like to visualize and compare multiple pipelines simultaneously, instead of inspecting them one by one.

  3. Primitive usage: E1 and E2 are interested in exploring how primitives are used across different AutoML systems. More specifically, they want to check if the systems are generating diverse solutions and if there are underutilized primitives.

  4. Visualizing primitive hyperparameters: D3 would like to be able to explore the hyperparameter space of the primitives used in his pipelines.

  5. Visualizing pipeline metadata: D1, D2, E1 and E2 mentioned they were interested in visualizing and comparing different aspects of the trained pipelines, including scores, prediction and training time.

  6. Finding correlations between primitives and scores: D1 and D2 were interested in identifying primitives that correlate with high scores on different problems and datasets. Furthermore, D2 would like to see primitives that perform poorly in order to blacklist them, and E2 is interested in identifying possible causes for pipeline failure (i.e., low scores).

  7. Visualizing and comparing pipeline graphs: all developers were interested in visualizing the connections between pipelines primitives using a graph metaphor. Furthermore, D3 and D4 are interested in performing a detailed comparison of the pipeline graphs. In particular, they want to identify how different AutoML systems structure their pipelines to solve a particular problem type.

2.2 Visualization Design

In order to fulfill the requirements identified in the previous section, we developed PipelineProfiler, a tool that enables the interactive exploration of pipelines generated by AutoML systems. Fig. PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines shows PipelineProfiler being applied to compare pipelines derived by three distinct AutoML systems for a classification problem that aims to predict the presence of heart disease using the Statlog (Heart) Data Set [dua2017uci]. The main components of PipelineProfiler are the Pipeline Matrix (C) and the Pipeline Comparison View (D). The Pipeline Matrix (C) shows a tabular summary of all the pipelines in the collection. The user can also drill down and explore one or multiple pipelines in more detail – the graph structure of selected pipelines are displayed in the Pipeline Comparison View (D) upon request. The system Menu (B) enables users to focus on a subset of the pipelines, export pipelines of interest to Python, sort the table rows and columns, and perform automated analyses over groups of primitives. These operations are described later in this section. PipelineProfiler is implemented as a Python library that can be used with Jupyter Notebooks to facilitate the integration with the workflow of the AutoML community (A).

Pipeline Matrix

The Pipeline Matrix provides a summary for a collection of machine learning pipelines 1 selected by the user. Its visual encoding was inspired by visualizations used for topic modeling systems [chuang2012termite, alexander2014serendip]. However, instead of words and documents, this matrix represents whether a primitive is used 2 in a machine learning pipeline (Fig. PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines(C1)). Users can interactively reorder rows and columns according to pipeline evaluation score, pipeline source (AutoML system that generated it), primitive type (e.g., classification, regression, feature extraction, etc.), and estimated primitive contribution (i.e., correlation of primitive usage with pipeline scores). Furthermore, we use shape to encode primitive types. For example, pre-processing primitives are represented by a circle, while a plus sign is used to represent feature extraction primitives (see the legend in the top-right corner of (Fig. PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines).

To support the exploration of hyperparameters 3, PipelineProfiler implements two interactions that show this information on demand: parameter tooltip and one-hot-encoded parameter matrix. When the user hovers over a cell in the matrix, a tooltip shows the primitive metadata (type and Python path) as well as a table with all the hyperparameters set. Fig. PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines(C2) shows a tooltip for primitive Denormalize, with four hyperparameter values set. Users can also inspect a summary of hyperparameter space for a primitive by selecting a column in the Pipeline Matrix. When a primitive (column) is selected, all of its hyperparameters are represented using a one-hot-encoding approach: each hyperparameter value becomes a column in the matrix, and dots indicate when the hyperparameter is set in a pipeline. Fig. PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines(C3) shows the hyperparameter space of Xgboost Gbtree.

Domain experts were interested in exploring pipeline metadata 4, including training and testing scores, training time and execution time. PipelineProfiler shows the pipeline metadata in the Metric View (Fig. PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines(C4)). Users can select which metric to display using a drop down menu, and the numerical values are shown in bar chart aligned with the matrix rows. In C4, the user can choose to display the metric F1 or the prediction time. Pipeline rows can be re-ordered based on the metric, and to enable a comparison across systems, users can also interactively group pipelines based by the system that generated them.

To convey information about the relationships between primitive usage and pipeline scores 5, we designed the Primitive Contribution view. This view shows an estimate of how much a primitive contributes to the score of the pipeline using a bar chart encoding, aligned with the columns of the matrix (Fig. PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines(C5)). The contribution can be either positive or negative, representing positive or negative primitive correlation with the scores. For example, in C5, Deep Feature Synthesis is the primitive most highly correlated with F1.

We estimate the primitive contribution using the Pearson correlation between the primitive indicator vector

( if pipeline contains the primitive in question and otherwise) and the pipeline metric vector , where is the metric score for pipeline . Since is dichotomous and is quantitative, the Pearson correlation can be computed more efficiently with the Point-Biserial Correlation (PBC) coefficient [sheskin2003handbook]. PBC is equivalent to the Pearson correlation, but can be evaluated with fewer operations. Let be the mean of the metric score () when the primitive is used (); , the mean of the scores when the primitive is not used ();

be the standard deviation of all the scores (

); be the number of pipelines where the primitive is used; be the number of pipelines where the primitive is not used; and . The point-biserial correlation is computed as:

Pipeline Comparison View

To provide a concise summary of a collection of pipelines, the Pipeline Matrix models the pipelines as a set of primitives that can be effectively displayed in a matrix. However, while analyzing pipelines collections, AutoML developers also need to examine and compare the graph structure of the pipelines 6. The Pipeline Comparison view (Fig. PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines(D)) consists of a node-link diagram that shows either an individual pipeline, or visual-difference summary of multiple pipelines selected in the matrix representation. In the summary graph, each primitive (node) is color-coded to indicate the pipeline where it appears. If a primitive is present in multiple pipelines, all corresponding colors are displayed. If a primitive appears in all selected pipelines, no color is displayed.

The Pipeline Comparison View enables users to quickly identify similarities and differences across pipelines. Fig. 1 shows the best (a) and worst pipelines (b) solving the 20 newsgroups classification problem [lang1995newsweeder]

, and a merged pipeline (c) that highlights the differences between the two pipelines, clearly showing that the best pipeline (blue) uses a Gradient Boosting classifier, and an HDP and Text Reader feature extractors.

(a) Best performing pipeline   (F1 Macro: 0.45)
(b) Worst performing pipeline   (F1 Macro: 0.06)
(c) Merged pipeline
Figure 1: Pipeline Comparison View, showing the best and worst pipelines for a multitask classification problem on the 20 newsgroups dataset. (a) and (b) show individual pipeline structure for the best and worst pipelines respectively. (c) presents the merged view of both pipelines, highlighting the differences between them using color-coded headers. In the merged pipeline, blue headers represent primitives that only appear in (a) and orange headers, primitives that only appear in (b). Primitives without a color-coded header are shared by both pipelines.

To support the comparison of multiple pipeline structures, we adapted the graph summarization method proposed by Koop et al. [koop_visual_2013]. Since ML pipelines are directed-acyclic graphs, we modify the method to avoid cycles in the merged graph. The algorithm creates a summary graph by iteratively merging graph pairs. The merge of two graphs and is performed in four steps, as detailed below.

1) Computing Node Similarity: Let and be two primitives (nodes). We say that and have the same type if they perform the same data manipulation (e.g., Classification, Regression, Feature Extraction, etc.). The similarity is given by:

As in Koop et al. [koop_visual_2013], we use the Similarity Flooding algorithm [melnik2002similarity] to iteratively adjust the similarity between nodes and take node connectivity into account. We refer the reader to [melnik2002similarity] for details.

2) Graph Edit Matrix Construction: In order to match two graphs, and , the algorithm builds a graph edit matrix that contains the all the possible costs to transform into . Let and be the number of nodes and respectively. The edit matrix is defined so that the selection of one entry from every row and one entry from every column corresponds to a graph edit that transforms into [riesen2009approximate]. contains the costs to add (), delete () and substitute () nodes. We choose costs that prioritize node substitutions in case of a total or partial match: , and .

3) Node matching: We use the Hungarian algorithm [kuhn1955hungarian] to select one entry of every row and one entry of every column of , while minimizing the total cost of the graph edit. Two nodes match when one can be substituted by the other, i.e., their substitution entry is selected from the matrix.

4) Graph merging: We merge and by creating a compound node for every pair of nodes that were matched in step 3. However, since machine learning pipelines are directed acyclic graphs, we do not want the merged graph to have cycles either. Therefore, we use the additional constraint to only merge nodes that do not result in cycles in the merged graph. This check is done using a depth search first after each merge.

Combined-Primitive Contribution

The primitive contribution presented in the previous section does not take into account primitive interactions. For example, it might be the case that for a given problem, the classification algorithm SVM and the pre-processing PCA together produce good models , but they may lead to low-scoring pipelines when used independently. Because the contribution is estimated with the Point-Biserial Correlation of the binary primitive usage vector and pipeline score, interactions involving multiple primitives are not considered.

To take all primitive interactions into consideration, it would be necessary to check for the correlations of all the primitive groups in the powerset of our primitive space. This strategy has two critical problems: 1) it is not computationally tractable, and 2) it would result in a number of combinations prohibitively large for users to inspect. To tackle this challenge, we propose a new algorithm to identify groups of primitives strongly correlated with pipeline scores. The algorithm works as follows: for every combination of primitives up to a predefined constant size, 1) create a new primitive indicator vector , which contains 1 if the set of primitives is used in the pipeline, and 0 otherwise. 2) compute the correlation of the primitive group with the pipeline scores using the Point-Biserial Correlation (comparing the pipelines that have vs do not have the combination of primitives). 3) select which combination of primitives to report to the user. We only report the primitive group if its Pearson correlation is greater than the Pearson correlation of all the elements in its powerset. Algorithm 1 describes CPC in detail.

0:  , the primitive indicator vectors

, the evaluation metric score vector

0:  , the maximal cardinality of the primitive group
  //Computing correlations for all groups of primitives up to size K
  // is the powerset of I up to cardinality K
  for   do
  end for
  //Selecting the combination of primitives to report to the user (R)
  // is the powerset of I of cardinality :
  for   do
     //Checks if there is a subset of S with greater contribution
     for  do
        if  then
        end if
     end for
     if  true then
     end if
  end for
  return  R
Algorithm 1 Combined-Primitive Contribution

The idea behind CPC is simple. The algorithm checks the correlation between combinations of primitives and the pipeline scores, and reports surprising combinations to the user (correlations not shown in the Primitive Contribution View). The user defines (in our tests, we found that is effective). If there are primitives, the algorithm evaluates groups of primitives and has a time complexity of . In PipelineProfiler, this CPC can be run via the “Combinatorial Analysis” menu (Fig. PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines(A)). When the algorithm is run, we show a table containing the selected groups of primitives and the correlation values. Fig. 2 shows an example of a CPC run over pipelines derived to perform a classification task using the Diabetes dataset [dua2017uci].

Figure 2: Combined-Primitive Contribution applied to pipelines that solve a classification problem on the Diabetes [dua2017uci] dataset: A) The Pipeline Matrix representation of the pipeline collection. B) The Combinatorial Analysis View, showing a group of two primitives, Min Max Scaler and RBF Sampler, that correlate with higher F1 scores. The two primitives are highlighted in (A). Notice that pipelines with higher scores use both primitives (#1, #2) – pipelines that use them separately have lower scores (#3 - #9).

Implementation details

PipelineProfiler is implemented as a Python 3 library. The front-end is implemented in Javascript with React [fedosejev2015react], D3 [bostock2011d3] and Dagre [cobarrubia2018dagrejs]. The back-end, responsible for data management, graph merging and the Jupyter Notebook hooks is implemented in Python with Numpy [walt2011numpy] and Network [hagberg2008exploring].

The PipelineProfiler library takes as input a Python array of pipelines in the D3M JSON [D3M2020metalearning] format, and plots the visualization in Jupyter using Jupyter Widgets hooks. We implemented a bi-directional communication between Jupyter Notebook and our tool. From Jupyter, the user can create an instance of PipelineProfiler for their dataset of choice. The main menu (Fig. PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines(B)) of PipelineProfiler, on the other hand, enables users to subset the data (remove pipelines from the analysis), reorder pipelines according to different metrics, and export the selected pipelines back to Python. The goal of our design is to provide a seamless integration with the existing AutoML ecosystem and Python libraries, and to make it easier for experts to explore, subset, combine and compare results from multiple AutoML systems.

PipelineProfiler is already being used in production by the DARPA D3M project members. An open-source release is available at

Figure 3: Pipelines Matrix. A) Pipelines are sorted by performance, the 10 best pipelines at the top and the 10 worst pipelines at the bottom. Green and red boxes show the presence and absence, respectively, of feature selection primitives in the pipelines with their performances. B) Pipelines are sorted by execution time, only pipelines generated by AlphaD3M are displayed. Green and red boxes show the absence and presence, respectively, of one-hot encoder primitive in the pipelines with their execution times.

3 Evaluation

To demonstrate the usefulness of PipelineProfiler, we present case studies that use a collection containing 10,131 pipelines created as part of the D3M program’s Summer 2019 Evaluation. In this evaluation, 20 AutoML systems were run to solve various ML tasks (classification, regression, forecasting, graph matching, link prediction, object detection, etc.) over 40 datasets, which covered multiple data types (tabular, image, audio, time-series and graph). Each AutoML system was executed for one hour and derived zero or more pipelines for each each dataset.

3.1 Case Study 1: Improving an AutoML System

To showcase how PipelineProfiler

supports the effective exploration of AutoML-derived pipelines in a real-world scenario, we describe how an AlphaD3M developer used the system, the insights he obtained, and how these insights helped him improve AlphaD3M. AlphaD3M is an AutoML system based on reinforcement learning that uses a grammar (set of primitives patterns) to reduce the search space of pipelines 

[drori_alphad3m:_2018, drori_alphad3m:_2019].

The AlphaD3M developer started his exploration using a problem for which AlphaD3M had a poor performance: a multi-class classification task using the libras move dataset111OpenML dataset, from the OpenML database [OpenML2013]. For this task, in the ranking of all pipelines produced by D3M AutoML systems, the best pipeline produced by AlphaD3M was ranked 18th with an accuracy score of 0.79.

Comparing pipeline patterns. The developer sought to identify common patterns in the best pipelines that were overlooked by the AlphaD3M search. To this end, he first sorted the primitives by type and the pipelines by performance. This uncovered useful patterns. As Fig. 3A shows, primitives for feature selection were frequently present in the best pipelines, while lower-scoring pipelines did not use these primitives. Although he identified other patterns, the information provided by primitive contribution bar charts indicated that feature selection primitives had a large impact in the score of best pipelines. This information led the developer to hypothesize that the usage of feature selection primitives might be necessary for pipelines to perform well for the problem and data combination.

Exploring execution times. The developer then analyzed the pipelines produced only by AlphaD3M. Fig. 3B clearly shows that pipelines containing one-hot encoding primitives take a substantially longer time to execute, approximately 10 seconds – this is in contrast to pipelines that do not use this primitive and take less than 1 second. He also saw in primitive contribution bar charts that the one-hot encoding primitive has the highest impact on the running time. Using this information, he realized that for this specific dataset, one hot encoding primitives were used inefficiently since all the features of the dataset were numeric. Since an AutoML system needs to evaluate a potentially large number of pipelines during its search, an order-of-magnitude difference in execution time, such as what was observed here, will greatly limit its ability to find good pipelines given a limited time budget – for the summer evaluation, this budget was 1 hour.

Reducing the search space. AutoML systems have to deal with large search spaces. To synthesize pipelines, AlphaD3M takes into account over 300 primitives. This often means that there is a delay for the system to derive good pipelines. An effective strategy to reduce the search space is the prioritization of primitives. In Fig. 4, we can see the results of the Combined-Primitive Contribution view, which shows that the combination of the primitives Joint Mutual Information, Extra Trees and Imputer produce good results. Using this information, the expert realized that this sequence of primitives could be added to AlphaD3M’s grammar as static components in order to reduce the search space, and consequently, to produce good pipelines faster.

Using insights to improve AlphaD3M. After the analysis, the developer modified the AlphaD3M system’s handling of feature selection, the use one-hot encoding primitive, and the prioritization of primitives. Feature selection and prioritization of primitives were added to the AlphaD3M grammar and rules were added to the workflow to apply one-hot encoding primitives only for categorical features. The new version of AlphaD3M now leads the ranking for the multi-class classification task in the libras move dataset with an accuracy of 0.88. With respect to execution time, the current average time to evaluate each pipeline for this problem is less than 1 second, while previously it took 10 seconds. As a point of comparison, whereas the best pipeline derived by AlphaD3m after 5 minutes of search had a score of 0.74, now, the best pipeline has a score of 0.79.

Figure 4: Combined-Primitive Contribution: the combination of primitives Joint Mutual Information, Extra Trees and Imputer produce good results together.

3.2 Case Study 2: Exploring AutoML Approaches

AutoML systems in the D3M program use different approaches to generate pipelines. In this case study, we show the use of PipelineProfiler to analyze and compare systems, and discuss some valuable insights obtained into features that impact a system’s performance for a problem. An AutoML developer set out to compare how six D3M systems – denoted by A, B,C,D, E, F – performed for a regression task using the cps 85 wages dataset.222OpenML dataset, The systems output a total of generated 114 pipelines after 1 hour. Since System F produced only one pipeline, it was excluded from the comparison. System A obtained the best performance followed by System B, System C, System D, and System E with 20.28, 20.29, 20.68, 21.46 and 21.46 mean squared error, respectively. Using the Pipeline Comparison View, the developer could also easily see noticeable differences in the strategies used by the AutoML systems to construct the pipelines. We discuss this further below.

Template-based approaches. ML templates are manually designed to reduce the number of invalid pipelines during the search process. Although this approach reduces the search space, it also limits the exploration of potential pipelines. Fig. 5 shows the visual difference for the top-5 pipelines produced by System D. Note that they all have the same exact structure and only differ in the estimator used (Ridge, Lars, Ada Boost, Elastic Net or Lasso). A similar behavior was observed for System E.

Figure 5: A visual comparison of pipelines produced by System D suggests that it fixes the pipeline structure and tries multiple regression algorithms.

Hyperparameter-tuning strategy. Analyses supported by PipelineProfiler can also provide insights into the strategies used by AutoML systems to tune hyperparameters. While exploring the pipelines produced by System C, the developer identified interesting patterns that uncover the strategy this system uses to tune hyperparameters. Using the Pipeline Comparison View, he noticed that some pipelines produced by this AutoML system had the same structure, but used different hyperparameter values. This is illustrated in Fig. 6A, which shows the merged graph for 4 distinct pipelines. He then inspected the hyperparameters of the XGBoost primitive using the one-hot-encoded parameter matrix, and observed that they had different values (see Fig. 6B). This suggests that System C first defines the structure of a pipeline, and then searches for the best-performing hyperparameters values. We note that the changes in these values have important impact in pipeline performance. For instance, the mean squared error for the best and worst pipeline are 20.68 and 29.48, respectively.

Figure 6: System C produced four pipelines (#1, #9, #15 and #17) with the same graph structure, as the merged graph (A) shows. Even though these pipelines have identical structures, the hyperparameter values for the Xgboost Gbtree primitive are different (B1), and this results in different scores for the pipelines (B2). This pattern suggests that System C tunes the hyperparameter values after it derives the pipeline structure.

Search over preprocessing primitives. By exploring another set of pipelines generated by System C (see Fig. 7), he observed that all pipelines use the same estimator – the XGBoost primitive, but the preprocessing primitives differ – Robust Scaler, Encoder and Extract Columns are used. This suggests that System C also searches over alternative preprocessing sequences for a given representative ML estimator, likely in an attempt to optimize the steps for data transformation and normalization.

Figure 7: A comparison of pipelines produced by System C indicates that, for a fixed regression algorithm (Xgboost), it searches for alternative sequences of preprocessing primitives.

Full-search approach. The approach applied by System A and System B seems to search over alternative preprocessing primitives as well as estimators. Fig. 8 shows the merged graph for the top-5 pipelines derived by System A. Note that these pipelines differ both in the structure and primitives used. Pipelines derived System B display a similar behavior.

Figure 8: A comparison of pipelines produced by System A shows that these pipelines vary both in structure and in the primitives used, suggesting that it performs a broad search which considers multiple preprocessing sequences and different regression algorithms.

Although comparing AutoML approaches requires complex analyses, these case studies show that, by presenting an overview that highlights differences and similarities for a set of pipelines, the graph comparison view is quite effective at uncovering interesting patterns that provide insights into the search strategies employed by AutoML systems. Additional questions can be explored by drilling down into the details of the hyperparameter values.

The developer also compared the performance of the different systems. System E and System D

resulted in the lowest-scoring pipelines, probably due to the unsophisticated search strategies they employ – the use of fixed templates may be pruning too much of the search space and ignoring efficient pipelines that do not follow the adopted template. On the other hand,

System A and System B, which perform a broader search, created pipelines that had high scores.

3.3 Expert Interviews

To validate our design decisions, we conducted a second round of interview with the six data scientists from the D3M project who had previously helped us identify the system requirements (Section 2.1): four AutoML Developers (D1-D4) and two AutoML Evaluators (E1, E2). In the experiment, experts were asked to explore a dataset of their choice according to their usual pipeline exploration workflows. Developers were asked to use the system to gain insights about the AutoML strategies and identify possible modifications that could result in system improvements. Evaluators were asked to use the tool to explore the produced pipelines in order to evaluate (and compare) the AutoML systems. They were also asked how this tool could be included in their current workflows to make them more effective.

Each interview took 45 minutes and proceeded as follows. We first presented our system to the participant and clarified any questions they had (10 minutes). Then, we let them choose one problem from the D3M repository [D3M2020metalearning] to explore (30 minutes). Finally, we asked if the participant had any comments on the system (5 minutes). The problems used in this study are shown in Table 1. Note that the systems did not always produce good evaluation metrics for the chosen problems, indicating that these problems are challening. For example, seven AutoML systems produced 115 pipelines for the Word Levels [guzey_classification_2014]

classification problem, but no system an F-score above 0.33.

The participants were free to use PipelineProfiler and explore the available pipelines. They were instructed to speak while using the system, following a “think aloud” protocol. While the participants performed the task, an investigator took notes related to the actions performed. After completion, the participants filled a questionnaire to express their impressions on the usability of the system. Participants received a $20 US Dollars gift card for their participation. In this section, we describe the insights gathered by the participants.

Dataset Dataset Type Task Type Metric Mean Score Score Range # Pipelines # Primitives Participant
Auto MPG [dua2017uci] Tabular Regression Mean Squared Error 103 71 D1
Word Levels [guzey_classification_2014] Tabular Classification F1 Macro 115 69 D2
Sunspots [sidc2019sunspots] Time Series Forecasting Root Mean Squared Error 137 71 D3
Popular Kids [vanschoren_openml_2014] Tabular Classification F1 Macro 120 64 D4
Chlorine Concentration [chen_ucr_2015] Time Series Classification F1 Macro 47 48 E1
GPS Trajectories [dua2017uci] Tabular Classification F1 Macro 163 91 E2
Table 1: Datasets used in the expert interviews

Expert Insights

Data preprocessing.

Before they started the investigation, two developers removed outliers from their datasets. D1 and D3 selected datasets with a Mean Squared Error evaluation metric, which is unbounded in the positive real numbers. The two selected datasets, Auto MPG

[dua2017uci] and Sunspots [sidc2019sunspots], had pipelines with error metrics above , which made the scales difficult to read. In both datasets, the data scientists looked at the Primitive Contribution View and noted that a problem with the SGD primitive was likely responsible for these high errors. They used PipelineProfiler subset menu to remove these pipelines from the analysis.

Performance investigation. Most participants started the analysis by looking at the performance of the pipelines. All developers were interested in how their systems compared against the others. Evaluators, on the other hand, focused on the distribution of scores across all systems. For example, the first comparison E1 did was using the pipeline scores. She grouped pipelines by source and noticed the difference in scores among the top pipelines from each AutoML system. The top two AutoML systems had pipelines with F1 Scores of 0.78 and 0.70, which she mentioned were very close. The other systems produced pipelines with much lower scores, below 0.25.

Primitive comparison. Participants were very interested in comparing the pipelines produced by different systems. In particular, developers spent a considerable amount of time comparing pipelines from their systems against pipelines from the other tools. For example, D4 inspected a classification dataset and found that while a gradient boosting algorithm was used in the top-scoring pipelines, his system was using decision trees. The Primitive Contribution view confirmed his hypothesis that the use of gradient boosting was indeed correlated with high scores. He said that he could use this insight to drop and replace primitives in his AutoML search space. D1, D2 and D3 had similar findings in their pipeline investigations. Evaluators compared primitive usage for a different reason: they wanted to make sure AutoML systems were exploring the search space and the primitives available to their systems. For example, E1 noticed that the best AutoML system used a single classifier type on its pipelines, as opposed to other systems that had more diverse solutions. E2 did a similar analysis on his dataset.

Hyperparameter search strategy. D1 noticed that the top-five pipelines belonged to the same AutoML system and were nearly identical. She explored the hyperparameters of these pipelines using the one-hot-encoded hyperparameter view, and found that although they had the same graph structure, they were using different hyperparameters for the Gradient Boosting primitive. She compared this strategy with another system which did not tune many hyperparameters, and concluded that tuning parameters was beneficial for this problem.

Primitive correctness. Participants also used PipelineProfiler to check if primitives were being used correctly. A common finding was the unnecessary use of primitives. For example, D2 found that pipelines containing Cast to Type resulted in lower F1 scores. He inspected the hyperparameters of this primitive and noted that string features were being converted to float values (hashes). He concluded that string hashes were bad features for this classification problem, and the Cast to Type primitive should be removed from those pipelines. Similar findings were obtained with One Hot Encoder used in datasets with no categorical features (D3, E1, E2), and Imputer used on datasets with no missing data (D4, E1). E2 also found incorrect hyperparameter settings, such as the use of “warm-start=true” in a Random Forest primitive.

Execution time investigation. D4 checked the running times of the pipelines. In particular, he was interested in verifying whether the best pipelines took longer to run. First he sorted the pipelines by score. Then, he switched the displayed metric to “Time” and noticed that, contrary to his original hypothesis, the best pipelines were also the fastest. He looked at the Primitive Contribution View in order to find what primitives were responsible for the longer running times, and identified that the General Relational Dataset primitive was most likely the culprit. He concluded that if he removed this primitive, he would get a faster pipeline.

Expert Feedback

We received very positive feedback from the participants. They expressed interest in using PipelineProfiler for their work and suggested new features to improve the system. After the think-aloud experiment, they were asked if they had any additional comments or suggestions. Here are some of their answers:

  • D1 mentioned that PipelineProfiler is better than her current tools: “I think this is very useful, we are always trying to improve our pipelines. The pipeline scores can give you some scope, but this is doing it more comprehensively”.

  • D2 liked the debugging capabilities of PipelineProfiler: “Actually, with this tool we can infer what search strategies the AutoML is using. This tool is really nice to do reverse engineering”.

  • D3 particularly liked the integration with Jupyter Notebook: “I really liked this tool! It is very informative and easy to use. It works as a standalone tool without any coding, but I can make more specific/advanced queries with just a little bit of code.”

  • D4 wants to integrate PipelineProfiler into his development workflow: “The tools is great, and I as mentioned earlier, it would be even better if an API is provided to ingest the data automatically from our AutoML systems”. E1 and E2 were also interested in integrating this tool with their sequestered datasets, which used for evaluation but not shared with the developers

3.4 Usability

We evaluated the usability of PipelineProfiler using the System Usability Score (SUS) [brooke1996sus], a valuable and robust tool for assessing the quality of system interfaces [bangor2008empirical]. In order to compute the SUS, we conducted a survey at the end of the second interview: we asked participants to fill out the standard SUS survey, grading each of the 10 statements on a scale from 1 (strongly disagree) to 5 (strongly agree). The SUS grades systems on a scale between 1 and 100 and our system obtained an average score of . According to Bangor et al. [bangor2008empirical]

, a mean SUS score above 80 is in the fourth quartile and is acceptable.

4 Conclusions and Future Work

We presented PipelineProfiler, a new tool for the exploration of pipeline collections derived by AutoML systems. PipelineProfiler advances the state-of-the-art in visual analytics for AutoML in two significant directions: it enables the analysis of pipelines that have complex structure and use a multitude of primitives, and it supports the comparison of multiple AutoML systems. Users can perform a wide range of analyses which can help them answer common questions that arise when they are debugging or evaluating AutoML systems. Because these analyses are scripted, they can be reproduced and re-used. We validated our system with a set of use cases that show how PipelineProfiler can be used to improve an open-source AutoML tool, and presented a detailed analysis of think-aloud interviews where experts reported that discovery of novel and actionable insights into their systems.

There are many avenues for future work. To increase the adoption of our tool beyond the D3M ecosystem, we plan to add support for other pipeline schemata adopted by widely used AutoML systems. On the research front, we would like to explore how to capture the knowledge derived by users of PipelineProfiler and use this knowledge to steer the search performed by the AutoML system, which in turn, can lead to the generation of more efficient pipelines in a shorter time. For example, if the user finds that a group of primitives work well together, they should be able to indicate this to the AutoML system, so that it can focus the search of pipelines that use these primitives.