A simple, extensible library for developing AutoML systems
As machine learning is applied more and more widely, data scientists often struggle to find or create end-to-end machine learning systems for specific tasks. The proliferation of libraries and frameworks and the complexity of the tasks have led to the emergence of "pipeline jungles" -- brittle, ad hoc ML systems. To address these problems, we introduce the Machine Learning Bazaar, a new approach to developing machine learning and AutoML software systems. First, we introduce ML primitives, a unified API and specification for data processing and ML components from different software libraries. Next, we compose primitives into usable ML programs, abstracting away glue code, data flow, and data storage. We further pair these programs with a hierarchy of search strategies -- Bayesian optimization and bandit learning. Finally, we create and describe a general-purpose, multi-task, end-to-end AutoML system that provides solutions to a variety of ML problem types (classification, regression, anomaly detection, graph matching, etc.) and data modalities (image, text, graph, tabular, relational, etc.). We both evaluate our approach on a curated collection of 431 real-world ML tasks and search millions of pipelines, and also demonstrate real-world use cases and case studies.READ FULL TEXT VIEW PDF
In this demo paper, we introduce the DARPA D3M program for automatic mac...
Software organizations are increasingly incorporating machine learning (...
In recent years, a wide variety of automated machine learning (AutoML)
Machine learning (ML) applications become increasingly common in many
The rise of Big Data has led to new demands for Machine Learning (ML) sy...
Machine Learning (ML) has revamped every domain of life as it provides
The effective utilization at scale of complex machine learning (ML)
A simple, extensible library for developing AutoML systems
A library for composing end-to-end tunable machine learning pipelines.
AutoBazaar: An AutoML System from the Machine Learning Bazaar
Many diverse fields have begun to incorporate large-scale data collection into their work. As a result, machine learning (ML), once limited to conventional commercial applications, is now being widely applied in physical and social sciences, in policy and government, and in a variety of industries. This diversification has led to difficulties in actually creating and deploying real-world solutions, as key functionality becomes fragmented across ML-specific or domain-specific software libraries created by independent communities. The pace of ML innovation also means that any one library is unlikely to support the latest techniques. In addition, the complex and difficult process of building problem-specific end-to-end solutions continues to be marked by challenges such as formulating achievable learning problems , managing and cleaning data and metadata [2, 3, 4], scaling tuning procedures [5, 6], and deployment and serving .
In practice, engineers and data scientists often develop ad hoc programs for new problems, writing a significant amount of “glue code” to connect components from different software libraries, and spending significant time processing different forms of raw input and interfacing with external systems. These steps are tedious and error-prone and lead to the emergence of brittle “pipeline jungles” .
These points raise the question, “How can we make building machine learning systems easier in practical settings?” This question applies to a spectrum of user populations, from a nuclear scientist performing a simple study to a data engineer creating an automated machine learning (AutoML) platform within a large enterprise.
A new comprehensive approach is needed to designing and developing software systems that solve specific ML tasks. Such an approach should address a wide variety of ML task types
: combinations of input data modalities, such as images, text, audio, signals, tabular data, relational data, time series, and graphs, and learning problem types, such as regression, classification, clustering, anomaly detection, community detection, graph matching, and collaborative filtering; it should cover the numerous intermediate stages involved in creating a solution for a ML task, such as data preprocessing, data munging, featurization, modeling, and evaluation; and it should support various levels of AutoML functionality to fine-tune solutions, such as hyperparameter tuning and algorithm selection. Moreover, it should enable fast iteration on ideas, coherent APIs, and easy integration of new techniques and libraries. In sum, this ambitious goal would allow many or all end-to-end learning problems to be solved or built within a single framework (Figure 2).
To address these challenges, we present the Machine Learning Bazaar, a multi-faceted approach to designing, organizing, and developing ML and AutoML software systems (Figure 1
). We organize the ML ecosystem into a hierarchy of components, ranging from basic building blocks like individual classifiers to full-fledged AutoML systems. With our design, a user specifies a task, provides a raw dataset, and requests a curated pipeline for their task or composes an end-to-end pipeline out of pre-existing, annotated, ML primitives (Section III-A). The resulting pipelines can be easily evaluated and deployed across a variety of software and hardware settings (Section III-B), and tuned using a hierarchy of AutoML search approaches (Section IV-B). We also enable the rapid contribution, integration, and exchange of primitives from members of the community — promising components and pipelines can be thoroughly validated and evaluated for the general-purpose performance across an extensive evaluation task suite (Section III-C).
“Bazaar-style” software development is exemplified by the Linux community, “a great babbling bazaar of different agendas and approaches” . Much like a bazaar, our approach is characterized by the availability of many compatible alternatives to achieve a single goal, a wide variety of libraries and custom solutions, broad coverage of ML task types, a space for contributors to bring primitives to support ML endeavors, and ready-to-use, pre-fit solutions for users who need to quickly complete a task.
We have been successfully using ML Bazaar for a number of real-world applications, such as anomaly detection for satellite telemetry and failure prediction in wind turbines (Section V-A). In addition, using our own approaches, we have created a full-fledge AutoML system (Section IV-C), which we have entered in participation in the DARPA Data-Driven Discovery of Models (D3M) program (Section V-B); ours is the first end-to-end, modular, publicly released system designed to meet the program’s goal.
To preview the potential of ML Bazaar-style development, we highlight the Orion project within MIT for ML-based anomaly detection in satellite telemetry (Section V-A). The Orion pipeline processes a satellite telemetry signal using several custom preprocessors before predicting values in the time series and using a dynamic thresholding method to identify anomalies. The entire pipeline can be represented in a short JSON file (LABEL:lis:orion-pipeline).
Our collaborators’ experience developing Orion demonstrates the strengths of ML Bazaar. The completed pipeline is described by a short sequence of primitives and some additional metadata. Custom processing steps are easily implemented as modular components, before being combined with two separate ML libraries into a complex and powerful ML pipeline without the need to write any glue code. The pipeline is then automatically tuned using built-in AutoML functionality. Using our runtime engine, this pipeline can be easily deployed on our collaborators’ systems. In the remainder of this paper, we will dive deeper into the innovations that make this effective ML system development possible.
Our contributions in this paper include:
A unified organization and API for ML and AutoML tasks: Our system enables users to specify a pipeline for any ML task, ranging from image classification to graph matching through a unified API.
Open source libraries: Components of our system have been released as four modular libraries:
piex444https://github.com/HDI-Project/piex: A library for exploration and meta-analysis of ML task results.
The first general-purpose automated machine learning system: Our system AutoBazaar 555https://github.com/HDI-Project/AutoBazaar is, to the best of our knowledge, the first publicly-available system with the ability to reliably compose end-to-end, automatically-tuned, solutions for 15 data modalities and problem types (Section IV-C).
ML task suite: We compile an extensive suite of 456 ML tasks/datasets covering 15 ML task types for experimentation, diagnostics, and more (Section III-C).
A comprehensive evaluation: We evaluated our AutoML system against our task suite, releasing a dataset of 2.5 million scored pipelines (Section VI).
Researchers have developed numerous algorithmic and software innovations to make it possible to create ML and AutoML systems in the first place.
Researchers today are fortunate to have access to high-quality libraries that have originated over a period of decades in separate academic communities. To support general ML applications, scikit-learn implements many different algorithms using a common API centered on the influential fit/predict paradigm . For specialized analysis, libraries have been developed in separate academic communities, often with different and incompatible APIs [11, 12, 13, 14, 15, 16]. In ML Bazaar, we connect and link components of these libraries, only creating ourselves missing functionality.
Prior work has provided several approaches for making it easier to develop ML systems. For example, caret 
standardizes interfaces and provides utilities for the R ecosystem, but without enabling more complex pipelines. Pieces in an ML system can be manipulated using graphical interfaces, such as NeuronBlocks for neural networks or Azure Machine Learning Studio666https://azure.microsoft.com/en-us/services/machine-learning-studio/ for general-purpose workflows.
AutoML research has often been limited to solving sub-problems of an end-to-end ML workflow, such as data cleaning , feature engineering [15, 20], or model selection and hyperparameter tuning [21, 22, 23, 24]. Thus AutoML solutions are often not widely applicable or deployed in practice without human support. In contrast, ML Bazaar integrates many of these approaches and designs one coherent and configurable structure for joint tuning and selection of end-to-end pipelines.
These AutoML libraries, if deployed, are typically one component within a larger system that aims to manage several practical aspects such as parallel and distributed training, tuning, and model storage, and even serving, deployment, and graphical interfaces for model building. These include ATM , Vizier , and Rafiki , as well as commercial platforms like Google AutoML, DataRobot, and Azure Machine Learning Studio. While these systems provide many benefits, they have several limitations. First, they often focus on a subset of ML use cases, such as vision, NLP, forecasting, or hyperparameter tuning, neglecting many of the other common practical uses of ML, which may require more careful data processing and pipeline composition. Second, these systems are designed as standalone applications and do not support community-driven integration of new innovations. ML Bazaar provides a new approach to developing such systems in the first place: it supports a wide variety of ML task types, and builds on top of a community-driven ecosystem of ML innovations. Indeed, it could serve as the backend for such ML services or platforms. DARPA’s Data-Driven Discovery of Models (D3M) program , of which we are participants, aims to spur development of automated systems for model discovery for use by non-experts, and has led to the development of systems such as Alpine Meadow .
The ML Bazaar is a hierarchical organization and unified API of the ecosystem of machine learning software and algorithms. Within the ML Bazaar, we will find structured software components for every aspect of the practical machine learning process, from featurizers for relational datasets to signal processing transformers to neural networks to pre-trained embeddings. From these components, or primitives, data scientists can easily and efficiently construct ML solutions for a variety of ML task types, and ultimately, automate much of the work of tuning these models (Section IV).
A primitive is a reusable, self-contained, software component for machine learning paired with the structured annotation of its metadata. It is the most fundamental unit of ML computation in our system. It has a well-defined interface such that it receives input data in one of several formats or types, performs computations, and returns the data in another format or type, exposing a fit/produce interface.
As a result of this abstraction, widely varying ML functionality can be collected in a single location, and each primitive can be re-used in chained computations (Section III-B) without callers writing any glue code.
Many primitives have no learning component and are trivial to specify, but are very important nonetheless. For example, the Hilbert and Hadamard transforms from a signal processing toolbox would be important primitives to include when building an ML system to solve problems in this application area.
For each primitive, we annotate the ML data types of declared inputs and outputs, i.e., recurring objects in ML that have a well-defined semantic meaning, such as a feature matrix
, a target vector, or a space of class labels classes. We provide a mapping between ML data types and synonyms used by specific libraries as necessary. This logical structure will help dramatically decrease the amount of glue code users must write (Section III-B1).
The design of ML primitives is motivated by several considerations:
Lightweight wrappers: We aim to enable lightweight wrappers around the functionality of other existing libraries with mutually incompatible APIs to minimize redundancy and avoid the “yet-another-library” problem.
Evolving annotations: We aim to naturally evolve primitive annotations, as primitives change due to hyperparameter settings, metadata tags, or improved implementations.
Ease of contribution: As new ML innovations and software emerge, we aim for contributors — not even necessarily the original researchers — to easily create and annotate new primitives, submit them for validation, and make them available to the community.
Structured metadata: We aim to make detailed metadata about each primitive available in both human- and machine-readable form to support automated tools and meta-learning approaches.
Each primitive is annotated with meta-information about its inputs and outputs, with their ranges and data types, its hyperparameters, and other detailed metadata, such as the author, description, and documentation URL. The full annotation is provided in a self-contained JSON file with the following fields and others:
primitive: The fully-qualified name of the underlying implementation as a Python object.
fit, produce: The entry points in the underlying implementation and the names and ML data types of the primitive’s inputs and outputs for the fit or produce phases.
hyperparameters: Details of all the hyperparameters of the primitive— their names, descriptions, data types, ranges, and whether they are fixed or tunable.
We have developed the open-source MLPrimitives library which contains a formal JSON Schema specification of the primitive JSON annotation format. To support annotation of primitives from libraries that need significant adaptation to the fit/produce interface, MLPrimitives also provides a powerful set of adapter modules that assist in wrapping common patterns. However, MLPrimitives aims to enable lightweight wrappers in which as little new code as possible is written; one can annotate entry points to an underlying primitive in terms of functions, class methods, or attributes.
MLPrimitives enables easy contribution of new primitives in several ways by providing primitive template and example annotations and detailed tutorials and documentation. We also provide procedures to validate proposed primitives against the formal specification and a unit test suite.
In addition, MLPrimitives maintains a curated catalog of high-quality, useful primitives from 11 libraries,777As of this writing (MLPrimitives v0.1.10). as well as custom primitives that we have created (Table I). Distributed as a widely-available Python package, end-users can pin versions of the package to access specific primitives, or update the package to gain access to the updated primitives. Each primitive is identified by a fully-qualified name to differentiate primitives across catalogs. The JSON annotations can then be mined for additional insights.
To solve practical learning problems, we must be able to instantiate and compose primitives into usable programs. These programs must be easy to specify with a natural interface, such that users can easily compose primitives without sacrificing flexibility. We aim to support both end-users trying to build an ML solution for their specific problem who may not be savvy about software engineering, as well as system developers wrapping individual ML solutions in AutoML components (Section IV) or otherwise. In addition, we provide an abstracted execution layer, such that learning, data flow, data storage, and deployment are handled automatically by various configurable and pluggable backends.
We introduce ML pipelines, which collect multiple primitives into a single computational graph. Each primitive in the graph is instantiated in a pipeline step, which loads and interprets the underlying primitive and provides a common interface to run a step in a larger ML program.
We define a pipeline as a directed acyclic multigraph , where is a collection of pipeline steps, are the directed edges between steps representing data flow, where each edge is endowed with one data item, and is a joint hyperparameter vector for the underlying primitives. A valid pipeline — and its derivatives (Section IV-A) — must also satisfy acceptability constraints that require the inputs to each step to be satisfied by the outputs of another step connected by a directed edge.
The term “pipeline” is used in the literature to refer to a ML-specific sequence of operations, and sometimes abused to refer to a more general computational graph or analysis. In our conception, we bring foundational data processing operations of raw inputs into this scope, like featurization of graphs, multi-table relational data, time series, text, and images, as well as simple data transforms, like encoding integer or string targets. This gives our pipelines a greatly expanded role, providing solutions to any ML task type and spanning the entire ML process beginning with the raw dataset.
Large graph-structured workloads can be difficult to specify for end-users due to the complexity of the data structure. In ML Bazaar, we consider three aspects of pipeline representation: ease of composition, readability, and computational issues. First, we prioritize easily composing complex ML pipelines by providing a pipeline description interface (PDI) in which users specify only the topological ordering of all pipeline steps in the pipeline without requiring any explicit dependency declarations. These steps can be specified using our software libraries or loaded from JSON files. Full training-time (fit) and inference-time (produce) computational graphs can then be recovered (Algorithm 1), without the user being required to write any glue code. This is made possible by the meta-information provided in the primitive annotations, in particular, the ML data types of the primitive inputs and outputs. We leverage the observation that steps that modify the same ML data type can be grouped into the same subpath. Though it may be more difficult to read and understand these pipelines from the PDI alone as the edges are not shown nor labeled, it is easy to accompany them with the recovered graph representation Figure 3.
The resulting graphs describe abstract computational workloads, but we must be able to actually execute them for purposes of learning and inference. After recovering the full graphs, we further compile them to an intermediate representation. We could re-purpose many existing systems within the data engineering landscape for scheduling and executing these workloads [30, 31] to serve as backends for this representation. We implement one execution engine, released as the open-source MLBlocks library, in which a collection of objects and a metadata tracker in a key-value store are iteratively transformed through sequential processing of pipeline steps.
A primary goal of ML Bazaar is to provide broad coverage of ML task types, that is to reliably produce high-quality solutions for a wide variety of data modalities and problem types. To that extent, we release the comprehensive ML Bazaar Task Suite for evaluation, experimentation, and diagnostics.
Our publicly-available task suite888The ML Bazaar Task Suite is available at https://d3m-data-dai.s3.amazonaws.com/index.html and can also be explored using our piex Python library. consists of 456 ML tasks spanning 15 task types. Tasks, which encompass raw datasets and annotated task descriptions, are assembled from a variety of sources, including MIT Lincoln Laboratory, Kaggle, OpenML, Quandl, and Crowdflower (Table II
). We created train/test splits and organized the folder structure. Other than this, we do not do any preprocessing (sampling, outlier detection, imputation, featurization, scaling, encoding, etc.), presenting data in its raw form as inputs to proposed end-to-end ML pipelines. Our holistic approach contrasts with other benchmarking approaches such as the OpenML 100 and the AutoML Benchmark[32, 33], which each target only one ML task type (single-table classification), and others [26, 34, 35] which target the black-box optimization aspect of AutoML in isolation.
ML experts developing new methods can use our ML task suite and integrate their proposed methods as replacement for a primitive or set of primitives. They can then evaluate the efficacy of the method across a realistic, general-purpose workload. We demonstrate this research approach in two case studies in Sections VI-C and VI-B.
|Data Modality||Problem Type||Tasks||Template|
In this section, we have described the design and implementation of ML primitives and pipelines and presented the ML Bazaar Task Suite.
Several alternatives exist to our new ML Pipeline abstraction (Section III-B), such as scikit-learn’s Pipeline. Ultimately, while our pipeline is inspired by these alternatives, it provides much more general data engineering and ML functionality. While the scikit-learn pipeline sequentially applies a list of transformers to and only before outputting a prediction, our pipeline supports general computational graphs, accepts multiple data modalities as input simultaneously, produces multiple outputs, manages evolving metadata, and can use software from outside the scikit-learn ecosystem/design paradigm. For example, we can use our pipeline to construct entity sets 
from multi-table relational data on-the-fly for input to other pipeline steps. We can also support pipelines outside the supervised learning paradigm, such as inOrion, where we create “on-the-fly” in an unsupervised setting (Figure 3).
In creating the ML Bazaar Task Suite (Section III-C
), we made every effort to curate a corpus that was evenly balanced across ML task types. Unfortunately, in practice, available datasets are heavily skewed to traditional ML problems of single-table classification and our task suite reflects this deficiency. Indeed, the OpenML 100 benchmark is exclusively comprised of single-table classification problems. In our task suite, 49 percent of tasks fall outside of this highly-studied problem, and we continue to release new versions.
While ML Bazaar handles 15 ML task types (Table II), there are many more task types for which we do not currently provide pipelines in our default catalog. To extend our approach to support new data modalities, such as audio or video, and task types, such as object detection or speech transcription, it is generally sufficient to write several new primitive annotations for pre-processing input and post-processing output. For example, for the anomaly detection task type from the Orion project, we implemented several new simple primitives: rolling_window_sequences, regression_errors, and find_anomalies. Importantly, no changes are needed to the core ML Bazaar software libraries such as MLPrimitives and MLBlocks. Indeed, support for a certain task type is predicated on the availability of a pipeline for that task type rather than any characteristics of our software libraries.
The default catalog of primitives from the MLPrimitives library is versioned together, and library conflicts are resolved manually by maintainers through carefully specifying minimum and maximum dependencies. This strategy ensures that the default catalog can always be used, even if there are incompatible updates to the underlying libraries. Thus a user can request a specific version of MLPrimitives and get predictable behavior. Users also can augment the default catalog with their own custom primitives; since the required libraries must be installed on their system anyway, versioning issues are no different. Finally, automated tools can be integrated to aid both users and maintainers in understanding potential conflicts and safely bumping library-wide versions.
In this work, we focus on the wealth of ML functionality that exists in the Python ecosystem. Through ML Bazaar
’s careful design, we could also support other common languages in data science like R, MATLAB, and Julia and enable multi-language pipelines. Our choice for primitive annotations of JSON, rather than a Python class or data structure, provides the first step towards this goal. Next, a multi-language pipeline execution backend would be built that uses language-specific kernels or containers and relies on an interoperable data format such as Apache Arrow.
We considered multiple alternatives to the primitives API, such as representing them as Python data structures or classes. We opted against these approaches as leading to excessive wrapper code and reducing the potential for language interoperability and pipeline meta-learning.
From the components of the ML Bazaar, data scientists can easily and effectively build machine learning pipelines with fixed hyperparameters for their specific problems. To improve the performance of these solutions, we first introduce templates and hypertemplates, which generalize pipelines by allowing a tunable hyperparameter configuration space to be specified. Next, we describe a set of AutoML primitives which facilitate hyperparameter tuning and model selection. Finally, we present the design and architecture of AutoBazaar, an AutoML system built on top of these innovations. Our system, which we have used to enter the DARPA D3M competition, automatically selects templates from available options and tunes the hyperparameters of those templates by evaluating millions of pipelines in a distributed setting.
Frequently, pipelines require hyperparameters to be specified at several places. Unless these values are fixed at annotation-time, hyperparameters must be exposed in a machine-friendly interface. This motivates generalizing pipelines through templates and hypertemplates and providing first-class tuning support.
We define a template as a directed acyclic multigraph , and is the joint hyperparameter configuration space for the underlying primitives. By providing values for the unset hyperparameters of a template, a concrete pipeline is created.
In some cases, certain values of hyperparameters can affect the domains of other hyperparameters. For example, the type of kernel for a support vector machine results in different kernel hyperparameters, and preprocessors used to adjust for class imbalance can affect the training procedure of a downstream classifier. We call theseconditional hyperparameters, and accommodate them with hypertemplates. We define a hypertemplate as a directed acyclic multigraph , where is a collection of pipeline steps, are directed edges between steps, and is the hyperparameter configuration space for template . A number of templates can be derived from one hypertemplate by fixing the conditional hyperparameters (Figure 4).
Just as primitives represent components of machine learning computation, AutoML primitives represent components of an AutoML system. We separate AutoML primitives into tuners and selectors. These underly our extensible AutoML library, BTB, which facilitates easy integration of methodological developments by AutoML developers.
Given a template, an AutoML system must find a specific pipeline with fully-specified hyperparameter values to minimize some cost. For template with hyperparameter space , and a function that assigns a performance score to pipeline with hyperparameters , we define the tuning problem as
Hyperparameter tuning is widely studied and its effective use is instrumental to maximizing the performance of machine learning solutions [36, 23, 21]. Since is expensive to evaluate, as the model is trained several times to compute a desired metric via cross-validation, the number of evaluations should be minimized. Within ML Bazaar, we focus on Bayesian optimization, a black-box optimization technique in which expensive evaluations of are minimized by forming and updating a meta-model for . At each iteration, the next hyperparameter configuration to try is chosen according to an acquisition function.
Researchers have argued for different formulations of meta-models (often in terms of the different kernels of Gaussian Processes) and acquisition functions [37, 38, 21]. We structure these meta-models and acquisition functions as separate AutoML primitives that can be combined together to form a tuner. Tuners provide a record/propose interface in which evaluation results are recorded to the tuner and new hyperparameters are proposed. For example, the GCP-EI tuner uses the Gaussian Copula Process meta-model primitive and the Expected Improvement acquisition function primitive.
For many ML task types, there may be multiple templates or hypertemplates available as possible solutions, each with their own tunable hyperparameters. The aim is to balance the exploration-exploitation tradeoff while selecting promising templates to tune. For a set of templates , we define the selection problem as
The selection problem is treated as a multi-armed bandit problem where for a selected template, the score achieved as a result of tuning can be assumed to come from an unknown underlying probability distribution. We structure selectors as AutoML primitives providing acompute_rewards/select API, with different decision criteria acting on the history of pipeline scores. For example, the upper confidence bound method  is represented by the UCB1 selector, where scores achieved for each template are converted into rewards, given by
where is the score achieved by template at iteration . The choice is then made using:
where is the total number of iterations and is the number of times template was chosen.
Whereas composition of high-quality primitives enables data scientists to build machine learning solutions (Section III), by combining both ML and AutoML primitives in a carefully designed and architected manner, we have built AutoBazaar, an end-to-end, general-purpose, multi-task, automated machine learning system. AutoBazaar consists of several components: user interfaces for administration and configuration, loaders and configuration for ML tasks and primitives and other components, data stores for metadata and pipeline evaluation results, a pipeline execution engine, and an AutoML coordinator.
We focus here on the core pipeline search and evaluation algorithm of AutoBazaar. The input to the search is a computational budget and an ML task, which consists of the raw data and task and dataset metadata — dataset resources, problem type, dataset partition specifications, and an evaluation procedure for scoring. Based on these inputs, it searches through its catalog of primitives and templates for the most suitable pipeline that it can build.
In order to do this, first it loads the train and test dataset partitions, and , following the metadata specifications. Next, it loads from its default catalog and the user’s custom catalog a collection of candidate templates that are suitable to be used for the data modality and problem type at hand. Using the BTB library, it then generates a tuner for each one of them, as well as a single selector that will be used to orchestrate them. Then it starts a search loop for as long as the computation budget allows. In each iteration the selector is queried to know which template to evaluate next, the corresponding tuner is queried for the next hyperparameters to try, and a pipeline is generated and evaluated with the provided scoring function using cross validation over . This produces a score which is then reported back to the tuner and selector, and the process continues. Once the budget is consumed, the best found, , is fitted on and scored over . Its specification is returned to the user alongside the score obtained, .
While this is one example, AutoML system developers within an organization can support the efforts of their data scientists by configuring their system with custom backends or cloud-specific infrastructure. This development is aided by the organization we impose on system components.
In this paper, we claim that ML Bazaar makes it easier to develop ML systems. We provide evidence for this claim along two axes. First, we describe four real-world use cases in which ML Bazaar is currently used to create both ML and AutoML systems. While we will evaluate our AutoML system against our publicly-available task suite in the next section, through these industrial applications we examine the following questions: Does ML Bazaar support the needs of developers of these application? If not, how easy was it to extend? Second, we demonstrate the ability of our AutoML system to compete in the DARPA D3M Challenge.
ML Bazaar is used by a communications satellite operator which provides video and data connectivity globally. This company wanted to monitor more than 10,000 telemetry signals from their satellites and identify anomalies, which might indicate a looming failure severely affecting the satellite’s coverage. This time series/anomaly detection task was not initially supported by any of the pipelines in our curated catalog. Our collaborators were able to easily implement a recently developed end-to-end anomaly detection method  using pre-existing transformation primitives in ML Bazaar and by adding several new primitives: a primitive for the specific LSTM architecture used in the paper and new time series anomaly detection postprocessing primitives, which take as input a time series and time series forecast, and produce as output a list of anomalies, identified by intervals . This design enabled rapid experimentation through substituting different time series forecasting primitives and comparing the results. In current work, they apply ML pipelines to 82 publicly available satellite telemetry signals from NASA and evaluate the anomaly detections against 105 known anomalies. The work has been released as the open-source Orion project (Section I-B) and is currently under active development.999https://github.com/D3-AI/Orion
Cardea is an open-source, automated framework for predictive modeling in health care on electronic health records following the FHIR schema. Its developers formulated a number of prediction problems including predicting length of hospital stay, missed appointments, and hospital readmission. All tasks in Cardea are multitable regression or classification. From ML Bazaar, Cardea uses the featuretools.dfs primitive to automatically engineer features for this highly-relational data and multiple other primitives for classification and regression. Cardea also integrates hyperopt, another library for Bayesian optimization, to tune their pipelines. The framework also presents examples on a publicly available patient no-show prediction problem. The framework has been released as an open-source project.101010https://github.com/D3-AI/Cardea
ML Bazaar is also used by a multinational energy utility to predict critical failures and stoppages in their wind turbines. Most prediction problems here pertain to the time series classification ML task type. ML Bazaar has several time series classification pipelines available in its catalog and they enable usage of time series from 140 turbines to develop multiple pipelines, tune them, and produce prediction results. Multiple outcomes are predicted, ranging from stoppage and pitch failure to less common issues, such as gearbox failure. This library is released as the open-source GreenGuard project.111111https://github.com/D3-AI/GreenGuard
A global water technology provider uses ML Bazaar for a variety of machine learning needs, ranging from image classification for detecting leaks from images, to crack detection from time series data, to demand forecasting using water meter data. A system like ML Bazaar provides a unified framework for these disparate needs. The team also builds custom primitives internally and uses them directly with the MLBlocks backend.
Using ML Bazaar, we designed an AutoML system to participate in DARPA’s D3M program (Section II). DARPA’s evaluation procedure is as follows. Submissions of AutoML systems by participants are run on a number of tasks spanning several task types. Each system is run for one hour per task, and at the end of the run, the best pipeline identified by the AutoML system is evaluated on held-out test data.
As part of DARPA’s evaluation setup, they also curate a subset of 17 tasks for which experts at MIT Lincoln Laboratory manually designed and tuned pipelines and for which we are able to compare and release our own performance. The results from our latest submission are shown in Figure 5. We find that ML Bazaar substantially outperforms the expert baselines (), finding superior pipelines for 15/17 tasks. We have submitted our system 3 times, adding new primitives each time.
In this section, we highlighted several successful, real-world use cases of ML Bazaar for developing ML systems. In the absence of the ability to run a fair user study, we believe this provides strong evidence for our claims about the usability and efficacy of our development approach.
The ease of developing ML solutions for the task at hand freed up time for these teams to think and design a comprehensive machine learning infrastructure. In the case of Orion and GreenGuard this led to development of a database that catalogues the metadata from every machine learning experiment run using ML Bazaar. It allowed time for development of a standard data schema and data ingestors in Cardea and GreenGuard. Perhaps one of the significant achievements with ML Bazaar is that it enables templatization of development of such ML infrastructure across use cases.
Additional evidence may take the form of an enthusiastic community of users and the widespread adoption of our work by the open-source community, for which we have made progress but plan to continue growing community support.
In this section, we demonstrate the ability of the AutoBazaar system to automatically solve a wide variety of ML task types on a comprehensive evaluation corpus and assess the system’s performance across a variety of metrics. We use the ML Bazaar Task Suite, a corpus of 456 ML tasks and datasets (Section III-C). We then leverage the results to perform several case studies in which we show how a general-purpose evaluation setting can be used to assess the value of specific ML and AutoML primitives.
We run the search process for all tasks in parallel on a heterogenous cluster of 400 AWS EC2 nodes, comprised of m4.xlarge (4 CPU, 16G RAM), m4.2xlarge (8 CPU, 32GB RAM), and m4.10xlarge (40 CPU, 160GB RAM) instances. In this distributed architecture, each ML Tasks is solved independently on a node of its own over a 2-hour time limit, at an average rate of 0.13 pipelines scored per second. Metadata and fine-grained details about every pipeline evaluated are stored in a MongoDB document store. Ultimately, the best pipeline for each task after checkpoints at 10, 30, 60, and 120 minutes of search are selected by considering the cross-validation score on the training set and are then re-scored on the held-out test set.121212Exact replication files and detailed instructions for the experiments in this section are included here: https://github.com/micahjsmith/ml-bazaar-2019 The datasets and tasks we used in our experiments can also be accessed using our piex Python package for pipeline exploration and analysis.
One important attribute of AutoBazaar is the ability to improve pipelines for different tasks over time through search and tuning. We measure the improvement in the best pipeline per task in Figure 6
. We find that the average task improves its best score by 1.06 standard deviations over the course of tuning, and that 31.7 percent of tasks improve by more than 1 standard deviation.
When new primitives are contributed by the ML community, they become candidates for inclusion in templates and hypertemplates, either to replace similar pipeline steps or to form the basis of new topologies. By running the end-to-end system on our evaluation corpus of datasets and tasks, we can assess the impact of the primitive in general, rather than on a small set of over-fit baselines.
In this first case study, we consider the hypothetical contribution of a new primitive that annotates the gradient boosting machine XGBoost (XGB) 
. This primitive replaces the default random forest (RF) estimator in any templates in which it appeared. To compare these two primitives, we ran two experiments, one in which RF is used in templates and one in which XGB is substituted instead.
We consider 1.86 million relevant pipelines to determine the best scores produced for 367 tasks. We find that the XGB pipelines substantially outperformed the RF pipelines, winning 64.9 percent of the comparisons. This confirms the experience of practitioners, who widely report that XGBoost is one of the most powerful ML methods for classification and regression.
The design of the ML Bazaar AutoML system and our extensive evaluation corpus allows us to easily swap in new AutoML primitives (Section IV-B) to see to what extent changes in components like tuners and selectors can improve performance in general settings.
In this case study, we revisit , a work which was partially responsible for bringing about the widespread use of Bayesian optimization for tuning ML models in practice. Their contributions include: (1) proposing the usage of the Matérn 5/2 kernel, (2) describing an integrated acquisition function that integrates over uncertainty in the GP hyperparameters, (3) incorporating a cost model into an expected improvement per second acquisition function, and (4) explicitly modeling pending parallel trials. How important was each of these contributions to the resulting tuner (or tuners)?
Using ML Bazaar, we show how a more thorough ablation study , not present in the original work, would be conducted to address these questions, by assessing the performance of our general-purpose AutoML system using different combinations of these 4 contributions. Here, we focus on the proposal of the Matérn 5/2 kernel for the tuner meta-model (Section IV-B1), given by
where and is the dimensionality of the configuration space.
We run experiments using a baseline tuner with a squared exponential kernel (GP-SE-EI) and compare it with a tuner using the Matérn 5/2 kernel (GP-Matern52-EI). In both cases, the kernel hyperparameters are set by optimizing the marginal likelihood. This experiment allows us to isolate the contributions of the proposed kernel in the context of general-purpose ML workloads.
In total, 431 thousand pipelines were evaluated to find the best pipelines for a subset of 414 tasks. We find that there is no significant improvement from using the Matérn 5/2 kernel over the SE kernel — in fact, the GP-SE-EI tuner outperforms, winning 60.1 percent of the comparisons. One possible explanation for this negative result is that the Matérn kernel is sensitive to hyperparameters which are set more effectively by optimization of the integrated acquisition function. This is supported by the over-performance of the tuner using the integrated acquisition function in the original work; however, the integrated acquisition function is not tested with the baseline SE kernel, and more study is needed.
Throughout this paper, we have built up abstractions, interfaces, and software components for data scientists, data engineers, and other practitioners to effectively develop machine learning systems. Users of ML Bazaar can develop one-off pipelines, tuned templates, or full-fledged distributed AutoML systems. Researchers can contribute ML or AutoML primitives and make them easily accessible to a broad base for inclusion in end-to-end solutions.
We have applied this approach to several real-world ML problems and entered our AutoML system in a modeling challenge. As we collect more and more scored pipelines, we expect opportunities will emerge for meta-learning and debugging on ML tasks and pipelines, as well as the ability to track progress and transfer knowledge within data science organizations. We will focus on several complementary extensions in future work. These include continuing to improve our AutoML system and making it more robust for everyday use by a diverse user base, and studying how to best support users of different backgrounds in using and interacting with ML and AutoML systems.
The authors would like to acknowledge the contributions of the following people: Laura Gustafson, William Xue, Akshay Ravikumar, Ihssan Tinawi, Alexander Geiger, Saman Amarasinghe, Stefanie Jegelka, Zi Wang, Benjamin Schreck, Seth Rothschild, Manual Alvarez Campo, Sebastian Mir Peral, Plamen Valentinov Kolev, Peter Fontana, and Brian Sandberg. The authors are part of the DARPA Data-Driven Discovery of Models (D3M) program, and would like to thank the D3M community for the discussions around the design.
H. Miao, A. Li, L. S. Davis, and A. Deshpande, “Modelhub: Deep learning lifecycle management,” in2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 2017, pp. 1393–1394.
D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, C. Y. Koo, L. Lew, C. Mewald, A. N. Modi, N. Polyzotis, S. Ramesh, S. Roy, S. E. Whang, M. Wicke, J. Wilkiewicz, X. Zhang, and M. Zinkevich, “Tfx: A tensorflow-based production-scale machine learning platform,” inKDD, 2017.
J. M. Kanter, “Deep Feature Synthesis:Towards Automating Data Science Endeavors,” in2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2015, pp. 1–10.
NIPS 2016 Workshop on Artificial Intelligence for Data Science, 2016.
H. Wang, B. van Stein, M. Emmerich, and T. Back, “A new acquisition function for bayesian optimization based on the moment-generating function,” in2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2017, pp. 507–512.