The explosion of digital data has made the use of machine learning (ML) more ubiquitous than ever before. Machine learning is now applied to almost any aspect of organizational work, and used to generate significant value. The growth in the use of ML, however, was not matched by a growth in the number of people who can effectively apply it, namely data scientists. This shortage in skilled practitioners has spurred efforts to automate various aspects of the data scientist’s work.
Automatic machine learning (AutoML) is a general term used to describe algorithms and frameworks that deal with the automatic selection and optimization of ML algorithms and their hyperparameters. Examples of AutoML include automatic hyperparameter selection for predefined algorithms[hutter2011sequential], automatic feature engineering [katz2016explorekit], and neural architecture search [bello2017neural]. While effective, the above mentioned studies sought to optimize only specific steps of the overall process undertaken by human data scientists. In recent years, studies exploring the problem of automatic ML pipeline generation have sought to automate the process end-to-end by generating entire ML pipelines.
The creation of entire ML pipelines is challenging because it involves a large and complex search space. Even simple pipelines usually involve multiple steps such as data preprocessing, feature selection, and the use of a classifier. Complex pipelines can both contain additional types of algorithms (e.g., feature engineering) and multiple algorithms from each type. The large number of available algorithms and the fact that the performance of each component of the pipeline is highly dependent on the input it receives from previous component(s) further complicates this task.
Existing approaches for automatic pipeline generation can be roughly divided into two groups: constrained space and unconstrained space. Constrained space approaches generally create a predefined pipeline structure and then search for the best algorithms combination to populate it. Studies that utilize this approach include Auto-Sklearn [feurer2015efficient] and Auto-Weka [thornton2013auto]. This approach narrows the search space, but it prevents the discovery of novel pipeline architectures. The unconstrained space approaches place little or no restrictions on the structure of the pipeline, but they come at a higher computational cost. Approaches of this kind include TPOT [olson2016tpot] and AlphaD3M [drori2019automatic].
In this study we propose DeepLine, a novel semi-constrained approach for AutoML pipeline generation. While our approach constrains the maximal size of the pipelines, it supports the inclusion of multiple algorithm of the same type (e.g, classification), as well as the creation of parallel sub-pipelines. In addition, any compatible algorithm(s) can serve as the input of another, ensuring that novel and interconnected architectures can be discovered.
Another important advantage of DeepLine over previous work is its ability to learn across multiple datasets. We apply deep reinforcement learning (DRL) techniques that enable our approach to perform all of its learning offline. This fact considerably speeds up performance for new datasets while enabling us to leverage past experience and improve DeepLine’s performance over time.
Our contributions in this study are as follows: (1) We present DeepLine, a novel approach for automatic ML pipeline generation. Our approach uses DRL to learn across multiple datasets, enabling it to efficiently produce pipelines for previously unseen datasets; (2) We propose a novel hierarchical action-modeling approach, which enables us to use fixed-size representation to model dynamic action spaces. This hierarchical solution not only speeds up the training process of the DRL agent but also enables the use of DRL methods that do not support dynamic action spaces. We implement our solution on the OpenAI Gym platform and publish the code, and; (3) We conduct an extensive evaluation on 56 datasets and show that DeepLine outperforms state-of-the-art methods, both constrained and unconstrained.
Applying AutoML for end-to-end pipeline generation has been an active field of research in recent years. Various studies offer a large variety of approaches to address this challenge, including Bayesian optimization [hutter2011sequential]olson2016tpot]. One popular example is Auto-Weka [thornton2013auto], which automatically selects an algorithm for each step of a pipeline with a fixed structure and then uses Bayesian optimization (Sequential model-based optimization) to search for optimal hyperparameter settings of the pipeline.
Following Auto-Weka, [feurer2015efficient] proposed an autoML system called Auto-Sklearn. Auto-Sklearn searches through a set of pre-generated, fixed-structure pipelines. These pipelines contain placeholders for data preprocessing, feature selection, and prediction model algorithms. It also uses past knowledge and meta-learning to guide the initial stages of the exploration process. Auto-Sklearn also has the option to use an ensemble of its generated pipelines.
Extending the standard definition of a pipeline, TPOT [olson2016tpot]
uses a tree-based pipeline optimization tool for autoML. It enables the formation of dynamic pipeline architectures with multiple prediction and preprocessing algorithms that can be linked either in sequences or in parallel. TPOT uses evolutionary algorithms both for the creation of the pipeline structure and for ML algorithm selection. Hyperparameter optimization is also supported. Similarly to TPOT, Autostacker[chen2018autostacker] also generates pipelines with evolutionary algorithms, but it does so using a layers-based architecture.
Recently, a different approach for pipeline generation was proposed by [drori2018alphad3m]. The system, called AlphaD3M, uses a deep reinforcement learning (DRL) approach inspired by AlphaZero [silver2017mastering] and expert iteration [anthony2017thinking]. AlphaD3m represents the pipeline search and population challenges as a single-player game, where the player iteratively builds a pipeline by selecting from a set of actions such insertion, deletion and replacement of various pipeline elements. While these approaches enable the generation of more complex pipelines, they are also much more computationally expensive [milutinovic2017end].
While effective, all the above methods perform all their learning ‘from scratch’ for each given dataset. Several approaches do use meta-learning, but only at a limited capacity and for initialization purposes. Our approach, on the other hand, relies heavily on learning from previously-analyzed dataset and is therefore capable to produce high-quality pipelines for new datasets at a fraction of the time previous studies require.
We define a learning job consisting of a tabular dataset of columns and instances, a prediction task
and an evaluation metric. Additionally, we define primitives as any type of machine learning algorithm (e.g., preprocessing, feature selection, classification) and denote the set of primitives as . We define a directed acyclic graph (DAG) of primitives , where are the vertices of the graph, and are the edges of the graph and determine the primitives’ order of activation. We denote as a ML pipeline, or in short. Our goal is to generate a pipeline as to achieve
where is the error of pipeline over the learning job .
To reduce the size of our search space, we consider the problem of generating
as a sequential decision making task, and generate the pipeline step-by-step. We further formalize the problem as a Markov Decision Process (MDP) defined by the tuple, where is the set of all possible states (i.e., pipeline and learning job configurations), is the set of all possible actions defining the transitions between states, is the (deterministic) transition function between states, and is the reward function which is directly derived from the pipeline score with metric . We also use the rewards discount factor .
Given the expected sum of discounted rewards , produced by a policy , our goal is to obtain .
Next we explain how we solve this problem through the application of reinforcement learning.
The Proposed Framework
Our framework consists of three main components: an environment, an agent, and a hierarchical-step plugin serving as a mediator between the two.
The main challenge we faced in defining the environment was the need to restrict the size of our state space while maintaining the ability to produce effective and complex pipelines. Our proposed solution consists of two parts: (1) the partition of the ML primitives into families, and; (2) a grid-world representation of the pipeline.
We group our primitives into the following families: (1) data preprocessing: data cleaning, feature encoding, etc.; (2) feature preprocessing: feature discretization, scaling and normalization; (3) feature selection: uni-variate selection, entropy-based selection etc.; (4) feature engineering: data enrichment, dimensionality reduction etc.; (5) classification and regression models: XGBoost, Random Forest, lasso regression, etc., and; (6) Combiners: the same algorithms as the previous family, but here they are used as meta-learners, combining the prediction results produced by the primitives of family 5. Their input is not limited to predictions only, and can also receive as input the output of other types of primitive families.
As we later explain in detail, at any time-step our agent can only select from a single family of primitives. By doing so we reduce the sizes of both the state space and the action space of the agent.
Grid-world representation of the pipeline
We design our pipeline representation based on two common practices used by data scientists and academic studies. The first is the specific order that primitive families are applied in a pipeline (e.g., pre-processing feature selection classification) and the second is the union of inputs from two or more sub-pipelines into a single pipeline (e.g., in [he2017neural]).
As depicted in Figure 1, our state space is defined as a 6x grid, consisting of six columns – one for each of the six primitive families – and rows. Each cell of the grid is a placeholder for a primitive. The cells of a given column can only contain members of that column’s assigned primitives family. Grid cells can be left empty, meaning that a pipeline doesn’t have to contain all types of primitives.
The output of a grid cell is automatically passed to the subsequent non-empty cell in the same row. For example, in Figure 1 we see that the Mutual Information component (step #2) is connected to the XGBoost classifier (step #3) since the Feature Engineering cell was left empty. Additionally, the output of a cell can be passed to additional cells in other rows, thus creating more complex pipelines. A cell can be connected to any other cell under two conditions: (i) that a cell’s input is valid, with predefined rules (see the following section and Alg. 1), and (ii) the column index of the target cell is equal or larger than that of the source cell. Any cell may receive multiple inputs, in which case all inputs are concatenated and duplicate columns are removed. An example of this is presented in Figure 1, where the RF classifier (step #6) receives inputs from two cells (steps #1 and #5).
We further reduce the size of the state space by constraining the transition between states. Only a single grid cell – marked by the dot cursor in Figure 1 – can be updated at any time step. Additionally, the cursor performs only a single pass on the grid in each episode, completing one row in the grid before moving to the next, making the process more manageable and efficient.
Although our generated pipelines may vary in width and depth, we create a fixed-size representation that can be easily used as input for a neural net (NN). The state vector is a concatenation of the following:
– represents the primitives in the grid. We use a one-hot encoding for the primitives of each cell (blank cells have their own encoding value), so the length of this vector is, where is the number of grid cells. This vector is sparse, and we create an embedding to compress it (see the following section and Figure 3).
– represents the incoming edges (i.e., inputs) of each cell in the grid. The length of this vector is , where is a configurable parameter defining the maximal number of inputs per cell. For example, in Figure 1, the cell containing the RF classifier (#6) has two inputs (#1 and #5). If , then the vector entry for this grid cell would be .
– general meta-data describing the pipeline’s topology: number of nodes and edges, graph centrality etc.
– dataset-based features, representing the data being processed by the current grid cell. The values of this vector include the number of features, number of instances, percentage of numeric features etc. It is important to note that we generate the representation for the current form of the dataset after being processed by the current primitives in the pipeline.
– concatenation of the following vectors: a task vector, with one entry for each possible task (e.g., regression, classification, etc.) that can be pursued by our model, a metric vector, with one entry for each possible metric (e.g., accuracy, AUC, etc.) and a dataset-based features vector of the raw dataset, similarly to .
– a vector representing the available actions. Each action is represented by a vector that details both a candidate primitive to the current grid cell, in addition to its connection(s) to the cells that provide its input. Each time step, actions are available for choice, as shown in the bottom row of figure 1. All action vectors are concatenated. We elaborate on this further in the next section.
We first describe how we generate the set of possible actions for each time step, reducing the action space of our environment. We denote this set as the actions open list. We then describe our novel hierarchical step plugin approach for reducing the actions space and accelerating training.
Open List of Actions
The list of possible actions for a given cell (the one with the cursor) is determined by two elements: (1) the primitive family assigned to the cell, and; (2) the set of possible candidate inputs, out of all the grid cells’ outputs.
The process for generating the open actions list is presented in Algorithm 1. We begin by creating the set of all possible combinations inputs, denoted by . While a cell in row and column always receives as input the output of the most recently populated cell in row , it may also receive up to additional inputs from any other previously populated cell with an equal or smaller column index in previous rows. For example, step #6 in Figure 1 will always receive input from step #5, but may also receive inputs from step #1, #2 and #3. step #4 cannot be an input for #6 because it precedes #5, and #7 also cannot be used because it is populated at a later time step.
Once we crated the set of all possible inputs, we match every item in this set to all the members of the cell’s assigned family primitives (lines 6-10 in Alg. 1). The validity of each combinations is then examined, with possible reasons for elimination including inability to process categorical features, missing entries, or negative values. All valid combinations are retained and make up the final open list.
While our representation of the possible actions is concise, the fact that it is dynamic in size is problematic. Because each cell is likely to have a different number of actions (and also a different set of actions), all DRL algorithms that rely on a fixed action space cannot be applied on our representation. Such algorithms include the popular policy gradients, e.g., TRPO [schulman2015trust], and deep Q-networks (DQN) [mnih2015human]. While solutions to this problem exist in the literature, they are not without their shortcomings. One approach is to create a fixed-size set with all possible inputs-primitives combinations. The main disadvantage of this solution is the large size of the actions vector, which will make training the agent slow and difficult. Another option is to calculate the Q-value of every state-action pair in each iteration, but this approach is both computationally expensive and only applicable to some RL methods. For these reasons, we now propose our hierarchical approach for dynamic actions modeling.
Hierarchical Representation of Actions
Our goal is to enable a RL agent to model a varying number of actions using a fixed-size representation. We devise a hierarchical representation of the open actions list, where each level of the hierarchy is split into equal sized clusters of the actions, defined by parameter . The agent iterates over the clusters of each level, selecting one action per cluster. The chosen actions are passed to the next level of the hierarchy, which is also clustered in its turn. The process is then repeated until it reaches a hierarchy level in which there are actions at most, where one single action is chosen out of the finalists.
Figure 2 depicts an example of the process. In this example our open actions list consists of 360 possible actions. given that , the top level of the hierarchy is split into clusters from which 60 actions are selected.
The proposed algorithm is presented in Algorithm 2, where and denote the actions matrix and the number of actions in each cluster, respectively. consists of vector representations of the actions in the open list, where each action vector is the concatenation of the candidate primitive vector, the inputs indices for the primitive and the dataset-based features of the input. We partition with the MakeClusters
method which returns the actions indices of each cluster. This method also pads clusters that do not have exactlyactions with invalid actions. Choosing an invalid action will prompt a negative reward (i.e., penalty). Our agent evaluates all available actions in each cluster, relying on their representations which are gathered in vector . As far as the agent concerns, it only sees a single cluster at each time-step, represented by which is added to the state vector in line 6 of the algorithm. The selected actions are passed to , the next level of the hierarchy.
We create the clusters using a variation of the popular K-Means clustering algorithm[hartigan1979algorithm]. The algorithm is applied at each level of the hierarchy, with the number of clusters determined by and . Our approach has one significant difference from the standard algorithm, since we require not only a fixed number of clusters but also a fixed number of samples in each cluster. We achieve this by selecting for each centroid the
nearest samples, using the cosine similarity metric.
An important strength of the proposed approach is the fact that the hierarchical process is transparent to the agent. In addition to being compatible with a variety of popular DRL approaches, including actor-critic methods such as A3C [mnih2016asynchronous] and the different variations of DQN, DeepLine places no limitations on the use of exploration-exploitation techniques such as -greedy or prioritized experience replay [schaul2015prioritized], which we use in our model. Furthermore, we implement the hierarchical step as a part of our environment in compliance with OpenAI-Gym’s settings, suitable for use with any DRL agent.
As we show in the evaluation section, the hierarchical approach outperforms a model with no hierarchical plugin, where the agent has to learn the environment with all possible unique actions at every time-step.
We implement our agent using the DQN algorithm, which is an off-policy algorithm. While on-policy algorithms such as policy gradients are generally more stable, they are also less sample-efficient and prone to converge to a local optimum. Moreover, while on-policy approaches generally outperform off-policy approaches in large action spaces, our hierarchical representation of actions makes this point irrelevant.
A recent improvement to the DQN algorithm is dueling-DQN (D-DQN) [wang2015dueling]. D-DQN achieves faster convergence by decoupling the Q-function to the value function of the state and the advantage function of the actions, thus enabling the DQN agent to learn the value function , separately from the actions.
The D-DQN architecture consists of two separate sub-architectures – one for the value function and one for the advantage of each action over the average – each with its own output layer. Both sub-architectures are fed to a global output layer which computes the combined loss.
Our implementation is a variation of the D-DQN, which makes use of the fact that our state representation consists of multiple components. Our dueling architecture is presented in Figure 3. We partition the state vector as follows: the vectors that model the state of the grid form the input to the value-function sub-architecture. The vectors that model the task and the possible actions form the input to the action advantage sub-architecture. We define the architecture’s objective function as follows:
where the state and action vectors are and .
Since , the vector representation of the primitives, is sparse, we add an embedding layer. Because both the action advantage sub-architecture and the hierarchical plugin use vector that also contains a representation of primitives, we use the same embedding layer in all cases. Applying the same embedding for the hierarchical step means that in the early stages of the training, the actions representations are random but as time progresses the representation becomes meaningful and the clusters are more concise.
Due to the unique characteristics of our problem domain, our D-DQN implementation differs from the one proposed in [wang2015dueling]
in several important aspects. Most significantly, we use a long short-term memory (LSTM) architecture in the value-function sub-architecture[hochreiter1997long]. We use LSTM due to the sequential manner in which we construct our ML-pipeline, where a single fixed-order sweep of the grid is performed. As a result, the action-advantage sub architecture, which consists only of fully-connected layers, is completely separate from the value-function sub-architecture. This is unlike the original D-DQN implementation, where the lower layers are shared.
Our algorithm for training the agent is identical to the one presented in [mnih2015human], except for one main difference: our use of the hierarchical-step plugin, which replaces the application of a conventional training step. However, this change, as explained in the previous section, is transparent to the architecture and does not require any modification to the D-DQN algorithm or to the exploration methods.
We evaluate our framework over 56 classification datasets with large variety in size, number of attributes, feature type composition and class imbalance. All datasets are available in the following online repositories: UCI, OpenML, and Kaggle.
We evaluate two groups of baselines: (1) Basic popular pipelines
– used to evaluate whether our approach is better than popular algorithms that are often used by non-experts. This group consists of three pipelines, each consisting of two pre-processing primitives – missing values imputation and one-hot encoding for categorical features – and one of the following classifiers: Random Forest[liaw2002classification], XGBoost [chen2016xgboost] and Extra-Trees [geurts2006extremely], and; (2) Pipeline generation frameworks
– we chose two popular open source pipeline generation platforms: TPOT and Auto-Sklearn. The former achieves current state-of-the-art results, while the latter is part of one of the most popular machine learning libraries.
For both TPOT and Auto-Sklearn, we used the default parameter settings. In the case of TPOT, this results in the generation and evaluation of 10,000 pipelines for each dataset. We run Auto-Sklearn for 30 minutes on each dataset, resulting in approximately 700 pipelines per dataset. To ensure fair comparison, we limit the list of primitives used by DeepLine to those used by TPOT.
Both TPOT and Auto-Sklearn return by default a single pipeline. This pipeline is chosen based on its average performance on the folds of the training set. For DeepLine, we found that integrating our agent’s Q-function into the selection process improved its performance. We define the score of a pipeline as:
Where are the final state and action of the episode, is a tunable parameter and is the k-fold validation performance on the training set.
We evaluate two versions of DeepLine. The first one, denoted by (for vanilla) in Table 1 returns the top-ranked pipeline. The second one, denoted by (for ensemble) generates pipelines and creates a weighted average of their predictions. The weight assigned to each pipeline is determined by its score.
We used the following settings throughout the evaluation: We set the parameters and to 3 and 6 respectively, meaning that we use a 3X6 grid (see Figure 1) to represent our generated pipelines.
The DRL agent’s NN architecture is constructed as follows: the value-function sub-architecture consists of embedding vectors of size 15 and an LSTM of size 80, followed by three fully connected layers with lengths of 256, 128 and 32. The action-advantage sub-architecture consists of four fully connected layers with lengths of 256, 128, 64 and 32. The NN’s learning rate is set to .
Our 56 datasets were randomly partitioned to four folds of 14 datasets each. We used K-fold cross validation: for each evaluated fold, we trained our model on the remaining three. Each dataset in the test fold is evaluated as follows: the dataset is split to train and test sets in a ratio of 0.8:0.2. The train set is used for the pipeline exploration by the trained agent. The returned top pipeline(s) is then evaluated on the test set with the accuracy metric.
Evaluation Results and Analysis
Table 1 shows the results of our evaluation. In addition to calculating the accuracy, we also present the percentage of datasets in which DeepLine’s accuracy was better or equal (BOE) to that of the corresponding baseline. It is clear that both versions of DeepLine outperform all the baselines for pipelines. Moreover, the ensemble version of our approach outperforms all the baselines – both in terms of accuracy and percentage of datasets positively affected – by a considerable margin.
We used paired t-test to determine whether the differences between DeepLine and the baselines are significant. Table2 shows that the difference between the ensemble version of DeepLine and all baselines are significant with . The only exception is the ensemble version of Auto-Sklearn, for which we reach . The results for the vanilla version of DeepLine are less significant, but they reach for all baselines.
Analyzing the contribution of the hierarchical plugin
In order to evaluate the contribution of the hierarchical plugin, we retrain our agent using the same parameters and dataset fold partitions, but without the hierarchical representation. By removing the plugin, we increase the size of the action space available to the agent from 6 to 7,800. Figure 4 plots the average reward obtained by the agent over all 56 datasets for the first 5,000 episodes of the training. It is clear that the hierarchical plugin not only enables faster convergence, but also produces better training results.
Additional analysis of the agent’s behavior during training with and without the plugin shows significant differences in action selection. While the hierarchical plugin enables the agent to explore various primitive combinations, the large action space made it difficult for the agent to effectively explore the actions space. As a result, the agent chose the “blank” option much more frequently. While this action is always legal (i.e., doesn’t incur a penalty), it does not provide much useful information to the agent in the long term. In other words, the hierarchical plugin forces DeepLine to explore multiple actions and get useful feedback while the non-hierarchical representation delays useful exploration.
|Auto-Sklearn (top 50)||0.794||-||0.589|
Conclusions and Future Work
We presented DeepLine, a framework for the automatic generation of ML pipelines. We use semi-constrained RL environment integrated with a novel hierarchical actions representation. our framework achieves state-of-the-art results at a much lower computational cost.
For future work, we plan to extend our framework to include the automatic hyperparameters search and a less constrained state space.