Making Classical Machine Learning Pipelines Differentiable: A Neural Translation Approach

06/10/2019
by   Gyeong-In Yu, et al.
Microsoft
Seoul National University
0

Classical Machine Learning (ML) pipelines often comprise of multiple ML models where models, within a pipeline, are trained in isolation. Conversely, when training neural network models, layers composing the neural models are simultaneously trained using backpropagation. We argue that the isolated training scheme of ML pipelines is sub-optimal, since it cannot jointly optimize multiple components. To this end, we propose a framework that translates a pre-trained ML pipeline into a neural network and fine-tunes the ML models within the pipeline jointly using backpropagation. Our experiments show that fine-tuning of the translated pipelines is a promising technique able to increase the final accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/17/2021

Towards autonomic orchestration of machine learning pipelines in future networks

Machine learning (ML) techniques are being increasingly used in mobile n...
01/30/2020

AVATAR – Machine Learning Pipeline Evaluation Using Surrogate Model

The evaluation of machine learning (ML) pipelines is essential during au...
09/23/2019

Machine Learning Pipelines with Modern Big DataTools for High Energy Physics

The effective utilization at scale of complex machine learning (ML) tech...
02/05/2021

Deep reinforcement learning for smart calibration of radio telescopes

Modern radio telescopes produce unprecedented amounts of data, which are...
02/28/2020

End-to-end Robustness for Sensing-Reasoning Machine Learning Pipelines

As machine learning (ML) being applied to many mission-critical scenario...
09/23/2019

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics

The effective utilization at scale of complex machine learning (ML) tech...
12/15/2020

Amazon SageMaker Autopilot: a white box AutoML solution at scale

AutoML systems provide a black-box solution to machine learning problems...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks (DNNs) have been exceptionally successful in pushing the limits of various fields such as computer vision and natural language processing 

imagenet ; parity

. Nevertheless, classical Machine Learning (ML) techniques such as gradient boosting and linear models are still popular among practitioners 

kaggle , especially because of their intrinsic efficacy and interpretability. When using these techniques, we often build a machine learning pipeline by composing multiple data transforms and ML models. This abstraction allows users to capture the data transformation pipeline as Directed Acyclic Graphs (DAGs) of operators.

Many of the top performing ML pipelines in the industry and Kaggle’s competitions (e.g., kaggle-criteo ; komaki ) often include more than one trainable operator, i.e., ML models or data transforms that determine how to process input by learning from the training dataset. These trainable operators are often trained sequentially by following the topological order specified in the DAG. In this paper, we claim that this sequential training of ML pipelines’ operators is sub-optimal since the operators are trained in isolation and are not jointly optimized. This approach substantially differs from how DNNs are trained. DNN layers, which can also be seen as multiple cascaded operators, are typically trained simultaneously using backpropagation

by which parameters can be globally estimated end-to-end to reach better (local) minima. Arguably, this is one of the most fundamental features of deep learning.

Inspired by these observations, we propose an approach whereby (possibly) trained ML pipelines are translated into neural networks and fine-tuned therein. By doing so, we can use backpropagation over ML pipelines in order to bypass the greedy one-operator-at-a-time training model and eventually boosting the accuracy of the entire ML pipeline. During the translation, we can retain the information already acquired by training the original ML pipeline and provide a useful parameter initialization for the translated neural network, making the further training of the network more accurate and faster.

Nevertheless, noticeable challenges arise when translating pipelines involving data transforms or models, such as word tokenization or decision tree, that are intrinsically non-differentiable. We propose neural translations for selected non-differentiable operators including decision tree and one-hot encoding, although only the translation of decision tree is studied in our experiments due to space constraints. We also suggest controlling which part of the neural network (translated from a decision tree) to be further trained as a natural way of setting the trade-off between fit and bias.

We conduct experiments on two different datasets, each with two different pipelines. Both pipelines contain multiple trainable operators that were not jointly optimized. The experiments show that we can arrive at better accuracy by jointly fine-tuning these operators. Furthermore, we find that our neural translation provides informative knowledge transfer from pre-trained pipelines, along with efficient network architecture that performs better than hand-designed networks with similar capacity.

2 Pipeline Translation

A machine learning pipeline is defined as a DAG of data-processing operators, and these operators are mainly divided into two categories: (1) the arithmetic operators and (2) the algorithmic operators. Arithmetic operators are typically described by a single mathematical formula. These operators are, in turn, divided into two sub-categories of parametric and non-parametric

operators. Non-parametric operators define a fixed arithmetic operation on their inputs; for example, the Sigmoid function can be seen as a non-parametric arithmetic operator. In contrast, parametric operators involve numerical parameters on the top of their inputs in calculating the operators’ outputs. For example, an affine transform is a parametric arithmetic operator where the parameters consist of the affine weights and biases. The parameters of these operators can be potentially tuned via some training procedure. The algorithmic operators, on the other hand, are those whose operation is not described by a single mathematical formula but rather by an algorithm. For instance, the operator that converts categorical features into one-hot vectors is an algorithmic operator that mainly implements the look-up operation.

Given a DAG of arithmetic and algorithmic operators, we propose the following general procedure for translating it into a single neural network:

  1. [label=0)]

  2. For an arithmetic operator, translate the mathematical formula into a neural network module (Sec 2.1). In the case of parametric operator, copy the values of the operator’s parameters into the resulting neural module.

  3. For an algorithmic operator, translate the operator by rewriting the algorithm as a differentiable module (Sec 2.2), or keep it as is (Sec 2.3).

  4. Compose all the resulting modules from Step 1 and 2 into a single neural network by following the dependencies in the original pipeline.

The final output of the above translation process is a neural network that provides the same prediction results (unless the translation includes approximation) as the original pipeline on the inputs. Note that Step 1 and 2 in the above procedure are where the actual translation happen, which are described in details next.

2.1 Translating Arithmetic Operators

It is straightforward to translate a non-parametric arithmetic operator into a neural network module: the mathematical function of the operator can in fact be directly rewritten using the math API provided by a neural network framework. On the other hand, parametric arithmetic operators are often implicitly derived from ML models111Note that some parametric operators exist that are not derived from ML models (e.g., normalizers). These operators, however, can be translated with the same mechanism used for parametric operators of ML models.

, which are not straightforward to translate. ML models typically consist of three key components: (1) the prediction function, (2) the loss function, and (3) the learning algorithm. While the prediction function defines the functional form of the model, the learning algorithm and the loss function define how it is trained toward what objective, respectively. Take the popular linear Support Vector Machine (SVM) model as an example: the prediction function is a linear function of certain input dimensionality; the loss function is the Hinge loss, and the learning algorithm is gradient descent in the dual space.

A crucial observation is that once the training is complete, the data-processing operation of any ML model can be completely defined by the prediction function regardless of the loss function and the learning algorithm. Hence, we can translate these parametric operators by applying the translation method for non-parametric operators to their prediction functions. For example, a linear SVM model can be translated into a linear layer of one output unit having the weights transferred from the parameters of the trained model. It is worth noting that the translation of a trained ML pipeline into a neural network is uniquely done starting from the prediction function, independently on how different parts of it have been trained. This is a powerful observation because it enables us to translate different operators of a pipeline using the same formalism even though they might have been obtained via different learning algorithms or objectives.

2.2 Translating Algorithmic Operators: Tree Models

(a) An example decision tree.
(b) Expressing logical conjunction using arithmetic operations.
(c) A neural network translated from the decision tree of Figure 0(a).
Figure 1: Translating a decision tree into a neural network.

While most ML models produce differentiable arithmetic operators that can be directly translated, some do not. Among such models are the popular tree models whose prediction functions (i.e. the decision trees) are not a simple differentiable function. Instead, each prediction is made by executing a sequence of if-else statements. In that respect, a tree prediction function is an algorithmic operator rather that an arithmetic one. During translation, we can treat it as such and simply translate a decision tree as a nested if-else statements, however, the main problem with this approach is that we will not be able to parametrize the tree prediction function and further fine-tune it. In order to do so, we would need to rewrite the tree prediction function as an arithmetic operator instead, which is not trivial.

To tackle this challenge, first we show how the branching decision in each internal node of a tree can be written as a differentiable equation. In particular, we note that at a given internal node of a binary decision tree, the prediction algorithm evaluates the decision function , where is a vector representing the input of the tree, is the index of the feature examined at node , is the decision threshold at node , and is the indicator function. If then the algorithm will traverse to the right child; otherwise, it will follow the left child. Now, can be approximated by where is the canonical basis vector along the -th dimension of the feature space and is the Sigmoid function. This formulation provides us with a smooth approximation of the decision function at each internal node that converges to the true decision function as the Sigmoid gets sharper.

Next, we note that in a single decision tree, the value of a leaf node is outputted as the final value of the tree prediction function iff the path from the root node to that leaf node is traversed, which in turn requires the decision functions on the intermediate nodes to take a specific pattern. For example, in Figure 0(a), the tree will return (i.e. the value of leaf ) iff and . If the binary values of the decision functions are interpreted as logical true and false values, then the leaf node gets activated iff the logical conjunction evaluates to true

. As such, we denote the leaf activation function here as

. To get a differentiable approximation of the logical conjunction, we can write where is the total number of literals in the conjunction (the path length from the root node to the target leaf) and . Figure 0(b) visualizes this approximation for 2 inputs. The equation

is a maximum-margin hyperplane between

true and false evaluations.

Having translated the basic operations of a tree prediction function into differentiable functions as above, any decision tree can be translated into a Multi-Layer Perceptron (MLP) with two hidden layers. The first hidden layer implements a hidden unit (

) per each internal (decision) node in the tree. The second hidden layer allocates a hidden unit () for each leaf node. Finally, the output layer is defined as a linear layer with only one unit, , where is the set of all leaf nodes and is the value of the leaf node. Note that, in the case of no approximation, one and only one of the leaf activation functions evaluates to for any given input , while the rest are . Figure 0(c) shows an example of this translation procedure which is similar to the approach proposed in previous works Banerjee94initializingneural ; Ivanova95initializationof

. Also note that in the case of random forest or gradient boosting trees, this technique is applied independently for each tree in the model with a possible additional linear layer at the end to combine the outputs of the trees.

Once the translation of a tree is complete, the main question is which of the parameters of the resulting neural network should be declared as trainable. We experimented with four levels of parametrization:

  • The leaf node values that constitute the weights of the output layer are declared as trainable.

  • In addition to ’s, the decision threshold values ’s at the internal nodes are declared as trainable. These parameters constitute the bias values for hidden units in the first hidden layer.

  • In addition to Level 2’s parameters, the canonical basis vectors in the equation of are replaced by a vector of free parameters of the same size. These parameters constitute the weights between the input and the first hidden layer.

  • In addition to Level 3’s parameters, all the weights (including the non-existing weights) between the first and the second hidden layers are declared as trainable.

As level number increases, we declare more parameters as trainable and as such increase the capacity of the resulting neural network to fit to data better. While Levels and can only change the leaf and the decision threshold values in the tree, Level can additionally lead to examining a linear combination of features at each internal node rather than a single feature. Up to Level , the tree structure is preserved; whereas, at Level , we let the entire decision structure of the tree change. That is, Level gives us a fully-connected and fully-trainable MLP initialized by a (trained) tree.

2.3 Translating Algorithmic Operators: Beyond Tree Models and Limitations

In this section we briefly discuss the translation logic for other algorithmic operators. We show translation of two widely-used operators: one-hot encoding and data binning.

One-hot encoding is widely used for generating one-hot vectors out of categorical inputs. This can be seen as an embedding vector lookup operation with the embedding dimension matching the vocabulary size. Therefore, we translate this operator into an embedding lookup module. The same statement holds for one-hot hash encoding, except that the embedding dimension is typically much smaller than the vocabulary size because it uses the hashing trick. We can declare the embedding matrices of the translated lookup modules as trainable.

Data binning is a form of quantization to reduce noise of the data. It replaces an input that belongs to a certain range (a.k.a. bin) by a representative value of that range. We approximate this by a smooth multi-step function. For example, is a smooth approximation of data binning using three bins, , , and with representative values , , and , respectively. We can declare the constants such as or of this equation as trainable, making the function behave similarly to the parametric activation function prelu .

Unfortunately, there are some algorithmic operators that we cannot translate into a differentiable format yet. Word tokenization and missing data imputation are such examples. Since our translation approach currently do not handle these operators, we do not translate them and keep them as they are. Nevertheless, in all the cases we studied, these non-translatable operators are placed at the beginning of the pipeline and do not affect backpropagation through the rest of the translated network. Hence, we can still compute gradients and fine-tune the downstream operators, which are the more essential parts of the original ML pipeline.

2.4 Fine-Tuning

After translating the ML pipeline into a neural network, one can further fine-tune the trainable parameters of the resulting network via backpropagation. There are many scenarios for which this fine-tuning step can be useful. First, by fine-tuning the resulting network on the original training data, we can potentially improve the generalization of the model since we are now jointly optimizing all the operators of the pipeline toward the final loss function. Second, as we discussed above, the translation process does not depend on the loss functions different operators of the pipeline have been trained toward before. This means that once the translation is complete, the resulting network can be fine-tuned toward a completely different objective that is more suitable for a given application. Third, fine-tuning can be used to adapt the model to new data that were not available before, which is not straightforward without re-training the original ML pipeline with the old and new data. It is worth noting that other methods for fine-tuning such as boosting may increase the model size and complexity, while our translation approach does not. Also, the ensemble model obtained by boosting can be seen as a pipeline containing multiple operators that were not jointly optimized, so it can also benefit from our translation approach (scenario 1 in Sec. 3).

3 Experiments

In this section, we empirically evaluate the performance of our translation approach. The main goal of the experiments is to show the followings: (1) we can improve the performance of ML pipelines by employing backpropagation instead of training each operator individually; (2) the translation of trained ML pipelines provides informative initialization of neural networks; and (3) the translation provides efficient neural architectures. We carry our experiments on a binary classification task by using two datasets. Each dataset is evaluated under two different scenarios to showcase the capabilities our translator is able to provide. We use ML.NET mldotnet

, a machine learning framework for .NET, to train and test classical ML pipelines. Given a pipeline implemented on ML.NET, we translate it into a neural network by composing neural operations provided by PyTorch 

pytorch .

Datasets

The Criteo dataset criteo includes around 46M records each of them with 39 features, for a total size of around 11GB. Among the 39 features, 13 are numeric while the remaining are categorical. Training, validation and test datasets, each of them containing respectively, 44M, 1M, and 1M records, are carved from the full dataset after a shuffling step. The FlightDelay dataset flightDelay includes around 21M records, for a total size of around 1GB. In this dataset, each record has 8 features, where 2 are numeric and 6 are categorical. For the set of experiments using FlightDelay, we use years 2006 and 2007 as training set, while year 2008 is divided in 2 and used as validation and test set.

Scenarios

We evaluate the performance improvements unlocked by our neural translation approach through two scenarios. In the first scenario, we use a simple pipeline employing a single Gradient Boosting Decision Tree (GBDT) model (i.e., LightGBM lgbm ). In this scenario, we measure the performance of our translator by comparing the AUC obtained by the baseline (the original pipeline trained on ML.NET) versus the different tree parametrization levels (L1~L4). Furthermore, for each level, we experiment with two regimes of initialization for the parameters that are declared as trainable: (1) in the warm start regime the parameter values are carried over from the trained ML.NET pipeline (denoted by “Warm” in the result tables); and (2) in the cold start

regime the trainable parameters are randomly initialized (denoted by “Cold”), while the other parameters not declared as trainable are transferred intact from the original pipeline. Since different levels of parametrization for tree translation introduce different trade-offs between good fit and strong inductive bias, we report results on different training sample sizes. Lastly, we also compare against a baseline MLP with 2 hidden layers. This MLP solution is designed such that it approximately matches the number of trainable parameters found in the network generated by the translator. The MLP uses ReLU as an activation function, and employs dropout with zeroing probability 0.1 on each layer.

(a) Scenario 1.
(b) Scenario 2.
Figure 2: The pipelines used on the Criteo dataset. The gray boxes represent the operators that are jointly optimized during the fine-tuning process. Similar pipelines are used on the FlightDelay dataset.

For the second scenario, we use a pipeline composed by more than one ML model. We compose the pipeline as follows: (1) apply Principal Component Analysis (PCA) to the input feature

and produce , which resides in the principal component space of ; (2) train a LightGBM model using and the input label ; (3) using the leaf activation function for each tree in the trained LightGBM model, create a one-hot vector that marks the index of the activated leaf as 1 and keeps others 0; (4) concatenate the output from (3) and the original input

; (5) train the final linear classifier model using the concatenated feature from (4) and the label

. We train the final linear model using Stochastic Dual Coordinate Ascent sdca . Again, we use ML.NET and the MLP with 2 hidden layers as baselines. Fig. 2 depicts the pipelines used for the two scenarios described above.

3.1 Criteo

In this set of experiments we aim at predicting the click-through rate for an online advertisement. We compose a ML.NET pipeline that (1) fills in the missing values in the numerical columns of the dataset; (2) encodes categorical columns into one-hot vectors using a hash function with 10 bits (“Hashing” in Fig. 2); (3) discard feature dimensions that do not have any record with nonzero value (“CountSelectA” in Fig. 2); and (4) feeds the data into either LightGBM (Fig. 1(a)) or the multi-model pipeline (Fig. 1(b)). For both scenarios, we set the LightGBM model to create 30 leaves for each tree, while we constructed 100 and 30 trees in scenario 1 and 2, respectively. The PCA transform used in scenario 2 follows “CountSelectB” that selects frequently occurring slots by using a threshold of 150K countselector

. We use ML.NET default settings for the other hyperparameters 

222Note that ML.NET provides strong defaults that are known to work quite well in general..

Regarding the translated networks, we fine-tune them using the Adam kingma2014adam optimizer with a batch size of 4096. For scenario 1, we use a learning rate (lr) of 1e-5 and a weight decay (wd) of 1e-6, while for scenario 2, we use (lr, wd) = (1e-3, 1e-8) for the learning rate and the weight decay, respectively. We select these hyperparameters by sweeping the space [1e-2, 1e-6] for lr and [1e-5, 1e-9] for wd, and using a fixed batch size of 4096 that utilizes a Titan Xp card well without overflowing its memory. We let the training process run until convergence.

Data ML.NET Level 1 (3.0K) Level 2 (5.9K) Level 3 (47.5M) Level 4 (47.6M) Level 4 + Dropout
Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold
1% 0.7704 0.7717 0.7687 0.7196 0.7007 0.7678 0.7698 0.7680 0.7697 0.7697 0.7697
10% 0.7748 0.7818 0.7816 0.7410 0.7285 0.7852 0.7845 0.7849 0.7847 0.7890 0.7861
30% 0.7756 0.7832 0.7831 0.7485 0.7401 0.7950 0.7913 0.7947 0.7920 0.7996 0.7942
50% 0.7752 0.7830 0.7830 0.7526 0.7455 0.7991 0.7973 0.7991 0.7980 0.8024 0.7985
70% 0.7753 0.7831 0.7831 0.7539 0.7476 0.8016 0.8003 0.8018 0.8006 0.8037 0.8005
100% 0.7756 0.7833 0.7833 0.7571 0.7514 0.8036 0.8031 0.8036 0.8029 0.8045 0.8023
Table 1: Test AUC for scenario 1 on Criteo for ML.NET versus our translated neural network at different parametrization levels, initialization regimes, and training sample sizes. The numbers in parentheses represent the number of parameters declared as trainable for each translated network at different parametrization levels.

Scenario 1: Tree Evaluation

Table 1 reports the test AUC of ML.NET versus the fine-tuned neural network. From these results, we can see that: first, the warm start outperforms the cold start (except a few cases with of training data), which means that the weights transferred from a trained ML.NET pipeline provide an informative initialization for the neural network. Second, further fine-tuning of the neural network improves the AUC over the original ML.NET pipeline except for Level . At Level , we keep the original tree structure and the decision features intact while trying to further fine-tune the decision thresholds and the leaf values. However, these results clearly show that under a fixed decision structure of a tree, the LightGBM training algorithm has already found the optimal decision thresholds which cannot be further improved. This makes the fine-tuning fails to recover the error induced from the smooth approximation with the Sigmoid function. Third, with of the training data, Level outperforms all other levels, whereas for other data percentages, Level and Level seem to be equally beating the other two levels. This trend clearly shows slight overfitting of Levels and in small sample scenarios and how it is avoided by Level that has much fewer trainable parameters. In other words, lower levels provide a natural regularization mechanism in small sample scenarios. Fourth, we can take regularization one step further and add various types of explicit regularizations (e.g. Dropout) to Level . We apply dropout with a zeroing probability of at the second layer of the neural network described in Fig. 0(c). In fact, doing so gives us the best results as shown in the last column of Table 1. This means that in this case, the best strategy for further fine-tuning the model is to create the opportunity for the most flexible fit (i.e. Level ) and at the same time enforce strong regularization to avoid overfitting.

Table 3 shows the test AUC of the MLP baseline, along with the best-performing translation network (Level + dropout). We designed the MLP to have similar number of parameters (37.7M) compared to the translated network (47.6M), to minimize the difference in capacity of each network. The results show that the translation approach not only transfers meaningful information from trained ML pipeline, but also provides a network architecture that can achieve better results than the MLP even under the cold start regime.

Scenario ML.NET MLP Translation
Warm Cold
1 0.7756 0.7971 0.8045 0.8023
2 0.7644 0.7793 0.7904 0.7903
Table 3: Test AUC for ML.NET, MLP, and translated network on FlghtDelay.
Scenario ML.NET MLP Translation
Warm Cold
1 0.7447 0.7196 0.7875 0.7629
2 0.6990 0.7223 0.7284 0.7082
Table 2: Test AUC for ML.NET, MLP, and translated network on Criteo.

Scenario 2: Multi-model Evaluation

The second row in Table 3 shows the results of the multi-model scenario. We use Level with dropout for fine-tuning the tree part of the translated network, which was the best strategy in scenario 1. Similarly to the previous set of experiments, fine tuning the translated network improves the AUC compared to the baselines. The parameter size of the MLP and the translated network is 65.5K and 52.9K, respectively. Even though the difference between warm and cold starts is negligible here, we note that our translation approach still provides an efficient network architecture that outperforms both baselines.

3.2 FlightDelay

Data ML.NET Level 1 (3.0K) Level 2 (5.9K) Level 3 (2.0M) Level 4 (2.1M) Level 4 + Dropout
Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold
1% 0.7335 0.7335 0.7276 0.7196 0.6044 0.7295 0.7195 0.7315 0.7198 0.7335 0.7198
10% 0.7429 0.7441 0.7415 0.7277 0.6141 0.7525 0.7302 0.7544 0.7285 0.7623 0.7306
30% 0.7421 0.7420 0.7412 0.7298 0.6148 0.7616 0.7383 0.7676 0.7487 0.7835 0.7323
50% 0.7425 0.7426 0.7416 0.7292 0.6151 0.7695 0.7579 0.7778 0.7686 0.7960 0.7563
70% 0.7420 0.7412 0.7390 0.7298 0.6158 0.7774 0.7667 0.7806 0.7762 0.7901 0.7734
100% 0.7447 0.7468 0.7464 0.7244 0.6122 0.7815 0.7728 0.7698 0.7525 0.7875 0.7629
Table 4: Test AUC for scenario 1 on FlightDelay for ML.NET versus our translated neural network at different parametrization levels, initialization regimes, and training sample sizes. The numbers in parentheses represent the number of parameters declared as trainable for each translated network at different parametrization level.

In this second set of experiments our aim is to predict whether a scheduled flight will be delayed (more than 15 min) or not according to historical records. We first convert all the categorical columns by using one-hot encoding (instead of the one-hot hash encoding “Hashing”, and omit the count selectors “CountSelectA” and “CountSelectB” in Fig. 2); successively, as for Criteo, we apply either LightGBM or the multi-model pipeline. The hyperparameters are the same as Criteo’s, except that we use (lr, wd) = (1e-4, 1e-8) and (1e-4, 1e-6) for scenario 1 and 2, respectively, obtained from the parameter sweep.

Scenario 1: Tree Evaluation

The results for scenario 1 are summarized in Table 3 and 4. The MLP baseline (the first row of Table 3) has 1.76M trainable parameters. We see the exact same trends and observations that we observed for the Criteo dataset. As such, we refer the reader to Sec. 3.1. Other than the observations already discussed above, we find that when using Level (and dropout), the result from of training data is worse than and of data, which is somewhat counterintuitive. By closely examining the cross entropy loss on the validation set, we noticed that even though the AUC is worse with full data, the cross entropy loss (which is used for fine-tuning) is actually better than the and cases on the validation set. This shows that cross entropy is not the perfect proxy for optimizing AUC. Compared to the Criteo dataset, the gap between warm and cold starts is larger for this dataset, especially when using dropout. This means that the knowledge transferred from the ML pipeline is more important for this dataset, and applying strong regularization without such knowledge (i.e., cold start) may hinder the training process.

Scenario 2: Multi-model Evaluation

For the second scenario, FlightDelay again follows the same trend as Criteo: the translated network (37.7K parameters) improves the AUC over ML.NET and MLP (48.8K parameters) baselines. Unlike Criteo, however, we notice that the warm start is considerably better than the cold start, which emphasizes the importance of knowledge transfer. We also observe that the MLP baseline with less parameters (scenario 2) performs better than the MLP with more parameters (scenario 1). This shows that neural networks with larger capacity do not always lead to better results due to the difficulty of training, while our translation approach can bypass this hurdle by using informative initialization.

4 Related Works

To the best of our knowledge, this is the first attempt that tries to jointly optimize multiple operators in a ML pipeline by translation into a neural network. In a previous work finetune2018sysml we sketched the translator’s design and software components, as well as some early result on the fine-tuning process.

Nevertheless, there have been early works that initialize a MLP from a single operator: decision tree Banerjee94initializingneural ; Ivanova95initializationof . We took inspiration from these previous works for our tree translator. Specifically, as in these works, we translate trees into a 2 hidden layers network whereby the first layer corresponds to non-leaf nodes, and the second layer corresponds to leaves. Biau et al. Biau2018 also follows a similar approach and extends the technique to a forest of trees. However, all of these works Banerjee94initializingneural ; Ivanova95initializationof ; Biau2018 conducted evaluations on relatively small datasets. Since the largest dataset used there is 1000 times smaller (45K records) Biau2018 than the Criteo dataset (46M records), they missed empirical evidence that the technique also works in large scale learning. Finally, Humbird et al. 8478232

uses a decision tree to initialize a MLP, where the depth of the decision tree is used to decide the number of layers. Weights are randomly initialized, while the information on the tree is retained only for sparsely connecting the neurons. We instead use the parameters of the pre-trained trees to initialize the network weights.

5 Conclusions

Inspired by the existing gap between classical ML pipelines and neural networks, in this paper, we propose a framework for translating ML pipelines into neural networks and further jointly fine-tuning them. As part of our translation procedure, we also propose techniques for translating popular non-differentiable operators including tree models. The experimental results show that the translation followed by the fine-tuning leads to significant accuracy improvements over the original pipeline and hand-designed neural networks. Furthermore, we see that our translation mechanism can be seen as an approach for designing neural network architectures for a given task that is inspired by the classical ML pipeline designed for that task. We deem this work as a first step towards filling the gap between classical ML pipelines and neural networks.

References