1 Introduction
Enterprises increasingly look to Machine Learning (ML) to help solve business challenges that escape imperative programming and analytical querying [flock]—examples include predictive maintenance, customer churn prediction, and supplychain optimizations [firmai]. To do so, they typically turn to technologies now broadly termed “traditional ML”, a term coined to contrast them with Deep Neural Networks (DNNs). A recent analysis by Amazon Web Services found that to of all ML applications in an organization are based on traditional ML [sagemakertco]. An analysis of notebooks in public GitHub repositories [dsonds] paints a similar picture: NumPy [numpy], Matplotlib [matplot], Pandas [pandas], and scikitlearn [scikit] are the four most used libraries—all four provide functions for traditional ML. As a point of comparison with DNN frameworks, scikitlearn is used about
times more than PyTorch
[pytorch]and TensorFlow
[tensorflow] combined, and growing faster than both. Acknowledging this trend, traditional ML capabilities have been recently added to DNN frameworks, such as the ONNXML [onnxml] flavor in ONNX [onnx] and TensorFlow’s TFX [tfx].When it comes to owning and operating ML solutions, enterprises differ from early adopters in their focus on longterm costs of ownership and amortized return on investments [enterpriseapplifespan]. As such, enterprises are highly sensitive to: (1) complexity, (2) performance, and (3) overall operational efficiency of their software infrastructure [aicost]. In this work we focus on model scoring (i.e., the process of getting a prediction from a trained model by presenting it with new data), as it is a key driving factor in each of these regards. First, each model is trained once but used multiple times for scoring in a variety of environments, thus scoring dominates infrastructure complexity for deployment, maintainability, and monitoring. Second, model scoring is often in the critical path of interactive and analytical enterprise applications, hence its performance (in terms of latency and throughput) is an important concern for enterprises. Finally,
model scoring is responsible for 4565% of the total cost of ownership of data science solutions
[sagemakertco].Predictive Pipelines. The output of the iterative process of designing and training traditional ML models is not just a model but a predictive pipeline: a Directed Acyclic Graph (DAG) of operators. Such pipelines are typically comprised of up to tens of operators out of a set of hundreds [dsonds] that fall into two main categories: (1) featurizers, which could be either stateless imperative code (e.g., string tokenization) or data transformations fit to the data (e.g., normalization); and (2) models, commonly decision tree ensembles or (generalized) linear models, fit to the data. Note that the whole pipeline is required to perform a prediction.
A Missing Abstraction. Today’s featurizers and model implementations are not expressed in a shared logical abstraction, but rather in an adhoc fashion using programming languages such as R, Python, Java, C++, or C#. This hints to the core problem with today’s approaches to model scoring: the combinatorial explosion of supporting many operators (and frameworks) across multiple target environments. Figure 1 (top) highlights this visually by showing how existing solutions lead to an explosion to support operators from various ML frameworks against deployment environments (e.g., how to run a scikitlearn model on an embedded device?). Furthermore, [dsonds] shows that the number of libraries used in data science (a metric correlated to ) increased by roughly in the last 2 years. Our expectation is that is also destined to grow as ML is applied more widely across a broad range of enterprise applications and hardware (e.g., [graphcore, celebras, fpga, tpu, sambanova]). From the vantage point of implementing runtimes for model scoring, this is a daunting proposition. We argue that any bruteforce approach directly tackling all combinations would dilute engineering focus leading to costly and less optimized solutions. In fact, today, with very few exceptions (e.g., NVIDIA RAPIDS [rapids] for GPU), traditional ML operators are only implemented for CPUs.
This state of affairs is in contrast with the DNN space, where neural networks are authored using tensor transformations (e.g., multiplications, convolutions), providing an algebraic abstraction over computations. Using such abstractions rather than imperative code not only enables evolved optimizations [xla, tvm] but also facilitates support for diverse environments (such as mobile devices [mobile], web browsers [tensorflowjs], and hardware accelerators [graphcore, tpu, fpga]), unlocking new levels of performance and portability.
Our Solution. To bypass this explosion in implementing traditional ML operators, we built Hummingbird (HB). HB leverages compilation and optimization techniques to translate a broad set of traditional ML operators into a small set of core operators, thereby reducing the cost to , as shown in Figure 1 (bottom). This is also the key intuition behind the ONNX model format [onnx] and its various runtimes [onnxsupported]. However, with HB we take one further bold step: we demonstrate that this set of core operators can be reduced to tensor computations and therefore be executed over DNN frameworks. This allows us to piggyback on existing investments in DNN compilers, runtimes, and specializedhardware, and reduce the challenge of “running operators across environments” for traditional ML to just operator translations. This leads to improved performance and portability, and reduced infrastructure complexity.
Contributions. In this paper we answer three main questions:

Can traditional ML operators (both linear algebrabased such as linear models, and algorithmic ones such as decision trees) be translated to tensor computations?

Can the resulting computations in tensor space be competitive with the imperative alternatives we get as input (e.g., traversing a tree)?

Can HB help in reducing software complexity and improving model portability?
Concretely, we: (1) port thousands of benchmark predictive pipelines to two DNN backends (PyTorch and TVM); (2) show that we can seamlessly leverage hardware accelerators and deliver speedups of up to against handcrafted GPU kernels, and up to for predictive pipelines against stateoftheart frameworks; and (3) qualitatively confirm improvements in software complexity and portability by enabling scikitlearn pipelines to run across CPUs and GPUs.
HB is open source under the MIT license ^{1}^{1}1https://github.com/microsoft/hummingbird, and is part of the PyTorch ecosystem [pytorchecosystem]. We are integrating HB with other systems, such as the ONNX converters [blog].
Organization. The remainder of this paper is organized as follows: Section 2 contains some background; Section 3 contains an overview of the HB system. Sections 4 and 5 describe some operator implementations and optimizations. Experiments are in Section 6. The paper ends with related work and conclusions, respectively in Sections 7 and 8.
2 Background and Challenges
We first provide background on traditional ML and DNNs. We then explain the challenges of compiling traditional ML operators and predictive pipelines into tensor computations.
2.1 Traditional ML and DNNs
Traditional Predictive Pipelines.
The result of the data science workflow over traditional ML are predictive pipelines, i.e., DAG of operators such as trained models, preprocessors, featurizers, and missingvalue imputers. The process of presenting a trained predictive pipeline with new data to obtain a prediction is referred to in literature interchangeably as: model scoring/inference/serving, pipeline evaluation, or prediction serving. We favor model scoring in our writing.
Packaging a trained pipeline into a single artifact is common practice [mldotnet]. These artifacts are then embedded inside host applications or containerized and deployed in the cloud to perform model scoring [DBLP:journals/sigmod/PolyzotisRWZ18, clipper2]. ML.NET [mldotnet] (.NETbased), scikitlearn [scikit] (Pythonbased), and HO [h2o] (Javabased) are popular toolkits to generate pipelines. However, they are primarily optimized for training. Scoring predictive pipelines is challenging, as their operators are implemented in imperative code and do not follow a shared abstraction. Supporting every operator in all target environments requires a huge effort, which is why these frameworks have limited portability.
DNNs.
Deep Neural Networks (DNNs) are a family of ML models that are based on artificial neurons
[dlbook]. They take raw features as input and perform a series of transformation operations. Unlike traditional ML, transformations in DNNs are drawn from a common abstraction based on tensor operators (e.g., generic matrix multiplication, elementwise operations). In recent years, DNNs have been extremely successful in vision and natural language processing tasks
[alexnet, bert]. Common frameworks used to author and train DNNs are TensorFlow [tensorflow], PyTorch [pytorch], CNTK [cntk], and MXNet [mxnet]. While these frameworks can also be used to perform model scoring, next we discuss systems specifically designed for that.Runtimes for DNN Model Scoring. To cater to the demand for DNN model inference, a new class of systems has emerged. ONNX Runtime (ORT) [onnxruntime] and TVM [tvm] are popular examples of such systems. These capitalize on the relative simplicity of neural networks: they accept a DAG of tensor operations as input, which they execute by implementing a small set of highly optimized operator kernels on multiple hardwares. Focusing on just the prediction serving scenario also enables these systems to perform additional inferencespecific optimizations, which are not applicable for training. HB is currently compatible with all such systems.
2.2 Challenges
HB combines the strength of traditional ML pipelines on structured data [tradMLtraining] with the computational and operational simplicity of DNN runtimes for model scoring. To do so, it relies on a simple yet key observation: once a model is trained, it can be represented as a prediction function transforming input features into a prediction score (e.g., 0 or 1 for binary classification), regardless of the training algorithm used. The same observation naturally applies to featurizers fit to the data. Therefore, HB only needs to compile the prediction functions (not the training logic) for each operator in a pipeline into tensor computations and stitch them appropriately. Towards this goal, we identify two challenges.

[label=,wide, labelindent=1pt]

Challenge 1: How can we map traditional predictive pipelines into tensor computations? Pipelines are generally composed of operators (with predictive functions) of two classes: algebraic (e.g., scalers or linear models) and algorithmic
(e.g., onehot encoder and treebased models). While translating algebraic operators into tensor computations is straightforward, the key challenge for
HB is the translation of algorithmic operators. Algorithmic operators perform arbitrary data accesses and control flow decisions. For example, in a decision tree ensemble potentially every tree is different from each other, not only with respect to the structure, but also the decision variables and the threshold values. Conversely, tensor operators perform bulk operations over the entire set of input elements. 
Challenge 2: How can we achieve efficient execution for tensorcompiled traditional ML operators? The ability to compile predictive pipelines into DAGs of tensor operations does not imply adequate performance of the resulting DAGs. In fact, common wisdom would suggest the opposite: even though tensor runtimes naturally support execution on hardware accelerators, treebased methods and commonly used data transformations are well known to be difficult to accelerate [inferline], even using customdeveloped implementations.
3 System Overview
In this section we explain our approach to overcome the challenges outlined in Section 2.2, and present HB’s architecture and implementation details. We conclude this section by explaining assumptions and limitations.
3.1 Highlevel Approach
In HB, we cast algorithmic operators into tensor computations. You will notice that this transformation introduces redundancies, both in terms of computation (we perform more computations than the original traditional ML operators) and storage (we create data structures that store more than what we actually need). Although these redundancies might sound counterintuitive at first, we are able to transform the arbitrary data accesses and control flow of the original operators into tensor operations that lead to efficient computations by leveraging stateoftheart DNN runtimes.
For a given traditional ML operator, there exist different strategies for compiling it to tensor computations, each introducing a different degree of redundancy. We discuss such strategies for representative operators in Section 4. The optimal tensor implementation to be used varies and is informed by model characteristics (e.g., treestructure for treebased models, or sparsity for linear models) and runtime statistics (e.g., batch size of the inputs). Heuristics at the operator level, runtimeindependent optimizations at the pipeline level, and runtimespecific optimizations at the execution level enable HB to further improve predictive pipelines performance endtoend. The dichotomy between runtimeindependent and runtimespecific optimizations allow us to both (1) apply optimizations unique to traditional ML and not captured by the DNN runtimes; and (2) exploit DNN runtime optimizations once the traditional ML is lowered into tensor computations. Finally, HB is able to run endtoend pipelines on the hardware platforms supported by the target DNN runtimes.
3.2 System Architecture and Implementation
The highlevel architecture of HB is shown in Figure 2. HB has three main components: (1) Pipeline Parser, (2) Optimizer, and (3) Tensor DAG Compiler.
Pipeline Parser. In this phase, input pipelines are parsed one operator at a time, and each operator is wrapped into a container object. Each operator’s container maintains (1) the inputs and outputs of the operator, and (2) the operator signature that codifies the operator type (e.g., “scikitlearn decision tree”). HB parser also introduces a set of extractor functions
that are used to extract the parameters of each operator (e.g., weights of a linear regression, thresholds of a decision tree). Operator signatures dictate which extractor function should be used for each operator. At startup time, extractor functions are registered into a hash table, mapping operator signatures to the related extractor function.
HB parser is extensible, allowing users to easily add new extractor functions. HB currently supports over 40 scikitlearn operators (listed in Table 1), as well as parsers for XGBoost
[xgboost], LightGBM [lgbm], and ONNXML [onnxml]. At the end of the parsing phase, the input pipeline is “logically” represented in HB as a DAG of containers storing all the information required for the successive phases. HB parser is based on skl2onnx [skl2onnx].Optimizer. In this phase, the DAG of containers generated in the parsing phase is traversed in topological order in two passes. During the first traversal pass, the Optimizer extracts the parameters of each operator via the referenced extractor function and stores them in the container. Furthermore, since HB supports different operator implementations based on the extracted parameters, the Optimizer annotates the container with the compilation strategy to be used for that specific operator (5.1). During the second pass, HB tries to apply runtimeindependent optimizations (5.2) over the DAG.
Tensor DAG Compiler. In this last phase, the DAG of containers is again traversed in topological order and a conversiontotensors function is triggered based on each operator signatures. Each conversion function receives as input the extracted parameters and generates a PyTorch’s neural network module composed of a small set of tensor operators (listed in Table 2). The generated module is then exported into the target runtime format. The current version of HB supports PyTorch/TorchScript, ONNX, and TVM output formats. The runtimespecific optimizations are triggered at this level.
matmul, add, mul, div, lt, le, eq, gt, ge,
, , , , bitwise_xor, gather, index_select, cat, reshape, cast, abs, pow, exp, arxmax, max, sum, relu, tanh, sigmoid, logsumexp, isnan, where 
Supported ML Models 
LogisticRegression, SVC, NuSVC, LinearSVC, SGDClassifier, LogisticRegressionCV, DecisionTreeClassifier/Regression, RandomForestClassifier/Regression, ExtraTreesClassifier/Regressor, GradientBoostingClassifier/Regression, HistGradientBoostingClassifier/Regressor, IsoltationForest, MLPClassifier, BernoulliNB, GaussianNB, MultinomialNB 
Supported Featurizers 
SelectKBest, VarianceThreshold, SelectPercentile, PCA, KernelPCA, TruncatedSVD, FastICA, SimpleImputer, Imputer, MissingIndicator, RobustScaler, MaxAbsScaler, MinMaxScaler, StandardScaler, Binarizer, KBinsDiscretizer, Normalizer, PolynomialFeatures, OneHotEncoder, LabelEncoder, FeatureHasher 
3.3 Assumptions and Limitations
In this paper, we make a few simplifying assumptions. First, we assume that predictive pipelines are “pure”, i.e., they do not contain arbitrary userdefined operators. There has been recent work [froid] on compiling imperative UDFs (userdefined functions) into relational algebra, and we plan to make use of such techniques in HB in the future. Second, we do not support sparse data well. We found that current support for sparse computations on DNN runtimes is primitive and not well optimized. We expect advances in DNN frameworks to improve on this aspect—TACO [taco]
is a notable such example. Third, although we support string operators, we currently do not support text feature extraction (e.g.,
TfidfVectorizer). The problem in this case is twofold: (1) compiling regexbased tokenizers into tensor computations is not trivial, and (2) representing arbitrarily long text documents in tensors is still an open challenge. Finally, HB is currently limited by single GPU memory execution. Given that several DNN runtimes nowadays support distributed processing [horovod, distributedpytorch], we plan to investigate distributed inference as future work.4 Compilation
HB supports compiling several algorithmic operators into tensor computations. Given their popularity [dsonds], in Section 4.1 we explain our approach for treebased models. Section 4.2 gives a summary of other techniques that we use for both algorithmic and arithmetic operators.
4.1 Compiling Treebased Models
HB has three different strategies for compiling treebased models. Strategies differ based on the degree of redundancy introduced. Table 3 explains the notation used in this section. We summarize the worstcase runtime and memory footprints of each strategy in Table 4. HB currently supports only trees built over numerical values: support for missing and categorical values is under development. For the sake of presentation, we assume all decision nodes perform comparisons.
Symbol  Description 
Ordered lists with all nodes, internal nodes, leaf nodes, features, and classes, respectively.  
Input records ( is the number of records).  
ThresholdValue()  
Strategy  Memory  Runtime 
GEMM  
Strategy 1: GEMM. We cast the evaluation of a tree as a series of three GEneric Matrix Multiplication (GEMM) operations interleaved by two elementwise logical operations. Given a tree, we create five tensors which collectively capture the tree structure: and . captures the relationship between input features and internal nodes. is set to the threshold value of each internal node. For any leaf node and internal node pair, captures whether the internal node is a parent of that internal node, and if so, whether it is in the left or right subtree. captures the count of the internal nodes in the path from a leaf node to the tree root, for which the internal node is the left child of its parent. Finally, captures the mapping between leaf nodes and the class labels. Given these tensors, Algorithm 1 presents how we perform tree scoring for a batch of input records . A graphical representation of an execution of the GEMM strategy is depicted in Figure 3.
The first GEMM is used to match each input feature with the internal node(s) using it. The following operations is used to evaluate all the internal decision nodes and produces a tensor of 0s and 1s based on the false/true outcome of the conditions. The second GEMM operation generates an encoding for the path composed by the true internal nodes, while the successive operation returns the leaf node selected by the encoded path. Finally, the third GEMM operation maps the selected leaf node to the class label.
This strategy can be easily applied to support tree ensembles and regression tasks too. For tree ensembles, we create the above 2dimensional tensors for each tree and batch them together. As the number of leaf nodes and internal nodes can vary among trees, we pick the maximum number of leaf nodes and internal nodes for any tree as the tensor dimensions and pad the smaller tensor slices with zeros. During scoring, we invoke the batched variants of
GEMM and logical operations and perform a final ReduceMean operation over the batched dimension to generate the ensemble output. For regression tasks, we initialize with label values.Symbol  Description 
Strategy 2: TreeTraversal (TT). In the GEMM strategy, we incorporated a high degree of computational redundancy by evaluating all internal nodes and leaf nodes. Here, we try to reduce the computational redundancy by mimicking the typical tree traversal—but implemented using tensor operations. In this strategy, the tree structure is captured by five tensors: and . We formally define these tensors in Table 5. The same column index (last dimension) across all tensors corresponds to the same tree node. and capture the indices of the left and right nodes for a given node. If the node is a leaf node, we set these to the index of the given node. Similarly, and capture the feature index and threshold value for each node, respectively. For leaf nodes, we set to 1 and to 0. Finally, captures the class label of each leaf node. For internal nodes this can be any value; we set it to 0.
Given these tensors, Algorithm 2 presents how we perform scoring for a batch of input records . We use Gather and Where operations which can be used to perform indexbased slicing and conditional value selection. We first initialize an index tensor corresponding to all records in , which points to the root node. Using , we Gather the corresponding feature indices and use them to Gather the corresponding feature values from . Similarly, we also Gather left node indices, right node indices, and node thresholds. Using these gathered tensors, we then invoke a Where operation which checks for the tree node decisions. Based on the evaluation, for each record the Where operator either returns the left child index or right child index. To perform full tree scoring, the above steps have to be repeated until we reach a leaf node for all records in . We exploit the fact that (1) TREE_DEPTH is a known property of the input model at compilation time, and (2) all leaf nodes are at a depth TREE_DEPTH, to iterate for that fixed number of iterations to ensure that all records have found their corresponding leaf node. Tensors are created in such a way that if one of the indices reaches a leaf node before running for TREE_DEPTH iterations, the same class label will keep getting selected. At compile time, we unroll all iterations and remove the for loop to improve efficiency. For ensembles, we create tensors for each tree and batch them together. However, between trees the number of nodes and dimensions may differ, so we use the maximum node count for any tree as the dimension and pad the remaining elements.
Strategy 3: PerfectTreeTraversal (PTT). Similar to the previous one, this strategy also mimics the tree traversal. However, here we assume the tree is a perfect binary tree. In a perfect binary tree, all internal nodes have exactly two children and all leaf nodes are at the same depth level. Assume we are given a nonperfect binary tree with a TREE_DEPTH of , and is a leaf node which is at a depth of . To push to a depth , we replace with a perfect subtree of depth and map all the leaf nodes of the subtree to : the label of the original leaf node. The decision nodes in the introduced subtree are free to perform arbitrary comparisons as the outcome is the same along any path. By pushing all leaf nodes at depth to a depth of , we transform the original tree to a perfect tree with the same functionality.
Symbol  Description 
Internal and leaf nodes of the perfect tree ordered by level.  
evaluates  
ThresholdValue()  
Working on perfect trees enables us to get rid of and tensors as we can now calculate them analytically, which also reduces memory lookup overheads during scoring. Thus we create only three tensors to capture the tree structure: , and (Table 6). They capture the same information as but have different dimensions and have a strict condition on the node order. Both and have elements and the values correspond to internal nodes generated by level order tree traversal. has elements with each corresponding to an actual leaf node from left to right order.
Given these tensors, in Algorithm 3 we present how works. From a highlevel point of view, it is very similar to the strategy with only a few changes. First, the index tensor is initialized to all ones as the root node is always the first node. Second, we get rid of finding the left index and right index of a node and using them in the Where operation. Instead, the Where operation returns for true case and for the false case. By adding this to we get the index of the child for the next iteration. For ensembles, we use the maximum TREE_DEPTH of any tree as for transforming trees to perfect trees. We create tensors separate for each tree and batch them together for . But for and instead of batching, we interleave them together in some order such that values corresponding to level for all trees appear before values corresponding to level of any tree.
4.2 Summary of Other Techniques
Next, we discuss the other techniques used across ML operators to efficiently compile them into tensor computations.
Exploiting Automatic Broadcasting. Broadcasting [broadcasting] is the process of making two tensors shape compatible for elementwise operations. Two tensors are said to be shape compatible if each dimension pair is the same, or one of them is 1. At execution time, tensor operations implicitly repeat the size 1 dimensions to match the size of the other tensor, without allocating memory. In HB, we heavily use this feature to execute some computation over multiple inputs. For example, consider performing an onehot encoding operation over column with a vocabulary . In order to implement this using tensor computations, we Reshape to and to and calculate Equal(, ), . The Reshape operations are for free because they only modify the metadata of the tensor. However, this approach performs redundant comparisons as it checks the feature values from all records against all vocabulary values.
Minimize Operator Invocations. Given two approaches to implement an ML operator, we found that often picking the one which invokes fewer operators outperforms the other—even if it performs extra computations. Consider a featurizer that generates feature interactions. Given an input , with , it generates a transformed output with . One way to implement this operator is to compute each new feature separately by first Gathering the corresponding input feature columns, perform an elementwise Multiplication, and conCatenate all new features. However, this approach requires performing operations and hence is highly inefficient due to high operator scheduling overheads. Alternatively, one could implement the same operator as follows. First, Reshape into and . Then perform a batched GEMM using these inputs, which will create . Finally, Reshape to . Notice that each row in has all the values of the corresponding row in , but in a different order. It also has some redundant values due to commutativity of multiplication (i.e., ). Hence, we perform a final Gather to extract the features in the required order, and generate . Compared to the previous one, this approach increases both the computation and the memory footprint roughly by a factor of two. However, we can implement feature interaction in just two tensor operations.
Avoid Generating Large Intermediate Results.
Automatic broadcasting in certain cases can become extremely inefficient due to the materialization of large intermediate tensors. Consider the Euclidean distance matrix calculation, which is popular in many ML operators (e.g., SVMs, KNN). Given two tensors
and , the objective is to calculate a tensor , where . Implementing this using broadcasting requires first reshaping to , to , calculate , and perform a final Sum over the last dimension. This approach causes a size blowup by a factor of in intermediate tensors. Alternatively, a popular trick [euclidean_distance_trick] is to use the quadratic expansion of and calculate the individual terms separately. This avoids generating intermediate tensors.Fixed Length Restriction on String Features. Features with strings of arbitrary lengths pose a challenge for HB. Strings are commonly used in categorical features, and operators like onehot encoding and feature hashing natively support strings. To support string features, HB imposes a fixed length restriction, with the length being determined by the max size of any string in the vocabulary. Vocabularies are generated during training and can be accessed at compile time by HB. Fixed length strings are then encoded into an int8.
5 Optimization
In this section we discuss the key optimizations performed by the HB’s Optimizer: heuristics for picking operator strategies (Section 5.1) and runtimeindependent optimizations (Section 5.2). Recall that our approach also leverages runtimespecific optimizations at the Tensor Compiler level. We refer to [torchscript, tvm] for runtimespecific optimizations.
5.1 Heuristicsbased Strategy Selection
For a given classical ML operator, there can be more than one compilation strategy available. In the previous section we explained three such strategies for treebased models. In practice, no strategy consistently dominates the others, but each is preferable in different situations based on the input and model structure. For instance, the strategy gets significantly inefficient as the size of the decision trees gets bigger because of the large number of redundant computations. This strategy performs ( is the depth of the tree) computations whereas the original algorithmic operator needs to perform only comparisons. Nevertheless, with small batch sizes or a large number of smaller trees, this strategy can be performancewise optimal on modern hardware, where GEMM operations can run efficiently. With large batch sizes and taller trees, techniques typically outperform the strategy and is slightly faster than vanilla due to the reduced number of memory accesses. But if the trees are too deep, we cannot implement because the memory footprint of the associated data structures will be prohibitive. In such cases, we resort to . The exact crossover point where strategy outperforms other strategies is determined by the characteristics of the tree model (e.g., number of trees, maximum depth of the trees), runtime statistics (e.g., batch size), and the underlying hardware (e.g., CPUs, GPUs). For instance, from our experiments (see Figure 7) we found that the strategy performs better for shallow trees ( on CPU, on GPU) or for scoring with smaller batch sizes. For tall trees, using when give a reasonable tradeoff between memory footprint and runtime, which leaves vanilla the only option for very tall trees (). These heuristics are currently hardcoded.
5.2 Runtimeindependent Optimizations
We discuss two novel optimizations, which are unique to HB. HB’s approach of separating the prediction pipeline from training pipeline, and representing them in a logical DAG before compilation into tensor computations facilitate the optimization of endtoend pipelines.
Feature Selection PushDown.Feature selection is a popular operation that is often used as the final featurization step as it reduces overfitting and improves the accuracy of the ML model [feature_selection]. However, during scoring, it can be pushed down in the pipeline to avoid redundant computations such as scaling and onehot encoding for discarded features or even reading the feature at all. This idea is similar to the concept of projection pushdown in relation query processing but through userdefined table functions, which in our case are the ML operators. For operators such as feature scaling, which performs 1to1 feature transformations, selection pushdown can be easily implemented. However, for operators such as onehot encoding and polynomial featurization, which perform 1tom or mto1 feature transformations, the operator will have to absorb the feature selection and stop generating those features. For example, say onehot encoding is applied on a categorical feature column which has a vocabulary size of 10, but 4 of those features are discarded by the feature selector. In such cases, we can remove such features from the vocabulary. Note that for some “blocking” operators [pretzel], such as normalizers, it is not possible to pushdown the feature selection.
Feature Selection Injection. Even if the original pipeline doesn’t have a feature selection operator, it is possible to inject one and then push it down. Linear models with L1 regularization (Lasso) is a typical example where feature selection is implicitly performed. The same idea can be extended to treebased models to prune the features that are not used as decision variables. In both of these examples, the ML model also has to be updated to take into account the pruned features. For linear models we prune the zero weights; for tree models, we update the indices of the decision variables.
6 Experimental Evaluation
In our experimental evaluation we report two microbenchmark experiments showing how HB performs compared to current stateoftheart for inference over (1) tree ensembles (Section 6.1.1); (2) other featurization operators and ML models (Section 6.1.2). Then we evaluate the optimizations by showing: (1) the need for heuristics for picking the best treemodel implementation (Section 6.2.1); and (2) the benefits introduced by the runtimeindependent optimizations (Section 6.2.2). Finally, we conduct an endtoend evaluation using pipelines (Section 6.3). We evaluate both CPUs and hardware accelerators (GPUs).
Hardware and Software Setup. For all the experiments (except when stated otherwise) we use an Azure NC6 v2 machine equipped with 112 GB of RAM, an Intel Xeon CPU E52690 v4 @ 2.6GHz (6 virtual cores), and an NVIDIA P100 GPU. The machine runs Ubuntu 18.04 with PyTorch 1.3.1, TVM 0.6, scikitlearn 0.21.3, XGBoost 0.9, LightGBM 2.3.1, ONNX runtime 1.0, RAPIDS 0.9, and CUDA 10. We run TVM with opt_level 3 when not failing; 0 otherwise.
Experimental Setup. We run all the experiments 5 times and report the truncated mean by averaging the middle values. In the following, we use ONNXML to indicate running an ONNXML model (i.e., traditional ML part of the standard) on the ONNX runtime. Additionally, we use bold numbers to highlight the best performance for the specific setup (CPU or GPU). Note that both scikitlearn and ONNXML do not natively support hardware acceleration.
6.1 Microbenchmarks
6.1.1 Tree Ensembles
Setup.
This experiment is run over a set of popular datasets used for benchmarking gradient boosting frameworks
[gbmbench]. We first do a 80%/20% train/test split over each dataset. Successively, we train a scikitlearn random forest, XGBoost [xgboost], and LightGBM [lgbm] models using the default parameters of the benchmark. Specifically, we set the number of trees to 500 and maximum depth to 8. For XGBoost and LightGBM we use the scikitlearn API. Note that each algorithm generates trees with different structures, and this experiment helps with understanding how HBbehaves with various tree types and dataset scales. For example, XGBoost generates balanced trees, LightGBM mostly generates skinny tall trees, while random forest is a mix between the two. Finally, we score the trained models over the test dataset using different batch sizes. We compare the results against
HB with different runtime backends and an ONNXML version of the model generated using ONNXMLTools [onnxmltools]. When evaluating over GPU, we also compared against NVIDIA RAPIDS Forest Inference Library (FIL) [fil]. We don’t compare against GPU implementations for XGBoost or LightGBM because we consider FIL as stateoftheart [filblog]. For the CPU experiments, we use all six cores in the machine, while for request/response experiments we use one core. We set a timeout of 1 hour for each experiment.Datasets. We use 6 datasets from NVIDIA’s gbmbench [gbmbench]. The datasets cover a wide spectrum of usecases: from regression to multiclass classification, from 285 rows to 100, and from few 10s of columns to 2.
Algorithm  Dataset  Baselines (CPU)  HB CPU  Baselines (GPU)  HB GPU  
Sklearn  ONNXML  PyTorch  TorchScript  TVM  RAPIDS FIL  TorchScript  TVM  
Rand. Forest  Fraud  2.5  7.1  8.0  7.8  3.0  not supported  0.044  0.015 
Epsilon  9.8  18.7  14.7  13.9  6.6  not supported  0.13  0.13  
Year  1.9  6.6  7.8  7.7  1.4  not supported  0.045  0.026  
Covtype  5.9  18.1  17.22  16.5  6.8  not supported  0.11  0.047  
Higgs  102.4  257.6  314.4  314.5  118.0  not supported  1.84  0.55  
Airline  1320.1  timeout  timeout  timeout  1216.7  not supported  18.83  5.23  
LightGBM  Fraud  3.4  5.9  7.9  7.6  1.7  0.014  0.044  0.014 
Epsilon  10.5  18.9  14.9  14.5  4.0  0.15  0.13  0.12  
Year  5.0  7.4  7.7  7.6  1.6  0.023  0.045  0.025  
Covtype  51.06  126.6  79.5  79.5  27.2  not supported  0.62  0.25  
Higgs  198.2  271.2  304.0  292.2  69.3  0.59  1.72  0.52  
Airline  1696.0  timeout  timeout  timeout  702.4  5.55  17.65  4.83  
XGBoost  Fraud  1.9  5.5  7.7  7.6  1.6  0.013  0.44  0.015 
Epsilon  7.6  18.9  14.8  14.8  4.2  0.15  0.13  0.12  
Year  3.1  8.6  7.6  7.6  1.6  0.022  0.045  0.026  
Covtype  42.3  121.7  79.2  79.0  26.4  not supported  0.62  0.25  
Higgs  126.4  309.7  301.0  301.7  66.0  0.59  1.73  0.53  
Airline  1316.0  timeout  timeout  timeout  663.3  5.43  17.16  4.83 
List of Experiments. We run the following set of experiments: (1) batch inference, both on CPU and GPU; (2) request/response where one single record is scored at a time; (3) scaling experiments by varying batch sizes, both over CPU and GPU; (4) evaluation on how HB behaves on different GPU generations; (5) dollar cost per prediction; (6) memory consumption; (7) validation of the produced output wrt scikitlearn; and finally (8) time spent on compiling the models.
Batch Inference. Table 7 reports the inference time for random forest, XGBoost and LightGBM models run over the 6 datasets. The batch size is set to records. Looking at the CPU numbers from the table, we can see that:

Among the baselines, scikitlearn models outperform ONNXML implementations by 2 to 3. This is because ONNXML v1.0 is not optimized for batch inference.

Looking at the HB’s backends, there is not a large difference between PyTorch and TorchScript, and in general these backends perform comparable to ONNXML.

The TVM backend provides the best performance on 15 experiments out of 18. In the worst case TVM is 20% slower (than scikitlearn); in the best cases it is up to 2 faster compared to the baseline solutions.
Let us look now at the GPU numbers of Table 7:

Baseline RAPIDS does not support random forest nor multiclass classification tasks. For the remaining experiments, GPU acceleration is able to provide speedups of up to 300 compared to CPU baselines.^{2}^{2}2The original FIL blog post [filblog] claims GPU acceleration to be in the order of 28 for XGBoost, versus close to 300 in our case (Airline). We think that the difference is in the hardware: in fact, they use 5 E52698 CPUs for a total of 100 physical cores, while we use a E52690 CPU with 6 (virtual) physical cores. Additionally, they use a V100 GPU versus a P100 in our case.

Looking at HB backends, TorchScript is about 2 to 3 slower compared to RAPIDS. TVM is instead the faster solution on 14 experiments out of 18, with a 10% to 20% improvement wrt RAPIDS.
The results are somehow surprising: HB targets the highlevel tensor APIs provided by PyTorch and TVM, and still it is able to outperform custom C++ and CUDA implementations.
Request/response.
In this scenario, one single record is scored at a time. For this experiment we run inference over the entire test datasets, but with batch size equal to 1. We used the same datasets and setup of Section 7, except that (1) we removed the Airline dataset since no system was able to complete within the 1 hour timeout; and (2) we only use one single core. The results are depicted in Table 8:

Unlike the batch scenario, ONNXML is much faster compared to scikitlearn, in some cases even more than 100. The reason is that ONNXML is currently optimized for single record, single core inference, whereas scikitlearn design is more towards batch inference.

PyTorch and TorchScript, again, behave very similarly. For random forest they are faster than scikitlearn but up to 5 slower compared to ONNXML. For LightGBM and XGBoost they are sometimes on par with scikitlearn, sometime slower.

TVM provides the best performance in 11 cases out of 15, with a best case of 3 compared to the baselines.
Algorithm  Dataset  Baselines  HB  
Sklearn  ONNXML  PT  TS  TVM  
Rand. Forest  Fraud  1688.22  9.96  84.95  75.5  11.63 
Epsilon  2945.42  32.58  153.32  134.17  20.4  
Year  1152.56  18.99  84.82  74.21  9.13  
Covtype  3388.50  35.49  179.4  157.8  34.1  
Higgs  timeout  335.23  timeout  timeout  450.65  
LightGBM  Fraud  354.27  12.05  96.5  84.56  10.19 
Epsilon  40.7  29.28  167.43  148.87  17.3  
Year  770.11  16.51  84.55  74.05  9.27  
Covtype  135.39  209.16  854.07  822.93  42.86  
Higgs  timeout  374.64  timeout  timeout  391.7  
XGBoost  Fraud  79.99  7.78  96.84  84.61  10.21 
Epsilon  121.21  27.51  169.03  148.76  17.4  
Year  98.67  17.14  85.23  74.62  9.25  
Covtype  135.3  197.09  883.64  818.39  43.65  
Higgs  timeout  585.89  timeout  timeout  425.12 
Framework  Random Forest  LightGBM  XGBoost 
Sklearn  180  182  392 
ONNXML  265  258  432 
TorchScript  375  370  568 
TVM  568  620  811 
These results are again surprising, considering that tensor operations should be more optimized for bulk workloads rather than request/response scenarios.
Scaling the Batch Size. We study how the performance of baselines and HB’s backends change with the batch size. Figures 3(a) and 3(b) depicts the performance variation over CPU and GPU, respectively. We report only a few combinations of dataset / algorithm, but all the other combinations behave similarly. Starting with the CPU experiment, we can see that ONNXML has the best runtime for batch size of 1, but then its performance remains flat as we increase the batch size. TorchScript and scikitlearn did not complete within the timeout for batch equal to 1, but, past 100, they both scale linearly as we increase the batch size. TVM is comparable to ONNXML for batch of 1; for batches of 100 records it gets about 5 faster, while it scales like TorchScript for batches greater than 100. This is likely due to the fact that TVM applies a set of optimizations (e.g., operator fusion) that introduce a constantfactor speedup compared to TorchScript.
Looking at the GPU numbers (Figure 3(b)), TorchScript and TVM again follow a similar trend, with TVM being around 3 faster than TorchScript. Both TVM and TorchScript plateau at about a batch size of . RAPIDS FIL is slower than TorchScript for small batch sizes, but it scales better than HB. This is because of its custom CUDA implementation that is able to better use hardware under higher utilization. Interestingly, FIL as well plateaus at around records. The custom CUDA implementation introduces a 50% gain over HB with TVM runtime over large batches.
Algorithm  Dataset  ONNXML  HB  
PyTorch  TorchScript  TVM  
Rand.Forest  Fraud  1.28  0.55  0.58  102.37 
Epsilon  7.53  2.63  2.67  108.64  
Year  7.11  2.77  2.86  69.99  
Covtype  9.87  2.16  2.2  106.8  
Higgs  8.25  2.41  2.44  103.77  
Airline  6.82  2.42  2.53  391.07  
LightGBM  Fraud  1.34  0.98  1.06  3.42 
Epsilon  11.71  7.55  7.60  9.95  
Year  9.49  6.11  6.15  8.35  
Covtype  32.46  22.57  23.12  26.36  
Higgs  6.73  25.04  26.3  109  
Airline  11.52  6.38  6.47  8.19  
XGBoost  Fraud  0.55  0.65  0.7  86.59 
Epsilon  6.86  25.89  25.94  113.4  
Year  5.66  23.4  23.54  110.24  
Covtype  9.87  2.16  2.20  106.8  
Higgs  6.73  25.04  26.3  109 
Scaling Hardware. We tested how RAPIDS FIL and HB (TorchScript and TVM) scale as we change the GPU model. For this experiment we tried both with a large batch size ( records, Figure 5 (a)) to maximize hardware utilization, and a smaller batch size (, Figure 5 (b)). We ran this on all datasets across random forest, LightGBM, XGBoost with similar results, and present the Airline dataset (the largest) with LightGBM as a representative sample. We tested on three NVIDIA devices: K80 (the oldest, 2014), P100 (2016), and V100 (2017). From the figures, in general we can see that: (1) RAPIDS FIL does not run on the K80 because it is an old generation; (2) with a batch size of 1K we get slower total inference time because we don’t utilize the full hardware; (3) TorchScript and TVM runtimes for HB scale similarly on different hardware, although TVM is consistently 4 to 7 faster; (4) FIL scales similarly to HB, although it is 50% faster on large batches, 3 slower for smaller batches; (5) TorchScript is not optimal in memory management because for batches of it fails on the K80 with an OOM exception. Finally, we also were able to run HB on the new Graphcore IPU [graphcore] over a single decision tree.
Cost. Figure 6 shows the cost comparison between the Azure VM instance equipped with GPU, and a comparable one without GPU (E8 v3). The plot shows the cost of executing 100k samples with a batch size of 1K for random forest. The cost is calculated based on the hourly rate of each VM divided by the amortized cost of a single prediction. We executed scikitlearn on the CPU and TorchScript and TVM on the GPU for comparison. We found that the CPU cost was significantly higher (between 10120) across all experiments. ^{3}^{3}3Note: airline times out for random forest for CPU with 1 batch. An interesting result was that the oldest GPU was the most cost effective, with the K80 and TVM having the lowest cost for 13 out of the 18 experiments (including LightGBM and XGBoost, not pictured). This result is explained by the fact that the K80 is readily available at significantly lower cost.
Memory Consumption. We measured the peak memory consumption over the Fraud dataset and for each algorithm. We used the memory_usage function in the memory_profiler library [mem_profiler]. The numbers are reported in Table 9, and are the result of the execution over 1 core with a batch size of 1. As we can see, scikitlearn is always the most memory efficient. ONNXML consumes from 10% to 50% more memory, while HB with TorchScript runtime consumes from 50% to about 2 more memory than scikitlearn. Conversely, TVM consumes from 2 to 3 more memory wrt scikitlearn. We think that TVM is more memory hungry because it optimizes compute at the cost of memory requirements. Note that the batch size influences the total memory consumption.
Output Validation. Since we run tree ensemble models as tensor operations, we could introduce rounding errors over floating point operations. Therefore, we need to validate that indeed the outputs produced match. To evaluate this, we used the numpy testing.assert_allclose function, and we set the relative and absolute errors to
. We validate both the final scores and the probabilities (when available) for all combinations of datasets and algorithms. Out of the 18 experiments listed in Table
7, 9 of them returned no mismatches for HB, 12 in the ONNXML case. Among the mismatches, the worst case for HB is random forest with Covtype where we have 0.8% of records differing from the original scikitlearn output. For the Epsilon dataset, HB with random forest returns a mismatch on 0.1% of records. All the remaining mismatches effect less than 0.1% of records. Note that the differences are small. The biggest mismatch is of 0.086 (absolute difference) for Higgs using LightGBM. For the same experiment ONNXML has an absolute difference of 0.115.Conversion Time. Table 10 shows the time it takes to convert a trained model into a target framework. The numbers are related to the generation of models running on a single core. This cost occurs only once per model and are not part of the inference cost. As we can see, converting a model to ONNXML can take up to a few tens of seconds; HB with PyTorch backend is constantly about 2 to 3 faster wrt ONNXML in converting random forests models, while it varies for LightGBM and XGBModels. TorchScript models are generated starting from PyTorch models, and in general this further compilation step does not introduce any major overhead. Finally, conversion to TVM is much slower, and it might take more than 3 minutes. This is due to code generation and optimizations introduced in TVM.
As a final note: parallel (i.e., more than 1 core) and GPU execution introduced further conversion time overheads, especially on TVM. For instance, TVM can take up to 40 minutes to convert a random forest model for execution on GPU.
6.1.2 Operators
Setup. This microbenchmark is a replication of the suite comparing scikitlearn and ONNXML operators [onnxbench]. We test all scikitlearn operators of the suite that are supported by both ONNXML and HB
(minus tree ensembles models). The total number of tested operators is 13, and they are a mix of ML models (Logistic Regression, Support Vector Machines, etc.) and featurizers (e.g., Binarizer, Polynomial, etc.). For this microbenchmark we score 1 million records.
Datasets. We use the Iris datasets [iris] with 20 features.
List of Experiments. We run the following experiments: (1) batch inference over records, both on CPU and GPU; (2) request/response over 1 record; (3) memory consumption and conversion time. All the output results are correct.
Operator  Baselines (CPU)  HB CPU  HB GPU  
Sklearn  ONNXML  TS  TVM  TS  TVM  
Log. Regres.  970  1540  260  47  13  15 
SGDClass.  180  1540  270  49  11  15 
LinearSVC  110  69  260  51  12  18 
NuSVC  3240  4410  2800  3000  140  72 
SVC  1690  2670  1520  1560  120  41 
BernoulliNB  280  1670  290  65  12  14 
MLPClassifier  930  1860  910  1430  17  31 
Dec.TreeClass.  59  1610  560  35  13  16 
Binarizer  98  75  39  59  38  38 
MinMaxScaler  92  200  78  57  38  38 
Normalizer  94  140  83  97  39  40 
Poly.Features  4030  29160  6380  3130  340  error 
StandardScaler  150  200  77  58  38  38 
Batch Inference. The batch numbers are reported in Table 11. On CPU, scikitlearn is faster than ONNXML, up to 6 for polynomial featurizer, although in most of the cases the two systems are within a factor of 2. HB with TorchScript backend is competitive with scikitlearn, whereas with TVM backend HB is faster on 8 out of 13 operators, with in general a speedup of about 2 compared to scikitlearn. If now we focus to the GPU numbers, we see that HB with TorchScript backend compares favorably against TVM on 11 operators out of 13. This is in contrast with the tree ensemble microbenchmark where the TVM backend was faster than the TorchScript one. We suspect that this is because TVM optimizations are less effective on these “simpler” operators. For the same reason, GPU acceleration does not provide the speedup we instead saw for the tree ensemble models. In general, we see around 2 performance improvement over the CPU runtime: only polynomial featurizer runs faster, with almost a 10 improvement. TVM returns a runtime error when generating the polynomial featurizer model on GPU.
Request/response. Table 12 contains the times to score 1 record. The results are similar to the request/response scenario for the tree ensemble microbenchmark. Namely, ONNXML outperform both scikitlearn and HB
in 9 out of 13 cases. Note, however, that all frameworks are within a factor of 2. The only outlier is polynomial featurizer which is about 10
faster on HB with TVM backend.Operator  Baselines  HB  
Sklearn  ONNXML  TS  TVM  
LogisticRegression  0.087  0.076  0.1  0.1 
SGDClassifier  0.098  0.1  0.12  0.1 
LinearSVC  0.077  0.05  0.11  0.1 
NuSVC  0.086  0.072  4.1  0.14 
SVC  0.086  0.074  2.3  0.12 
BernoulliNB  0.26  0.1  0.07  0.11 
MLPClassifier  0.15  0.11  0.1  0.12 
DecisionTreeClassifier  0.087  0.074  0.44  0.12 
Binarizer  0.064  0.053  0.063  0.1 
MinMaxScaler  0.066  0.060  0.058  0.1 
Normalizer  0.11  0.063  0.072  0.1 
PolynomialFeatures  1.2  1  0.5  0.1 
StandardScaler  0.069  0.048  0.059  0.1 
Memory Consumption and Conversion Time. We measured the peak memory consumed and conversion time for each operator on each framework. We used batch inference over 1K records. For memory consumption, the results are in line with what we already saw in Section 6.1.1. Regarding the conversion time, for ONNXML and HB with TorchScript, the conversion time is in the order of few milliseconds. The TVM backend is slightly slower but still in the order of few tens of milliseconds (exception for NuSVC and SVC which take up to 3.2 seconds). In comparison with the tree ensembles numbers (Table 10), we confirm that these operators are simpler, even from a compilation perspective.
6.2 Optimizations
6.2.1 Tree Models Implementation
Next we test the different treebased models implementation to make the case for the heuristics.
Datasets. For this experiment we employ a synthetic dataset randomly generated with 5000 rows and 200 features.
Experiments Setup. We study the behavior of the tree implementations as we change the training algorithm, the batch size, and the tree depth. For each experiment we set the number of trees to 100. We use the TVM runtime backend. Each experiment is run on 1 CPU core.
Results. Figure 7 shows the comparison between the different tree implementations, and the two scikitlearn and ONNXML baselines. In the top part of the figure we run all experiments using a batch size of 1; on the bottom part we instead use a batch size of 1. In the column on the lefthand side, we generate trees with a max depth of 3; 7 for the middle column, and 12 for column on the righthand side. In general, two things are apparent: (1) HB is as fast as or better than the baselines; and (2) no tree implementation is always better than the others. The implementation outperforms the other two for small batch sizes, whereas TT and PTT are better over larger batch sizes. Between TT and PTT, the latter is usually the best performant (although not by a large margin). PTT however creates balanced trees, and fails for very deep trees.
6.2.2 Runtimeindependent Optimizations.
Next we test the optimizations described in Section 5.2.
Dataset. We use the Nomao dataset [nomao] with 119 features.
Feature Selection Push Down. In this experiment we measure the benefits of the feature selection push down. In Figure 6.1.2 we compare HB with and without feature selection pushdown, and the baseline implementation of the pipelines in scikitlearn. We use a pipeline which trains a logistic regression model with L2 loss. The featurization part contains onehot encoding for categorical features, missing value imputation for numerical values, followed by feature scaling, and a final feature selection operator (scikitlearn’s SelectKBest). We vary the percentile of features that are picked by the feature selection operator. In general, we can see that HB without optimization is about 2 faster than scikitlearn in evaluating the pipelines. For small percentiles, the feature selection pushdown optimization delivers a further 3. As we increase the percentile of features that are selected, the runtime of HB both with and without optimizations increase, although with the optimization HB is still 2 faster than without.
Feature Selection Injection. In this experiment we evaluate whether we can improve the performance of pipelines with sparse models by injecting (and then pushing down) feature selection operators. The pipeline is the same as in the previous case but without the feature selection operator. Instead we train the logistic regression model with L1 regularization. In Figure 6.1.2 we vary the L1 regularization coefficient and study how much performance we can gain. Also in this case, with very sparse models we can see up to 3 improvement wrt HB without optimization. Performance gains dissipate as we decrease the sparsity of the model.
6.3 Endtoend Pipelines
Setup. In this experiment we test HB over endtoend pipelines. We downloaded the 72 tasks composing the OpenMLCC18 suite [openmlcc18]. Among all the tasks, we discarded all the “not pure scikitlearn” ML pipelines (e.g., containing also arbitrary Python code). We successively discarded all the pipelines returning a failure during training. 88% of the remaining pipelines are exclusively composed by operators supported by HB, for a total of 2328 ML pipelines. Among these, 11 failed during inference due to runtime errors in HB; we report the summary of executing 2317 pipelines. These pipelines contain an average of 3.3 operators, which is in line with what was observed elsewhere [dsonds].
Datasets. For this experiment we have 72 datasets in total [openmlcc18]. The datasets are a curated mix specifically designed for ML benchmarking. We did the typical 80%/20% split between training and inference. The smaller dataset has just 100 records, the bigger 19264, while the median value is 462. The minimum number of columns for a dataset is 4, the maximum 3072, with a median of 30.
Results. Figure 10 summarizes the speedup / slowdown introduced by HB when scoring all 2317 pipelines. As we can see, HB is able to accelerate about 60% of the pipelines on CPU (9(a)). In general, the slowest pipeline gets about 60 slower wrt scikitlearn, the fastest instead gets a 1200 speed up. The slowdowns are due to a couple of factors: (a) the datasets used for these experiments are quite small; (b) some pipelines contain largely sparse operations (i.e., SVM on sparse inputs); (c) several pipelines are small and do not require much computation (e.g., a simple inputer followed by a small decision tree). These three factors are highlighted also by the fact that even if we move computation to the GPU (9(b)), still 27% of the pipelines have some slowdown. Note however that (1) both sparse and small pipelines can be detected at compile time, and therefore we can return a warning or an error; (2) DNN frameworks are continuously adding new sparse tensor operations (e.g., [sparsetensorpytorch]); and (3) an option could be to add a specific runtime backend for sparse tensor operations (e.g., we have a prototype integration with TACO [taco]). In general, DNN frameworks are relatively young, and HB will exploit any future improvement with no additional costs.
With GPU acceleration (Figure 9(b)), 73% of the pipelines show some speedup. The slowest pipeline gets about 130 slower wrt scikitlearn, the fastest instead gets a speedup of 3 orders of magnitude. Some of the pipelines get worse from CPU to GPU execution. This is due to (1) sparsity; (2) small compute; and (3) data movements between CPU and GPU memory. Indeed we run all pipelines on GPU, even the ones for which in practice would not make much sense (e.g., a decision tree with 3 nodes). We leave as future work an extension to our heuristics for picking the right hardware backend.
7 Related Work
PyTorch [pytorch], TensorFlow [tensorflow], MXNet [mxnet], CNTK [cntk] are DNN frameworks that provide easytouse (tensorbased) APIs for authoring DNN models, and heterogeneous hardware support for both training and inference. Beyond these popular frameworks, inference runtimes such as ONNX [onnxruntime], nGraph [ngraph], TVM [tvm], and TensorRT [tensorrt] provide optimizations and efficient execution targets, specifically for inference. To prove the versatility of our approach, we have tested HB with both PyTorch and TVM. HB uses a twolevel, logicalphysical optimization approach. First, logical optimizations are applied based on the operators composing the pipeline. Afterwards, physical operator implementations are selected based on model statistics, and physical rewrites, which are externally implemented by the DNN runtime, are executed (e.g., algebraic rewrites, operator fusion). Willump [willump] uses a similar twolevel optimization strategy, although it targets Weld [weld] as its low level runtime and therefore it cannot natively support inference on hardware accelerators. Conversely, HB casts ML pipelines into tensor computations and takes advantage of DNN serving systems to ease the deployment on target environments. Other optimizers for predictive pipelines, such as Pretzel [pretzel], only target logical optimizations. We have integrated HB into Raven [raven] as part of our bigger vision for optimizing ML prediction pipelines.
Several works deal with executing trees (ensembles) [fil, treefpga, treegpu] on hardware accelerators. These systems provide a custom implementation of the strategy specific to the target hardware (e.g., NVIDIA GPUs for RAPIDS FIL [fil], FPGAs for [treefpga]), and where computation is parallelized along on the treedimension. Alternatively, HB provides three tree inference strategies, including two novel strategies ( and ), and picks the best alternative based on the efficiency and redundancy tradeoff.
8 Conclusions
In this paper, we explore the idea of using DNN frameworks as generic compilers and optimizers for heterogeneous hardware. Our usecase is “traditional” ML inference. We ported 40+ data featurizers and traditional ML models into tensor operations and tested their performance over two DNN frameworks (PyTorch and TVM) and over different hardware (CPUs and GPUs). The results are compelling: even though we target highlevel tensor operations, we are able to outperform custom C++ and CUDA implementations. To our knowledge, Hummingbird is the first system able to run traditional ML inference on heterogeneous hardware.
9 Acknowledgements
We thank the anonymous reviewers and our shepherd, Chen Wenguang, for the suggestions and improvements on the paper. We also would like to thank Nellie Gustafsson, Gopal Vashishtha, Emma Ning, and Faith Xu for their support.