DeepAI
Log In Sign Up

A Flexible HLS Hoeffding Tree Implementation for Runtime Learning on FPGA

Decision trees are often preferred when implementing Machine Learning in embedded systems for their simplicity and scalability. Hoeffding Trees are a type of Decision Trees that take advantage of the Hoeffding Bound to allow them to learn patterns in data without having to continuously store the data samples for future reprocessing. This makes them especially suitable for deployment on embedded devices. In this work we highlight the features of an HLS implementation of the Hoeffding Tree. The implementation parameters include the feature size of the samples (D), the number of output classes (K), and the maximum number of nodes to which the tree is allowed to grow (Nd). We target a Xilinx MPSoC ZCU102, and evaluate: the design's resource requirements and clock frequency for different numbers of classes and feature size, the execution time on several synthetic datasets of varying sample sizes (N), number of output classes and the execution time and accuracy for two datasets from UCI. For a problem size of D3, K5, and N40000, a single decision tree operating at 103MHz is capable of 8.3x faster inference than the 1.2GHz ARM Cortex-A53 core. Compared to a reference implementation of the Hoeffding tree, we achieve comparable classification accuracy for the UCI datasets.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/01/2021

Tree in Tree: from Decision Trees to Decision Graphs

Decision trees have been widely used as classifiers in many machine lear...
11/08/2019

An Experimental Comparison of Old and New Decision Tree Algorithms

This paper presents a detailed comparison of a recently proposed algorit...
05/16/2022

The Influence of Dimensions on the Complexity of Computing Decision Trees

A decision tree recursively splits a feature space ℝ^d and then assigns ...
12/19/2017

A Faster Drop-in Implementation for Leaf-wise Exact Greedy Induction of Decision Tree Using Pre-sorted Deque

This short article presents a new implementation for decision trees. By ...
02/05/2020

Fast inference of Boosted Decision Trees in FPGAs for particle physics

We describe the implementation of Boosted Decision Trees in the hls4ml l...
05/10/2019

Formal Verification of Input-Output Mappings of Tree Ensembles

Recent advances in machine learning and artificial intelligence are now ...
06/18/2017

Data set operations to hide decision tree rules

This paper focuses on preserving the privacy of sensitive patterns when ...

I Introduction

With the rise of edge computing, FPGA vendors have been releasing and marketing CPU+FPGA SOCs as the ideal solution for this domain. As edge devices are often specialised for a single task in a constrained environment, it is advantageous to build dedicated hardware to improve performance and energy efficiency. FPGAs offer the advantage of targeted hardware without losing the ability to adapt the platform to changes (e.g., security updates), while being more efficient than a pure software solution.

As High Level Synthesis (HLS) matures [XilinxInc.2020VivadoSynthesis], it becomes a more attractive approach to creating efficient high-preformance accelerators for FPGA devices.

Machine Learning (ML) algorithms are a prime candidate for acceleration at the edge, but their computational requirements exceed the capabilities of many embedded devices. Inference at the edge is a problem being addressed by many works, but training at the edge still faces hurdles to adoption despite its clear benefits. In the field of Decision Treess, many algorithms are incompatible with devices of this class due to memory constraints. ID3 [Quinlan1983LearningGames], and derivatives such as C4.5 and C5.0 require the entire training dataset be present in memory for training. Incremental learning algorithms as ID5 [UTGOFF1988ID5:ID3], ID5R [Utgoff1989IMPROVEDLEARNING] and ITI [Utgoff1997DecisionRestructuring] do allow for ongoing learning from streaming data but store the dataset samples within the tree.

Hoeffding Trees  [Domingos2000MiningStreams] are incremental learning trees, which are more suitable for embedded scenarios because they have the following advantages: They asymptotically guarantee the same classification as traditional batch learners, and they store information about the distribution of samples statistically rather than the samples themselves, which drastically reduces memory requirements, especially for large datasets.

In this work, we present a flexible C/C++ HLS implementation of a Hoeffding Tree variant tailored for use in FPGAs, originally proposed by Lin et al. [Lin2019TowardsFPGA]. Their work built on an earlier variant in which the storage of the statistical data of the sampling distribution of the original Hoeffding Tree was replaced by a Gaussian approximation [Pfahringer2008HandlingTrees]

. Lin et al. replace this approximation with quantile estimation using asymmetric signum functions 

[Althoff2017AnTesting]. The result is a larger memory footprint but a reduction in computational requirements, while achieving similar results. Since it is implemented in Verilog, the applicability of the implementation is limited to circuit synthesis, e.g. for FPGA. By using HLS, an implementation can be created that is equally suitable for CPU and FPGA.

The contributions of this work are as follows:

  • A generic, template-based C/C++ implementation of the Hoeffding Tree classifier as per Lin et al.

    [Lin2019TowardsFPGA], but that is suited for HLS.

  • Functional validation of the implementation through software execution, and post-synthesis onto a Xilinx ZCU102 development board.

  • Experimental evaluation of memory requirements of the tree object as a function of template parameters.

  • Experimental evaluation of FPGA resource requirements and execution time of the synthesised training and inference method as a function of template parameters.

Ii HLS Hoeffding Tree Implementation

A decision tree is a type of machine learning algorithm used either for classification or regression. A decision tree performs sequential binary decisions over an incoming vector of features, and a classification is computed when a leaf node is reached. During training, leaf nodes are added to the tree based on a splitting criteria, which separates the data into two regions at every tree junction. A Hoeffding tree is a type of decision tree where the criteria is the Hoeffding bound, shown in Equation

1. The tree performs learning and inference by relying on a property of the Hoeffding bound that guarantees that best splitting point is chosen. If a gain function , is to be maximised, then given and (X and Y being the attributes that generate the highest and second highest values of ) if

then the Hoeffding bound guarantees that with probability

X is the best attribute to split on. represents the range of the attributes e the number of samples on a node.

(1)

Over other criteria, the Hoeffding bound has two characteristics: it allows for online incremental learning and growth of the tree which asymptotically tends towards the results provided by batch learners, and is independent of the probability distribution of the data sampling. The Hoeffding tree allows for continuous learning and node splitting for a potentially infinite (e.g. streaming applications) number of samples

[Domingos2000MiningStreams].

FPGAs have been intensively studied for decision tree implementations, as a tree structure maps efficiently to specialised hardware. In conjunction with other optimisations, decision trees in FPGAs have been shown to outperform CPU and GPU solutions [Barbareschi2021AdvancingStudy]. Lin et al. [Lin2019TowardsFPGA] demonstrate speedups of up to 1500x for an RTL implementation of the Hoeffding tree versus a 2.6GHz processor. Our aim is to explore a higher abstraction level via HLS, providing greater applicability features, while evaluating the attainable performance.

We implemented the tree as a C++ class template. The parameters include the maximum number of nodes in the tree, the feature size, and the floating-point precision. The class contains the training and inference methods which are synthesised to hardware. At runtime, the C++ tree object can be manipulated in software, and passed as an argument to the training/inference method, as summarised in Figure 1.

This allows for instantiation of several tree objects in memory (with different template parameters if desired). Trees with the same template parameters can be processed by the same synthesised circuit. Since the functions can also be invoked in software, this means that training or inference can be dynamically partitioned based on which device performs better for either task, as a function of the tree parameters. This also means that if FPGA is occupied processing a tree object, other trees can be evaluated via software without the need for a blocking wait.

Finally, evaluation of multiple trees is possible by either a combination of software and hardware invocations, by deploying multiple instances of the hardware kernel, or by time-multiplexing a single hardware kernel (as explained below). Either case allows for the possibility of arbitrary runtime tree ensembles. This evaluation is currently future work.

Fig. 1: Software and hardware architecture of the Hoeffding Tree implementation; the training and inference kernels are shared by multiple tree objects

The Xilinx Vitis HLS flow enforces an OpenCL model for kernel invocation. The implemented kernel, krnl_Tree, receives 4 arguments. A HoeffdingTree object as mentioned, an array of samples, an array of output classifications, and the size of these arrays.

In this model, a large overhead penalty would occur for invocations with a single sample, due to the data transfer time. A practical application of the kernel design could be, e.g., in the sensor domain, where the tree could continuously sample fused data from multiple sensors (i.e., multiple attributes) without processor intervention, avoiding transfer overheads. Alternatively, streaming samples can be accumulated until a sufficiently large number is held that mitigates this overhead. This does not mean that the tree behaves as a batch learner, as one sample is processed per each infer-then-train step.

Inference on an incremental learning decision tree cannot be easily parallelised as the model changes and evolves with every training sample that arrives. This restricts the pipeline to dealing with one sample at a time, sequentially. The sample structure contains information about whether it should be used for training purposes or only for inference. Thus, as the kernel loops through the sample array, it executes either the train or infer method of the tree object accordingly. The results are placed in the output data structure.

The OpenCL API allows for fine-grained control of how these arguments are passed to the kernels, each argument being a separate buffer with persistent storage. Thus, trees can be transferred to FPGA memory once, and not retrieved between executions of the kernels. With this mechanism, a tree object can reside in memory while only new samples are transferred in, and the model can be retrieved in a final stage.

Conversely, the samples themselves may remain in memory, and trees freely exchanged. This is one strategy for the construction of tree ensembles mentioned previously. Trees can reuse the same kernel instance via time-multiplexing, or by concurrent instantiation of several copies of krnl_Tree. In either case, the same read-only sample buffer can be assigned to all trees, thus significantly reducing overhead and preventing data duplication. For brevity, the evaluation of ensembles is out of the scope of this paper.

Iii Experimental Evaluation

We performed the following experiments: evaluated the resource utilisation of a single synthesised tree for a range of values for the feature size and number of classes; evaluated the training and inference time of a single tree in hardware, versus the ARM CPU, for several synthetic clustering datasets (varying number of point, clusters, and feature size); evaluated the classification accuracy and execution time of a single tree for UCI’s Bank and Covertype datasets.

Iii-a Resource Utilisation

Table I presents various configurations of the kernel, tailored for datasets of different dimensions (D), with different number of classes (K), number of samples (N) and max number of nodes (Nd). The purpose is to determine the effect of these parameters on FPGA resource utilisation. As expected, parameter N has no effect on resource utilisation as samples cannot be processed in parallel.

The feature size and the number of classes result in an increase in resource usage. This is due to the highly sequential nature of the generated kernel, which also explains why the performance of this kernel on training tasks is poor compared to the CPU. This overall advantage is less surprising when considered in the context of an 11-fold CPU advantage in clock speed. Current HLS tools cannot automatically parallelize sequential code. Without hardware design expertise in order to optimise the design, the implementation will be far from optimal. In our implementation, we still believe that further parallelization can be achieved even within a single tree, through inner loop unrolling or memory partitioning.

One interesting result is that of the kernel’s operating frequency. It remains unchanged for all configurations. Looking deeper into the cause of this phenomenon, one finds that the bottleneck is the sorting of a sample down from the root node to the appropriate leaf node. This sequential operation also prevents the kernel from being pipelined.

Nodes 100 100 100 1000 100 100 100 1000
K 5 5 10 5 5 5 10 5
D 3 100 3 3 3 100 3 3
N 40k 40k 40k 40k 500k 500k 500k 500k
LUT 23304 (8.6%) 20567 (7.6%) 23776 (8.8%) 24351 (9.0%) 23304 (8.6%) 20567 (7.6%) 23776 (8.8%) 24351 (9.0%)
LUTRAM 1395 (1.0%) 1179 (0.8%) 1399 (1.0%) 1397 (1.0%) 1395 (1.0%) 1179 (0.8%) 1399 (1.0%) 1397 (1.0%)
FF 35682 (6.6%) 29775 (5.5%) 36374 (6.7%) 36336 (6.7%) 35682 (6.6%) 29775 (5.5%) 36374 (6.7%) 36336 (6.7%)
BRAM 12 (1.3%) 9.5 (1.0%) 12 (1.3%) 12 (1.3%) 12 (1.3%) 9.5 (1.0%) 12 (1.3%) 12 (1.3%)
DSP 23 (0.9%) 25 (1.0%) 25 (1.0%) 25 (1.0%) 23 (0.9%) 25 (1.0%) 25 (1.0%) 25 (1.0%)
BUFG 13 (3.2%) 13 (3.2%) 13 (3.2%) 13 (3.2%) 13 (3.2%) 13 (3.2%) 13 (3.2%) 13 (3.2%)
MMCM 1 (25.0%) 1 (25.0%) 1 (25.0%) 1 (25.0%) 1 (25.0%) 1 (25.0%) 1 (25.0%) 1 (25.0%)
Freq. (MHz) 103.6 103.6 103.6 103.6 103.6 103.6 103.6 103.6
TABLE I: N, D, K and Nd effects on FPGA Resource Utilisation

Fig. 2: Size of Tree objects in bytes for Nd, D and K. Each bar in every grouping, depicts a tree with a max number of nodes from to .

Iii-B Performance

These results were obtained by feeding the tree with datasets of K clusters in a D dimensional spaces, constituted of N points. For these experimental runs, we will have the entire dataset transferred in a single operation to the FPGA’s memory.

Fig. 3: Illustrative visualisation of tree model derived from UCI Covertype dataset. The tree was only allowed to grow to 5 nodes.

Looking at the first four rows of Table II (D=3) it can be observed that for a 3-dimensional dataset, regardless of the bundle size, the ARM CPU in the ZCU102 SoC significantly outperforms the FPGA implementation in both the training and inference tasks. Also, the performance gap between both implementations grows with the number of samples processed. This indicates that the kernel is slower, per iteration, than the pure software solution. Regarding the last four rows of Table II (D=100), the ARM CPU still outperforms the FPGA kernel in training. However, it does it with a lower margin and one that does not appear to grow with the added number of samples. On the inference task with this larger dataset, the FPGA outperforms the ARM processor by 8.3×.

Table III presents benchmarks of two of the UCI datasets used by Lin et al. [Lin2019TowardsFPGA]. The same tree parameters were used (, , , , , , ), with one being of special relevance: Nd (maximum number of nodes). A significant slowdown occurred. With the increased number of nodes, the sequential tree traversal algorithm increases in length. Our HLS implementation achieves comparable accuracy for Bank, although the performance for Covertype is inferior. Lin et al. [Lin2019TowardsFPGA] reports 89.30% and 72.51%, respectively. We believe a difference in calculation precision between the CPU and FPGA caused the degradation, despite the use of 32-bit floating point data types for both devices.

K D N Task ARM CPU FPGA Speedup
5 3 40k Training 207 ms 1,990 ms 0.10×
Inference 151 ms 462 ms 0.33×
500k Training 2,983 ms 30,933 ms 0.10×
Inference 2,260 ms 11,442 ms 0.20×
100 40k Training 6,028 ms 51,648 ms 0.12×
Inference 3,924 ms 469 ms 8.37×
500k Training 75,763 ms 651,775 ms 0.12×
Inference 49,495 ms 11,494 ms 4.31×
TABLE II: Training and inference times for four synthetic clustering datasets, for the ARM CPU (1.2Ghz) and the FPGA (103MHz)
ARM CPU FPGA
Acc. Time Acc. Time Speedup
Bank 88.3% 202 ms 88.3% 8,525 ms 0.02×
Covertype 72.2% 9,712 ms 63.7% 374,600 ms 0.03×
TABLE III: Training time and Accuracy (Acc.) for Covertype and Bank datasets, for the ARM CPU (1.2Ghz) and the FPGA (103MHz)

Iv Related Work

Kulaga et al. [Kuaga2014FPGAHls] present an HLS decision tree ensemble solution for inference tasks. The results achieved are competitive regarding performance when compared to the ARM core present in the tested SoC. However, the design is highly dependent on the number of trees and corresponding depths, as a change in ensemble parameters requires re-tuning multiple pragmas. As we have also seen, an unavoidable sequential portion of the algorithm is the sample sorting through the tree structure. Unlike our approach, the number of trees in an ensemble is hardcoded into the synthesised kernel. In contrast, by having one or more synthesised training/inference methods (for different hyper-parameters), we can deploy N instances of such circuits and process a runtime allocated number of trees.

As previously stated, the work on this paper builds on Lin et al. [Lin2019TowardsFPGA] work. However, their implementation is closed-source and done in Verilog, which excludes native execution on CPUs. Also, as the work was developed for a datacenter-class FPGA device, the implementation is very resource intensive and thus not suitable for small devices such as the ones used on embedded systems.

InAccel111InAccel, 2019, XGBoost Exact Updater IP core, https://github.com/inaccel/xgboost

provides an HLS implementation of the XGBoost learning algorithm, which is also based on decision trees. For a dataset of 65k points, 5 classes, and 128 features, the training time is 2.7 seconds. This is significantly faster than our performance for similarly sized datasets, but InAccel’s implementation targets server-grade FPGA accelerator boards (including multi-board setups), while we target the embedded domain. However, the potential for HLS FPGA acceleration of decision tree algorithms is demonstrated, given expert optimisation of the code for HLS.

V Conclusions

We presented a flexible and scalable implementation of a Hoeffding Tree compatible with HLS tools222https://github.com/Sleepy105/Hoeffding-Tree/tree/fpt21 We performed a functional validation of the tree design, against software execution, by implementation on chip on a Xilinx ZCU102. We provide a evaluation of the design’s resource usage for multiple template parameter values (i.e., maximum tree size, number of sample attributes, number of clusters, and number of dataset samples), as well as execution time versus an ARM Cortex-A53 processor. The resource requirements of the tree do not scale significantly with problem size, although further HLS optimisations such as unrolling remain unexplored. Even so, we outperform the ARM by 8.3x times for largest dataset for the inference task, while being 8.6x slower during training. As future work, we envision the use of tree ensembles, and the partitioning of training and inference task between software and hardware based on problem size.

Acknowledgments

This work was supported by the PEPCC project (PTDC/EEI-HAC/30848/2017), financed by Fundação para a Ciência e Tecnologia (FCT).

References