Booster: An Accelerator for Gradient Boosting Decision Trees
We propose Booster, a novel accelerator for gradient boosting trees based on the unique characteristics of gradient boosting models. We observe that the dominant steps of gradient boosting training (accounting for 90-98 time) involve simple, fine-grained, independent operations on small-footprint data structures (e.g., accumulate and compare values in the structures). Unfortunately, existing multicores and GPUs are unable to harness this parallelism because they do not support massively-parallel data structure accesses that are irregular and data-dependent. By employing a scalable sea-of-small-SRAMs approach and an SRAM bandwidth-preserving mapping of data record fields to the SRAMs, Booster achieves significantly more parallelism (e.g., 3200-way parallelism) than multicores and GPU. In addition, Booster employs a redundant data representation that significantly lowers the memory bandwidth demand. Our simulations reveal that Booster achieves 11.4x speedup and 6.4x speedup over an ideal 32-core multicore and an ideal GPU, respectively. Based on ASIC synthesis of FPGA-validated RTL using 45 nm technology, we estimate a Booster chip to occupy 60 mm^2 of area and dissipate 23 W when operating at 1-GHz clock speed.
READ FULL TEXT