DeepAI
Log In Sign Up

Feature Encodings for Gradient Boosting with Automunge

09/25/2022
by   Nicholas J. Teague, et al.
0

Selecting a default feature encoding strategy for gradient boosted learning may consider metrics of training duration and achieved predictive performance associated with the feature representations. The Automunge library for dataframe preprocessing offers a default of binarization for categoric features and z-score normalization for numeric. The presented study sought to validate those defaults by way of benchmarking on a series of diverse data sets by encoding variations with tuned gradient boosted learning. We found that on average our chosen defaults were top performers both from a tuning duration and a model performance standpoint. Another key finding was that one hot encoding did not perform in a manner consistent with suitability to serve as a categoric default in comparison to categoric binarization. We present here these and further benchmarks.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/05/2019

A Comparative Analysis of XGBoost

XGBoost is a scalable ensemble technique based on gradient boosting that...
07/31/2020

Rethinking Defaults Values: a Low Cost and Efficient Strategy to Define Hyperparameters

Machine Learning (ML) algorithms have been successfully employed by a va...
08/09/2013

MaLeS: A Framework for Automatic Tuning of Automated Theorem Provers

MaLeS is an automatic tuning framework for automated theorem provers. It...
01/25/2019

Rescaling and other forms of unsupervised preprocessing introduce bias into cross-validation

Cross-validation of predictive models is the de-facto standard for model...
11/23/2017

Grabit: Gradient Tree Boosted Tobit Models for Default Prediction

We introduce a novel model which is obtained by applying gradient tree b...
02/20/2022

Mining Robust Default Configurations for Resource-constrained AutoML

Automatic machine learning (AutoML) is a key enabler of the mass deploym...

1 Introduction

The usefulness of feature engineering for applications of deep learning has long been considered a settled question in the negative, as neural networks are on their own universal function approximators

Goodfellow et al. (2016). However, even in the context of deep learning, tabular features are often treated with some form of encoding for preprocessing. Automunge Teague (2022b)

is a platform for encoding dataframes developed by the authors. This python library was originally built for a simple use case of basic encoding conventions for numeric and categoric features, like z-score normalization and one-hot encodings. Along the iterative development journey we began to flesh out a full library of encoding options, including a series of options for numeric and categoric features that now include scenarios for normalization, binarization, hashing, and missing data infill under automation. Although it was expected that these range of encoding options would be superfluous for deep learning, that does not rule out their utility in other paradigms which could range from simple regression, support vector machines, decisions trees, or as will be the focus of this paper, gradient boosting.

The purpose of this work is to present the results of a benchmarking study between alternate encoding strategies for numeric and categoric features for gradient boosted tabular learning. We were particularly interested in validating the library’s default encoding strategies, and found that in both primary performance metrics of tuning duration time and model performance the current defaults under automation of categoric binarization and numeric z-score normalization demonstrated merit to serve as default encodings for the Automunge library. We also found that in addition to our default binarization, even a frequency sorted variant of ordinal encoding on average outperformed one hot encoding.

2 Gradient Boosting

Gradient boosting Friedman (2000) refers to a paradigm of decision tree learning Quinlan (1986)

similar to random forests

Breiman (2001)

but in which the optimization is boosted by recursively training an iteration’s model objective to correct the performance of the preceding iteration’s model. It is commonly implemented in practice by the XGBoost library

Chen and Guestrin (2016) for GPU acceleration, although there are architecture variations available for different fortes, like LightGPM Ke et al. (2017) which may train faster on CPU’s than XGBoost (with a possible performance tradeoff).

Gradient boosting has traditionally been found as a winning solution for tabular modality competitions on the Kaggle platform, and its competitive efficacy has even been demonstrated for more sophisticated applications like time series sequential learning when used for window based regression Elsayed et al. (2021). Recent tabular benchmarking papers have found that gradient boosting may still mostly outperform sophisticated neural architectures like transformers Gorishniy et al. (2021)

, although even a vanilla multi layer perceptron neural network could have capacity to outperform gradient boosting with comprehensively tuned regularizers

Kadra et al. (2021). Gradient boosting can also be expected to have higher latency inference than neural networks Borisov et al. (2021).

Conventional wisdom is that one can expect gradient boosting models to have capacity for better performance than random forests for tabular applications but with a tradeoff of increased probability of overfitting without hyperparameter tuning

Howard and Gugger (2020). With both more sensitivity to tuning parameters and a much higher number of parameters in play than random forest, gradient boosting usually requires more sophistication than a simple grid or random search for tuning. One compromise method available is for a sequential grid search through different subsets of parameters Jain (2016), although more automated and even parallelized methods are available by way of black box optimization libraries like Optuna Akiba et al. (2019). There will likely be more improvements to come both in libraries and tuning conventions, this is an active channel of industry research.

3 Feature Encodings

Feature encoding refers to feature set transformations that serve to prepare the data for machine learning. Common forms of feature encoding preparations include normalizations for numeric sets and one hot encodings for categoric, although some learning libraries may accept categoric features in string representations for internal encodings. Before the advent of deep learning, it was common to supplement features with alternate representations of extracted information or to combine features in some fashion. Such practices of feature engineering are sometimes still applied in gradient boosted learning, and it was one of the purposes of these benchmarks to evaluate benefits of the practice in comparison to directly training on the data.

An important distinction of feature encodings can be considered as those that can be applied independent of an esoteric domain profile verses those that rely on external structure. An example could be the difference between supplementing a feature with bins derived based on the distribution of populated numeric values verses extracting bins based on an external database lookup. In the case of Automunge, the internal library of encodings follows almost exclusively the former, that is most encodings are based on inherent numeric or string properties and do not consider adjacent properties that could be inferred based on relevant application domains. (An exception is made for date-time formatted features which under automation automatically extract bins for weekdays, business hours, holidays, and redundantly encodes entries based on cyclic periods of different time scales London (2016).) The library includes a simple template for integrating custom univariate transformations Teague (2020) if a user would like to integrate into a pipeline alternate conventions.

3.1 Numeric

Numeric normalizations in practice are most commonly applied similar to our default of z-score ‘nmbr’ (subtract mean and divide by standard deviation) or min-max scaling ‘mnmx’ (converting to range between 0–1). Other variations that may be found in practice include mean scaling ‘mean’(subtract mean and divide by min max delta), and max scaling ‘mxab’ (divide by feature set absolute max). More sophisticated conventions may convert a distribution shape in addition to the scale, such as the box-cox power law transformation ‘bxcx’

Box and Cox (1964) or Scikit-Learn’s Pedregosa et al. (2011)quantile transformer ‘qttf’, which both may serve the purpose of converting a feature set to closer resemble a Gaussian distribution. In general, numeric normalizations are more commonly applied for learning paradigms other than those based on decision trees, where for example in neural networks they serve the purpose of normalizing gradient updates across features. We did find that the type of normalizations applied to numeric features appeared to impact performance, and we will present these findings below.

3.2 Categoric

Categoric encodings are most commonly derived in practice as a one hot encoding, where each unique entry in a received feature is translated to boolean integer activations in a dedicated column among a returned set thereof. The practice of one hot encoding has shortcomings in the high cardinality case (where a categoric feature has an excessive number of unique entries), which in the context of gradient boosting may be particularly impactful as an inflated column count impairs latency performance of a training operation — or when the feature is targeted as a classification label may even cause training to exceed memory overhead constraints. The Automunge library attempts to circumvent this high cardinality edge case in two fashions, first by defaulting to a binarization encoding instead of one hot, and second by distinguishing highest cardinality sets for a hashed encoding Moody (1988) Weinberger et al. (2009) Teague (2020) which may stochastically consolidate multiple unique entries into a shared ordinal representation for a reduced number of unique entries.

The library default of categoric binarization ‘1010’ refers to translating each unique entry in a received feature to a unique set of zero, one, or more boolean integer activations in a returned set of boolean integer columns. Where one hot encoding may return a set of n columns for n unique entries, binarization will instead return a smaller count of log2(n) rounded up to nearest integer. We have previously seen the practice discussed in the blogging literature, such as Ravi (2019), although without validation as offered herein.

A third common variation on categoric representations includes ordinal encodings, which simply refers to returning a single column encoding of a feature with a distinct integer representation for each unique entry. Variations on ordinal encodings in the library may sort the integer representations by frequency of the unique entry ‘ord3’ or based on alphabetic sorting ‘ordl’.

Another convention for categoric sets unique to the Automunge library we refer to as parsed categoric encodings ‘or19’ Teague (2022a). Parsed encodings search through tiers of string character subsets of unique entries to identify shared grammatical structure for supplementing encodings with structure derived from a training set basis. Parsed encodings are supplemented with extracted numeric portions of unique entries for additional information retention in the form received by training.

4 Benchmarking

The benchmarking sought to evaluate a range of numeric and categoric encoding scenarios by way of two key performance metrics, training time and model performance. Training was performed over the course of  1.5 weeks on a Lambda workstation with AMD 3970X processor, 128Gb RAM, and two Nvidia 3080 GPUs. Training was performed by way of XGBoost tuned by Optuna with 5-fold fast cross-validation Swersky et al. (2013)

and early stopping criteria of 50 tuning iterations without improvement. Performance was evaluated against a partitioned 25% validation set based on a f1 score performance metric, which we understand is a good default for balanced evaluation of bias and variance performance of classification tasks

Stevens et al. (2020). This loop was repeated and averaged across 5 iterations and then repeated and averaged across 31 tabular classification data sets sourced from the OpenML benchmarking repository Vanschoren et al. (2014). Rephrasing for clarity, the reported metrics are averages of 5 repetitions of 31 data sets for each encoding type as applied to all numeric or categoric features for training. The distribution bands shown in the figures are across the five repetitions. The data sets were selected for diverse tabular classification applications with in-memory scale training data and tractable label cardinality.

We found that these benchmarks gave us comfort in the Automunge library’s defaults of numeric z-score normalization and categoric binarization. An interesting result was the outperformance of categoric binarization in comparison to one-hot encoding, as the latter is commonly used in mainstream practice as a default. Further dialogue for the interpretation of the results presented in Figures 1 and 2 are provided in the Appendix.

(a) Numeric tuning time comparison
(b) Numeric model performance comparison
Figure 1: Numeric Results
(a) Categoric tuning time comparison
(b) Categoric model performance comparison
Figure 2: Categoric Results

5 Conclusion

We hope that these benchmarks may have provided some level of user comfort by validating the default encodings applied under automation by the Automunge library of z-score normalization and categoric binarization, both from a training time and model performance standpoint. If you would like to try out the library we recommend the tutorials folder found on GitHub Teague (2022b) as a starting point.

References

  • T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama (2019) Optuna: a next-generation hyperparameter optimization framework. arXiv. External Links: Document, Link Cited by: §2.
  • V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci (2021) Deep neural networks and tabular data: a survey. arXiv. External Links: Document, Link Cited by: §2.
  • G. E. P. Box and D. R. Cox (1964) An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological) 26 (2), pp. 211–243. External Links: Document, Link, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.2517-6161.1964.tb00553.x Cited by: §3.1.
  • L. Breiman (2001) Random forests. Mach. Learn. 45 (1), pp. 5–32. External Links: ISSN 0885-6125, Link, Document Cited by: §2.
  • T. Chen and C. Guestrin (2016) XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, External Links: Document, Link Cited by: §2.
  • S. Elsayed, D. Thyssens, A. Rashed, H. S. Jomaa, and L. Schmidt-Thieme (2021) Do we really need deep learning models for time series forecasting?. arXiv. External Links: Document, Link Cited by: §2.
  • J. H. Friedman (2000) Greedy function approximation: a gradient boosting machine. Annals of Statistics 29, pp. 1189–1232. Cited by: §2.
  • I. J. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press, Cambridge, MA, USA. Note: http://www.deeplearningbook.org Cited by: §1.
  • Y. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko (2021) Revisiting deep learning models for tabular data. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), External Links: Link Cited by: §2.
  • J. Howard and S. Gugger (2020)

    Deep learning for coders with fastai and pytorch

    .
    O’Reilly Media. Cited by: §2.
  • A. Jain (2016) Complete guide to parameter tuning in xgboost with codes in python. Analytics Vidhya. External Links: Link Cited by: §2.
  • A. Kadra, M. Lindauer, F. Hutter, and J. Grabocka (2021) Well-tuned simple nets excel on tabular datasets. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), External Links: Link Cited by: Appendix C, §2.
  • G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu (2017) LightGBM: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §2.
  • I. London (2016) Encoding cyclical continuous features — 24-hour time. External Links: Link Cited by: §3.
  • J. Moody (1988) Fast learning in multi-resolution hierarchies. In Advances in Neural Information Processing Systems, D. Touretzky (Ed.), Vol. 1, pp. . External Links: Link Cited by: §3.2.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay (2011) Scikit-learn: machine learning in python. Journal of Machine Learning Research 12 (85), pp. 2825–2830. External Links: Link Cited by: §3.1.
  • J. R. Quinlan (1986) Induction of decision trees. Machine Learning 1, pp. 81–106. Cited by: §2.
  • R. Ravi (2019) One-hot encoding is making your tree-based ensembles worse, here’s why?.

    Towards Data Science

    .
    External Links: Link Cited by: §3.2.
  • E. Stevens, L. Antiga, and T. Viehmann (2020) Deep learning with pytorch. External Links: Link Cited by: §4.
  • K. Swersky, J. Snoek, and R. P. Adams (2013) Multi-task bayesian optimization. In Advances in Neural Information Processing Systems, C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), Vol. 26, pp. . External Links: Link Cited by: §4.
  • N. J. Teague (2022a) Parsed categoric encodings with automunge. arXiv. External Links: Document, Link Cited by: 1st item, §3.2.
  • N. Teague (2020) Hashed categoric encodings with Automunge. External Links: Link Cited by: §3.2, §3.
  • N. Teague (2022b) Automunge. External Links: Link Cited by: §1, §5.
  • J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo (2014) OpenML. ACM SIGKDD Explorations Newsletter 15 (2), pp. 49–60. External Links: Document, Link Cited by: Appendix C, §4.
  • K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg (2009) Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA, pp. 1113–1120. External Links: ISBN 9781605585161, Link, Document Cited by: §3.2.

Appendix A Numeric Results

a.1 default

  • defaults for Automunge under automation as z-score normalization (‘nmbr’ code in the library)

  • The default encoding was validated both from a tuning duration and a model performance standpoint as top performing scenario on average.

a.2 qttf

  • Scikit-Learn QuantileTransformer with a normal output distribution

  • The quantile distribution conversion did not perform as well on average as simple z-score normalization, although it remained a top performer.

a.3 powertransform

  • the Automunge option to conditionally encode between ‘bxcx’, ‘mmmx’, or ‘MAD3’ based on distribution properties (via library’s powertransform=True setting)

  • This was the worst performing encoding scenario, which at a minimum demonstrates that the heuristics and statistical measures currently applied by the library to conditionally select types of encodings could use some refinement.

a.4 mnmx

  • min max scaling ‘mnmx’ which shifts a feature distribution into the range 0–1

  • This scenario performed considerably worse than z-score normalization, which we expect was due to cases where outlier values may have caused the predominantly populated region to get “squished together” in the encoding space.

a.5 capped quantiles

  • min max scaling with capped outliers at 0.99 and 0.01 quantiles (‘mnm3’ code in library)

  • This scenario is best compared directly to min-max scaling, and demonstrates that defaulting to capping outliers did not benefit performance on average.

a.6 binstransform

  • z-score normalization supplemented by 5 one hot encoded standard deviation bins (via library’s binstransform=True setting)

  • In addition to a widened range of tuning durations, the supplemental bins did not appear to be beneficial to model performance for gradient boosting.

Appendix B Categoric Results

b.1 default

  • defaults for Automunge under automation for categoric binarization (‘1010’ code in the library)

  • The default encoding was validated as top performing both from a tuning duration and a model performance standpoint.

b.2 onht

  • one hot encoding

  • The model performance impact was surprisingly negative compared to the default considering this is often used as a default in mainstream practice. Based on this benchmark we recommend discontinuing use of one-hot encoding outside of special use cases (like e.g. for purposes of feature importance analysis).

b.3 ord3

  • ordinal encoding with integers sorted by category frequency ‘ord3’

  • Sorting ordinal integers by category frequency instead of alphabetic significantly benefited model performance, in most cases lifting ordinal above one hot encoding although still not in the range of the default binarization.

b.4 ordl

  • ordinal encoding with integers sorted alphabetically by category ‘ordl’

  • Alphabetic sorted ordinal encodings (as is the default for Scikit-Learn’s OrdinalEncoder) did not perform as well, we recommend defaulting to frequency sorted integers when applying ordinal.

b.5 hsh2

  • hashed ordinal encoding (library default for high cardinality categoric ‘hsh2’)

  • This benchmark was primarily included for reference, it was expected that as some categories may be consolidated there would be a performance impact for low cardinality sets. The benefit of hashing is for high cardinality which may otherwise impact gradient boosting memory overhead.

b.6 or19

  • multi-tier string parsing ‘or19’ Teague (2022a)

  • It appears that our recent invention of multi-tier string parsing succeeded in outperforming one-hot encoding and was the second top performer, but did not perform sufficiently to recommended defaulting in comparison to vanilla binarization. We recommend reserving string parsing for cases where the application may have some extended structure associated with grammatical content, as was validated as outperforming binarization for an example in the citation.

Appendix C Data Sets

The Benchmarking included the following tabular data sets, shown here with their OpenML ID number. A thank you to Vanschoren et al. (2014) for providing the data sets and Kadra et al. (2021) for inspiring the composition.

  • Click prediction / 233146

  • C.C.FraudD. / 233143

  • sylvine / 233135

  • jasmine / 233134

  • fabert / 233133

  • APSFailure / 233130

  • MiniBooNE / 233126

  • volkert / 233124

  • jannis / 233123

  • numerai28.6 / 233120

  • Jungle-Chess-2pcs / 233119

  • segment / 233117

  • car / 233116

  • Australian / 233115

  • higgs / 233114

  • shuttle / 233113

  • connect-4 / 233112

  • bank-marketing / 233110

  • blood-transfusion / 233109

  • nomao / 233107

  • ldpa / 233106

  • skin-segmentation / 233104

  • phoneme / 233103

  • walking-activity / 233102

  • adult / 233099

  • kc1 / 233096

  • vehicle / 233094

  • credit-g / 233088

  • mfeat-factors / 233093

  • arrhythmia / 233092

  • kr-vs-kp / 233091

Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work?

    3. Did you discuss any potential negative societal impacts of your work?

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? (see supplemental material notebooks)

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets?

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?