1 Introduction
In contrast to the mainstream in deep learning (DL), in this paper, we focus on tabular data, a domain that we feel is understudied in DL. Nevertheless, it is of great relevance for many practical applications, such as climate science, medicine, manufacturing, finance, recommender systems, etc. During the last decade, traditional machine learning methods, such as GradientBoosted Decision Trees (GBDT) Chen:2016:XST:2939672.2939785 , dominated tabular data applications due to their superior performance, and the success story DL has had for raw data (e.g., images, speech, and text) stopped short of tabular data.
Even in recent years, the existing literature still gives mixed messages on the stateoftheart status of deep learning for tabular data. While some recent neural network methods (49698, ; Popov2020Neural, ) claim to outperform GBDT, others confirm that GBDT are still the most accurate method on tabular data (10.5555/3326943.3327070, ; katzir2021netdnf, ). The extensive experiments on 40 datasets we report indeed confirm that recent neural networks (49698, ; Popov2020Neural, ; DBLP:journals/corr/abs200306505, ) do not outperform GBDT when the hyperparameters of all methods are thoroughly tuned.
We hypothesize that the key to improving the performance of neural networks on tabular data lies in exploiting the recent DL advances on regularization techniques (reviewed in Section 3), such as data augmentation, residual blocks, model averaging (e.g. dropout, or snapshot ensembles), or on learning dynamics (e.g. lookahead optimizer, or stochastic weight averaging). Indeed, we find that even plain Multilayer Perceptrons (MLPs) achieve stateoftheart results when regularized by multiple modern regularization techniques applied jointly and simultaneously.
Applying multiple regularizers jointly is already a common standard for practitioners, who routinely mix regularization techniques (e.g. Dropout with early stopping and weight decay). However, the deeper question of “Which subset of regularizers gives the largest generalization performance on a particular dataset among dozens of available methods?” remains unanswered, as practitioners currently combine regularizers via inefficient trialanderror procedures. In this paper, we provide a simple, yet principled answer to that question, by posing the selection of the optimal subset of regularization techniques and their inherent hyperparameters, as a joint search for the best combination of MLP regularizers for each dataset among a pool of 13 modern regularization techniques and their subsidiary hyperparameters (Section 4).
From an empirical perspective, this paper is the first to provide compelling evidence that wellregularized neural networks (even simple MLPs!) indeed surpass the current stateoftheart models in tabular datasets, including recent neural network architectures and GBDT (Section 6). In fact, the performance improvements are quite pronounced and highly significant. We believe this finding to potentially have farreaching implications, and to open up a garden of delights of new applications on tabular datasets for DL.
Our contributions are as follows:

We demonstrate that modern DL regularizers (developed for DL applications on raw data, such as images, speech, or text) also substantially improve the performance of deep multilayer perceptrons on tabular data.

We propose a simple, yet principled, paradigm for selecting the optimal subset of regularization techniques and their subsidiary hyperparameters (socalled regularization cocktails).

We demonstrate that these regularization cocktails enable even simple MLPs to outperform both recent neural network architectures, as well as traditional strong ML methods, such as GBDT, on tabular data. Specifically, we are the first to show neural networks to significantly (and substantially) outperform XGBoost in a fair, largescale experimental study.
2 Related Work on Deep Learning for Tabular Data
Recently, various neural architectures have been proposed for improving the performance of neural networks on tabular data. TabNet (49698, ) introduced a sequential attention mechanism for capturing salient features. Neural oblivious decision ensembles (NODE (Popov2020Neural, )) blend the concept of hierarchical decisions into neural networks. Regularization learning networks train a regularization strength on every neural weight by posing the problem as a largescale hyperparameter tuning scheme (10.5555/3326943.3327070, ). The recent NETDNF technique introduces a novel inductive bias in the neural structure corresponding to logical Boolean formulas in disjunctive normal forms (katzir2021netdnf, ). An approach that is often mistaken as deep learning for tabular data is AutoGluon Tabular (DBLP:journals/corr/abs200306505, ). It builds ensembles of basic neural networks together with other traditional ML techniques, with its key contribution being a strong stacking approach. We emphasize that some of these publications claim to outperform Gradient Boosted Decision Trees (GDBT) (49698, ; Popov2020Neural, ), and other papers explicitly stress that their neural networks do not outperform GBDT on tabular datasets (10.5555/3326943.3327070, ; katzir2021netdnf, ). In contrast, we do not propose a new kind of neural architecture, but a novel paradigm for learning a combination of regularization methods.
3 An Overview of Regularization Methods for Deep Learning
Weight decay: The most classical approaches of regularization focused on minimizing the norms of the parameter values, e.g., either the L1 (tibshirani96regression, ), the L2 (Tikhonov1943OnTS, ), or a combination of L1 and L2 known as the Elastic Net (zou2005regularization, ). A recent work fixes the malpractice of adding the decay penalty term before momentumbased adaptive learning rate steps (e.g., in common implementations of Adam (kingma:adam, )), by decoupling the regularization from the loss and applying it after the learning rate computation (loshchilov2018decoupled, ).
Data Augmentation: Among the augmentation regularizers, CutOut (Devries2017ImprovedRO, ) proposes to mask a subset of input features (e.g., pixel patches for images) for ensuring that the predictions remain invariant to distortions in the input space. Along similar lines, MixUp (zhang2018mixup, ) generates new instances as a linear span of pairs of training examples, while CutMix (yun2019cutmix, ) suggests superpositions of instance pairs with mutuallyexclusive pixel masks. A recent technique, called AugMix (hendrycks*2020augmix, )
, generates instances by sampling chains of augmentation operations. On the other hand, the direction of reinforcement learning (RL) for augmentation policies was elaborated by AutoAugment
(Cubuk_2019_CVPR, ), followed by a technique that speeds up the training of the RL policy (NIPS2019_8892, ). Last but not least, adversarial attack strategies (e.g., FGSM (43405, )) generate synthetic examples with minimal perturbations, which are employed in training robust models (madry2018towards, ).Model Averaging:
Ensembled machine learning models have been shown to reduce variance and act as regularizers
(polikar_ensemble_2012, ). A popular ensemble neural network with shared weights among its base models is Dropout (10.5555/2627435.2670313, ), which was extended to a variational version with a Gaussian posterior of the model parameters (10.5555/2969442.2969527, ). As a followup, MixOut (Lee2020Mixout:, ) extends Dropout by statistically fusing the parameters of two base models. Furthermore, socalled “snapshot ensembles” (huang_snapshot_2016, )can be created using models from intermediate convergence points of stochastic gradient descent with restarts
(loshchilovICLR17SGDR, ).Structural and Linearization: In terms of structural regularization, ResNet adds skip connections across layers (7780459, ), while the Inception model computes latent representations by aggregating diverse convolutional filter sizes (szegedy2017inception, ). A recent trend adds a dosage of linearization to deep models, where skip connections transfer embeddings from previous less nonlinear layers (7780459, ; huang2017densely, ). Along similar lines, the ShakeShake regularization deploys skip connections in parallel convolutional blocks and aggregates the parallel representations through affine combinations (DBLP:conf/iclr/Gastaldi17, ), while ShakeDrop extends this mechanism to a larger number of CNN architectures (yamada2018shakedrop, ).
Implicit: The last family of regularizers broadly encapsulates methods that do not directly propose novel regularization techniques but have an implicit regularization effect as a virtue of their ‘modus operandi’ (NIPS2019_8960, )
. For instance, Batch Normalization improves generalization by reducing the internal covariate shifts
(pmlrv37ioffe15, ), while early stopping of the optimization procedure also yields a similar generalization effect (Yao2007, ). On the other hand, stabilizing the convergence of the training routine is another implicit regularization, for instance by introducing learning rate scheduling schemes (loshchilovICLR17SGDR, ). The recent strategy of stochastic weight averaging relies on averaging parameter values from the local optima encountered along the sequence of optimization steps (izmailov2018averaging, ), while another approach conducts updates in the direction of a few ‘lookahead’ steps (DBLP:conf/nips/ZhangLBH19, ).4 Regularization Cocktails for Multilayer Perceptrons
4.1 Problem Definition
A training set is composed of features and targets , while the test dataset is denoted by . A parametrized function , i.e., a neural network, approximates the targets as , where the parameters
are trained to minimize a differentiable loss function
as . To generalize into minimizing , the parameters of are controlled with a regularization technique that avoids overfitting to the peculiarities of the training data. With a slight abuse of notation we denote to be the predictions of the model whose parameters are optimized under the regime of the regularization method , where represents the hyperparameters of . The training data is further divided into two subsets as training and validation splits, the later denoted by , such that can be tuned on the validation loss via the following hyperparameter optimization objective:(1)  
After finding the optimal (or in practice at least a wellperforming) configuration , we refit on the entire training dataset, i.e., and .
While the search for optimal hyperparameters is an active field of research in the realm of AutoML (automl_book, ), still the choice of the regularizer mostly remains an adhoc practice, where practitioners select a few combinations among popular regularizers (Dropout, L2, Batch Normalization, etc.). In contrast to prior studies, we hypothesize that the optimal regularizer is a cocktail mixture of a large set of regularization methods, all being simultaneously applied with different strengths (i.e., datasetspecific hyperparameters). Given a set of regularizers , each with its own hyperparameters , the problem of finding the optimal cocktail of regularizers is:
(2)  
The intuitive interpretation of Equation 2 is searching for the optimal hyperparameters (i.e., strengths) of the cocktail’s regularizers using the validation set, given that the optimal prediction model parameters are trained under the regime of all the regularizers being applied jointly. We stress that, for each regularizer, the hyperparameters include a conditional hyperparameter controlling whether the th regularizer is applied at all or skipped. The best cocktail might comprise only a subset of regularizers.
4.2 Cocktail Search Space
To build our regularization cocktails we combine the 13 regularization methods listed in Table 1, which are selected among the categories of regularizers covered in Section 3. The regularization cocktail’s search space with the exact ranges for the selected regularizers’ hyperparameters is given in the same table. In total, the optimal cocktail is searched in a space of 19 hyperparameters.
While we can in principle use any hyperparameter optimization method, we decided to use the multifidelity Bayesian optimization method BOHB (falknericml18, ) since it achieves strong performance across a wide range of computing budgets by combining Hyperband (10.5555/3122009.3242042, ) and Bayesian Optimization (DBLP:journals/jgo/Mockus94, ) and still has the convergence guarantees of Hyperband. Furthermore, BOHB can deal with the categorical hyperparameters we use for enabling or disabling regularization techniques and the corresponding conditional structures. In Appendix A we provide a brief description of how BOHB works. Some of the regularization methods cannot be combined, and we, therefore, introduce the following constraints to the proposed search space: (i) ShakeShake and ShakeDrop are not simultaneously active since the latter builds on the former; (ii) Only one data augmentation technique out of MixUp, CutMix, CutOut, and FGSM adversarial learning can be active at once due to a technical limitation of the base library we use (DBLP:journals/corr/abs200613799, ).
Group  Regularizer  Hyperparameter  Type  Range  Conditionality 
Implicit  BN  BNactive  Boolean  
SWA  SWAactive  Boolean    
LA  LAactive  Boolean  
Step size  Continuous  LAactive  
Num. steps  Integer  LAactive  
W. Decay  WD  WDactive  Boolean  
Decay factor  Continuous  WDactive  
M. Averaging  DO  DOactive  Boolean  
Dropout shape  Nominal  DOactive  
Drop rate  Continuous  DOactive  
SE  SEactive  Boolean    
Structural  SC  SCactive  Boolean  
MB choice  Nominal  SCactive  
SD  Max. probability 
Continuous  
SS        
Augmentation  Augment  Nominal  
MU  Mix. magnitude  Continuous  
CM  Probability  Continuous  
CO  Probability  Continuous  
Patch ratio  Continuous  
AT       
5 Experimental Protocol
5.1 Experimental Setup and Datasets
We use a large collection of 40 tabular datasets (listed in Table 7 of Appendix D
). This includes 31 datasets from the recent opensource OpenML AutoML Benchmark
(amlb2019, )^{1}^{1}1The remaining 8 datasets from that benchmark were too large to run effectively on our cluster.. In addition, we added 9 popular datasets from UCI (asuncion2007uci, ) and Kaggle that contain roughly 100K+ instances. Our resulting benchmark of 40 datasets includes tabular datasets that represent diverse classification problems, containing between 452 and 416 188 data points, and between 4 and 2 001 features, varying in terms of the number of numerical and categorical features. The datasets are retrieved from the OpenML repository (vanschoren2014openml, ) and split as training, validation, and testing sets. The data is standardized to have zero mean and unit variance where the statistics for the standardization are calculated on the training split.We ran all experiments on a CPU cluster, each node of which contains two Intel Xeon E52630v4 at 2.2GHz with 20 CPU cores and a total memory of 128GB. We chose the PyTorch library
(paszke2019pytorch, ) as a deep learning framework and extended the AutoDLframework AutoPytorch (mendozaautomlbook18a, ; DBLP:journals/corr/abs200613799, ) with our implementations for the regularizers of Table 1.To optimally utilize resources, we ran BOHB with 10 workers in parallel, where each worker had access to 2 CPU cores and 12GB of memory, executing one configuration at a time. Taking into account the dimensions of the considered configuration spaces, we ran BOHB for at most 4 days, or at most
hyperparameter configurations, whichever came first. During the training phase, each configuration was run for 105 epochs, in accordance with the cosine learning rate annealing with restarts (described in the following subsection). For the sake of studying the effect on more datasets, we only evaluated a single trainvaltest split. After the training phase is completed, we report the results of the best hyperparameter configuration found, retrained on the joint train and validation set.
5.2 Fixed Architecture and Optimization Hyperparameters
In order to focus exclusively on investigating the effect of regularization we fix the neural architecture to a simple multilayer perceptron (MLP) and also fix some hyperparameters of the general training procedure. These fixed hyperparameter values, as specified in Table 3 of Appendix B.1, have been tuned for maximizing the performance of an unregularized neural network on our dataset collection (see Table 7 in Appendix D
). We use a 9layer feedforward neural network with 512 units for each layer, a choice motivated by previous work
(orhan2017skip, ).Moreover, we set a low learning rate of after performing a grid search for finding the best value across datasets. We use AdamW (loshchilov2018decoupled, ), which implements decoupled weight decay, and cosine annealing with restarts (loshchilovICLR17SGDR, ) as a learning rate scheduler. Using a learning rate scheduler with restarts helps in our case because we keep a fixed initial learning rate. For the restarts, we use an initial budget of 15 epochs, with a budget multiplier of 2, following published practices (DBLP:journals/corr/abs200613799, ). Additionally, since our benchmark includes imbalanced datasets, we use a weighted version of categorical crossentropy and balanced accuracy (brodersen2010balanced, )
as the evaluation metric.
5.3 Research Hypotheses and Associated Experiments
 Hypothesis 1:

Regularization cocktails outperform stateoftheart deep learning architectures on tabular datasets.
 Experiment 1:

We compare our wellregularized MLPs against the recently proposed deep learning architectures Node (Popov2020Neural, ) and TabNet (49698, ). We also compare against AutoGluon Tabular (DBLP:journals/corr/abs200306505, ) and add an unregularized version of our MLP for reference, as well as a version of our MLP regularized with Dropout (where the dropout hyperparameters are tuned on every dataset).
 Hypothesis 2:

Regularization cocktails outperform GradientBoosted Decision Trees, as the most commonly used traditional ML method for tabular data.
 Experiment 2:

We compare against stateoftheart classifiers for tabular data. In particular, we compare against Gradient Boosted Decision Trees (GBDT), the defacto stateoftheart in tabular datasets. We use two different implementations of GBDT: an implementation from scikitlearn
scikitlearn and optimized by Autosklearn (autosklearn, ), and the popular XGBoost (Chen:2016:XST:2939672.2939785, ).
5.4 Experimental Setup for the Baselines
All baselines use the same train, validation, and test splits, the same seed, and the same HPO resources and constraints as for our automatically constructed regularization cocktails (4 days on 20 CPU cores with 128GB of memory). After finding the best incumbent configuration, the baselines are refitted on the union of the training and validation sets and evaluated on the test set. The baselines consist of two recent neural architectures, AutoGluon Tabular with neural networks, and two implementations of GBDT, as follows:
 TabNet:

This library does not provide an HPO algorithm by default; therefore, we also used BOHB for this search space, with the hyperparameter value ranges recommended by the authors (49698, ).
 Node:

This library does not offer an HPO algorithm by default. We performed a grid search among the hyperparameter value ranges as proposed by the authors (Popov2020Neural, ); however, we faced multiple memory and runtime issues in running the code. To overcome these issues we used the default hyperparameters the authors used in their public implementation^{2}^{2}2https://github.com/Qwicen/node/blob/master/.
 AutoGluon Tabular:

This library constructs stacked ensembles with bagging among diverse neural network architectures having various kinds of regularization (DBLP:journals/corr/abs200306505, ). The training of the stacking ensemble of neural networks and its hyperparameter tuning are integrated into the library. While AutoGluon Tabular by default uses a broad range of traditional ML techniques, here, in order to study it as a “pure” deep learning method, we restrict it to only use neural networks as base learners.
 ASKGBDT:

The GBDT implementation of scikitlearn offered by Autosklearn (autosklearn, ) uses SMAC for HPO, and we used the default hyperparameter search space given by the library.
 XGBoost:

The original library (Chen:2016:XST:2939672.2939785, ) does not incorporate an HPO algorithm by default, so we used BOHB for its HPO. We defined a search space for XGBoost’s hyperparameters following the best practices by the community; we describe this in the Appendix B.2.
For indepth details about the different baseline configurations with the exact hyperparameter search spaces, please refer to Appendix B.2.
6 Experimental Results
Table 2 presents the comparative results of our MLPs regularized with the proposed regularization cocktails against seven baselines: two stateoftheart architectures, AutoGluon Tabular with neural networks, two GradientBoosted Decision Tree (GBDT) implementations, as well as two reference MLPs (unregularized and regularized only with Dropout). It is worth reemphasizing that the hyperparameters of all the presented baselines (except the unregularized MLP, which has no hyperparameters) are carefully tuned on a validation set as detailed in Section 5 and the appendices referenced therein. The table entries represent the test sets’ balanced accuracies achieved over the described largescale collection of 40 datasets. Figure 1 visualizes these results, showing substantial improvements for our method.
To assess the statistical significance, we analyze the ranks of the classification accuracies across the 40 datasets. We use the Critical Difference (CD) diagram of the ranks based on the Wilcoxon significance test, a standard metric for comparing classifiers across multiple datasets (10.5555/1248547.1248548, ). The overall empirical comparison of the elaborated methods is given in Figure 2. The analysis of neural network baselines in Subplot 1(a) reveals a clear statistical significance of the regularization cocktails against the other methods. Apart from AutoGluon, the other neural architectures are not competitive even against a MLP regularized only with Dropout and optimized with our standard, fixed training pipeline of Adam with cosine annealing. To be even fairer to the weaker baselines (TabNet and Node) we tried boosting them by adding early stopping (indicated with "+ES"), but their rank did not improve. Overall, the largescale experimental analysis deduces that Hypothesis 1 in Section 5.3 is validated and that wellregularized simple deep MLPs outperform specialized neural architectures.
Next, we analyze the empirical significance of our wellregularized MLPs against the GBDT implementations in Figure 1(b). The results show that our MLPs outperform both GBDT variants (XGBoost and autosklearn) with a statistically significant margin. We added early stopping ("+ES") to XGBoost, but it did not improve its performance. Among the GBDT implementations, XGBoost has a nonsignificant margin over autosklearn. We conclude that wellregularized simple deep MLPs outperform GBDT, which validates Hypothesis 2 in Section 5.3.
The final cumulative comparison in Figure 1(c) provides a further result: none of the specialized previous deep learning methods (TabNet, NODE, AutoGluon Tabular) outperforms GBDT significantly. To the best of our awareness, this paper is therefore the first to demonstrate that neural networks beat GBDT with a statistically significant margin over a largescale experimental protocol that conducts a thorough hyperparameter optimization for all methods.
Lastly, Figure 3 provides a further analysis on the most prominent regularizers of the MLP cocktails, based on the frequency of regularization methods that our HPO procedure selected for each dataset’s cocktail. In the left plot, we show the frequent individual regularizers, while in the right plot the frequencies are grouped by types of regularizers. The grouping reveals that a cocktail for each dataset often has at least one ingredient from every regularization family (detailed in Section 3), highlighting the need for jointly applying diverse regularization methods.
7 Conclusion
Summary. Focusing on the important domain of tabular datasets, this paper studied improvements to deep learning (DL) by better regularization techniques. We presented regularization cocktails, perdatasetoptimized combinations of many regularization techniques, and demonstrated that these improve the performance of even simple neural networks enough to substantially and significantly surpass XGBoost, the current stateoftheart method for tabular datasets. We conducted a largescale experiment involving 13 regularization methods and 40 datasets and empirically showed that (i) modern DL regularization methods developed in the context of raw data (e.g., vision, speech, text) substantially improve the performance of deep neural networks on tabular data; (ii) regularization cocktails significantly outperform recent neural networks architectures, and most importantly iii) regularization cocktails outperform GBDT on tabular datasets.
Limitations.
Compared to traditional machine learning methods, such as XGBoost, fitting deep neural networks is slow, and our regularization cocktails require perdataset hyperparameter optimization on top. Therefore, in many data science applications, practitioners may currently still prefer the cheaper, albeit less accurate, traditional methods. To comprehensively study basic principles, we have also chosen an empirical evaluation that has many limitations. We only studied classification, not regression. We only used somewhat balanced datasets (the ratio of the minority class and the majority class is above 0.05). We did not study the regime of extremely few data points (our smallest data set contained 452 data points, our largest 416 188 data points). We also did not study datasets with extreme outliers, missing labels, semisupervised data, streaming data, and many more modalities in which tabular data arises. An important point worth noticing is that the recent neural network architectures (Section
5.4) could also benefit from our regularization cocktails, however, integrating the regularizers into the baseline libraries requires considerable coding efforts.Future Work. This work opens up the door for a wealth of exciting followup research. Firstly, the perdataset optimization of regularization cocktails may be substantially sped up by using metalearning across datasets metalearning_vanschoren . Secondly, as we have used a fixed neural architecture, our method’s performance may be further improved by using joint architecture and hyperparameter optimization. Thirdly, regularization cocktails should also be tested under all the data modalities under “Limitations” above. In addition, it is interesting to validate the gain of integrating our wellregularized MLPs into modern AutoML libraries, by combining them with enhanced feature preprocessing and ensembling.
Takeaway. Even simple neural networks can achieve competitive classification accuracies on tabular datasets when they are well regularized, using datasetspecific regularization cocktails found via standard hyperparameter optimization.
Acknowledgements. The authors acknowledge funding by the Robert Bosch GmbH and Eva MayrStihl foundation. A part of this work was supported by the German Federal Ministry of Education and Research (BMBF, grant RenormalizedFlows 01IS19077C). The authors acknowledge support by the state of BadenWürttemberg through bwHPC and the German Research Foundation (DFG) through grant no INST 39/9631 FUGG.
References

(1)
S. Arik and T. Pfister.
Tabnet: Attentive interpretable tabular learning.
In
AAAI Conference on Artificial Intelligence
, 2021.  (2) S. Arora, N. Cohen, W. Hu, and Y. Luo. Implicit regularization in deep matrix factorization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alche Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 7413–7424. Curran Associates, Inc., 2019.
 (3) A. Asuncion and D. Newman. Uci machine learning repository, 2007.

(4)
K. Henning Brodersen, C., and E. Klaas Stephan J. M Buhmann.
The balanced accuracy and its posterior distribution.
In
2010 20th International Conference on Pattern Recognition
, pages 3121–3124. IEEE, 2010.  (5) T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016.

(6)
E. Cubuk, B. Zoph, D. Mane, and V.Vasudevanand Q. Le.
Autoaugment: Learning augmentation strategies from data.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2019.  (7) J. Demšar. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7:1–30, December 2006.
 (8) T. Devries and G. Taylor. Improved regularization of convolutional neural networks with cutout. ArXiv, abs/1708.04552, 2017.
 (9) N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola. Autogluontabular: Robust and accurate automl for structured data. CoRR, abs/2003.06505, 2020.
 (10) S. Falkner, A.Klein, and F. Hutter. BOHB: Robust and efficient hyperparameter optimization at scalae. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), pages 1436–1445, July 2018.
 (11) M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated machine learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 2, page 2755–2763. MIT Press, 2015.
 (12) X. Gastaldi. Shakeshake regularization of 3branch residual networks. In 5th International Conference on Learning Representations, ICLR. OpenReview.net, 2017.
 (13) P. Gijsbers, E. LeDell, S. Poirier, J. Thomas, B. Bischl, and J. Vanschoren. An open source automl benchmark. arXiv preprint arXiv:1907.00909 [cs.LG], 2019. Accepted at AutoML Workshop at ICML 2019.
 (14) I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR, 2015.
 (15) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
 (16) D. Hendrycks, N. Mu, E. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan. Augmix: A simple method to improve robustness and uncertainty under data shift. In International Conference on Learning Representations, 2020.
 (17) G. Huang, Y. Li, G. Pleiss, Z. Liu, J. Hopcroft, and K. Weinberger. Snapshot Ensembles: Train 1, Get M for Free. International Conference on Learning Representations, November 2017.
 (18) G. Huang, Z. Liu, L. van der Maaten, and K. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 (19) F. Hutter, L. Kotthoff, and J. Vanschoren, editors. Automated Machine Learning: Methods, Systems, Challenges. Springer, 2019. In press, available at http://automl.org/book.
 (20) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456. PMLR, 07–09 Jul 2015.
 (21) P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. Wilson. Averaging weights leads to wider optima and better generalization. In Proceedings of the ThirtyFourth Conference on Uncertainty in Artificial Intelligence, UAI, pages 876–885. AUAI Press, 2018.
 (22) L. Katzir, G. Elidan, and R. ElYaniv. Net{dnf}: Effective deep modeling of tabular data. In International Conference on Learning Representations, 2021.
 (23) D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
 (24) D. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 2, NIPS’15, page 2575–2583. MIT Press, 2015.

(25)
C. Lee, K. Cho, and W. Kang.
Mixout: Effective regularization to finetune largescale pretrained language models.
In International Conference on Learning Representations, 2020.  (26) L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel banditbased approach to hyperparameter optimization. J. Mach. Learn. Res., 18(1):6765–6816, January 2017.
 (27) S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim. Fast autoaugment. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d AlcheBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 6665–6675. Curran Associates, Inc., 2019.
 (28) I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR) 2017 Conference Track, April 2017.
 (29) I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
 (30) A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
 (31) H. Mendoza, A. Klein, M. Feurer, J. Tobias Springenberg, M. Urban, M. Burkart, M. Dippel, M. Lindauer, and F. Hutter. Towards automaticallytuned deep neural networks. In F. Hutter, L. Kotthoff, and J. Vanschoren, editors, AutoML: Methods, Sytems, Challenges, chapter 7, pages 141–156. Springer, December 2019.
 (32) J. Mockus. Application of bayesian approach to numerical methods of global and stochastic optimization. J. Glob. Optim., 4(4):347–365, 1994.
 (33) A. Emin Orhan and X. Pitkow. Skip connections eliminate singularities. arXiv preprint arXiv:1701.09175, 2017.
 (34) A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. In Advances in neural information processing systems, pages 8026–8037, 2019.
 (35) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 (36) R. Polikar. Ensemble Learning. In C. Zhang and Y. Ma, editors, Ensemble Machine Learning: Methods and Applications, pages 1–34. Springer US, 2012.
 (37) S. Popov, S. Morozov, and A. Babenko. Neural oblivious decision ensembles for deep learning on tabular data. In International Conference on Learning Representations, 2020.
 (38) I. Shavitt and E. Segal. Regularization learning networks: Deep learning for tabular datasets. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, page 1386–1396. Curran Associates Inc., 2018.
 (39) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, January 2014.

(40)
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In Thirtyfirst AAAI conference on artificial intelligence, 2017.  (41) R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58:267–288, 1996.
 (42) A. Tikhonov. On the stability of inverse problems. In Doklady Akademii Nauk SSSR, 1943.
 (43) J. Vanschoren. Metalearning. In F. Hutter, L. Kotthoff, and J. Vanschoren, editors, Automated Machine Learning  Methods, Systems, Challenges, The Springer Series on Challenges in Machine Learning, pages 35–61. Springer, 2019.
 (44) J. Vanschoren, J. Van Rijn, B. Bischl, and L. Torgo. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014.
 (45) Y. Yamada, M. Iwamura, and K. Kise. Shakedrop regularization, 2018.
 (46) Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, August 2007.
 (47) S. Yun, D. Han, S. Joon, S. Chun, J. Choe, and Y. Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision (ICCV), 2019.
 (48) H. Zhang, M. Cisse, Y. Dauphin, and D. Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
 (49) M. Zhang, J. Lucas, J. Ba, and G. Hinton. Lookahead optimizer: k steps forward, 1 step back. In H. Wallach, H. Larochelle, A. Beygelzimer, F. Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, pages 9593–9604, 2019.
 (50) L. Zimmer, M. Lindauer, and F. Hutter. Autopytorch tabular: Multifidelity metalearning for efficient and robust autodl. IEEE TPAMI, 2021. IEEE Early Access.
 (51) H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005.
Appendix A Description of BOHB
BOHB [10] is a hyperparameter optimization algorithm that extends Hyperband [26] by sampling from a model instead of sampling randomly from the hyperparameter search space.
Initially, BOHB performs random search and favors exploration. As it iterates and gets more observations, it builds models over different fidelities and trades off exploration with exploitation to avoid converging in bad regions of the search space. BOHB samples from the model of the highest fidelity with a probability and with from random. A model is built for a fidelity only when enough observations exist for that fidelity; by default, this limit is set to equal + 1 observations, where is the dimensionality of the search space.
Appendix B Configuration Spaces
b.1 Method implicit search space
Category  Hyperparameter  Type  Range 
Cosine Annealing  Iterations multiplier  Continuous  
Max. iterations  Integer  
Network  Activation  Nominal  
Bias initialization  Nominal  
Blocks in a group  Integer  
Embeddings  Nominal  
Number of groups  Integer  
Resnet shape  Nominal  
Type  Nominal  
Units in a layer  Integer  
Preprocessing  Preprocessor  Nominal  
Training  Batch size  Integer  
Imputation  Nominal  
Initialization method  Nominal  
Learning rate  Continuous  
Loss module  Nominal  
Normalization strategy  Nominal  
Optimizer  Nominal  
Scheduler  Nominal  
Seed  Integer 
Table 3 presents the network architecture and the training pipeline choices used in all our experiments for the individual regularizers and for the regularization cocktails.
b.2 Benchmark search space
For Experiment 3, we set up the search space and the individual configurations of the stateoftheart competitors used for the comparison as follows:
AutoSklearn.
The estimator is restricted to only include GBDT, for the sake of fully comparing against the algorithm as a baseline. We do not activate any preprocessing since our regularization cocktails do not make use of preprocessing algorithms in the pipeline. The time left is always selected based on the time it took BOHB to find the hyperparameter with the best validation accuracy from the start of the hyperparameter optimization phase. The ensemble size is kept to
since our method only uses models from one training run, not multiple ones. The seed is set to as it was set in the experiments with the regularization cocktail, to obtain the same data splits. To keep the comparison fair, there is no warm start for the initial configurations with metalearning, since, our method also does not make use of metalearning. Lastly, the number of workers in parallel is set to , to match the parallel resources that were given to the experiment with the regularization cocktails. The search space of the hyperparameters is left to the default search space offered by AutoSklearn which is shown in Table 4.Hyperparameter  Type  Range 

Nominal  
Continuous  
Continuous  
Integer  
Integer  
Integer  
Continuous 
XGBoost.
To have a wellperforming configuration space for XGBoost we augmented the default configuration spaces previously used in AutoSklearn^{3}^{3}3https://github.com/automl/autosklearn/blob/v.0.4.2/autosklearn/pipeline/components/classification/xgradient_boosting.py with further recommended hyperparameters and ranges from Amazon^{4}^{4}4https://docs.aws.amazon.com/sagemaker/latest/dg/xgboosttuning.html We reduced the size of some ranges since the ranges given at this website were too broad and resulted in poor performance.. In Table 5 we present a refined version of the configuration space that achieves a better performance on the benchmark. We would like to note that we did not apply OneHot encoding to the categorical features for the experiment, since, we observed better overall results when the categorical features were label encoded.
Hyperparameter  Type  Range  Log scale 

Continuous  ✓  
Continuous  ✓  
Continuous  ✓  
Integer    
Continuous  ✓  
Continuous    
Continuous    
Continuous    
Integer    
Integer    
Continuous  ✓  
Continuous   
TabNet.
For the search space of the TabNet model, we used the default hyperparameter ranges suggested by the authors which were found to perform the best in their experiments.
Hyperparameter  Type  Values 

Integer  
Continuous  
Continuous  
Integer  
Continuous  
Integer  
Integer  
Continuous  
Integer  
Continuous 
For our experiments with the Tabnet and XGBoost models, we also used BOHB for hyperparameter tuning, using the same parallel resources and limiting conditions as for our regularization cocktail.
In the above search spaces for the experiments with the XGBoost and TabNet models, we have not included early stopping, although, we did run experiments where both models had early stopping activated. For both models, the results were not better than their counterparts that did not make use of early stopping. Lastly, for both experiments, we imputed the missing values with the most frequent strategy. The reason behind our choice was that the implementation that we used did not accept the median strategy for categorical value imputation.
AutoGluon.
The library is configured to construct stacked ensembles with bagging among diverse neural network architectures having various kinds of regularization with the to achieve the best predictive accuracy. Furthermore, we used the same seed as for our MLPs with regularization cocktails to obtain the same dataset splits. We allowed AutoGluon to make use of early stopping and additionally, we allowed feature preprocessing since different feature preprocessing techniques are embedded in different model types, to allow for better overall performance. For all the other training and hyperparameter settings we use the library’s default^{5}^{5}5We used version 0.2.0 of the autogluon library following the explicit recommendation of the authors on the efficacy of their proposed stacking without needing any HPO [9].
Node.
For our experiments with NODE we used the official implementation^{6}^{6}6https://github.com/Qwicen/node. In our initial experiment iterations, we used the search space that is proposed by the authors [37]. However, evaluating the search space proposed is unfeasible, since the memory and runtime requirements of the experiments are very high and can not be satisfied within our cluster constraints. The high runtime and memory issues are noted by the authors in the official implementation.
To alleviate the problems, we used the default configuration suggested by the authors in the examples where the , and . Lastly, we use the same seed as for our experiment with the regularization cocktails to obtain the same data splits.
Appendix C Plots
c.1 Regularization Cocktail Performance
To investigate the performance of our formulation, we compare plain MLPs regularized with only one individual regularization technique at a time against the datasetspecific regularization cocktails. The hyperparameters for all methods are tuned on the validation set and the best configuration is refitted on the full training set. In Figure 4, we present the results of each pairwise comparison. The results presented are calculated on the test set after the refit phase is completed on the best hyperparameter configuration. The pvalue is generated by performing a Wilcoxon signedrank test. As can be seen from the results, the regularization cocktail is the only method that has statistically significant improvements compared to all other methods (with a pvalue in all cases). The detailed results for all methods on every dataset are shown in Table 9.
c.2 Datasetdependent optimal cocktails
To verify the necessity for datasetspecific regularization cocktails, we initially investigate the bestfound hyperparameter configurations to observe the occurrences of individual regularization techniques. In Figure 5, we present the occurrences of every regularization method over all datasets. The occurrences are calculated by analyzing the bestfound hyperparameter configuration for each dataset and observing the number of times the regularization method was chosen to be activated by BOHB. As can be seen from Figure 5 there is no regularization method or combination that is always chosen for every dataset.
Additionally, we compare our regularization cocktails against the top5 frequently chosen regularization techniques and the top5 best performing regularization techniques. For the top5 baselines, the regularization techniques are activated and their hyperparameters are tuned on the validation set. The results of the comparison as shown in Table 8 show that the cocktail outperforms both top5 variants, indicating the need for datasetspecific regularization cocktails.
c.3 Learning rate as a hyperparameter
In the majority of our experiments, we keep a fixed initial learning rate to investigate in detail the effect of the individual regularization techniques and the regularization cocktails. The learning rate is set to a fixed value that achieves the best results across the chosen benchmark of datasets. To investigate the role and importance of the learning rate in the regularization cocktail performance, we perform an additional experiment, where, the learning rate is a hyperparameter that is optimized individually for every dataset. The results as shown in Table 10, indicate that regularization cocktails with a dynamic learning rate outperform the regularization cocktails with a fixed learning rate in 21 out of 40 datasets, tie in 1 and lose in 18. However, the results are not statistically significant with a value of and do not indicate a clear region where the dynamic learning rate helps.
Appendix D Tables
In Table 7
we provide information about the datasets that are considered in our experiments. Concretely, we provide descriptive statistics and the identifiers for every dataset. The identifier (the task id) can be used to download the datasets from OpenML (
http://www.openml.org).Table 8 shows the results for the comparison between the Regularization Cocktail and the Top5 cocktail variants. The results are calculated on the test set for all datasets, after retraining on the best datasetspecific hyperparameter configuration.
Table 9 provides the results of all our experiments for the baseline, the individual regularization methods, and the regularization cocktail. All the results are calculated on the test set, after retraining on the bestfound hyperparameter configurations. The evaluation metric used for the performance is balanced accuracy.
Additionally, in Table 10 we provide the results of the regularization cocktails with a fixed learning rate and with the learning rate being a hyperparameter optimized for every dataset.