Optimization of genomic classifiers for clinical deployment: evaluation of Bayesian optimization for identification of predictive models of acute infection and in-hospital mort

by   Michael B. Mayhew, et al.
Inflammatix, Inc.

Acute infection, if not rapidly and accurately detected, can lead to sepsis, organ failure and even death. Currently, detection of acute infection as well as assessment of a patient's severity of illness are based on imperfect (and often superficial) measures of patient physiology. Characterization of a patient's immune response by quantifying expression levels of key genes from blood represents a potentially more timely and precise means of accomplishing both tasks. Machine learning methods provide a platform for development of deployment-ready classification models robust to the smaller, more heterogeneous datasets typical of healthcare. Identification of promising classifiers is dependent, in part, on hyperparameter optimization (HO), for which a number of approaches including grid search, random sampling and Bayesian optimization have been shown to be effective. In this analysis, we compare HO approaches for the development of diagnostic classifiers of acute infection and in-hospital mortality from gene expression of 29 diagnostic markers. Our comprehensive analysis of a multi-study patient cohort evaluates HO for three different classifier types and over a range of different optimization settings. Consistent with previous research, we find that Bayesian optimization is more efficient than grid search or random sampling-based methods, identifying promising classifiers with fewer evaluated hyperparameter configurations. However, we also find evidence of a lack of correspondence between internal and external validation performance of selected classifiers that complicates model selection for deployment as well as stymies development of clear-cut, practical guidelines for HO application in healthcare. We highlight the need for additional considerations about patient heterogeneity, dataset partitioning and optimization setup when applying HO methods in the healthcare context.



There are no comments yet.


page 1

page 2

page 3

page 4


Analysis of the AOK Lower Saxony hospitalisation records data (years 2008 -- 2015)

Multidrug-resistant Enterobacteriaceae (MDR-E) have become a major publi...

Is One Hyperparameter Optimizer Enough?

Hyperparameter tuning is the black art of automatically finding a good c...

AutoPrognosis: Automated Clinical Prognostic Modeling via Bayesian Optimization with Structured Kernel Learning

Clinical prognostic models derived from largescale healthcare data can i...

Clinical prediction system of complications among COVID-19 patients: a development and validation retrospective multicentre study

Existing prognostic tools mainly focus on predicting the risk of mortali...

Detection of Surgical Site Infection Utilizing Automated Feature Generation in Clinical Notes

Postsurgical complications (PSCs) are known as a deviation from the norm...

Learning Tasks for Multitask Learning: Heterogenous Patient Populations in the ICU

Machine learning approaches have been effective in predicting adverse ou...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Patient lives depend on the swiftness and accuracy of 1) assessment of the severity of their illness and 2) detection of acute infection (when present). These two vital clinical tasks offer opportunities for improvement. Currently, clinicians determine severity of illness by computing scores (e.g. SOFA; Jones et al. (2009)) based on patient physiological features associated with the risk of adverse events (e.g. in-hospital mortality, organ failure) Similarly, detection of acute infection generally involves evaluation of clinical/physiological symptoms (e.g. cough, runny nose, fever) as well as laboratory tests for the presence of specific pathogens. However, these methods provide superficial and imprecise measures of patient illness. Lab tests, in particular, can prove time-consuming and misleading, and the absence of a pathogen finding does not necessarily indicate the absence of infection.

Recent work has highlighted the potential of an alternative approach for diagnostic classifier development: using gene expression measurements from patient blood to detect the presence and type of infection to which the patient is responding (Sweeney et al. (2015, 2016); Sweeney and Khatri (2017); Mayhew et al. (2020)) as well as the patient’s severity of illness (Sweeney et al. (2018)). Coupled with these gene expression signatures of host response, advances in machine learning (ML) provide a platform for the development and deployment of robust, diagnostic classifiers. However, realizing the potential of ML in healthcare depends on respecting the realities of the healthcare context. Indeed, a recent study by our group found that diagnostic classifier development efforts must take into account patient heterogeneity (e.g. biological differences, differences in healthcare received) as well as the effects of this heterogeneity on the performance of a candidate pre-deployment classifier in external validation (Mayhew et al. (2020)).

An important step for the development of robust, high-performing classifiers is optimization of the classifier’s hyperparameters: aspects of the model that affect its complexity (e.g. penalty coefficient in a LASSO logistic regression or numbers of hidden layers in a neural network) as well as how it’s trained (e.g. learning rates or mini-batch sizes for stochastic gradient descent). Hyperparameter optimization begins with specification of a search space (i.e. what model hyperparameters to optimize, what ranges of values each hyperparameter takes and whether the hyperparameters take discrete/categorical or continuous values) HO then proceeds by generating a user-specified number of hyperparameter configurations (sets of specific hyperparameter values corresponding to a unique classification model), training the classifier models given by each configuration, and evaluating the performance of the trained classifier in

internal validation. Internal validation performance is typically assessed either on a separate validation/tuning dataset or by cross-validation. Hyperparameter configurations are then ranked by this performance, with the top configuration selected and retained for external validation (application to a held-out test dataset).

For classifiers with relatively small hyperparameter spaces (e.g. support vector machines and regularized logistic regression), optimizing over a pre-defined grid of hyperparameter values (grid search; GS) has proven effective. More recent work has shown that optimization by randomly sampling (RS) hyperparameter configurations can lead to better coverage of hyperparameter spaces (especially those with more dimensions) and potentially better classifier performance (

Bergstra and Bengio (2012)). GS/RS approaches tend to generate all of the desired number of hyperparameter configurations prior to their evaluation, making these approaches amenable to distributed computation of their internal validation performance.

Bayesian optimization (BO) is a sequential global optimization procedure that has found success in applications as wide-ranging as viral capsid assembly (Thomas and Schwartz (2018)), genetic pre-breeding (Tanaka and Iwata (2018), high-throughput RNA sequencing assembly (Mao et al. (2019)) and materials design Zhang et al. (2020). BO has also proven effective for hyperparameter optimization in classicial (Snoek et al. (2012, 2015); Klein et al. (2017); Falkner et al. (2018); Klein et al. (2019)) and biomedical (Ghassemi et al. (2014); Colopy et al. (2018); Nishio et al. (2018); Borgli et al. (2019)) ML applications. In BO, one uses a model (commonly a Gaussian process (GP); Rasmussen and Williams (2006)) to approximate the objective function one wants to optimize. In the case of BO for hyperparameter optimization, the objective function maps from hyperparameter configurations to their internal validation performance of their corresponding classifier models. In contrast to RS, BO proceeds by sequentially evaluating configurations with the next configuration to visit being determined by previously visited configurations and an acquisition function. The acquisition function manages the tradeoff between re-visiting promising configurations (exploitation) or visiting new, potentially promising configurations (exploration). Each newly visited configuration is used to update the model of the objective function.

Previous studies comparing HO approaches have demonstrated that BO can identify promising classifiers more efficiently (with fewer evaluations of hyperparameter configurations) than GS/RS methods Bergstra et al. (2013); Snoek et al. (2012, 2015); Klein et al. (2017); Falkner et al. (2018); Nishio et al. (2018); Borgli et al. (2019); Klein et al. (2019). However, we note two aspects of previous evaluations of HO approaches that hinder their practical usage in our setting and, more broadly, in the healthcare domain. First, studies from the ML community have compared HO approaches with a focus on benchmark datasets whose composition and handling (i.e. partitioning into training-validation-test splits) doesn’t necessarily reflect characteristics of healthcare settings (e.g. smaller, structured, and more heterogeneous datasets; high propensity for models to be applied to out-of-distribution samples at test time). Second, comparison of HO approaches has been based on performance of models in internal validation, under the assumption that high performance in internal validation likely corresponds to high performance in a held-out dataset. However, to the best of our knowledge, no studies have evaluated the external validation performance of selected models, an important pre-requisite for eventual model deployment. In addition, no comparison of HO approaches has yet been attempted for development of diagnostic classifiers using genomic data.

In this work, we compare GS/RS and BO methods for hyperparameter optimization of gene expression-based diagnostic classifiers for two clinical tasks: 1) detection of acute infection and 2) prediction of mortality within 30 days of hospitalization. We optimize and train three different types of classifiers using gene expression features from 29 diagnostic markers in a multi-study cohort of 3413 patient samples for acute infection detection (3288 for 30-day mortality prediction). Patient samples were assayed on a variety of technical platforms and collected from a range of geographical regions, healthcare settings, and disease contexts. Our extensive analysis evaluates the BO approach, in particular, under a range of computational budgets and optimization settings. Crucially, beyond assessing and comparing the performance of top classifiers in internal validation, we further evaluate top models selected by all HO approaches in a multi-cohort external validation set comprising nearly 300 patients profiled by a targeted diagnostic instrument (NanoString). Our analysis provides important insights for diagnostic classifier development using genomic data, and, more generally, about the implementation and practical usage of HO approaches in healthcare.

2 Methods

2.1 Grouped cross-validation

Owing to considerable heterogeneity in our datasets due to both variability in sample processing and clinical variation in the target population, we seek robust classifiers that are able to generalize well to unseen patients. While ML community studies of HO have typically ranked and selected classifiers by performance on validation/tuning sets taken from much larger datasets, we use cross-validation (CV) performance to rank and select classifier models. Previous analyses by our group and others (e.g. Tabe-Bordbar et al. (2018)) suggested that alternative cross-validation strategies were preferable over conventional k-fold CV for identifying classifiers able to generalize across heterogeneous patient populations. One such strategy is leave-one-study-out (LOSO) CV in which each study (rather than a random sub-sampling of the full dataset) is treated as a fold. However, as LOSO CV’s run-time scales with the number of studies, we opt for 5-fold grouped CV, in which full studies are allocated to one and only one of five folds. Grouped 5-fold CV strikes a balance between the scalability of standard k-fold CV and the accounting for structured, heterogeneous observations of LOSO. We use grouped 5-fold CV performance to rank hyperparameter configurations from GS/RS methods and as an objective function in BO.

2.2 Classifier types and performance assessment

We evaluated three types of classification models in our analyses: 1) a support vector machine with a radial basis function (RBF) kernel, 2) XGBoost (XGB; 

Chen and Guestrin (2016)

) and 3) multi-layer perceptrons (MLP). These models were chosen because they support multi-class classification and represent both low-dimensional (RBF: N = 2) and relatively high-dimensional (XGBoost: N = 13, MLP: N = 12) hyperparameter spaces. MLP models were trained with the Adam optimizer (a variant of stochastic gradient descent optimization) and mini-batch size was fixed at 128.

For the BVN task (three-class classification), we ranked and selected models based on multi-class AUC (mAUC) (Hand and Till (2001)

). For the mortality task (binary classification), we used binary AUC as our performance metric. We bypassed metrics able to accommodate the class imbalance in our mortality classification task (e.g. area-under-the-precision-recall-curve, AUPR), in favor of AUC due to its clinical acceptance. The top-performing model for each task and hyperparameter optimization run was the hyperparameter configuration that resulted in the highest grouped 5-fold CV performance. To determine performance of models in grouped 5-fold CV, we pooled the model’s predicted probabilities for each fold and computed the relevant metric (mAUC or AUC) from the pooled probabilities. The top model (hyperparameter configuration) for each task and hyperparameter optimization run was then trained on the full training set and applied to the external validation set. We computed external validation performance (either mAUC or AUC) for these top models using their predicted probabilities for the validation samples. We also compute 95% confidence intervals for both CV and external performance estimates using 5k bootstrap samples of the corresponding predicted probabilities.

2.3 Grid search and random sampling-based optimization details

We use performance of classifiers selected by GS/RS methods as a baseline for comparison with models selected by BO. In our parallelized implementation of GS/RS, we first generate all configurations and then distribute the configurations to multiple compute nodes in a commodity cluster for evaluation. For RBF SVM, we conduct a grid search over 4757 configurations of cost, , and bandwidth parameter, . We consider 67 values in a grid ranging from 1e-03 to 2.15 and 71 values in a grid ranging from 1.12e-04 to 10. For XGB and MLP, we sampled 25k hyperparameter configurations each. We generated these samples uniformly and independently of one another from pre-specified ranges or from grids. Details of the XGBoost and MLP hyperparameters we chose to optimize appear in the Supplement.

2.4 Bayesian optimization details

In recent applications of BO for hyperparameter optimization, the objective function is usually a mapping from a domain of hyperparameter configurations to performance of the corresponding classifier on am internal validation/tuning set. In our setting, the objective function maps from hyperparameter configurations to 5-fold grouped CV performance of the corresponding classifiers.

There are two main components of BO: 1) a model that approximates the objective function, and 2) an acquisition function to propose the next configuration to visit in hyperparameter space. We use a Gaussian process model with noise to approximate the objective function. We also investigate both the expected improvement (EI) and upper confidence bound (UCB) acquisition functions. We use the Matern5/2 covariance function (Equation 1) in the GP model of the objective function:




Here is the dimensionality of the hyperparameter space, and are two configurations being compared,

is the variance in the objective function’s value and

is a lengthscale parameter that modulates the effect of changes in the hyperparameter dimensions on changes in grouped 5-fold CV performance. In this formulation, there is a single lengthscale parameter for all hyperparameter dimensions. We also evaluate the automatic relevance determination (ARD) form of the Matern5/2 kernel in which each hyperparameter dimension has their own lengthscale parameter:



For both XGBoost and MLP, the hyperparameter space was a mixed

domain containing both discrete/categorical hyperparameters and continuous hyperparameters. We performed BO using two different forms of the hyperparameter space. In one form, we forced continuous and discrete hyperparameter dimensions to take values in the continuous range 0 to 1. We then transformed these hyperparameters back to their original scales prior to their evaluation. For this transformation, we multiplied each discrete/continuous hyperparameter by its user-specified original value range and then added the minimum value in the range. In the second form, we searched discrete and continuous hyperparameter dimensions in their original scales. For both forms, categorical hyperparameters were converted to one-hot encoding vectors and discrete hyperparameters were converted to the nearest integer. Further details appear in the Supplement.

Each BO run consists of two stages: 1) initialization and 2) optimization. In the initialization stage, we uniformly and independently sample points from the hyperparameter space to seed construction of the GP approximation of the objective function. The optimization stage involves the sequential (dependent) updating of the GP with newly visited configurations as well as the use of the acquisition function to propose each new configuration to visit. We evaluated BO at a range of different initialization and optimization budgets (numbers of configurations evaluated). We considered two initialization budgets (5 and 25 configurations) and optimization budgets representing 1/50th, 1/100th, and 1/500th of the number of configurations we evaluate in our RS optimization runs for classifier development ( 5k configurations for RBF; 25k configurations for XGBoost and MLP).

2.5 Implementation details

We used the RBF SVM implementation available in Python3’s scikit-learn module (version 0.20.1). For XGBoost analyses, we used the scikit-learn API available from the xgboost package (version 0.90). All MLP models were implemented using Keras (version 2.1.6) with a TensorFlow (version 1.8.0) back-end. We carried out all GS/RS analyses using our proprietary distributed hyperparameter optimization software. We executed each GS/RS run using 100 t3a.large instances (2 vCPUs/instance) from Amazon Web Services (AWS). We implemented BO analyses with the GPyOpt package (GPyOpt authors (2016)) and carried out BO analyses on a personal laptop with an Intel Core i7-8550U at 1.80GHz 8-core CPU, running Ubuntu 18.04.

3 Cohort

To build the datasets used in this study, we combined gene expression data from public sources and in-house clinical studies designed for research in diagnosing acute infections and sepsis. We collected the publicly available studies from the NCBI GEO and EMBL-EBI ArrayExpress databases using a systematic search (described in Sweeney et al. (2015)). The public studies were profiled using a variety of different technical platforms (e.g. mostly from microarrays). In contrast, samples from the in-house clinical studies were profiled on the NanoString nCounter platform using a custom codeset for 29 diagnostic genes of interest. All included studies consisted of samples from our target population, comprising both adult and pediatric patients from diverse geographical regions and clinical settings. In addition, each included study had measurements taken from patient blood for all 29 markers of interest and had at least five healthy samples.

3.1 Cohort Selection

For both classification tasks, we separated studies into a training set and an external validation set. For the BVN task, the training set consisted of 43 studies (profiled outside Inflammatix) and 3413 patients (1087 with bacterial infection, 1244 with viral infection, and 1082 non-infected). The BVN external validation set consisted of six studies (profiled by Inflammatix) and 293 patients (153 with bacterial infection, 106 with viral infection, and 34 non-infected). For the mortality task, the training set consisted of 33 studies (profiled outside Inflammatix) and 3288 patients (175 30-day mortality events) while the mortality external validation set comprised four studies (profiled by Inflammatix) and 348 patients (80 30-day mortality events).

3.2 Data Extraction

We extracted gene expression measurements for all 29 diagnostic markers of interest and normalized these raw expression values for each study. To account for technical and biological heterogeneity across the expression studies, we also performed co-normalization (see Sweeney et al. (2016)), estimating distributional characteristics of each study based on their healthy controls and adjusting expression values of non-healthy samples in each study based on these estimates.

In addition to gene expression values for the 29 diagnostic markers of interest, we collected labels for patient infection and mortality status. Labels for one of three classes of the acute infection detection or BVN (Bacterial infection, Viral infection, or Non-infectious inflammation) task were determined by multiple-physician adjudication after chart review and not necessarily by positive pathogen identification. Non-infected determinations did not include healthy controls. Binary indicator labels of whether a patient died within 30 days of hospitalization were used for the mortality task. We derived these labels from study metadata (when available) and the associated study’s manuscripts.

3.3 Feature Choices

The features we used in our analyses were based on the expression values of 29 genes previously found to accurately discriminate three different aspects of acute infection: 1) viral vs. bacterial infection (7 genes) (Sweeney et al. (2016)), 2) infection vs. non-infectious inflammation (11 genes) (Sweeney et al. (2015)), and 3) high vs. low risk of 30-day mortality (11 genes) (Sweeney et al. (2018)). The 29 genes were grouped into six modules based on positive or negative correlation with infection status or mortality risk: increased expression in viral infections (IFI27, JUP, LAX1), increased expression in bacterial infections (TNIP1, CTSB, HK3, GPAA1), increased (HIF1A, SEPP1, RGS1, C11orf74, CD163, PER1, DEFA4, CIT) and decreased (LY86, TST, KCNJ2) expression in patients who died within 28 days of hospitalization, increased (CEACAM1, ZDHHC19, C9orf95, GNA15, BATF, C3AR1) and decreased (KIAA1370, TGFBI, MTCH1, RPGRIP1, HLA-DPB1) expression in patients with sepsis. Building on our previous work (Mayhew et al. (2020)

), we computed both the geometric means and arithmetic means of these six groups of genes, producing 12 features. We optimized and trained our classifiers on the combination of these 12 features and the expression values of all 29 genes (41 features in total).

4 Results

We compared the ability of BO and GS/RS approaches to optimize hyperparameters of three types of classifiers for two clinical tasks. For the BVN task, we sought classifiers that could achieve high performance in predicting whether a patient had a bacterial or viral infection or whether the patient was showing a non-infectious inflammatory response. For the mortality task, we sought classifiers that could achive high performance in predicting mortality events within 30 days of hospital admission.

4.1 Comparison of internal validation performance between BO-selected and GS/RS-selected classifiers

As shown in both Tables 1 and 2 and consistent with previous findings, BO identified candidate models with comparable or better grouped 5-fold CV (internal validation) performance than those models identified by GS/RS approaches and with fewer evaluations of the objective function needed. More specifically, BO identified candidate XGBoost and MLP classifiers for both tasks and RBF SVM classifiers for BVN that matched or exceeded the internal validation performance of GS/RS-identified classifiers with as little as 100 objective evaluations. We observed these trends when we used a different acquisition function (UCB; Supplementary Tables 1 and 2) and when we modified the BO search space for the discrete and continuous dimensions to a unit hypercube (details in Supplement; Supplementary Tables 3 and 4

Model HO Type No. of Init. No. of Evals. ARD CV Perf. Val. Perf.
RBF GS - 4757 - 0.815 (0.804, 0.826) 0.853 (0.811, 0.892)
BO 5 10 Y 0.809 (0.798, 0.820) 0.859 (0.817, 0.898)
BO 5 10 N 0.811 (0.800, 0.823) 0.851 (0.811, 0.889)
BO 5 50 Y 0.816 (0.804, 0.827) 0.851 (0.810, 0.889)
BO 5 50 N 0.815 (0.804, 0.827) 0.852 (0.807, 0.891)
BO 5 100 Y 0.816 (0.804, 0.827) 0.852 (0.810, 0.891)
BO 5 100 N 0.816 (0.805, 0.827) 0.852 (0.810, 0.891)
BO 25 10 Y 0.811 (0.800, 0.823) 0.788 (0.756, 0.820)
BO 25 10 N 0.815 (0.804, 0.826) 0.851 (0.809, 0.891)
BO 25 50 Y 0.816 (0.805, 0.826) 0.852 (0.810, 0.892)
BO 25 50 N 0.816 (0.804, 0.827) 0.852 (0.810, 0.892)
BO 25 100 Y 0.816 (0.805, 0.826) 0.852 (0.809, 0.889)
BO 25 100 N 0.816 (0.804, 0.826) 0.852 (0.809, 0.892)
XGB RS - 25000 - 0.815 (0.804, 0.826) 0.860 (0.819, 0.899)
BO 5 50 Y 0.809 (0.798, 0.820) 0.822 (0.775, 0.866)
BO 5 50 N 0.809 (0.798, 0.820) 0.816 (0.769, 0.861)
BO 5 100 Y 0.818 (0.807, 0.829) 0.864 (0.823, 0.903)
BO 5 100 N 0.812 (0.801, 0.823) 0.827 (0.782, 0.870)
BO 5 250 Y 0.809 (0.797, 0.820) 0.821 (0.772, 0.865)
BO 5 250 N 0.810 (0.799, 0.821) 0.827 (0.780, 0.873)
BO 25 50 Y 0.818 (0.807, 0.828) 0.865 (0.822, 0.902)
BO 25 50 N 0.812 (0.800, 0.823) 0.828 (0.782, 0.872)
BO 25 100 Y 0.811 (0.799, 0.822) 0.825 (0.775, 0.871)
BO 25 100 N 0.809 (0.798, 0.820) 0.826 (0.778, 0.871)
BO 25 250 Y 0.807 (0.796, 0.818) 0.865 (0.826, 0.903)
BO 25 250 N 0.810 (0.799, 0.821) 0.826 (0.778, 0.871)
MLP RS - 25000 - 0.840 (0.829, 0.851) 0.856 (0.815, 0.896)
BO 5 50 Y 0.812 (0.802, 0.823) 0.867 (0.825, 0.905)
BO 5 50 N 0.814 (0.803, 0.825) 0.869 (0.831, 0.904)
BO 5 100 Y 0.825 (0.815, 0.836) 0.842 (0.801, 0.880)
BO 5 100 N 0.828 (0.817, 0.839) 0.874 (0.833, 0.912)
BO 5 250 Y 0.829 (0.818, 0.841) 0.780 (0.730, 0.829)
BO 5 250 N 0.831 (0.820, 0.841) 0.858 (0.817, 0.896)
BO 25 50 Y 0.831 (0.820, 0.841) 0.833 (0.791, 0.873)
BO 25 50 N 0.825 (0.814, 0.836) 0.851 (0.806, 0.893)
BO 25 100 Y 0.827 (0.817, 0.838) 0.835 (0.791, 0.878)
BO 25 100 N 0.824 (0.813, 0.835) 0.848 (0.806, 0.888)
BO 25 250 Y 0.841 (0.830, 0.851) 0.857 (0.814, 0.896)
BO 25 250 N 0.841 (0.830, 0.852) 0.793 (0.748, 0.838)
Table 1: Grouped 5-fold CV and external validation (Val.) performance of selected classifiers for the BVN task. All shown results for BO runs used a Matern5/2 kernel in the underlying GP model of the objective function and the EI acquisition function. 95% confidence intervals for selected model performance appear in parentheses. Bold numbers indicate the best performance in a column.

However, for each classifier type, not every BO run resulted in a model that performed better in grouped 5-fold CV than models identified by GS/RS. For example, while two BO runs for the BVN task and with XGBoost (Table 1) did result in classifiers with better CV performance, the remaining ten BO runs did not. Furthermore, we did not necessarily observe better CV performance from the BO runs with larger optimization budgets. BO-selected XGBoost and MLP models for both tasks could exceed the CV performance of GS/RS-selected models with as little as 50 or 100 objective evaluations but, for some runs, not with 250 evaluations. Surprisingly, regardless of computational budget or optimization settings, no BO-selected RBF SVM classifiers exceeded the CV performance of GS/RS-selected classifiers on the mortality task. Taken together, these findings suggest that our chosen objective (grouped 5-fold CV) might be more difficult to optimize than internal validation objectives commonly used in ML and biomedical applications (i.e. separate internal validation set or conventional k-fold CV).

4.2 Comparison of external validation performance between BO-selected and GS/RS-selected classifiers

An important pre-requisite for deployment of a diagnostic classifier is that the classifier generalizes well to previously unseen patient data. In addition to comparing the classifiers’ internal validation performance as is commonly done in practice, we further evaluated how the selected classifiers for both tasks performed in a held-out external validation set comprising nearly 300 patients (Tables 1 and 2). As with our analysis of internal validation performance, we found that BO identified candidate models for both classification tasks with comparable or better performance in the external validation set than those identified by GS/RS approaches. We observed these patterns for the three different classifier types and across a range of different computational budgets for BO. We also found that these trends held for BO-selected classifiers when using a different acquisition function (UCB; Supplementary Tables 1 and 2) or when using a modified search space (Supplementary Tables 3 and 4). Importantly, classifiers identified with BO’s previously noted efficiency relative to GS/RS and that show good internal validation performance also generally perform well in external validation performance. BO was able to identify models with the best performance in external validation (RBF SVM row 2, XGBoost rows 8 and 12, and MLP row 5 in Table 1; RBF SBM row 3, XGBoost rows 4, 5 and 6, and MLP row 5 in Table 2) with a fraction of the number of objective function evaluations used by either GS or RS approaches.

Model HO Type No. of Init. No. of Evals. ARD CV Perf. Val. Perf.
RBF GS - 4757 - 0.839 (0.810, 0.867) 0.708 (0.643, 0.772)
BO 5 10 Y 0.800 (0.769, 0.831) 0.746 (0.689, 0.798)
BO 5 10 N 0.799 (0.766, 0.832) 0.755 (0.701, 0.807)
BO 5 50 Y 0.800 (0.767, 0.832) 0.749 (0.695, 0.802)
BO 5 50 N 0.800 (0.769, 0.831) 0.745 (0.688, 0.798)
BO 5 100 Y 0.800 (0.767, 0.832) 0.750 (0.696, 0.802)
BO 5 100 N 0.801 (0.768, 0.832) 0.745 (0.688, 0.797)
BO 25 10 Y 0.800 (0.768, 0.831) 0.747 (0.692, 0.802)
BO 25 10 N 0.800 (0.768, 0.830) 0.746 (0.691, 0.799)
BO 25 50 Y 0.801 (0.769, 0.831) 0.752 (0.695, 0.806)
BO 25 50 N 0.801 (0.769, 0.831) 0.749 (0.693, 0.801)
BO 25 100 Y 0.800 (0.769, 0.831) 0.753 (0.698, 0.806)
BO 25 100 N 0.801 (0.767, 0.832) 0.752 (0.697, 0.804)
XGB RS - 25000 - 0.889 (0.870, 0.908) 0.816 (0.763, 0.864)
BO 5 50 Y 0.881 (0.861, 0.901) 0.822 (0.769, 0.872)
BO 5 50 N 0.877 (0.857, 0.896) 0.801 (0.748, 0.851)
BO 5 100 Y 0.883 (0.864, 0.902) 0.827 (0.774, 0.873)
BO 5 100 N 0.889 (0.870, 0.908) 0.827 (0.776, 0.872)
BO 5 250 Y 0.884 (0.864, 0.903) 0.827 (0.775, 0.875)
BO 5 250 N 0.881 (0.860, 0.901) 0.814 (0.761, 0.862)
BO 25 50 Y 0.887 (0.867, 0.905) 0.814 (0.760, 0.864)
BO 25 50 N 0.881 (0.861, 0.900) 0.817 (0.765, 0.866)
BO 25 100 Y 0.885 (0.866, 0.903) 0.825 (0.773, 0.872)
BO 25 100 N 0.878 (0.858, 0.898) 0.817 (0.765, 0.865)
BO 25 250 Y 0.886 (0.867, 0.904) 0.826 (0.776, 0.872)
BO 25 250 N 0.882 (0.862, 0.900) 0.802 (0.751, 0.851)
MLP RS - 25000 - 0.859 (0.836, 0.879) 0.743 (0.686, 0.796)
BO 5 50 Y 0.885 (0.864, 0.905) 0.823 (0.772, 0.870)
BO 5 50 N 0.885 (0.864, 0.904) 0.808 (0.753, 0.860)
BO 5 100 Y 0.892 (0.873, 0.910) 0.838 (0.788, 0.883)
BO 5 100 N 0.886 (0.865, 0.907) 0.840 (0.789, 0.886)
BO 5 250 Y 0.890 (0.870, 0.910) 0.809 (0.754, 0.855)
BO 5 250 N 0.894 (0.876, 0.912) 0.828 (0.780, 0.876)
BO 25 50 Y 0.888 (0.868, 0.907) 0.833 (0.783, 0.880)
BO 25 50 N 0.889 (0.870, 0.908) 0.816 (0.763, 0.865)
BO 25 100 Y 0.888 (0.868, 0.906) 0.807 (0.747, 0.860)
BO 25 100 N 0.885 (0.864, 0.903) 0.834 (0.784, 0.880)
BO 25 250 Y 0.893 (0.873, 0.914) 0.822 (0.768, 0.871)
BO 25 250 N 0.887 (0.867, 0.907) 0.815 (0.761, 0.864)
Table 2: Grouped 5-fold CV and external validation (Val.) performance of selected classifiers for the 30-day mortality task. Table layout and abbreviations are the same as in Table 1.

However, counter to assumptions that optimizing a classifier’s internal validation performance will lead to optimal external validation performance, we observed relatively poor correspondence between internal and external validation performance across both tasks and for all three classifier types. For example, while no BO-selected RBF SVM classifier achieved the grouped 5-fold CV performance of the GS-selected classifier on the mortality task, all BO-selected classifiers surpassed the GS-selected classifier in external validation performance (Table 2). For another example, the two MLP classifiers selected by BO (250 optimization evaluations; bottom rows of Table 1) for the BVN task achieve the highest internal validation performance of all HO runs but are both wildly different from one another as well as suboptimal across all BVN runs in their external validation performance, making selection of either of these models for deployment problematic. The latter of these two models (CV mAUC: 0.841, Val. mAUC: 0.793) as well as other examples (e.g. GS-selected RBF SVM for mortality highlights that, despite our best efforts through use of grouped 5-fold CV as an objective function, there is still a potential for model overfitting that must be taken into account. Our findings suggest that the assumed correspondence between internal validation and external validation performance does not necessarily hold in our setting and merits further investigation in other healthcare applications of HO.

Our analysis also raises questions about the practical usage of HO for diagnostic classifier development. Regarding whether one HO approach could be replaced with another on the basis of the external validation performance of selected classifiers, we found conflicting evidence. While for RBF SVM classifiers on the BVN task, nearly all models identified by BO matched or exceeded the external validation performance of models found by grid search, not every BO run resulted in a XGBoost or MLP model that generalized well to the external validation set. For the BVN task, only three models from the 12 XGBoost BO runs for BVN and five models from the 12 MLP BO runs attained validation performance that surpassed the performance achieved by the top model selected by RS (Table 1). Thus, simply conducting BO for this task and with a particular model type (e.g. XGBoost) might miss models that generalize well. Moreover, our analyses did not reveal an ideal optimization budget for BO.

4.3 Evaluation of effects of automatic relevance determination in BO on classifier generalization performance

For high-dimensional hyperparameter spaces, some dimensions (or hyperparameters) may have a greater impact on the model’s generalization performance than others. Automatic relevance determination (ARD; Neal (1996)) provides the means to estimate effects of variations in hyperparameter dimensions on the value of the objective function and has been used in multiple implementations of BO (e.g. Snoek et al. (2012) and BoTorch, https://botorch.org/docs/models). Higher values of a given hyperparameter dimension’s lengthscale parameter indicate that variations in that hyperparameter’s value have less of an effect on variations in the objective function value. Conversely, lower values of a hyperparameter dimension’s lengthscale parameter suggest that changes in that hyperparameter lead to large changes in the objective function.

Figure 1: Traceplots showing the effect of ARD on two example BO runs for the activation type

hyperparameter of MLP models. Traceplots correspond to BO runs with ARD enabled (right two panels) or disabled (left two panels) for the BVN task (top two panels) and the mortality task (bottom two panels). The blue shaded segment of the plot indicates the BO initialization stage. Subsequent iterations comprise the optimization stage of BO. A red asterisk denotes the iteration corresponding to the best model selected for that BO run. Activation types: 0 - ELU, 1 - ReLU, 2 - sigmoid, 3 - tanh, 4 - leaky ReLU.

We analyzed both qualitative and quantitative effects of using ARD in our BO runs. We assessed qualitative effects of using ARD in our BO analyses by generating traceplots (Figure 1). With MLP, we consistently found low lengthscale parameter values associated with the type of activation for each hidden layer, suggesting their effect on grouped 5-fold CV (and potentially generalization) performance of the corresponding model. Having leaky ReLU activations appeared to have an impact on grouped 5-fold CV performance of the corresponding MLP model for the BVN task. In the ARD-disabled run (top left panel of Figure 1), leaky ReLU activations are rarely visited whereas, in the ARD-enabled run (top right panel of Figure 1), leaky ReLU activations are visited considerably more often. We observed similar trends with MLPs for the mortality task: as shown in the bottom panels of Figure 1, the ARD-enabled run clearly discourages visits to MLP models with ReLU activations. Beyond observing qualitative effects of ARD on the dynamics of hyperparameter optimization, we wanted to determine whether using ARD improved external validation performance of selected models. After pooling the performance estimates across the three different classifier types for each classification task, we did not find significant differences in external validation performance between BO-selected classifiers with ARD enabled and those without ARD enabled (BVN: Wilcoxon paired signed rank test p-value - 0.910, mortality: p-value - 0.193).

5 Discussion & Conclusions

In this analysis, we carried out a comprehensive comparison of hyperparameter optimization approaches for diagnostic classifier development to determine what approach (if any) led to improvements in: 1) computational efficiency (i.e. the number of hyperparameter configurations needed to identify promising candidate classifiers) or, more importantly, 2) external validation performance. Consistent with previous findings, we found that BO was able to identify candidate classifiers with a fraction of the configurations evaluated using GS/RS. We observed this pattern across a range of settings for BO, three different classifier types, and two classification tasks relevant to critical care. As embarrassingly parallel approaches like GS/RS can necessitate the use of commodity computing clusters, BO’s efficiency makes the approach a potentially cost-effective alternative to hyperparameter optimization. Moreover, we found that the external validation performance of classifiers identified by BO was generally competitive with the performance of classifiers selected by GS/RS. In several cases, the external validation performance of classifiers selected by BO exceeded the performance of classifiers identified by GS/RS, highlighting another potential benefit of BO for diagnostic classifier development.

We investigated the effect of enabling automatic relevance determination in the kernel of BO’s GP model of the objective function. We assumed that some hyperparameter dimensions had more of an effect on the cross-validation (generalization) performance of the corresponding model configuration than others. However, while we did qualitatively observe ARD’s effects on BO visitation of certain hyperparameter values over others, we did not detect noticeable improvements in the external validation performance of classifiers identified by ARD-enabled BO. Thus, the choice of whether to enable ARD for all future BO analyses is not clear-cut and appears to depend on the type of classifier being optimized. Of course, this information can’t be known prior to conducting any hyperparameter optimization for new indications, and our future BO searches will likely be both with and without ARD enabled.

We acknowledge several limitations of our approach. For our random sampling-based hyperparameter optimization runs, we sampled uniformly from pre-defined ranges or grids of values. This scheme involves sampling configurations independently from one another, leading to potential re-sampling of regions of hyperparameter space that may or may not result in models with good generalization performance. Other random sampling approaches have been investigated in which new hyperparameter configurations are generated dependent on the values of previously generated configurations (e.g. Latin hypercube or low-discrepancy Sobol sequences) in order to encourage diversity of the resulting sampled configurations (Bergstra and Bengio (2012)). Such approaches could still be used in an embarassingly parallel implementation of RS whereby the dependent hyperparameter configuration sample would be generated prior to distribution of the samples to different compute nodes. Likewise, these approaches could be used in the initialization stage of BO. Another limitation is that we used a single set of features derived from a previously identified set of 29 gene expression markers. We chose these features based on previous analyses (Mayhew et al. (2020)) and consistent with our goal of developing diagnostic classifiers from these specific markers for clinical deployment. While we acknowledge our conclusions may not hold with other feature sets or in other settings, biomedical or otherwise, we are encouraged by the fact that our results regarding BO’s efficiency in internal validation do echo those observed by previous studies of hyperparameter optimization.

At the outset of this analysis, we hoped we would uncover distinct differences between the hyperparameter optimization approaches in order to develop better guidelines about when (or even if) we would use one approach over another in our ongoing and future classifier development efforts. Unfortunately, we did not identify such clear differences, though we did come to better understand the benefits and drawbacks of either approach. GS/RS methods are well-known, effective and easily implemented for embarrassingly parallel computation. As hyperparameter configurations are generated prior to evaluation, GS/RS allows optimization over more complex hyperparameter spaces. For example, in our MLP analyses, we made the simplifying assumption that each hidden layer had the same dimensionality. This assumption made our optimization of these models with BO more tractable. However, conducting optimization of MLPs with hidden layers that differ in dimensionality from one layer to the next did not appear straightforward with BO. If we represented a general MLP architecture with a vector, with each position in the vector giving the dimensionality of the corresponding hidden layer, any changes in the number of layers would lead to a change in the dimensionality of this vector representation, effectively changing BO’s hyperparameter space during the search. This change in the size of the architecture vector (essentially another hyperparameter) would complicate BO’s choice of dimensionality for the added or remaining layers. In contrast, such architectures could easily be sampled prior to optimization in a GS/RS setting and evaluated in a distributed manner.

On the other hand, previous studies as well as our analyses have demonstrated the benefits of BO. BO can identify classifiers with as good if not better generalization performance than those selected by GS/RS with far fewer evaluations of the hyperparameter objective function. BO’s relative efficiency is clearly beneficial when evaluation of the objective function requires long run-times and/or when GS/RS searches of more complex hyperparameter spaces necessitate use of commodity clusters. We also note that, owing to our relatively small sample sizes (3k patient samples per task), we were able to complete our BO searches within a day or so on personal laptops.

Our analyses also highlight important general considerations for hyperparameter optimization in the smaller, more heterogeneous data regimes typical of biomedical research that don’t seem to have arisen in previous ML studies of HO. These studies have primarily focused on larger (100k) datasets composed mainly of natural images. Moreover, these benchmark datasets are often constructed (e.g. MNIST; http://yann.lecun.com/exdb/mnist/) to meet the assumption that the distribution of training and external validation samples are similar if not the same. In this setting, one would expect that the correspondence between a model’s performance at optimization time (in terms of the value of the objective function) and at external validation (or generalization) time should be strong. In other words, a model with the best performance in optimization should have nearly the best, if not the best, performance in a held-out dataset. However, patient data is known to be heterogeneous due to biological differences as well as differences in geography, healthcare delivery, and assay technologies used, suggesting that the assumption of distributional similarity between training and external validation samples is likely to be violated. Indeed, our recent work indicates that estimates of generalization performance from optimization performance can be optimistically biased, even when accounting for technical/biological heterogeneity (Mayhew et al. (2020)). In this study, we observed this lack of correspondence between optimization and generalization performance, as the best-performing model from optimization (either by GS/RS or BO) did not always produce the best performance in external validation. Continued investigation of these techniques for diagnostic (and genomic) classifier development should take these considerations into account, especially when the data on hand is limited and/or highly heterogeneous. Regardless, both approaches remain promising avenues for hyperparameter optimization and represent key components in the development of more effective diagnostics for emergency and critical care.


  • Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
  • Bergstra et al. (2013) James Bergstra, Daniel Yamins, and David Daniel Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. 2013.
  • Borgli et al. (2019) R. J. Borgli, H. Kvale Stensland, M. A. Riegler, and P. Halvorsen.

    Automatic hyperparameter optimization for transfer learning on medical image datasets using bayesian optimization.

    In 2019 13th International Symposium on Medical Information and Communication Technology (ISMICT), pages 1–6, May 2019. doi: 10.1109/ISMICT.2019.8743779.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939785. URL http://doi.acm.org/10.1145/2939672.2939785.
  • Colopy et al. (2018) G. W. Colopy, S. J. Roberts, and D. A. Clifton. Bayesian optimization of personalized models for patient vital-sign monitoring. IEEE Journal of Biomedical and Health Informatics, 22(2):301–310, March 2018. ISSN 2168-2208. doi: 10.1109/JBHI.2017.2751509.
  • Falkner et al. (2018) Stefan Falkner, Aaron Klein, and Frank Hutter. Bohb: Robust and efficient hyperparameter optimization at scale. In ICML, pages 1436–1445, 2018. URL http://proceedings.mlr.press/v80/falkner18a.html.
  • Ghassemi et al. (2014) M. Ghassemi, L. Lehman, J. Snoek, and S. Nemati. Global optimization approaches for parameter tuning in biomedical signal processing: A focus on multi-scale entropy. In Computing in Cardiology 2014, pages 993–996, Sep. 2014.
  • GPyOpt authors (2016) The GPyOpt authors. GPyOpt: A bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt, 2016.
  • Hand and Till (2001) David J Hand and Robert J Till. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine learning, 45(2):171–186, 2001.
  • Jones et al. (2009) Alan E. Jones, Stephen Trzeciak, and Jeffrey A. Kline. The sequential organ failure assessment score for predicting outcome in patients with severe sepsis and evidence of hypoperfusion at the time of emergency department presentation. Critical care medicine, 37(5):1649–1654, May 2009. ISSN 1530-0293. doi: 10.1097/CCM.0b013e31819def97. URL https://pubmed.ncbi.nlm.nih.gov/19325482. 19325482[pmid].
  • Klein et al. (2017) Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. In Aarti Singh and Jerry Zhu, editors,

    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics

    , volume 54 of Proceedings of Machine Learning Research, pages 528–536, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.
    URL http://proceedings.mlr.press/v54/klein17a.html.
  • Klein et al. (2019) Aaron Klein, Zhenwen Dai, Frank Hutter, Neil Lawrence, and Javier Gonzalez. Meta-surrogate benchmarking for hyperparameter optimization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 6270–6280. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8857-meta-surrogate-benchmarking-for-hyperparameter-optimization.pdf.
  • Mao et al. (2019) Shunfu Mao, Yihan Jiang, Edwin Basil Mathew, and Sreeram Kannan. Boassembler: a bayesian optimization framework to improve rna-seq assembly performance, 2019.
  • Mayhew et al. (2020) Michael B. Mayhew, Ljubomir Buturovic, Roland Luethy, Uros Midic, Andrew R. Moore, Jonasel A. Roque, Brian D. Shaller, Tola Asuni, David Rawling, Melissa Remmel, Kirindi Choi, James Wacker, Purvesh Khatri, Angela J. Rogers, and Timothy E. Sweeney. A generalizable 29-mrna neural-network classifier for acute bacterial and viral infections. Nature Communications, 11(1):1177, 2020. ISSN 2041-1723. doi: 10.1038/s41467-020-14975-w. URL https://doi.org/10.1038/s41467-020-14975-w.
  • Neal (1996) Radford M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, Berlin, Heidelberg, 1996. ISBN 0387947248.
  • Nishio et al. (2018) Mizuho Nishio, Mitsuo Nishizawa, Osamu Sugiyama, Ryosuke Kojima, Masahiro Yakami, Tomohiro Kuroda, and Kaori Togashi. Computer-aided diagnosis of lung nodule using gradient tree boosting and bayesian optimization. PloS one, 13(4):e0195875–e0195875, Apr 2018. ISSN 1932-6203. doi: 10.1371/journal.pone.0195875. URL https://pubmed.ncbi.nlm.nih.gov/29672639. 29672639[pmid].
  • Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.
  • Snoek et al. (2012) J Snoek, H Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
  • Snoek et al. (2015) Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pages 2171–2180, 2015.
  • Sweeney and Khatri (2017) TE Sweeney and Purvesh Khatri. Benchmarking sepsis gene expression diagnostics using public data. Critical care medicine, 45(1):1, 2017.
  • Sweeney et al. (2015) TE Sweeney, Aaditya Shidham, Hector R Wong, and Purvesh Khatri. A comprehensive time-course–based multicohort analysis of sepsis and sterile inflammation reveals a robust diagnostic gene set. Science Translational Medicine, 7(287), 2015.
  • Sweeney et al. (2016) TE Sweeney, Hector R Wong, and Purvesh Khatri. Robust classification of bacterial and viral infections via integrated host gene expression diagnostics. Science Translational Medicine, 8(346), 2016.
  • Sweeney et al. (2018) TE Sweeney, TM Perumal, and R et al. Henao. A community approach to mortality prediction in sepsis via gene expression analysis. Nat Commun, (1), 2018.
  • Tabe-Bordbar et al. (2018) S. Tabe-Bordbar, A. Emad, S.D. Zhao, and S. Sinha. A closer look at cross-validation for assessing the accuracy of gene regulatory networks and models. Sci Rep, 8(6620), 2018. doi: 10.1038/s41598-018-24937-4.
  • Tanaka and Iwata (2018) R. Tanaka and H. Iwata. Bayesian optimization for genomic selection: a method for discovering the best genotype among a large number of candidates. Theor Appl Genet, 131:93–105, 2018. doi: 10.1007/s00122-017-2988-z.
  • Thomas and Schwartz (2018) Marcus Thomas and Russell Schwartz. A method for efficient bayesian optimization of self-assembly systems from scattering data. BMC Systems Biology, 12(1):65, 2018. ISSN 1752-0509. doi: 10.1186/s12918-018-0592-8. URL https://doi.org/10.1186/s12918-018-0592-8.
  • Zhang et al. (2020) Yichi Zhang, Daniel W. Apley, and Wei Chen. Bayesian optimization for materials design with mixed quantitative and qualitative variables. Scientific Reports, 10(1):4924, 2020. ISSN 2045-2322. doi: 10.1038/s41598-020-60652-9. URL https://doi.org/10.1038/s41598-020-60652-9.