Multi-fidelity classification with Gaussian process
Machine learning techniques typically rely on large datasets to create accurate classifiers. However, there are situations when data is scarce and expensive to acquire. This is the case of studies that rely on state-of-the-art computational models which typically take days to run, thus hindering the potential of machine learning tools. In this work, we present a novel classifier that takes advantage of lower fidelity models and inexpensive approximations to predict the binary output of expensive computer simulations. We postulate an autoregressive model between the different levels of fidelity with Gaussian process priors. We adopt a fully Bayesian treatment for the hyper-parameters and use Markov Chain Mont Carlo samplers. We take advantage of the probabilistic nature of the classifier to implement active learning strategies. We also introduce a sparse approximation to enhance the ability of themulti-fidelity classifier to handle large datasets. We test these multi-fidelity classifiers against their single-fidelity counterpart with synthetic data, showing a median computational cost reduction of 23 target accuracy of 90 multi-fidelity classifier achieves an F1 score, the harmonic mean of precision and recall, of 99.6 both are trained with 50 samples. In general, our results show that the multi-fidelity classifiers outperform their single-fidelity counterpart in terms of accuracy in all cases. We envision that this new tool will enable researchers to study classification problems that would otherwise be prohibitively expensive. Source code is available at https://github.com/fsahli/MFclass.READ FULL TEXT VIEW PDF
Multi-fidelity classification with Gaussian process
Machine learning is a paradigm-shift field that is transforming many areas of engineering and sciences. Some of the most impressive applications of machine learning rely on the availability of large amounts of data that are used to create accurate classifiers that, for example, could discriminate between the absence or presence of a disease Hannun et al. (2019). However, data required to train a classifier may not be available in a sufficient amount, or it can be and expensive and lengthy to acquire, thus hindering the potential of machine learning in many areas of application. This is the case in studies that largely rely on state-of-the-art computational models, which often take several days to run and require a significant amount of resources before one data point is obtained. Despite these restrictions, computational modeling represents an attractive alternative to physical experiments, which can be even more expensive, time consuming, and may pose important restrictions due to ethical considerations. Several areas of the sciences and engineering are already benefiting from computational modeling, including the medical device industry Morris et al. (2016); Lee et al. (2018), car and aircraft manufacturing Forrester et al. (2008), and public policy design Santner et al. (2003), to name a few. Despite their massive adoption, researchers can often only afford to perform a few runs of complex models that can have hundreds of parameters Sahli Costabal et al. (2018a). This introduces a bottleneck for understanding the parameter influence and the uncertainties associated with the model. Nonetheless, there are usually lower fidelity approximations of the model that are less expensive to compute and can provide valuable information. For example, the finite element method is widely used to solve problems in solid and fluid mechanics. Here, a natural low fidelity model would be a model with a coarser mesh. Although this approximation may not accurately represent the target quantities of interest, it will likely capture how parameters changes are reflected in the output. Another alternative for low-fidelity models is to reduce the dimensionality of the problem, replacing a fully three-dimensional problem with a two- or one-dimensional representation Sahli Costabal et al. (2019)
. Finally, experimental data can be considered as an additional information source with variable fidelity, either high or low, depending on the application. In recent years, there has been an increased attention in the machine learning community to develop predictive methods that enable the effective fusion of variable fidelity information sources. Some techniques are specially tailored to estimate the uncertainty in the model predictionsSchiavazzi et al. (2017); Hurtado et al. (2017); Biehler et al. (2015); Peherstorfer et al. (2018); Qian et al. (2018) and parameter estimation Koutsourelakis (2009). To predict the model output, Kennedy and O’Hagan Kennedy and O’Hagan (2000) first proposed to use Gaussian process priors to perform multi-fidelity regression. This approach has been widely used Parussini et al. (2017) and has been extended for efficiency Le Gratiet and Garnier (2014) and to accommodate large data-sets Perdikaris et al. (2016). However, all these techniques have been solely developed in the regression setting, and none of them has yet been extended to classification problems.
There are many problems that would benefit from multi-fidelity tools for classification. Phase diagrams and most types of bifurcations in physical systems can be treated as a classification problem. Here, we will determine which state will be reached under a given set of parameters, initial or boundary conditions. A good example of this type of behavior is present in cardiac tissue dynamics. Due to the nature of excitable cells, under certain conditions, cardiac tissue can achieve arrhythmic states that are self-sustained, such as spiral waves Gray et al. (1998). Another example comes from the mechanical analysis of bi-layered systems, such as the white and grey matter in the brain. Here, it is hypothesized that, during development, different rates of growth in the two layers causes a buckling instability that could lead to the formation of either creases or ridges Budday et al. (2014); Holland et al. (2017).
In the Gaussian process setting, there are some fundamental differences between the classification and regression tasks that make the former harder to implement. Most notably, in the regression setting, a Gaussian likelihood is typically assumed for the data. Combined with a Gaussian process prior over the latent generating function, this allows for deriving an analytic expression for both the marginal likelihood and the predictive posterior of the regression model Rasmussen and Williams (2006). In the classification setting, the Gaussian likelihood assumption is not appropriate for modeling discrete class labels, and it is not possible to formulate an exact inference scheme for training the model parameters. Therefore, we must resort to approximate inference techniques, such as the widely used Laplace approximation and the expectation propagation algorithm Rasmussen and Williams (2006). Markov-chain Monte Carlo (MCMC) methods provide a powerful alternative to approximate posterior inference, but standard sampling schemes such as Gibbs and Metropolis-Hastings suffer from slow convergence rates due to strong correlations in the Gaussian process posterior Robert and Casella (2013). However, recent advances in Hamiltonian Monte Carlo have yielded improved sampling schemes that can perform well in this setting Neal (1999)
. In this work, we leverage recent developments in the machine learning community, such as automatic differentiation and parallelization, to implement a novel efficient multi-fidelity classifier using Gaussian process priors. We will also present a sparse version to enhance the computational expediency of our method for large data-sets. We will demonstrate its superior performance against single-fidelity classifiers with synthetic data, as well as its effectiveness on a realistic large-scale application to cardiac electrophysiology. Since Gaussian process classifiers have an associated variance with their predictions, they naturally enable the design of effective data acquisition policies via active learningKapoor et al. (2007); Gramacy and Polson (2017). In this way, it is possible to optimize the sampling strategy of expensive computer models to maximize accuracy under a constrained computational budget. We will show how this methodology is implemented in our multi-fidelity classifier and how it compares against standard single-fidelity classification models.
This paper is structured as follows: in Section 2 we introduce the multi-fidelity Gaussian process classifier, the sparse approximations and the active learning criteria, in Section 3 we test the classifier with a synthetic example to evaluate its performance, in Section 4 we use the classifier to identify a temporal region where an arrhythmia can be induced in cardiac tissue. We finalize this work discussing these results in Section 5.
We start this section by revisiting classification with Gaussian processes to form the base of the proposed multi-fidelity classifier. We later introduce the sparse approximations, the statistical inference technique we employ to train the multi-fidelity model, as well as the active learning approach for adaptive data acquisition.
We start by assuming that we have a dataset comprised of input/output pairs . The inputs contain features for training examples and the outputs contain the discrete class labels for the corresponding inputs. Here, we restrict the scope of this work to binary classification, thus the labels can only take on two values . We note that it is possible to extend this framework to the multi-class classification setting. The classical formulation of Gaussian process classification defines an intermediate variable which is computed from a latent function Rasmussen and Williams (2006). Throughout this paper, we will assume standardized data-sets and work with zero-mean Gaussian process priors of the form . Here, is a covariance kernel function, which depends on a set of parameters . We adopt a fully Bayesian treatment and prescribe prior distributions over these parameters, which we will specify later Neal (1999)
. To obtain class probability predictions we pass the Gaussian process outputthrough a nonlinear warping function , such that the output is constrained to [0,1], rendering meaningful class probabilities. We define the conditional class probability as . A common choice for
is the logistic sigmoid function, which we will use throughout this work. We assume that the class labels are independent according to a Bernoulli likelihood with probability Nickisch and Rasmussen (2008). Further, we choose an automatic-relevance determination (ARD) squared exponential kernel Rasmussen and Williams (2006), , parametrized by . We assume the following distributions for the parameters and :
The posterior distribution over the model parameters cannot be described analytically, and thus we resort to approximate-inference techniques to calibrate this model on the available data. To this end, we use the NO-U-Turn sampler (NUTS) Hoffman and Gelman (2014), which is a type of Hamiltonian Monte Carlo algorithm, as implemented in PyMC3 Salvatier et al. (2016). We use two chains, and set the target accept probability between 0.95 to 0.99 depending on the convergence. The first 1000 samples are used to adjust the step size of the sampler, and are later discarded. We use the subsequent 1000 samples to estimate the parameters . Once we have completed the inference, we can make predictions at new location
in three steps: First we compute the predictive random variable
, which by construction follows a multivariate normal distribution, with meanand covariance obtained by conditioning on the available data Rasmussen and Williams (2006):
where the covariance matrix results from evaluating the kernel function at the locations of the input training data and . Then we sample drawing parameters from the estimated posterior distributions. This will result in a number of realizations of that we finally pass through the logistic sigmoid function to obtain a distribution of class probabilities . Finally, we take the mean of this distribution to make a single prediction.
We assume that we have information sources, each with a different level of evaluation cost and fidelity, producing a class label , with . Here, level represents the highest, most accurate and expensive information source and 1 represents the most simplified and cheapest model available. We propose to model the cross-correlation structure between different levels using an auto-regressive model for the latent functions Kennedy and O’Hagan (2000):
Here, is a scalar parameter that needs to be inferred and is a function that aims to capture the bias in the predictions of the lower fidelity models. The role of is to capture linear correlations between the different fidelity levels. In general, can also be a function of leading to more complex schemes that can capture non-linear correlations Perdikaris et al. (2017). Following Kennedy and O’Hagan (2000), we complete our model specification by assuming independent Gaussian process priors for and . To simplify our presentation, and without loss of generality, we will assume levels of fidelity from now on: a low level, which we will denote with the subscript and a high level denoted with . Now we have a composed dataset with the low and high fidelity levels: . As a consequence of the auto-regressive model we have chosen, we can write the joint prior distribution of the latent functions as Kennedy and O’Hagan (2000):
Now the entire covariance matrix for the different levels of fidelity has a block structure, where and model the data in each fidelity level and models the cross-correlation between levels of fidelity, according to the priors specified in equations (5) and (6). We also have kernel parameters for the different levels of fidelity. In principle, we could use different kernels for each level of fidelity, but we decide to use the squared exponential kernel, which results in parameters , and . For these parameters and the scaling factor we set the following priors:
To increase the number of samples our can methodology handle efficiently, we introduce a sparse approximation of the Gaussian process prior. In particular, we replace the prior with another function Snelson and Ghahramani (2006); Quiñonero-Candela and Rasmussen (2005). Here, we will have inducing points , which we group in a matrix . The number of inducing points needs to be smaller than the number of data points to gain in computational cost, while their role is to effectively summarize any redundancies in the observed data Snelson and Ghahramani (2006). We can then compute our original latent variable with an independent covariance structure:
Here, we introduce the following matrices that evaluate the kernel: and , and , where . The parameter represents uncorrelated Gaussian noise in the regression setting. In this work, we deal with noise-free computer simulations and we set this parameter to help with the invertibility of the covariance matrix, which may be singular depending on the selected inputs Neal (1999). The main computational advantage comes from the size reduction of the full covariance matrix , which is to , which is . The cost of inverting the full matrix is that we replacing by inverting a smaller matrix with cost and the trivial inversion of the diagonal covariance matrix in equation (12). Now, the dominant operation in time complexity becomes the matrix multiplication needed to compute , which is .
We can make predictions at new locations as described in Section 2.1 with a new conditional mean and covariance:
with , , , , and .
In the multi-fidelity case, we will have a composite vector of inducing pointsand all the matrix evaluations are assembled using the structure in equation (8). Since, in general, we will have few high-fidelity points, we decide to place the inducing points at the same locations as the training data
. For the low-fidelity inducing points, we use the k-means algorithm to selectcluster centroids from the training inputs . Although we can, in principle, optimize the inducing point locations, most current approaches rely on variational inference Titsias (2009); Hensman et al. (2015), which we believe is beyond the scope of the present work.
Leveraging the posterior uncertainty estimates of Bayesian models, active learning strategies enable the judicious and cost-effective acquisition of new informative data with the goal of maximally enhancing the accuracy of the predictive model Cohn et al. (1996); MacKay and Mac Kay (2003)
. Both the single- and multi-fidelity classifiers proposed in this work are well suited for active learning. Here, we attempt to estimate the next point to sample our computational model that will increase the accuracy of the classifier the most. In the machine learning community, there are multiple data acquisition policies postulated for this purpose. In the context of Gaussian process classification, we use a simple heuristic that balances exploration and exploitation trade-off, namely the trade-off of sampling at new locations to reduce uncertainty (exploration) versus sampling at regions that are most informative for refining the target classification boundary (exploitation)Kapoor et al. (2007). This sequential refinement procedure starts by defining a set of candidate locations to sample . A typical approach is to sample the points that are closest to the classification boundary, where the class probability is 0.5. Considering the sigmoid function that we use, this is equivalent to the point at which the latent function is closest to 0. However, this methodology is highly sensitive to the initial set of points used to train the classifier. For example, if there are multiple boundaries and the initial classifier only detected one, this heuristic is very unlikely to detect the additional classification regions. To solve this issue, we take advantage of the predictive posterior uncertainty of . Specifically, in regions with a lot of training data, the posterior variance will be small, but as we move away from the training samples, the variance will increase. If we include this variance in our active learning scheme, we can also explore the parameter space, rather than exploiting the estimated boundary. Combining these two concepts, we solve the following minimization problem to select the next point to sample:
where and are the Monte Carlo estimates of the mean and variance of , since we are using the posterior distribution of the parameters to make predictions rather than point estimates. We solve the optimization problem in equation (15) by computing this metric for all candidates and selecting the one with the lowest score. Next, we evaluate the computer code at the parameters to generate a new class label and add this input/output pair to the data-set. Finally, we re-train the classifier with the new information and repeat the active learning process until a certain number of iterations has been reached. In the multi-fidelity case, we only perform active learning in the highest level of fidelity and assume that the lower levels are inexpensive enough to be sampled as needed.
We test the multi-fidelity classifier with a synthetic example in two dimensions. We define a low and high fidelity boundary with a sine function:
with . We note that the low fidelity boundary is a perturbed version of the high fidelity function, as can be seen in Figure 1. We set the number of kernel length scales to 1 for this case for all classifiers. We generate a synthetic data set using the following procedure: First, we consider low fidelity samples for the multi-fidelity classifier. To reduce the computational cost, we select 30 low fidelity samples that are close to the boundary and obtain the remaining 15 samples using a Latin hypercube sampling strategy Stein (1987) to cover the remaining of the parameter space. For the sparse multi-fidelity classifier, we use the same samples combined with inducing points, which results in a reduction of computational cost. To select the high fidelity samples, we first obtain 500 low fidelity samples and randomly pick where and where . This ensures that the initial samples are balanced between categories. We start the single-fidelity, multi-fidelity, and sparse multi-fidelity classifiers with the same samples to make fair comparisons. We test two cases: (i) We select samples as just described and run the classifier for each number 30 times for all three classifiers and (ii) We start with and use active learning until we reach and repeat this process 30 times for the single- and multi-fidelity classifiers. In both cases we test the accuracy of the classifiers using 1000 locations created with a Latin hypercube design. Finally, we compare whether one classifier is significantly more accurate than the other using non-parametric Wilcoxon signed-rank test Wilcoxon (1945). To compare whether the active learning achieves better accuracy, we use the Mann Whitney U test Mann and Whitney (1947). For both cases, we set the level of significance to 5%.
Figure 2 shows trace plots depicting the evolution of the inferred parameters as the Markov Chain Monte Carlo sampler for all three classifiers trained with . The kernel parameters and exhibit similar behaviors and magnitudes for all classifiers. All parameters show similar patterns for the dense and sparse multi-fidelity classifiers. Remarkably, the high-fidelity length scale parameter is larger than its low-fidelity counterpart , indicating that the details of the classification boundary are captured by the low fidelity Gaussian process and the differences of high and low fidelity are represented in the Gaussian process.
Figure 1 shows representative examples of the single- and multi-fidelity classifiers trained with active learning. At the beginning, the multi-fidelity classifier inherits the structure of the boundary that was learned from the low fidelity data. The single-fidelity classifier does not have this information and shows a poorer fit. The active learning strategy tends to reduce the error in both cases. However, this error reduction is achieved faster for the multi-fidelity classifier. To quantify the differences in accuracy, we first compare the classifiers without active learning (Figure 3, left). For sample sizes 10, 20, 30, 40 and 50, the multi-fidelity and sparse multi-fidelity classifiers are significantly more accurate than the single fidelity classifier (p 0.01). The multi-fidelity and sparse multi-fidelity classifiers behave similarly and there is no significant difference in accuracy for any sample size (p 0.16). The differences in median error range from 5.4% at 10 samples to 1.3% at 50 samples between the single fidelity and multi-fidelity classifiers. As the single-fidelity gains more information, the gap in accuracy is reduced. In the case of active learning, the trend is similar, where we see that the selected strategy reduces the error for both classifiers. Nonetheless, the multi-fidelity classifier has a higher median accuracy than the single-fidelity classifier. We quantify the effect of the the active learning strategy by comparing the accuracy of the case where we use the active learning strategy to achieve to the case where we start the classifier with . The effect for both classifiers is a statistically-significant error reduction (p 0.001). The median error is decreased 7.1% for the single-fidelity and 5% for the multi-fidelity classifier. This difference can be explained because it is easier to reduce the error when it is higher. Finally, we quantify the differences in computational cost between the classifiers. For a target error of 10%, the multi-fidelity classifier requires significantly less samples (p 0.005). The median number of samples is 24 for the single-fidelity classifier and 18.5 for the multi-fidelity classifier, representing a 23% decrease in computational cost for this particular example.
In this section, we use the multi-fidelity classifier to study the formation of arrhythmic conduction in cardiac tissue. Cardiac tissue is susceptible to the formation of spiral waves Gray et al. (1998) when a secondary electrical stimulus is applied within certain time and location after a first planar wave passes. This range of time to successfully initiate a spiral wave is known as the vulnerability window Moreno et al. (2012). It is hypothesized that, the larger this window, the greater the probability of arrhythmia for a given set of conditions including drugs or diseases. In the past, this quantity has been computed by brute-force in one-dimensional models of cardiac electrophysiology Moreno et al. (2012). However, the window computed in one dimension might not be an accurate representation of what happens in two-dimensions or in the whole heart. As we will see in this example, it is possible that the spiral wave extinguishes as it meanders in the domain. Here, we will employ a multi-fidelity classifier to determine the vulnerability window in a two-parameter space, using the one-dimensional approximation as our low fidelity model. We choose a two-dimensional model instead of the full heart Sahli Costabal et al. (2018b) to keep the computational cost tractable and maintain the ability to compute a test set for validating the accuracy of the classifier.
To generate a spiral wave, after a primary wave is elicited, we apply a stimulus that results in a secondary wave that propagates in the opposite direction of the primary wave. This is illustrated in Figure 4, top row, for the one-dimensional model. If we apply the stimulus too early (left), the primary wave will block it. If we apply the stimulus too late (right), a new wave will propagate in both directions. If we apply the stimulus within the vulnerability time-window, we obtain a single new wave that propagates in the opposite direction. In two dimensions, this results in a spiral wave, as can be seen in the bottom row of Figure 4.
Here, we use the monodomain equation with an Aliev-Panfilov cellular model Aliev and Panfilov (1996) to characterize the electrical behavior of cardiac tissue. We use a domain size of 50 mm for the one-dimensional cable and 50x50 mm for the two-dimensional case. We discretize in time and space with finite differences, using a spatial step of 0.25 mm and a temporal step of 0.01 ms. We set the conductivity to 0.1 mm/ms. We solve the monodomain equation with an explicit time integrator implemented in FORTRAN Sahli Costabal et al. (2018c). To test spiral wave initiation, we first apply a stimulus on the left boundary to create a planar wave. Then, for 5 ms, we apply a second stimulus at a given time in the central 2 mm of the cable and in the central 2 mm of the bottom half of the two-dimensional tissue patch (Figure 4, bottom left). Finally, in the one-dimensional model, we classify the spiral creation as successful if we detect an activation (dV/dt 0) at the left quarter but not at the right quarter of the cable after applying the second stimulus. We classify a successful spiral wave creation if we detect an activation (dV/dt 0) 300 ms after applying the second stimulus.
We begin by exploring two-dimensional parameter space, where we vary the time that we apply the second stimulus and the parameter of the Aliev-Panfilov model. We choose this particular parameter because we know that, besides controlling the action potential duration Sahli Costabal et al. (2016); Hurtado and Kuhl (2014), it controls how much the spiral wave meanders in the domain. For some values of , the wave will initially form, but it will instantly move to the boundaries of the domain, thus extinguishing itself. In these particular cases, we expect a disagreement in the labels between the low fidelity one-dimensional cable and the high fidelity two-dimensional simulation, causing the classification boundaries for the low and high fidelity data to be different. We set the range of the parameter to and the time interval to .
In this example, we allow the kernel to have two different length scales with the same priors for both classifiers. To obtain the low fidelity samples for the multi-fidelity classifier, we train a single-fidelity classifier starting with 10 samples until we reach 84 samples with active learning. We then train a single- and a multi-fidelity classifier starting with samples, where we select five samples from the low-fidelity samples with and five samples with . We use active learning until we reach 50 samples and compute the precision, recall, and F1 score using a test set of created with a Latin hypercube design (Figure 5, top right). In this example, it is not appropriate to use accuracy to assess the performance of the classifiers, since only 13% of the samples on the test belong to the class . If we used the accuracy as a metric, a classifier that only predicts would already have an accuracy of 87%. Precision is defined as the number of true positives divided by the sum of true and false positives. In other words, precision quantifies what proportion of the predicted labels were correctly predicted as . Recall is defined as the number of true positives divided by the sum of true positives and false negatives. This metric quantifies proportion of the true labels were correctly predicted as . Finally, the F1 score is the harmonic mean of precision and recall.
|precision [%]||recall [%]||F1 score [%]|
|# of samples||10||50||10||50||10||50|
Our results, summarized in Figures 5 and 6 and Table 1, show that the multi-fidelity classifiers outperform the single-fidelity classifier. Qualitatively, we observe that both the multi-fidelity classifier and sparse multi-fidelity (Figure 5, middle and middle right) have placed most of the samples near the boundary, while the single-fidelity classifier (Figure 5, middle left) spend a significant portion of the samples exploring the entire parameter space. Remarkably, the active learning heuristic detected the difference in classification boundaries between the low fidelity (Figure 5, top middle) and the high fidelity data. Quantitatively, both multi-fidelity classifiers have better precision and recall with than the single-fidelity classifier with , as seen in Table 1 and Figure 5, bottom row. This trend continues as the number of samples acquired by active learning increase. The multi-fidelity classifier achieves 100% precision and 99.6% F1 score with .
Figure 6 shows the posterior distribution for the parameters in all three classifiers after trained with 40 iterations of active learning. Similar to the synthetic data example, the low fidelity length scale parameters tend to be smaller than their high fidelity counterparts , meaning that the detail of the classification boundary region is characterized by the low fidelity function. For the single-fidelity classifier, the length scale parameters tend to be larger than , which results in a less defined boundary as can be seen in Figure 5, bottom left. Both multi-classifiers display similar distributions for all parameters.
We study a three-dimensional parameter space by also varying the parameter of the Aliev-Panfilov, along with the parameter and the impulse time. This problem requires a large number of samples to characterize the low-fidelity boundary, therefore we use the sparse multi-fidelity classifier. We vary the parameter in the interval , in and the time to apply the stimulus between 105 and 160 ms. We first compute 1000 low-fidelity samples with a Latin hypercube design to examine the boundary. Due to the small amount of samples, we construct a dataset with all 109 samples along with 400 samples, resulting in . We then select , where 15 samples came from low-fidelity samples with and 15 samples came from low-fidelity samples with . We use inducing points for the low fidelity data. We train the classifier for 100 active learning iterations and compute the vulnerability window. This quantity is approximated by integrating the predicted class probability over the impulse time parameter. We approach this using a uniform grid spaced every 0.5 ms in 21x21 locations. The results are shown in Figure 7, where the low-fidelity points are displayed on the left, and the resulting high-fidelity points after 100 iterations of active learning are shown on the middle panel. Remarkably, most of the points of the high-fidelity model are located near the boundary, expending a small amount of the computational budget exploring the parameter space. Results on the vulnerability window are presented on the right of Figure 7, showing that for certain parameter combinations, this quantity is zero, where it is possible avoid the formation of spiral waves. Some other combination of parameters may provide a larger opportunity to induce a spiral wave.
In this work, we present a novel multi-fidelity classifier using Gaussian process priors. While the multi-fidelity paradigm has been proposed for regression Kennedy and O’Hagan (2000) and uncertainty quantification Schiavazzi et al. (2017), this work represents one of the first attempts to formulate and implement a fully-Bayesian multi-fidelity classifier. Based on an autoregressive Gaussian process prior that enables the seamless integration of data with different levels of fidelity, our classifier outperforms the single-fidelity classifier both with synthetic data and in an application of cardiac electrophysiology. We adopt a fully Bayesian approach and set priors to the kernel hyper-parameters using efficient implementations of Markov-chain Monte-Carlo samplers Salvatier et al. (2016); Hoffman and Gelman (2014). We also introduce a sparse approximation to extend the amount of data that our methodology can handle.
To strike a balance between classification accuracy, data efficiency, and computational cost we propose an active learning approach that takes advantage of the predictive uncertainty in our predictions Kapoor et al. (2007). This approach significantly reduced the error in both examples shown here, showing a good combination of exploration of the parameter space and exploitation of the boundary. Other authors have proposed sampling points of maximum information entropy Gramacy and Polson (2017)
; however, it is easy to see that the approach used here is similar. The entropy would be maximum for a uniform distribution, which, in our case, would be equivalent to a prediction ofwith zero mean and large variance. Such case would be close to minimize our active learning heuristic in equation (15). The actual minimum is for the case of zero mean and infinite variance, however the variance will be bounded for real applications. At the same time, the entropy will be minimum when all predictions are or . In our implementation, this would correspond to a mean prediction of or , which maximizes our heuristic.
Although the multi-fidelity and sparse multi-fidelity classifier work better in all examples presented here, the differences are greater in the cardiac electrophysiology example. This suggests that multi-fidelity classifiers are advantageous when there is class imbalance: only a small region of the parameter space is labelled with a particular class. The multi-fidelity classifiers already have a candidate boundary that is encoded by the low fidelity data and it only needs to infer a simpler function to account for the differences in the low and high fidelity boundaries. In contrast, a single-fidelity classifier would spend most of its computational budget exploring the remaining parameter space. In the two-dimensional parameter space example, the low recall values for the single-fidelity classifier suggest that it is struggling to detect the region, while the multi-fidelity classifiers have this information encoded in the low-fidelity data. The active learning training seems to be more effective in the multi-fidelity classifier than in the sparse version of it. Although the sparse multi-fidelity classifier starts with better accuracy at , it progresses at a slower pace than the full multi-fidelity classifier, achieving an F1 score 94.3%. Since our active learning metric depends on the variance of the Gaussian process, this could be related to the independent covariance assumption used to reduce the computational cost in the sparse version. Overall, the cardiac electrophysiology application highlights the potential of this methodology. Combined with more complex electrophysiology models, this tool could assess the effect of drugs on the inducibility of arrhythmias, for example.
Although we have successfully used our multi-fidelity classifiers in the examples presented here, our methodology suffers from some limitations that open new research directions. First, as we mention in Section 2.3, the full classifier scales with and the sparse version with , where is the number of data points and is the number of inducing points Snelson and Ghahramani (2006); Titsias (2009). Additionally, using Markov-chain Monte-Carlo methods to perform the statistical inference is expensive. Even though the implementation we use here is highly efficient, training our classifiers took in the order of 10 minutes on a laptop for the two-dimensional example at , and up to two hours for the three-dimensional case that considered a sparse classifier with and . This suggests that the methodology presented here will be useful when the time to compute the high fidelity model is on the order of several hours, which is not unusual. The sparse implementation allow us to drastically increase the number of data points we can handle, but we are still bounded to a regime of at most to obtain reasonable training times, which is typical for Gaussian processes. A future direction to reduce the cost of training the classifier is to use variational inference Hensman et al. (2014, 2015) instead of Markov chain Monte Carlo. This methodology would allow us to significantly increase the number of samples we can handle at the cost of loosing expressiveness and interpretability of the Gaussian processes. Another area to improve our methodology is that currently, we need to retrain the classifier from scratch after each active learning iteration. An alternative could be to use the posterior distributions of the parameters as priors for the next iteration to accelerate the convergence of the Markov-chain Monte-Carlo sampler or to use a sequential Monte-Carlo sampler with online updates Gramacy and Polson (2017).
In summary, we propose a multi-fidelity classifier with active learning to efficiently predict the binary output of computer simulations. We envision that our new tool will help researchers to address problems that were prohibitively expensive without this approach.
This work was supported by the School of Engineering Postdoctoral Fellowship from Pontificia Universidad Católica de Chile, the FONDECYT-Postdoctorado 3190355 awarded to F.S.C. and by the US Department of Energy under the Advanced Scientific Computing Research grant DE-SC0019116 to awarded P.P. This publication has received funding from Millenium Science Initiative of the Ministry of Economy, Development and Tourism of Chile, grant Nucleus for Cardiovascular Magnetic Resonance.
Cohn, D. A., Ghahramani, Z., Jordan, M. I., 1996. Active learning with statistical models. Journal of artificial intelligence research 4, 129–145.
Hannun, A. Y., Rajpurkar, P., Haghpanahi, M., Tison, G. H., Bourn, C., Turakhia, M. P., Ng, A. Y., 2019. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature medicine 25 (1), 65.
Kapoor, A., Grauman, K., Urtasun, R., Darrell, T., 2007. Active learning with gaussian processes for object categorization. In: 2007 IEEE 11th International Conference on Computer Vision. IEEE, pp. 1–8.
Neal, R., 1999. Regression and classification using gaussian process priors. Bayesian statistics 6, 475–501.