Suppose we have a random sample of size , ,satisfying the following relationship,
where is an unknown nonparametric regression function which maps to . Here, we consider . The nonparametric regression function is determined by only data without taking a prespecified structure. The nonparametric regression aims to identify the relationships between the predictors and responses and then make further predictions on a new data set based on the relationships above. If is one, the purpose is to strive to locally approximate a target function referred to as the nonparametric function estimation. Moreover, when the responses take discrete values (e.g., ), the function
is estimated using classification algorithms. In this article, we focus on multi-dimensional or high-dimensional data, which are very common in real-world applications.
A common way of estimating an unknown mean function is to express it as a sum of the functions
where the function is specified nonparametrically. The most widely used form of is , where , denote unknown coefficients, is a basis set on whose parameters is . For recovering a regression function , it is important which a basis set is selected and then how to estimate
s. There are other basis sets like decision trees and splines.
There has been much research on constructing the functions or selecting basis elements and estimation techniques for multivariate data. The first approach is kernel-based methods are connected to the reproducing kernel Hilbert space (RKHS). By the representer theorem (kimeldorf1971some), a regression function over the RKHS can be expressed as
where is a positive-definite real-valued kernel on (wahba1990spline
for details). A solution to regularization problems in a reproducing kernel Hilbert space (RKHS) is the well-known Support Vector Machine (SVM)(boser1992training; cortes1995support) with the kernel trick, which leads to computationally efficiency. tipping2000relevance developed a probabilistic SVM by putting a Gaussian prior for , which obtained a sparser representation than the SVM.
Moreover, another approach for kernel-based methods is to take advantage of overcomplete bases. In the Bayesian framework, an example of methods using the overcomplete system is the Lévy adaptive regression kernels (LARK), firstly proposed by tu2006bayesian. It approximates target functions by adaptive basis expansions of elements in an overcomplete system. The main advantages of the LARK model are to extract features and lead sparse representations for functions. ouyang2008bayesian proposed sparse additive models using a multivariate Gaussian kernel with the diagonal covariance function as an extension of the LARK method for multi-dimensional cases.
The second approach is to use the spline functions. The most representative spline-based model is the multivariate adaptive regression splines (MARS) introduced byfriedman1991multivariate. The MARS has a form of a weighted sum of spline functions as
is the parameter vector of theth tensor product of univariate linear spline functions . It has the advantages of capturing the nonlinear relationships and interactions between variables and simplifying high-dimensional problems into low-dimensional settings. denison1998bayesian and francom2018sensitivity
proposed Bayesian approaches to the MARS and improved predictive performance compared to the original model. The neural network (NN) with two layers of hidden units can also be represented as a sum of spline functions as
whereis the weights, and is the bias for the th hidden layer. The ReLU activation is which equals the linear spline function in the tensor product bases of the MARS. Recently, park2021labs proposed the Lévy adaptive B-spline regression (LABS), which remedies the disadvantage for the LARK mentioned above using a variety of B-spline bases as elements of an overcomplete system.
The third approach is ensemble methods, which combine several decision trees. That is, is a single tree model. These are divided into two main categories: bagging (breiman1996bagging) and boosting (freund1999short; friedman2001greedy)
. The bagging builds many trees based on different bootstrapped samples and averages the results of them. As an improved bagging model, random forest(breiman2001random) constructs many independent trees based on a random subset of the features and combines them. The boosted trees sequentially estimate regression trees and aggregate them to form a strong tree model. chen2016xgboost
developed the scalable and enhanced version of the gradient boosting algorithm named extreme gradient boosting. In the Bayesian framework,chipman2010bart proposed the Bayesian additive regression trees (BART) that construct the function as
where is a th tree structure, and is a set of parameters at the th terminal nodes (also called leaves). The BART has become quite popular owing to theoretical results and outstanding empirical performance. linero2018dart and linero2018sbart enhanced the BART model placing a sparsity inducing Dirichlet prior in high-dimensional problems.
In this paper, we develop a fully Bayesian nonparametric regression with tensor products of B-spline bases based on the Lévy process priors and call the Multivariate Lévy Adaptive B-Spline regression (MLABS). The MLABS models adaptively as a sum of basis functions. There are three main contributions of this work. First, the MLABS can build predictive models for regression and classification with features beyond univariate models such as LARK and LABS. Since Lévy process priors encourage sparsity in the expansions and tensor products bases are formed by the product of univariate B-spline functions much less than , it is capable of analyzing multi or high-dimensional datasets. Second, the proposed method can adapt various smoothness of functions in the multi-dimensional data by changing a set of degrees of the tensor product basis function. Especially, the local support of the B-spline basis can also make more delicate predictions than other existing methods in the non-smooth surface data. Finally, the MLABS model has comparable performance on regression and classification problems. Empirical results demonstrate that the MLABS has more stable and accurate predictive abilities than state-of-the-art regression models.
The outline of the paper is as follows. In Section 2, we introduce the two Bayesian nonparametric regression using Lévy process priors, i.e., the LARK and LABS model. In Section 3, we propose an extended model of LARK and LABS models for multivariate analysis. The posterior computation and details for tensor product bases used in the proposed model are also presented. Simulation experiments comparing the predictive performance of our method with others are provided in Section 4. In Section 5, the proposed model is applied to both regression and classification problems using several real-world data sets. We conclude the paper with a discussion in Section 6.
We provide a review of Lévy adaptive regression kernels and Lévy adaptive B-spline regression as core concepts of our proposed method. In this section, we consider is one-dimensional space.
2.1 Lévy adaptive regression kernels
Let be a complete separable metric space, and be a Lévy measure on with satisfying integrability condition,
for each compact set . Then the Lévy random measure can be expressed through a Poisson random measure with mean measure as
We write to mean that
follows a Lèvy distribution which has the characteristic function
Let be a real-valued function defined on . A real-valued random function on can be constructed by
Here, we call a generating function of . The Poisson integral (3) is well defined for all bounded functions . If is finite, a Lévy random measure can be represented as , where
follows a Poisson distribution with meanand . Hence, equation (3) can be expressed as a random finite sum:
This implies that specifying prior distributions for the Lèvy random measure in (3) and for the parameters of the expansion (4) are equivalent. However, if , then the number of the support points of will be infinite almost surely. For practical posterior inference, tu2006bayesian made use of a truncation method to erase infinitely many small jumps and approximate the Lévy measure to a finite Lévy measure.
The LARK model is summarized as follows.
where denotes the prior distribution of . The conditional distribution for
has a hyperparameter. tu2006bayesian focused on infinite Lèvy measures of gamma, symmetric gamma, and symmetric -stable (SS) () process. The generating function as elements of an overcomplete system was suggested by the Gaussian kernels, the Laplace kernels, Haar wavelets, and etc.
2.2 Lévy adaptive B-spline regression
The LABS model was designed to simultaneously use various B-spline basis functions to capture all parts of functions with locally varying smoothness. Thus, the mean function of the LABS model can be defined as
where denotes the subset of degree numbers of B-spline basis (e.g., ) and indicates a B-spline basis of degree with a knot sequence as
where and . The LABS model adopt the B-spline basis functions instead of specific kernel functions as a generating function. A Lèvy random measure has a Lèvy measure satisfying for all .
Since the LABS model assumes finite Lèvy meausres, the mean function (5) can be also expressed as a random finite sum:
where is Poisson-distributed with and ) for all . park2021labs chose the following prior distributions for knot points (locations) and magnitudes .
Although the Lévy measures satisfying integrability condition (2) for each maybe infinite, the stochastic integrals and sums above are well defined due to the properties of the B-spline basis.
The LABS model can be represented in a hierarchical structure as follows:
for . The parameters and of the LABS model have varying dimensions since is the random number that is stochastically determined by a Lèvy random measure . In this case, park2021labs applied the reversible jump Markov chain Monte Carlo (RJMCMC) algorithm proposed by green1995reversible for posterior inference.
3 Proposed model
In this section we propose an extended model of the LABS model that can only cope with data has one variable for multivariate analyses.
3.1 Model specifications
General tensor product B-spline bases require many computations as the number of variables increases. This problem is the so-called “curse of dimensionality”, which means computation burden increases exponentially with dimension. We apply the structure of basis functions of (Bayesian) MARS to that of the LABS model to lessen the computational effort. The idea regarding tensor products of B-spline bases was initially proposed bybakin2000parallel. We consider general basis functions without restricted degrees. The MARS model approximates an unknown function as a weighted sum of basis functions product of univariate spline functions for handling the multi-dimensional or high-dimensional data. It means that it is enough to represent an unknown function by a combination of main effect terms and lower-order interactions.
We first define the th tensor product of B-spline bases used by a generating function as
where is an interaction order of , is a degree number of univariate B-spline basis, is an index to determine which a variable is used and are a knot sequence on , a product space of the th variable . For the parameters in the th tensor product of B-spline bases, we write and . We also assume and , a complete separable metric space. Then, we can rewrite the th basis function from to .
The mean function of the MLABS model can be formulated by
where is a fixed intercept term,
is a Poisson random variable with mean, and are i.i.d from a distribution . The main different things are the structure of basis functions and the randomness of degrees of B-spline basis. The prespecified degree numbers of the basis functions in are fixed in the LABS model but random in the MLABS model. The mean function (9) can also be expressed as a stochastic integral
with respect to a Lévy random measure with a Lévy measure satisfying .
We follow the priors for , , , and of the LABS model (7) and have to place priors additionally on parameters in the basis functions including , and . The prior distributions for , and , following nott2005efficient
are assumed to follow the discrete uniform distribution over some predetermined sets. We also assume independent prior distributions for, , and . In detail, the prior on is uniform on , where is the maximum degree of interaction for the tensor product basis. We set below 3 in most experiments of section 4 and section 5. The prior for is a uniform distribution that puts equal weight on indices of candidate predictors from one to denoted by (e.g., if , then V = ). The prior for is uniform on , the prespecified subset of degree numbers of B-spline basis. Note that the prior for is the uniform distribution over since length and support of a knot sequence depend on a degree number and an index , respectively. Below we summarize the MLABS model:
and we set and = Var or .
3.2 Comparisons between basis fucntions of MLABS and MARS
The main difference between the basis function of the MLABS model and the (Bayesian) MARS model is the form of univariate basis functions in each element of the tensor product. Thus, their basis functions have very different parameters, too. The tensor product spline basis of the MARS is given by
where is a sign indicator, is a knot point, and . and of the MARS are the same as those of the MLABS.
First, the number and the location of the knot point in the basis functions are quite unlike. The B-spline basis with a degree in the MLABS needs knot points. The locations of the knots in the MLABS are freely chosen in the domain of . In contrast, the univariate basis function of the MARS has only one knot point is set at each data point. In the Bayesian MARS, the prior distribution for is uniform on . We fit the MLABS model and the MARS model to the data generated from a piecewise smooth function with two-dimensional support provided by imaizumi2019deep at equally spaced points on the unit square. Figure 1 reveals that there is a considerable difference between the numbers of knot points used in the two methods and they set knot points with or without data points.
Second, while the degrees of the basis functions in the MARS model is fixed, those in the MLABS model are random and comprised of various combinations of predetermined degree numbers, . Furthermore, the degree, is added to the basis functions in the modified Bayesian MARS approach of francom2018sensitivity. Then, in the case of the (Bayesian) MARS, . Figure 2 shows that the MLABS model needs more basis functions and uses more diverse types of basis functions than the MARS model to estimate an unknown surface. Especially, some of the tensor product bases in the MLABS model have very small local support, unlike those of the MARS. These parts will lead to producing accurate estimations for spatially varying surfaces.
3.3 Posterior inference
The structure of the MLABS model is similar to that of the LABS model, although we modified the form of basis function from the univariate case to the multivariate case. Thus, we follow most of the posterior computation steps of park2021labs but incorporate update steps for newly added parameters such as , and to the existing MCMC algorithm. The joint posterior distribution of the MLABS model (10) is given by
where is the likelihood function based on data generating mechanism (1).
We sum up the posterior sampling schemes of the MLABS model based on the RJMCMC algorithm. Let us denote by an element of , where both and are dimensional vectors, has knot points, and
is the number of coefficients (or basis functions) in the current model. The RJMCMC algorithm consists of three updating steps to sample posterior distribution. Such move types are called birth step, death step, and relocation step, respectively. The probabilities of exploring the birth, death, and relocation steps are, , and with . Each step is determined with probabilities , , and .
The birth step is to decide whether to add a new component generated from the proposal distributions or not, i.e., this updating phase allows the sampler to move from a current state to a new state . On the contrary, the death step is to decide whether to remove one of the existing components, , or not. Finally, the relocation step is to only update without altering the dimensionality of the parameters. The updating scheme of this step is the same as the standard MCMC methods, including Gibbs sampling or Metropolis-Hastings algorithm. The acceptance ratio in each move step is given by
where and indicate the current model parameters and the number of tensor product basis functions in the current state. and refer to the new model parameters and the number of tensor product basis functions in the new state. is the jump proposal distribution that proposes a new state given a current state . We follow the jump proposals of lee2020bayesian for each move step. The posterior samples for and are drawn from each full conditional distribution. See park2021labs for more details on posterior computation.
In practice, the LABS model had an inefficient sampling for knot points because they were uniformly sampled from the domain regardless of the distribution of data points. It caused proposed samples for knot points to locate far from the data points. As a result, the LABS model generated unnecessary B-spline bases and spent many MCMC iterations.
To solve this problem, we introduce new knot proposal schemes to the MLABS model. We illustrate the proposal processes for knot points using Figure 3. First, in the case of a degree (panel (a) of Figure 3), a data point is uniformly sampled from and then knot points and are generated from and intervals, respectively. Here, the domain, is expanded to the interval for boundary data points. In practice, we expand by the from endpoints, where is a multiplier. Second, if (panel (b) of Figure 3), is uniformly sampled from and set it to . Similarly, and are generated from and intervals, respectively. Third, in the case of (panel (c) of Figure 3), and are generated from and and are generated from after is uniformly sampled from . Finally, for (panel (d) of Figure 3), we generate a point uniformly distributed on and set it to . Then, and are generated from and and are generated from . These data-dependent knot proposals lead to achieving faster convergence than the LABS model.
3.4 Binomial regressions for MLABS
The generalized linear models can cope with the non-Gaussian data. We can further extend the MLABS model (10) to generalized linear models by introducing a distribution and link function into the model as
In this subsection, we focus on binary regressions. Thus, the link function will be either the logit or probit function for Binomial distribution. For example, the logit model of the MLABS can be defined as
where . The priors for the remaining parameters , and are identical with those of the MLABS model (10) for regression. For the logit model, the posterior distribution for has no closed-form and is approximated using the Metropolis-Hastings sampler.
In the probit link function, model (11) takes the form as
wherealbert1993bayesian. We introduce the latent variables such that
Then, the normal prior for gives a conjugate Gibbs-sampling update, unlike the logit model. The full conditional of is given by
where is a truncated normal distribution with mean
, variance, and support . The posterior samples for are drawn from the full conditional after the RJMCMC algorithm as illustrated in subsection 3.3. The model parameters , and have the same prior distributions of the MLABS model (10). We use the MCMC algorithm using the probit link function in terms of efficient posterior sampling.
shows that the MLABS model produces visually more reasonable decision boundaries than the state-of-the-art classifiers. In other words, the MLABS model can have different and flexible decisions changing the degrees or interaction orders in the tensor product basis function (8).
4 Simulation studies
In this section, in the regression settings, we measure the performance of the MLABS model (10) and competitive methods on simulated data sets. We first consider three test functions with bivariate predictors: the radial and complex interaction functions of hwang1994regression and the non-smooth test function of imaizumi2019deep. The two test functions of hwang1994regression are smooth. Second, we take the examples proposed by friedman1991multivariate as benchmark datasets in the multivariate nonparametric regression. One of Friedman’s test functions is widely used to assess variable selection performance in high-dimensional data. For all test functions, we generate 100 pairs of held-in data with independent Gaussian noise and held-out data to evaluate the predictive performance based on root-mean-square error (RMSE)
where is a held-out test set.
For comparison, we consider several competitive alternatives, including the multivariate adaptive regression splines of friedman1991multivariate (denoted by MARS), a modified version of Bayesian MARS of francom2018sensitivity (denoted by BASS), LARK model using multivariate Gaussian kernels of ouyang2008bayesian
(denoted by BARK), support vector machines with radial basis function (RBF) kernels ofboser1992training; cortes1995support (denoted by SVM), a fully connected Neural network with two hidden layers (each 15 nodes) using sigmoid activation (denoted by NN), random forests of breiman2001random (denoted by RF), accelerated gradient-boosted decision trees of chen2016xgboost (denoted by XGB), and Bayesian decision tree ensembles: Bayesian additive regression trees of chipman2010bart (denoted by BART) and BART using soft decision trees of linero2018sbart (denoted by SBART). All competing models were implemented in R packages: earth, BASS (francom2020bass), bark, e1071 (meyer2015support), keras, randomforest, xgboost, BayesTree, and SoftBart, respectively.
The hyperparameters for all methods are chosen using grid-search with five-fold cross-validation. The MLABS model have seven tuning parameters such as , , , , , , and . We set , , and as default values. The parameters , , and are optimized by cross-validated grid-search over parameter grids. The hyperparameter candidates of all methods used in all experiments of this section are given in Appendix B. We also run the MLABS model for 100,000 iterations, with the first 50,000 iterations discarded as burn-in, and retain every 50th sample.
4.1 Surface test functions
For each surface test function, in-sample data sets are generated from the true function at equally spaced grid points on . We also add independent normally distributed noises to the true target functions. We select the value of
such that the root signal-to-noise ratio (RSNR) was 1 and 5. We use 2500 additional data points generated independently and uniformly onas out-of-sample data. The three true surfaces are given by
where , and is the indicator function of . They are visualized in Figure 4.
In this example, we add the thin plate spline (TPS) as a benchmark technique since it is a commonly used for the smooth interpolation of two-dimensional data. The TPS is also referred to as a generalization of the smoothing spline. Results of this simulation are presented inTable 1. Table 1 demonstrates that the MLABS model performs well in most cases with the lowest, the second, or the third-lowest average RMSE values across 100 in-sample and out-of-sample sets. According to the average rank of Table 1, the MLABS attains a more accurate estimation of the surface test function than the TPS. The tree-based models such as SBART, BART, RF, and XGB have difficulties estimating smooth surfaces or regions due to their lack of smoothness. The NN does not work very well owing to fixed model structures relative to the training data size. The BASS can choose diverse degrees of the spline functions and produce the lowest value on the radial and complex test functions with RSNR = 1, unlike the MARS. One characteristic of the proposed model is smoothness adaptation Figure 5 supports that the MLABS model has the advantages of canceling the noise and adapting to the non-smooth function.
4.2 Friedman’s test functions
We conduct additional experiments using Friedman 1, 2, and 3 data sets to assess the practical performance of the proposed method on general dimensional data. The Friedman 1 data set has ten independent uniform random variables on the interval . The output is computed using the following formula
The data set uses only the first five variables out of ten variables. The Friedman 2 and 3 data sets have four independent random variables with uniform distribution on the intervals
The corresponding responses are created according to the mean functions
These data sets have non-linear and high interaction order terms. For each test function, we create in-sample data sets of 250 observations and add independent Gaussian noise with mean zero and standard deviation, so that the root signal-to-noise ratio is set at 1 and 5. We also generate out-of-sample data sets of 1000 observations to measure the predictive accuracy of regression models.
Results of the simulation for Friedman’s data sets are given in Table 2. The MLABS model has the best performance in almost all cases, as shown in Table 2. The feature of this experiment is that the tensor product basis-based models, including the MLABS, BASS, and MARS, are superior to others. The results are caused by whether the interaction order terms can be estimated directly or not. Although the SBART and BART have relatively good prediction abilities, the MLABS overwhelms them for all test functions regardless of the RSNR. The average rank in Table 2 shows the ensemble models of the RF and XGB, and kernel-based models of the BARK and SVM perform poorly in Friedman’s data sets. The NN is not appropriate for handling small datasets, as seen in the previous surface examples.
We evaluate the out-of-sample performance with methods based on the Friedman 1 data set in the high-dimensional settings for a detailed comparison. In other words, we check how well the models work as the number of variables increases. We reproduce the simulation scenarios of linero2018sbart. We create five pairs of 250 training and 1000 test samples with features, which increase from 5 to 1000 along an evenly spaced grid on the scale of . Independent Gaussian noise with mean zero and standard deviation is also added to the training samples generated from the true mean function. Methods are compared by an average of RMSEs over five replications. Every time the number of variables increases, most methods are tuned by using cross-validation.
Results of this simulation are provided in Figure 6. An interesting part of Figure 6 is that the MLABS achieves the best performance up to about 70-dimensional data irrespective of the noise level. After that point, its error increases gradually in both the low and the high noise settings. Since the MLABS and BASS have the same performance behaviors, unlike MARS, these results seem to come from slowly mixing of the RJMCMC algorithm. In contrast, the SBART and MARS are interestingly invariant to the number of predictors. The SBART is superior to other methods, including the MLABS, for high-dimensional settings where is large.
5 Real data applications
We now compare the MLABS model (10) with various competing methods in regression and classification problems on several real-world datasets.
5.1 Regression examples
We prepare the six real-life datasets from the UCI Machine Learning Repository (UCI) and several R packages:caret, mfp, MASS, and AppliedPredictiveModeling. The summary of these data sets is provided in Table 4. Since the MLABS model can handle only quantitative variables, we don’t consider categorical predictors in the data sets. We also erase missing values. Specifically, case 42 of the bodyfat data seems to be an apparent error, and its height variable is replaced by 69.5. The tecator meat and residential building datasets have multiple responses variables. We choose one of the responses in each data set: the percentages of protein (tecator meat) and actual sales prices (residential building).
We consider the nine competing approaches as illustrated in subsection 4.2 and select the best hyperparameters of each method using cross-validation methods. To gauge the predictive performance among the methods, we make use of 20 times replicated five-fold cross-validations. Thus, we compute an average of 20 estimated CV errors as a measure of accuracy.
Results of the experiment for the regression problem are presented in Table 4. Table 4 illustrates that the MLABS model has stable predictive abilities by getting the best performance on three data sets. It also produces the third-lowest average RMSE in the remaining three data sets. By the average rank of Table 4, the MLABS model generally outperforms state-of-art methods in the fields of machine learning or Bayesian nonparametrics. Furthermore, for the tecator meat data, the tensor product basis based models work much better than the tree-based models do.
In contrast, the tree-based methods perform well for the chemical manufacturing process datasets and rank high among the methods. In practice, the kernel-based methods show bad performance in the regression examples, and the lowest-ranked approach is the SVM. These results are attributed to lacking the flexibility and adaptability to the data sets by using only one type of kernel function.
5.2 Classification examples
We choose the seven competitive methods for classification problems and exclude the SBART and BASS because the two models cannot yet analyze the binary data. We compare the MLABS model using the probit link with other methods that optimized their hyperparameters using grid-search with five-fold cross-validation by a classification performance measure: AUC (area under the receiver operating characteristic (ROC) curve). The AUC is the most common metric for classification tasks, and the value lies between 0 to 1, where 1 indicates an excellent classifier. We calculate the average of performance metrics obtained by repeating 5-fold cross-validation 20 times. We collect the seven real data sets for classification from the UCI Machine Learning Repository and two R packages: mlbench and datamicroarray
. The Alon dataset is the high-dimensional microarray data set for colon cancer. The Pima Indian diabetes data set contains zero values of some variables, and we consider the values missing values. The missing values and categorical variables of every real data set for classification are processed in the same way as regression experiments. The real data sets are listed in with the information such as the number of sample size and features, source, and imbalanced ratio (IR) defined as
where is the set of all classes.
Results of this experiment are given in Table 6. The columns of the methods represent their average of cross-validated AUC values over 20 replicates. As shown in Table 6, the MLABS method doesn’t show excellent predictive performance for classification, but it is comparable to the XGB and RF as gold standard models. Specifically, the MLABS model performs well in most cases except the Ionosphere, Sonar, and Alon data set. It is seen as having difficulties estimating in high-dimensional cases. Here, the XGB model provides the best performance, followed by the RF, MLABS, and BART model. In contrast with the regression problems, tree-based models generally provide better predictive capabilities than the others.
|Parkinson||0.97 (2)||0.962 (4)||0.922 (6)||0.961 (5)||0.975 (1)||0.899 (7)||0.797 (8)||0.967 (3)|
|Ionosphere||0.971 (4)||0.963 (5)||0.95 (6)||0.978 (1)||0.978 (2)||0.935 (7)||0.917 (8)||0.976 (3)|
|Breast cancer Wisconsin||0.995 (1)||0.992 (3)||0.984 (7)||0.989 (4)||0.975 (8)||0.988 (5)||0.987 (6)||0.994 (2)|
|Sonar||0.907 (5)||0.933 (3)||0.79 (8)||0.941 (1)||0.909 (4)||0.864 (6)||0.853 (7)||0.935 (2)|
|Pima Indian Diabetes||0.849 (2)||0.847 (4)||0.846 (5)||0.848 (3)||0.801 (8)||0.816 (7)||0.831 (6)||0.852 (1)|
|Spambase||0.983 (3)||0.982 (4)||0.975 (6)||0.986 (2)||0.949 (8)||0.977 (5)||0.966 (7)||0.988 (1)|
|Alon||0.885 (4)||0.889 (3)||0.836 (6)||0.876 (5)||0.5 (8)||0.771 (7)||0.905 (2)||0.914 (1)|
|Average rank||3.125 (3)||3.5 (4)||6.25 (7)||3 (2)||5.875 (5)||6.375 (8)||6.125 (6)||1.75 (1)|
In this article, we have introduced a general Bayesian sum-of-bases model named Multivariate Lévy Adaptive B-Spline Regression using the tensor product of B-spline basis function of which parameters are automatically determined by the Lévy random measure. The B-spline basis has nice properties such as local support and differentiability. We have illustrated that it has a powerful predictive ability over the state-of-the-art methods in simulation studies and real data applications of the regression problems. We also proposed a comparable classification model using the data augmentation strategies of albert1993bayesian.
However, there are drawbacks that the proposed model can treat only continuous variables and is slightly inefficient as it uses the RJMCMC. The MCMC algorithm makes it difficult to deal with high-dimensional data. The classifier based on the MLABS framework also does not work well compared to the tree-based models. Further studies are needed to improve these problems.
Future work will develop a versatile and efficient sampling-based model for the MLABS model. One possibility is to give the Lévy process prior up and use regularization priors to handle the high-dimensional data under fixed and a large of the basis functions. Using a Bayesian backfitting algorithm of hastie2000bayesian as a core algorithm in the BART is expected to be more effective to achieve high performance and fast convergence than the inefficient RJMCMC. Moreover, scalable algorithms such as the Consensus Monte Carlo or variational Bayes can be applied to our model for large and tall data. Another possibility is that the tensor product bases will be allowed to contain indicators for categorical data.
Appendix A Decision boundaries for MLABS
We apply main classifiers including MLABS, BART, RF, SVM, and XGB on five binary class datasets with two-dimensional space to visualize the classification performance. For each dataset, different decision boundaries of all methods are shown in Figure 7.
Appendix B Tuning hyperparameters
To select optimal hyperparameters for all methods, we use a grid search approach using 5-fold cross-validation for all experiments. Table 7 summarizes the hyperparameter search spaces we are using.