Multivariate Lévy Adaptive B-Spline Regression

08/26/2021
by   Sewon Park, et al.
0

We develop a fully Bayesian nonparametric regression model based on a Lévy process prior named MLABS (Multivariate Lévy Adaptive B-Spline regression) model, a multivariate version of the LARK (Lévy Adaptive Regression Kernels) models, for estimating unknown functions with either varying degrees of smoothness or high interaction orders. Lévy process priors have advantages of encouraging sparsity in the expansions and providing automatic selection over the number of basis functions. The unknown regression function is expressed as a weighted sum of tensor product of B-spline basis functions as the elements of an overcomplete system, which can deal with multi-dimensional data. The B-spline basis can express systematically functions with varying degrees of smoothness. By changing a set of degrees of the tensor product basis function, MLABS can adapt the smoothness of target functions due to the nice properties of B-spline bases. The local support of the B-spline basis enables the MLABS to make more delicate predictions than other existing methods in the two-dimensional surface data. Experiments on various simulated and real-world datasets illustrate that the MLABS model has comparable performance on regression and classification problems. We also show that the MLABS model has more stable and accurate predictive abilities than state-of-the-art nonparametric regression models in relatively low-dimensional data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 28

01/28/2021

Lévy Adaptive B-spline Regression via Overcomplete Systems

The estimation of functions with varying degrees of smoothness is a chal...
08/06/2018

Spline Regression with Automatic Knot Selection

In this paper we introduce a new method for automatically selecting knot...
02/01/2016

A Spectral Series Approach to High-Dimensional Nonparametric Regression

A key question in modern statistics is how to make fast and reliable inf...
01/30/2021

Smoothly varying ridge regularization

A basis expansion with regularization methods is much appealing to the f...
08/14/2019

Efficient Estimation of Pathwise Differentiable Target Parameters with the Undersmoothed Highly Adaptive Lasso

We consider estimation of a functional parameter of a realistically mode...
03/01/2022

Bayesian adaptive and interpretable functional regression for exposure profiles

Pollutant exposures during gestation are a known and adverse factor for ...
10/29/2021

A spline-based spatial impulse response simulator

The spatial impulse response (SIR) method is a well-known approach to ca...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Suppose we have a random sample of size , ,

and response variables

satisfying the following relationship,

(1)

where is an unknown nonparametric regression function which maps to . Here, we consider . The nonparametric regression function is determined by only data without taking a prespecified structure. The nonparametric regression aims to identify the relationships between the predictors and responses and then make further predictions on a new data set based on the relationships above. If is one, the purpose is to strive to locally approximate a target function referred to as the nonparametric function estimation. Moreover, when the responses take discrete values (e.g., ), the function

is estimated using classification algorithms. In this article, we focus on multi-dimensional or high-dimensional data, which are very common in real-world applications.

A common way of estimating an unknown mean function is to express it as a sum of the functions

where the function is specified nonparametrically. The most widely used form of is , where , denote unknown coefficients, is a basis set on whose parameters is . For recovering a regression function , it is important which a basis set is selected and then how to estimate

s. There are other basis sets like decision trees and splines.

There has been much research on constructing the functions or selecting basis elements and estimation techniques for multivariate data. The first approach is kernel-based methods are connected to the reproducing kernel Hilbert space (RKHS). By the representer theorem (kimeldorf1971some), a regression function over the RKHS can be expressed as

where is a positive-definite real-valued kernel on (wahba1990spline

for details). A solution to regularization problems in a reproducing kernel Hilbert space (RKHS) is the well-known Support Vector Machine (SVM)

(boser1992training; cortes1995support) with the kernel trick, which leads to computationally efficiency. tipping2000relevance developed a probabilistic SVM by putting a Gaussian prior for , which obtained a sparser representation than the SVM.

Moreover, another approach for kernel-based methods is to take advantage of overcomplete bases. In the Bayesian framework, an example of methods using the overcomplete system is the Lévy adaptive regression kernels (LARK), firstly proposed by tu2006bayesian. It approximates target functions by adaptive basis expansions of elements in an overcomplete system. The main advantages of the LARK model are to extract features and lead sparse representations for functions. ouyang2008bayesian proposed sparse additive models using a multivariate Gaussian kernel with the diagonal covariance function as an extension of the LARK method for multi-dimensional cases.

The second approach is to use the spline functions. The most representative spline-based model is the multivariate adaptive regression splines (MARS) introduced by

friedman1991multivariate. The MARS has a form of a weighted sum of spline functions as

where

is the parameter vector of the

th tensor product of univariate linear spline functions . It has the advantages of capturing the nonlinear relationships and interactions between variables and simplifying high-dimensional problems into low-dimensional settings. denison1998bayesian and francom2018sensitivity

proposed Bayesian approaches to the MARS and improved predictive performance compared to the original model. The neural network (NN) with two layers of hidden units can also be represented as a sum of spline functions as

where

is the ReLU (Rectified Linear Unit) activation function,

is the weights, and is the bias for the th hidden layer. The ReLU activation is which equals the linear spline function in the tensor product bases of the MARS. Recently, park2021labs proposed the Lévy adaptive B-spline regression (LABS), which remedies the disadvantage for the LARK mentioned above using a variety of B-spline bases as elements of an overcomplete system.

The third approach is ensemble methods, which combine several decision trees. That is, is a single tree model. These are divided into two main categories: bagging (breiman1996bagging) and boosting (freund1999short; friedman2001greedy)

. The bagging builds many trees based on different bootstrapped samples and averages the results of them. As an improved bagging model, random forest

(breiman2001random) constructs many independent trees based on a random subset of the features and combines them. The boosted trees sequentially estimate regression trees and aggregate them to form a strong tree model. chen2016xgboost

developed the scalable and enhanced version of the gradient boosting algorithm named extreme gradient boosting. In the Bayesian framework,

chipman2010bart proposed the Bayesian additive regression trees (BART) that construct the function as

where is a th tree structure, and is a set of parameters at the th terminal nodes (also called leaves). The BART has become quite popular owing to theoretical results and outstanding empirical performance. linero2018dart and linero2018sbart enhanced the BART model placing a sparsity inducing Dirichlet prior in high-dimensional problems.

In this paper, we develop a fully Bayesian nonparametric regression with tensor products of B-spline bases based on the Lévy process priors and call the Multivariate Lévy Adaptive B-Spline regression (MLABS). The MLABS models adaptively as a sum of basis functions. There are three main contributions of this work. First, the MLABS can build predictive models for regression and classification with features beyond univariate models such as LARK and LABS. Since Lévy process priors encourage sparsity in the expansions and tensor products bases are formed by the product of univariate B-spline functions much less than , it is capable of analyzing multi or high-dimensional datasets. Second, the proposed method can adapt various smoothness of functions in the multi-dimensional data by changing a set of degrees of the tensor product basis function. Especially, the local support of the B-spline basis can also make more delicate predictions than other existing methods in the non-smooth surface data. Finally, the MLABS model has comparable performance on regression and classification problems. Empirical results demonstrate that the MLABS has more stable and accurate predictive abilities than state-of-the-art regression models.

The outline of the paper is as follows. In Section 2, we introduce the two Bayesian nonparametric regression using Lévy process priors, i.e., the LARK and LABS model. In Section 3, we propose an extended model of LARK and LABS models for multivariate analysis. The posterior computation and details for tensor product bases used in the proposed model are also presented. Simulation experiments comparing the predictive performance of our method with others are provided in Section 4. In Section 5, the proposed model is applied to both regression and classification problems using several real-world data sets. We conclude the paper with a discussion in Section 6.

2 Background

We provide a review of Lévy adaptive regression kernels and Lévy adaptive B-spline regression as core concepts of our proposed method. In this section, we consider is one-dimensional space.

2.1 Lévy adaptive regression kernels

Let be a complete separable metric space, and be a Lévy measure on with satisfying integrability condition,

(2)

for each compact set . Then the Lévy random measure can be expressed through a Poisson random measure with mean measure as

We write to mean that

follows a Lèvy distribution which has the characteristic function

Let be a real-valued function defined on . A real-valued random function on can be constructed by

(3)

Here, we call a generating function of . The Poisson integral (3) is well defined for all bounded functions . If is finite, a Lévy random measure can be represented as , where

follows a Poisson distribution with mean

and . Hence, equation (3) can be expressed as a random finite sum:

(4)

This implies that specifying prior distributions for the Lèvy random measure in (3) and for the parameters of the expansion (4) are equivalent. However, if , then the number of the support points of will be infinite almost surely. For practical posterior inference, tu2006bayesian made use of a truncation method to erase infinitely many small jumps and approximate the Lévy measure to a finite Lévy measure.

The LARK model is summarized as follows.

where denotes the prior distribution of . The conditional distribution for

has a hyperparameter

. tu2006bayesian focused on infinite Lèvy measures of gamma, symmetric gamma, and symmetric -stable (SS) () process. The generating function as elements of an overcomplete system was suggested by the Gaussian kernels, the Laplace kernels, Haar wavelets, and etc.

2.2 Lévy adaptive B-spline regression

The LABS model was designed to simultaneously use various B-spline basis functions to capture all parts of functions with locally varying smoothness. Thus, the mean function of the LABS model can be defined as

(5)

where denotes the subset of degree numbers of B-spline basis (e.g., ) and indicates a B-spline basis of degree with a knot sequence as

(6)

where and . The LABS model adopt the B-spline basis functions instead of specific kernel functions as a generating function. A Lèvy random measure has a Lèvy measure satisfying for all .

Since the LABS model assumes finite Lèvy meausres, the mean function (5) can be also expressed as a random finite sum:

where is Poisson-distributed with and ) for all . park2021labs chose the following prior distributions for knot points (locations) and magnitudes .

Although the Lévy measures satisfying integrability condition (2) for each maybe infinite, the stochastic integrals and sums above are well defined due to the properties of the B-spline basis.

The LABS model can be represented in a hierarchical structure as follows:

(7)

for . The parameters and of the LABS model have varying dimensions since is the random number that is stochastically determined by a Lèvy random measure . In this case, park2021labs applied the reversible jump Markov chain Monte Carlo (RJMCMC) algorithm proposed by green1995reversible for posterior inference.

3 Proposed model

In this section we propose an extended model of the LABS model that can only cope with data has one variable for multivariate analyses.

3.1 Model specifications

General tensor product B-spline bases require many computations as the number of variables increases. This problem is the so-called “curse of dimensionality”, which means computation burden increases exponentially with dimension. We apply the structure of basis functions of (Bayesian) MARS to that of the LABS model to lessen the computational effort. The idea regarding tensor products of B-spline bases was initially proposed by

bakin2000parallel. We consider general basis functions without restricted degrees. The MARS model approximates an unknown function as a weighted sum of basis functions product of univariate spline functions for handling the multi-dimensional or high-dimensional data. It means that it is enough to represent an unknown function by a combination of main effect terms and lower-order interactions.

We first define the th tensor product of B-spline bases used by a generating function as

(8)

where is an interaction order of , is a degree number of univariate B-spline basis, is an index to determine which a variable is used and are a knot sequence on , a product space of the th variable . For the parameters in the th tensor product of B-spline bases, we write and . We also assume and , a complete separable metric space. Then, we can rewrite the th basis function from to .

The mean function of the MLABS model can be formulated by

(9)

where is a fixed intercept term,

is a Poisson random variable with mean

, and are i.i.d from a distribution . The main different things are the structure of basis functions and the randomness of degrees of B-spline basis. The prespecified degree numbers of the basis functions in are fixed in the LABS model but random in the MLABS model. The mean function (9) can also be expressed as a stochastic integral

with respect to a Lévy random measure with a Lévy measure satisfying .

We follow the priors for , , , and of the LABS model (7) and have to place priors additionally on parameters in the basis functions including , and . The prior distributions for , and , following nott2005efficient

are assumed to follow the discrete uniform distribution over some predetermined sets. We also assume independent prior distributions for

, , and . In detail, the prior on is uniform on , where is the maximum degree of interaction for the tensor product basis. We set below 3 in most experiments of section 4 and section 5. The prior for is a uniform distribution that puts equal weight on indices of candidate predictors from one to denoted by (e.g., if , then V = ). The prior for is uniform on , the prespecified subset of degree numbers of B-spline basis. Note that the prior for is the uniform distribution over since length and support of a knot sequence depend on a degree number and an index , respectively. Below we summarize the MLABS model:

(10)

and we set and = Var or .

3.2 Comparisons between basis fucntions of MLABS and MARS

The main difference between the basis function of the MLABS model and the (Bayesian) MARS model is the form of univariate basis functions in each element of the tensor product. Thus, their basis functions have very different parameters, too. The tensor product spline basis of the MARS is given by

where is a sign indicator, is a knot point, and . and of the MARS are the same as those of the MLABS.

First, the number and the location of the knot point in the basis functions are quite unlike. The B-spline basis with a degree in the MLABS needs knot points. The locations of the knots in the MLABS are freely chosen in the domain of . In contrast, the univariate basis function of the MARS has only one knot point is set at each data point. In the Bayesian MARS, the prior distribution for is uniform on . We fit the MLABS model and the MARS model to the data generated from a piecewise smooth function with two-dimensional support provided by imaizumi2019deep at equally spaced points on the unit square. Figure 1 reveals that there is a considerable difference between the numbers of knot points used in the two methods and they set knot points with or without data points.

Figure 1: Plot for knot points of the Bayesian MARS (left) and the MLABS (right). In each plot the solid lines mean the locations of the knots and the small dots indicate the data points.

Second, while the degrees of the basis functions in the MARS model is fixed, those in the MLABS model are random and comprised of various combinations of predetermined degree numbers, . Furthermore, the degree, is added to the basis functions in the modified Bayesian MARS approach of francom2018sensitivity. Then, in the case of the (Bayesian) MARS, . Figure 2 shows that the MLABS model needs more basis functions and uses more diverse types of basis functions than the MARS model to estimate an unknown surface. Especially, some of the tensor product bases in the MLABS model have very small local support, unlike those of the MARS. These parts will lead to producing accurate estimations for spatially varying surfaces.


Figure 2: Plot for tensor product basis functions constructed by the Bayesian MARS (left) and the MLABS (right) to estimate a non-smooth function of imaizumi2019deep.

3.3 Posterior inference

The structure of the MLABS model is similar to that of the LABS model, although we modified the form of basis function from the univariate case to the multivariate case. Thus, we follow most of the posterior computation steps of park2021labs but incorporate update steps for newly added parameters such as , and to the existing MCMC algorithm. The joint posterior distribution of the MLABS model (10) is given by

where is the likelihood function based on data generating mechanism (1).

We sum up the posterior sampling schemes of the MLABS model based on the RJMCMC algorithm. Let us denote by an element of , where both and are dimensional vectors, has knot points, and

is the number of coefficients (or basis functions) in the current model. The RJMCMC algorithm consists of three updating steps to sample posterior distribution. Such move types are called birth step, death step, and relocation step, respectively. The probabilities of exploring the birth, death, and relocation steps are

, , and with . Each step is determined with probabilities , , and .

The birth step is to decide whether to add a new component generated from the proposal distributions or not, i.e., this updating phase allows the sampler to move from a current state to a new state . On the contrary, the death step is to decide whether to remove one of the existing components, , or not. Finally, the relocation step is to only update without altering the dimensionality of the parameters. The updating scheme of this step is the same as the standard MCMC methods, including Gibbs sampling or Metropolis-Hastings algorithm. The acceptance ratio in each move step is given by

where and indicate the current model parameters and the number of tensor product basis functions in the current state. and refer to the new model parameters and the number of tensor product basis functions in the new state. is the jump proposal distribution that proposes a new state given a current state . We follow the jump proposals of lee2020bayesian for each move step. The posterior samples for and are drawn from each full conditional distribution. See park2021labs for more details on posterior computation.

In practice, the LABS model had an inefficient sampling for knot points because they were uniformly sampled from the domain regardless of the distribution of data points. It caused proposed samples for knot points to locate far from the data points. As a result, the LABS model generated unnecessary B-spline bases and spent many MCMC iterations.

To solve this problem, we introduce new knot proposal schemes to the MLABS model. We illustrate the proposal processes for knot points using Figure 3. First, in the case of a degree (panel (a) of Figure 3), a data point is uniformly sampled from and then knot points and are generated from and intervals, respectively. Here, the domain, is expanded to the interval for boundary data points. In practice, we expand by the from endpoints, where is a multiplier. Second, if (panel (b) of Figure 3), is uniformly sampled from and set it to . Similarly, and are generated from and intervals, respectively. Third, in the case of (panel (c) of Figure 3), and are generated from and and are generated from after is uniformly sampled from . Finally, for (panel (d) of Figure 3), we generate a point uniformly distributed on and set it to . Then, and are generated from and and are generated from . These data-dependent knot proposals lead to achieving faster convergence than the LABS model.

(a)
(b)
(c)
(d)
Figure 3: Proposal schemes for knot points of the B-spline basis function with a degree (a) 0, (b) 1, (c) 2, and (d) 3.

3.4 Binomial regressions for MLABS

The generalized linear models can cope with the non-Gaussian data. We can further extend the MLABS model (10) to generalized linear models by introducing a distribution and link function into the model as

(11)

In this subsection, we focus on binary regressions. Thus, the link function will be either the logit or probit function for Binomial distribution. For example, the logit model of the MLABS can be defined as

where . The priors for the remaining parameters , and are identical with those of the MLABS model (10) for regression. For the logit model, the posterior distribution for has no closed-form and is approximated using the Metropolis-Hastings sampler.

In the probit link function, model (11) takes the form as

where

denotes the cumulative distribution function of the standard normal distribution. For posterior inference in the probit model, we use the data augmentation algorithm proposed by

albert1993bayesian. We introduce the latent variables such that

Then, the normal prior for gives a conjugate Gibbs-sampling update, unlike the logit model. The full conditional of is given by

where is a truncated normal distribution with mean

, variance

, and support . The posterior samples for are drawn from the full conditional after the RJMCMC algorithm as illustrated in subsection 3.3. The model parameters , and have the same prior distributions of the MLABS model (10). We use the MCMC algorithm using the probit link function in terms of efficient posterior sampling.

We identify the decision boundaries for the probit model of the MLABS on five benchmark data sets: Linear, Circle, XOR, Two moons, and Two spirals. Figure 7 in Appendix A

shows that the MLABS model produces visually more reasonable decision boundaries than the state-of-the-art classifiers. In other words, the MLABS model can have different and flexible decisions changing the degrees or interaction orders in the tensor product basis function (

8).

4 Simulation studies

In this section, in the regression settings, we measure the performance of the MLABS model (10) and competitive methods on simulated data sets. We first consider three test functions with bivariate predictors: the radial and complex interaction functions of hwang1994regression and the non-smooth test function of imaizumi2019deep. The two test functions of hwang1994regression are smooth. Second, we take the examples proposed by friedman1991multivariate as benchmark datasets in the multivariate nonparametric regression. One of Friedman’s test functions is widely used to assess variable selection performance in high-dimensional data. For all test functions, we generate 100 pairs of held-in data with independent Gaussian noise and held-out data to evaluate the predictive performance based on root-mean-square error (RMSE)

where is a held-out test set.

For comparison, we consider several competitive alternatives, including the multivariate adaptive regression splines of friedman1991multivariate (denoted by MARS), a modified version of Bayesian MARS of francom2018sensitivity (denoted by BASS), LARK model using multivariate Gaussian kernels of ouyang2008bayesian

(denoted by BARK), support vector machines with radial basis function (RBF) kernels of

boser1992training; cortes1995support (denoted by SVM), a fully connected Neural network with two hidden layers (each 15 nodes) using sigmoid activation (denoted by NN), random forests of breiman2001random (denoted by RF), accelerated gradient-boosted decision trees of chen2016xgboost (denoted by XGB), and Bayesian decision tree ensembles: Bayesian additive regression trees of chipman2010bart (denoted by BART) and BART using soft decision trees of linero2018sbart (denoted by SBART). All competing models were implemented in R packages: earth, BASS (francom2020bass), bark, e1071 (meyer2015support), keras, randomforest, xgboost, BayesTree, and SoftBart, respectively.

The hyperparameters for all methods are chosen using grid-search with five-fold cross-validation. The MLABS model have seven tuning parameters such as , , , , , , and . We set , , and as default values. The parameters , , and are optimized by cross-validated grid-search over parameter grids. The hyperparameter candidates of all methods used in all experiments of this section are given in Appendix B. We also run the MLABS model for 100,000 iterations, with the first 50,000 iterations discarded as burn-in, and retain every 50th sample.

4.1 Surface test functions

For each surface test function, in-sample data sets are generated from the true function at equally spaced grid points on . We also add independent normally distributed noises to the true target functions. We select the value of

such that the root signal-to-noise ratio (RSNR) was 1 and 5. We use 2500 additional data points generated independently and uniformly on

as out-of-sample data. The three true surfaces are given by

where , and is the indicator function of . They are visualized in Figure 4.

(a)
(b)
(c)
Figure 4: Three true surfaces: (a) radial (b) complex interaction and (c) non-smooth functions

In this example, we add the thin plate spline (TPS) as a benchmark technique since it is a commonly used for the smooth interpolation of two-dimensional data. The TPS is also referred to as a generalization of the smoothing spline. Results of this simulation are presented in

Table 1. Table 1 demonstrates that the MLABS model performs well in most cases with the lowest, the second, or the third-lowest average RMSE values across 100 in-sample and out-of-sample sets. According to the average rank of Table 1, the MLABS attains a more accurate estimation of the surface test function than the TPS. The tree-based models such as SBART, BART, RF, and XGB have difficulties estimating smooth surfaces or regions due to their lack of smoothness. The NN does not work very well owing to fixed model structures relative to the training data size. The BASS can choose diverse degrees of the spline functions and produce the lowest value on the radial and complex test functions with RSNR = 1, unlike the MARS. One characteristic of the proposed model is smoothness adaptation Figure 5 supports that the MLABS model has the advantages of canceling the noise and adapting to the non-smooth function.

(a)
(b)
(c)
(d)
Figure 5: Plot of the (a) true non-smooth function with additive Gaussian noise, and estimated surfaces obtained by fitting the (b) TPS, (c) BART, and (d) MLABS model.

max width= Function Noise MLABS SBART BART BARK BASS RF SVM MARS NN XGB TPS Radial RSNR = 5 0.035 (2) 0.045 (4) 0.071 (7) 0.03 (1) 0.053 (6) 0.129 (10) 0.04 (3) 0.1 (8) 0.167 (11) 0.1 (9) 0.047 (5) RSNR = 1 0.124 (2) 0.187 (6) 0.216 (8) 0.154 (5) 0.115 (1) 0.531 (11) 0.152 (4) 0.203 (7) 0.366 (10) 0.247 (9) 0.135 (3) Complex RSNR = 5 0.055 (1) 0.067 (5) 0.114 (7) 0.074 (6) 0.059 (4) 0.149 (9) 0.057 (3) 0.34 (11) 0.316 (10) 0.13 (8) 0.056 (2) RSNR = 1 0.209 (3) 0.274 (6) 0.325 (8) 0.272 (5) 0.195 (1) 0.534 (11) 0.24 (4) 0.386 (9) 0.522 (10) 0.3 (7) 0.196 (2) Non-smooth RSNR = 5 0.029 (1) 0.036 (5) 0.038 (6) 0.047 (10) 0.04 (8) 0.039 (7) 0.036 (4) 0.058 (11) 0.033 (3) 0.043 (9) 0.032 (2) RSNR = 1 0.059 (1) 0.063 (2) 0.068 (5) 0.069 (7) 0.066 (3) 0.125 (11) 0.067 (4) 0.072 (9) 0.075 (10) 0.068 (6) 0.07 (8) Average rank 1.67 (1) 4.67 (5) 6.83 (7) 5.67 (6) 3.83 (4) 9.83 (11) 3.67 (2) 9.17 (10) 9 (9) 8 (8) 3.67 (2)

Table 1: Average of predictive RMSEs over 100 pairs of held-in and held-out sets for three surface test functions. The rank of the method among the eleven approaches is shown in parentheses. The top-ranked model for each test function is given in bold.

4.2 Friedman’s test functions

We conduct additional experiments using Friedman 1, 2, and 3 data sets to assess the practical performance of the proposed method on general dimensional data. The Friedman 1 data set has ten independent uniform random variables on the interval . The output is computed using the following formula

The data set uses only the first five variables out of ten variables. The Friedman 2 and 3 data sets have four independent random variables with uniform distribution on the intervals

The corresponding responses are created according to the mean functions

These data sets have non-linear and high interaction order terms. For each test function, we create in-sample data sets of 250 observations and add independent Gaussian noise with mean zero and standard deviation

, so that the root signal-to-noise ratio is set at 1 and 5. We also generate out-of-sample data sets of 1000 observations to measure the predictive accuracy of regression models.

Results of the simulation for Friedman’s data sets are given in Table 2. The MLABS model has the best performance in almost all cases, as shown in Table 2. The feature of this experiment is that the tensor product basis-based models, including the MLABS, BASS, and MARS, are superior to others. The results are caused by whether the interaction order terms can be estimated directly or not. Although the SBART and BART have relatively good prediction abilities, the MLABS overwhelms them for all test functions regardless of the RSNR. The average rank in Table 2 shows the ensemble models of the RF and XGB, and kernel-based models of the BARK and SVM perform poorly in Friedman’s data sets. The NN is not appropriate for handling small datasets, as seen in the previous surface examples.

max width= Function Noise MLABS SBART BART BASS BARK RF SVM MARS NN XGB Friedman 1 RSNR = 5 0.383 (1) 0.532 (3) 1.077 (6) 0.394 (2) 1.042 (5) 2.173 (8) 3.977 (9) 0.609 (4) 4.075 (10) 1.435 (7) RSNR = 1 1.796 (1) 1.92 (2) 2.156 (4) 2.117 (3) 2.362 (5) 2.56 (8) 4.067 (9) 2.368 (6) 4.176 (10) 2.407 (7) Friedman 2 RSNR = 5 16.679 (1) 27.731 (5) 44.088 (6) 16.994 (2) 26.553 (4) 50.754 (8) 78.98 (10) 24.106 (3) 53.783 (9) 48.897 (7) RSNR = 1 69.585 (2) 90.785 (4) 121.777 (5) 53.52 (1) 80.731 (3) 148.693 (8) 183.48 (10) 123.786 (6) 178.673 (9) 128.736 (7) Friedman 3 RSNR = 5 0.061 (1) 0.063 (2) 0.079 (5) 0.066 (3) 0.092 (7) 0.105 (8) 0.125 (10) 0.073 (4) 0.088 (6) 0.105 (9) RSNR = 1 0.11 (1) 0.116 (2) 0.134 (4) 0.133 (3) 0.146 (7) 0.144 (6) 0.196 (9) 0.146 (8) 0.198 (10) 0.138 (5) Average rank 1.17 (1) 3 (3) 5 (4) 2.33 (2) 5.17 (6) 7.67 (8) 9.5 (10) 5.17 (6) 9 (9) 7 (7)

Table 2: Average of predictive RMSEs over 100 pairs of held-in and held-out sets for Friedman’s test functions. The rank of the method among the ten approaches is shown in parentheses. The top-ranked model for each test function is given in bold.

We evaluate the out-of-sample performance with methods based on the Friedman 1 data set in the high-dimensional settings for a detailed comparison. In other words, we check how well the models work as the number of variables increases. We reproduce the simulation scenarios of linero2018sbart. We create five pairs of 250 training and 1000 test samples with features, which increase from 5 to 1000 along an evenly spaced grid on the scale of . Independent Gaussian noise with mean zero and standard deviation is also added to the training samples generated from the true mean function. Methods are compared by an average of RMSEs over five replications. Every time the number of variables increases, most methods are tuned by using cross-validation.

Results of this simulation are provided in Figure 6. An interesting part of Figure 6 is that the MLABS achieves the best performance up to about 70-dimensional data irrespective of the noise level. After that point, its error increases gradually in both the low and the high noise settings. Since the MLABS and BASS have the same performance behaviors, unlike MARS, these results seem to come from slowly mixing of the RJMCMC algorithm. In contrast, the SBART and MARS are interestingly invariant to the number of predictors. The SBART is superior to other methods, including the MLABS, for high-dimensional settings where is large.

Figure 6: Average root-mean-square error of various methods with a smoothing line as a function of the dimension on the log scale.

5 Real data applications

We now compare the MLABS model (10) with various competing methods in regression and classification problems on several real-world datasets.

5.1 Regression examples

We prepare the six real-life datasets from the UCI Machine Learning Repository (UCI) and several R packages:

caret, mfp, MASS, and AppliedPredictiveModeling. The summary of these data sets is provided in Table 4. Since the MLABS model can handle only quantitative variables, we don’t consider categorical predictors in the data sets. We also erase missing values. Specifically, case 42 of the bodyfat data seems to be an apparent error, and its height variable is replaced by 69.5. The tecator meat and residential building datasets have multiple responses variables. We choose one of the responses in each data set: the percentages of protein (tecator meat) and actual sales prices (residential building).


max width= Dataset # Samples # Features Source Bodyfat 252 13 mfp Boston housing 506 12 MASS Concrete compressive strength 1030 8 UCI Residential building 372 103 UCI Tecator meat 215 100 caret Chemical manufacturing process 152 58 AppliedPredictiveModeling

Table 3:

Information of six data sets for regression analysis.

We consider the nine competing approaches as illustrated in subsection 4.2 and select the best hyperparameters of each method using cross-validation methods. To gauge the predictive performance among the methods, we make use of 20 times replicated five-fold cross-validations. Thus, we compute an average of 20 estimated CV errors as a measure of accuracy.

Results of the experiment for the regression problem are presented in Table 4. Table 4 illustrates that the MLABS model has stable predictive abilities by getting the best performance on three data sets. It also produces the third-lowest average RMSE in the remaining three data sets. By the average rank of Table 4, the MLABS model generally outperforms state-of-art methods in the fields of machine learning or Bayesian nonparametrics. Furthermore, for the tecator meat data, the tensor product basis based models work much better than the tree-based models do.

In contrast, the tree-based methods perform well for the chemical manufacturing process datasets and rank high among the methods. In practice, the kernel-based methods show bad performance in the regression examples, and the lowest-ranked approach is the SVM. These results are attributed to lacking the flexibility and adaptability to the data sets by using only one type of kernel function.

max width= Data MLABS SBART BART BARK BASS RF SVM MARS NN XGB Bodyfat 4.08 (1) 4.1 (2) 4.22 (6) 4.17 (4) 4.13 (3) 4.32 (8) 8.08 (10) 4.36 (9) 4.18 (5) 4.28 (7) Boston housing 2.95 (1) 3.11 (3) 3.14 (4) 3.31 (6) 4.25 (10) 3.19 (5) 3.75 (8) 3.84 (9) 3.71 (7) 3 (2) Concrete compressive strength 4.3 (3) 4.82 (4) 4.14 (2) 6.66 (9) 5.44 (6) 4.89 (5) 5.9 (7) 6.31 (8) 8.1 (10) 3.85 (1) Residential building 108.67 (1) 115.29 (3) 128.93 (5) 140.98 (7) 137.65 (6) 245.83 (9) 909.32 (10) 122.21 (4) 113.7 (2) 196.93 (8) Tecator meat 1.17 (3) 1.53 (5) 2.18 (9.5) 2.18 (9.5) 1 (2) 2.04 (8) 1.86 (6) 0.98 (1) 1.42 (4) 1.93 (7) Chemical manufacturing process 1.09 (3) 1.11 (5) 1.09 (4) 1.21 (6) 1.24 (7) 1.08 (2) 1.87 (10) 1.26 (8) 1.42 (9) 1.01 (1) Average rank 2 (1) 3.67 (2) 5.08 (4) 6.92 (9) 5.67 (5) 6.17 (6.5) 8.5 (10) 6.5 (8) 6.17 (6.5) 4.33 (3)

Table 4: Average root-mean-square error of the MLABS and competitve methods with the rank of the method among the ten approaches in parentheses in the real data sets for regression problem. The top-ranked model for each real data set is given in bold.

5.2 Classification examples

We choose the seven competitive methods for classification problems and exclude the SBART and BASS because the two models cannot yet analyze the binary data. We compare the MLABS model using the probit link with other methods that optimized their hyperparameters using grid-search with five-fold cross-validation by a classification performance measure: AUC (area under the receiver operating characteristic (ROC) curve). The AUC is the most common metric for classification tasks, and the value lies between 0 to 1, where 1 indicates an excellent classifier. We calculate the average of performance metrics obtained by repeating 5-fold cross-validation 20 times. We collect the seven real data sets for classification from the UCI Machine Learning Repository and two R packages: mlbench and datamicroarray

. The Alon dataset is the high-dimensional microarray data set for colon cancer. The Pima Indian diabetes data set contains zero values of some variables, and we consider the values missing values. The missing values and categorical variables of every real data set for classification are processed in the same way as regression experiments. The real data sets are listed in with the information such as the number of sample size and features, source, and imbalanced ratio (IR) defined as

where is the set of all classes.

max width= Dataset # Samples # Features IR Source Parkinson 195 22 3.06 UCI Ionosphere 351 32 1.79 UCI Breast cancer Wisconsin (Diagnostic) 569 30 1.7 UCI Sonar 208 61 1.14 UCI Spambase 4601 57 1.54 UCI Pima Indian diabetes 392 9 2.02 mlbench Alon 62 2000 1.82 datamicroarray

Table 5: Information of seven real data sets for classification tasks

Results of this experiment are given in Table 6. The columns of the methods represent their average of cross-validated AUC values over 20 replicates. As shown in Table 6, the MLABS method doesn’t show excellent predictive performance for classification, but it is comparable to the XGB and RF as gold standard models. Specifically, the MLABS model performs well in most cases except the Ionosphere, Sonar, and Alon data set. It is seen as having difficulties estimating in high-dimensional cases. Here, the XGB model provides the best performance, followed by the RF, MLABS, and BART model. In contrast with the regression problems, tree-based models generally provide better predictive capabilities than the others.

max width=

Data MLABS BART BARK RF SVM MARS NN XGB
Parkinson 0.97 (2) 0.962 (4) 0.922 (6) 0.961 (5) 0.975 (1) 0.899 (7) 0.797 (8) 0.967 (3)
Ionosphere 0.971 (4) 0.963 (5) 0.95 (6) 0.978 (1) 0.978 (2) 0.935 (7) 0.917 (8) 0.976 (3)
Breast cancer Wisconsin 0.995 (1) 0.992 (3) 0.984 (7) 0.989 (4) 0.975 (8) 0.988 (5) 0.987 (6) 0.994 (2)
Sonar 0.907 (5) 0.933 (3) 0.79 (8) 0.941 (1) 0.909 (4) 0.864 (6) 0.853 (7) 0.935 (2)
Pima Indian Diabetes 0.849 (2) 0.847 (4) 0.846 (5) 0.848 (3) 0.801 (8) 0.816 (7) 0.831 (6) 0.852 (1)
Spambase 0.983 (3) 0.982 (4) 0.975 (6) 0.986 (2) 0.949 (8) 0.977 (5) 0.966 (7) 0.988 (1)
Alon 0.885 (4) 0.889 (3) 0.836 (6) 0.876 (5) 0.5 (8) 0.771 (7) 0.905 (2) 0.914 (1)
Average rank 3.125 (3) 3.5 (4) 6.25 (7) 3 (2) 5.875 (5) 6.375 (8) 6.125 (6) 1.75 (1)
Table 6: Average of cross-validated AUC of the MLABS and competitve methods with the rank of the method among the eight approaches in parentheses in the real data sets for classfication problem. The top-ranked model for each real data set is given in bold.

6 Discussion

In this article, we have introduced a general Bayesian sum-of-bases model named Multivariate Lévy Adaptive B-Spline Regression using the tensor product of B-spline basis function of which parameters are automatically determined by the Lévy random measure. The B-spline basis has nice properties such as local support and differentiability. We have illustrated that it has a powerful predictive ability over the state-of-the-art methods in simulation studies and real data applications of the regression problems. We also proposed a comparable classification model using the data augmentation strategies of albert1993bayesian.

However, there are drawbacks that the proposed model can treat only continuous variables and is slightly inefficient as it uses the RJMCMC. The MCMC algorithm makes it difficult to deal with high-dimensional data. The classifier based on the MLABS framework also does not work well compared to the tree-based models. Further studies are needed to improve these problems.

Future work will develop a versatile and efficient sampling-based model for the MLABS model. One possibility is to give the Lévy process prior up and use regularization priors to handle the high-dimensional data under fixed and a large of the basis functions. Using a Bayesian backfitting algorithm of hastie2000bayesian as a core algorithm in the BART is expected to be more effective to achieve high performance and fast convergence than the inefficient RJMCMC. Moreover, scalable algorithms such as the Consensus Monte Carlo or variational Bayes can be applied to our model for large and tall data. Another possibility is that the tensor product bases will be allowed to contain indicators for categorical data.

References

Appendix A Decision boundaries for MLABS

We apply main classifiers including MLABS, BART, RF, SVM, and XGB on five binary class datasets with two-dimensional space to visualize the classification performance. For each dataset, different decision boundaries of all methods are shown in Figure 7.

(a) Linearly separable dataset
(b) Circle dataset
(c) Two moons dataset
(d) XOR dataset
(e) Two spirals dataset
Figure 7: Comparison of decision boundaries of MLABS using the probit link and four classifiers on five data sets

Appendix B Tuning hyperparameters

To select optimal hyperparameters for all methods, we use a grid search approach using 5-fold cross-validation for all experiments. Table 7 summarizes the hyperparameter search spaces we are using.

max width= Method Parameter Values considerred MLABS set of degree numbers: 0, 1, 2, 3, (0,1), (0,2), (0,3), (1,2), (1,3), (2,3), (0,1,2), (0,1,2,3) maximum degree of interaction: 1, 2, 3 multiplier for expanded intervals: 0.1, 1, 2, 3 SBART number of trees 20, 50, 200 BART sigma prior: combinations (3,0.9), (3,0.99), (10,0.75) number of trees: 50, 20 prior: value for 1, 2, 3, 5 BARK type of prior for the scale parameters “e”, “d”,“se”, “sd” BASS degree of splines: 1, 2, 3 maximum degree of interaction: 1, 2, 3 MARS maximum number of terms in the pruned model 2, 12, 23, 34, 45, 56, 67, 78, 89, 100 maximum degree of interaction: 1, 2, 3 RF number of trees 2,…, SVM regularization constant: 0.001, 0.01, 0.1, 1, 5, 10, 100 kernel hyperparameter: 0.5, 1, 2, 3, 4 NN learning rate: 0.001, 0.005, 0.01, 0.05, 0.1, 0.5 XGB max number of boosting iterations 250, 500, 1000 maximum depth of a tree 4, 8, 12 learning rate 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40 minimum sum of instance weight needed in a child 1, 10, 15 subsample ratio of columns 0.7, 1

Table 7: Hyperparameters and ranges for all experiments