1 Introduction
Regression models are an indispensible tool for answering scientific questions in almost all research areas. Traditionally scientists have been trained to be extremely careful in specifying adequate models and to include not too many explanatory variables. The orthodox statistical approach warns against blindly collecting data of too many variables and relying on automatic procedures to detect the important ones (see for example Burnham and Anderson, 2002). Instead, expert based knowledge of the field should guide the model building process such that only a moderate number of models are compared to answer specific research questions.
In contrast, modern technologies have lead to the entirely different paradigm of machine learning where routinely extremely large sets of input explanatory variables  the so called features  are considered. Recently deep learning procedures have become quite popular and highly successful in a variety of real world applications
(Goodfellow et al., 2016). These algorithms apply iteratively some nonlinear transformations aiming at optimal prediction of response variables from the outer layer features. Each transformation yields another hidden layer of features which are also called neurons. The architecture of a deep neural network then includes the specification of the nonlinear intralayer transformations (
activation functions), the number of layers (depth), the number of features at each layer (width) and the connections between the neurons (weights). The resulting model is trained by means of some optimization procedure (e.g. stochastic gradient search) with respect to its parameters in order to fit a particular objective (like minimization of RMSE, or maximization of the likelihood, etc.).Surprisingly it is often the case that such procedures easily outperform traditional statistical models, even when these were carefully designed and reflect expert knowledge (Refenes et al., 1994; Razi and Athappilly, 2005; Adya and Collopy, 1998; Sargent, 2001; Kanter and Veeramachaneni, 2015). Apparently the main reason for this is that the features from the outer layer of the deep neural networks become highly predictive after being processed through the numerous optimized nonlinear transformations. Specific regularization techniques (dropout, and penalties on the weights, etc.) have been developed for deep learning procedures to avoid overfitting of training data sets, however success of the latter is not obvious. Normally one has to use huge data sets to be able to produce generalizable neural networks.
The universal approximation theorems (Cybenko, 1989; Hornik, 1991) prove that all neural networks with sigmoidal activation functions (generalized to the class of monotonous bounded functions in Hornik, 1991)
with at least one hidden layer can approximate any function of interest defined on a closed domain in the Euclidian space. Successful applications typically involve huge datasets where even nonparametric methods can be efficiently applied. One drawback of deep learning procedures is that, due to their complex nature, such models and their resulting parameters are difficult to interpret. Depending on the context this can be a more or less severe limitation. These models are densely approximating the function of interest and transparency is not a goal in the traditional applications of deep learning. However, in many research areas it might be desirable to obtain interpretable (nonlinear) regression models rather than just some dense representation of them. Another problem is that fitting deep neural networks is very challenging due to the huge number of parameters involved and the nonconcavity of the likelihood function. As a consequence optimization procedures often yield only local optima as parameter estimates.
This paper introduces a novel approach which combines the key ideas of deep neural networks with Bayesian regression resulting in a flexible and broad framework that we call deep Bayesian regression. This framework also includes many other popular statistical learning approaches. Compared to deep neural networks we do not have to prespecify the architecture but our deep regression model can adaptively learn the number of layers, the number of features within each layer and the activation functions. In a Bayesian model based approach potential overfitting is avoided through appropriate priors which strongly penalize engineered features from deeper layers. Furthermore deep Bayesian regression allows to incorporate correlation structures via latent Gaussian variables, which seems to be rather difficult to achieve within traditional deep neural networks.
Fitting of the deep Bayesian regression model is based on a Markov chain Monte Carlo (MCMC) algorithm for Bayesian model selection which is embedded in a genetic algorithm for feature engineering. A similar algorithm was previously introduced in the context of logic regression (Hubin et al., 2018)
. We further develop a reversible version of this genetic algorithm to obtain a proper MetropolisHastings algorithm. We will demonstrate that automatic feature engineering within regression models combined with Bayesian variable selection and model averaging can improve predictive abilities of statistical models whilst keeping them reasonably simple, interpretable and transparent. The predictive ability of deep Bayesian regression is compared with deep neural networks, CARTs, elastic networks, random forests, and other statistical learning techniques under various scenarios. Furthermore we illustrate the potential of our approach to find meaningful nonlinear models and infer on parameters of interest. As an example we will retrieve several ground physical laws from raw data.
The rest of the paper is organized as follows. The class of deep Bayesian regression models (DBRM) is mathematically defined in Section 2. In Section 3 we describe two algorithms for fitting DBRM, namely the genetically modified MJMCMC (GMJMCMC) and its reversible version (RGMJMCMC). In Section 4 these algorithms are applied to several real data sets. The first examples are aiming at prediction where the performance of our approach is compared with various competing statistical learning algorithms. Later examples have the specific goal of retrieving an interpretable model. In the final Section 5 some conclusions and suggestions for further research are given. Additional examples and details about the implementation can be found in the Appendix.
2 DBRM: The deep Bayesian regression model
We model the relationship between features and a response variable based on samples from a training data set. Let denote the response data and the
dimensional vector of input covariates for
. The proposed model is within the framework of a generalized linear model, but extended to include a flexible class of nonlinear transformations (features) of the covariates, to be further described in Section
2.1. This class includes a finite (though huge) number of possible features, which can in principle be enumerated as , . With this notation we define the deep Bayesian regression model (DBRM), including (potentially) up to features:(1a)  
(1b) 
Here
is the density (mass) of a probability distribution from the exponential family with expectation
and dispersion parameter , while is a link function relating the mean to the underlying covariates (McCullagh and Nelder, 1989). The features can enter through an additive structure with coefficients . Equation (1b) includes all possiblecomponents (features), using binary variables
indicating whether the corresponding variables are to be actually included into the model or not. Priors for the different parameters of the model are specified in Section 2.3.2.1 Topology of the feature space
The feature space is constructed through a hierarchical procedure similar to the deep learning approach (LeCun et al., 2015; Goodfellow et al., 2016), but allowing for automatic construction of the architecture. This is in contrast to deep neural networks where the architecture has to be set in advance (LeCun et al., 2015). We can also obtain posterior distributions of different architectures within the suggested approach.
Deep neural networks typically use one or several prespecified activation functions to compute hidden layer values, where currently the rectified linear unit
is the most popular. The configuration of where (at which layers and for which subsets of neurons) these functions are applied is fixed. In contrast, in our approach the activation function can be dynamically selected from a prespecified set of nonlinear functions, which can include, apart from the rectified linear unit, several other transformations, possibly adjustable to the particular application. Examples include , , , and .The construction of possible features will, similarly to deep neural networks, be performed recursively through nonlinear combinations of previously defined features. Let denote a set of nonlinear functions, . Define the depth of a feature as the maximal number of nonlinear functions from applied recursively when generating that feature. For example, a feature has depth equal to 3. Denote the set of features of depth by , which will be of size . Furthermore denote the vector of all features of depth less or equal to by . The inner layer of features of depth zero consists of the original covariates themselves, , where we drop the index for notational convenience. Then and . A new feature is obtained by applying a nonlinear transformation on an affine transformation of :
(2) 
Equation (2) has the functional form most commonly used in deep neural networks models (Goodfellow et al., 2016), though we allow for linear combinations of features from layers of different depths as well as combinations of different nonlinear transformations. The affine transformation in (2) is parameterized by the intercept and the coefficient vector of the linear combination which is typically very sparse but must include at least one nonzero coefficient corresponding to a feature from . Different features of are distinguished only according to the hierarchical pattern of nonzero entries of . This is similar to the common notion in variable selection problems where models including the same variables but different nonzero coefficients are still considered to represent the same model. In that sense our features are characterized by the model topology and not by the exact values of the coefficients .
The number of features with depth of size can be calculated recursively with respect to the number of features from the previous layer, namely
(3) 
where as discussed above and denotes the number of different functions included in . One can clearly see that the number of features grows exponentially with depth. In order to avoid potential overfitting through too complex models the two constraints are defined.
Constraint 1.
The depth of any feature involved is less or equal to .
Constraint 2.
The total number of features in a model is less or equal to .
The first constraint ensures that the feature space is finite, with total size , while the second constraint limits the number of possible models by . The (generalized) universal approximation theorem (Hornik, 1991) is applicable to the defined class of models provided that contains at least one bounded monotonously increasing function. Hence the defined class of models is dense in terms of approximating any function of interest in the closed domain of the Euclidean space.
The feature space we have iteratively constructed through equation (2) is extremely rich and encompasses as particular cases features from numerous other popular statistical and machine learning models. If the set of nonlinear functions only consists of one specific transformation, for example where
is the sigmoid function, then the corresponding feature space includes all possible neural networks with the sigmoid activation function. Another important class of models included in the DBRM framework are decision trees
(Breiman et al., 1984). Simple decision rules correspond to the nonlinear function. Intervals and higher dimensional regions can be defined through multiplications of such terms. Multivariate adaptive regression splines
(Friedman, 1991) are included by allowing a pair of piecewise linear functions and . Fractional polynomials (Royston and Altman, 1997) can also be easily included through . Logic regression, characterized by features being logic combinations of binary covariates (Ruczinski et al., 2003; Hubin et al., 2018) is also fully covered by DBRM models. Combining more than one function in provides quite interesting additional flexibilities in construction of features, e.g. .Interactions between (or multiplication of) variables are important features to consider. Assuming that both and are members of , multiplication of two (positive) features becomes a new feature with depth via
(4) 
However, due to its importance we will include the multiplication operator between two features directly in our algorithmic approach (see Section 3) and treat it simply as an additional transformation with depth 1.
2.2 Feature engineering
In principle one could generate any feature defined sequentially through (2) but in order to construct a computationally feasible algorithm for inference, restrictions on the choices of ’s are made. Our feature engineering procedure computes specific values of for any engineered feature in the topology using the following three operators which take features of depth as input and create new features of depth .
The modification operator is the special case of (2) where has only one nonzero element . The crossover operator can also be seen as a special case in the sense of (4). Only for the general projection operator one has to estimate , which is usually assumed to be very sparse. We have currently implemented four different strategies to compute parameters. Here and in Section 4 we focus on the simplest and computationally most efficient version. The other three strategies as well as further ideas to potentially improve the feature engineering step are discussed in Section 5 and in Appendix A available in the web supplement.
Our default procedure to obtain is to compute maximum likelihood estimates for model (1) including only as covariates, that is for
(5) 
This choice is made not only for computational convenience, but also has some important advantages. The nonlinear transformation is not involved when computing . Therefore the procedure can easily be applied for nonlinear transformations
which are not differentiable, like for example the extremely popular rectified linear unit of neural networks or the characteristic functions for decision trees. Furthermore ML estimation for generalized linear models usually involves convex optimization problems with unique solutions. On the other hand this simple approach means that the parameters
from are not reestimated but kept fixed, a restriction which will be overcome by some of the alternative strategies (including a fully Bayesian one) introduced in Appendix A.2.3 Bayesian model specifications
In order to put model (1) into a Bayesian framework one has to specify priors for all parameters involved. The structure of a specific model is uniquely defined by the vector . We introduce model priors which penalize for number and the complexity of included features in the following way:
(6) 
The measure is a nondecreasing function of the complexity of feature . With , the prior prefers both fewer terms and simpler features over more complex ones.
There are many different ways of defining feature complexity. We will consider a measure taking into account both the number of nonlinear transformations and the number of features used at each transformation step. Define the local width of a feature as the number of nonzero coefficients of (including ) in equation (2). Features obtained with a modification operator or a crossover operator both have local width 1. However, features may inherit different widths from parental layers. Define accordingly the total width of a feature recursively as the sum of all local widths of features contributing in equation (2). In the current implementation of DBRM this total width serves as complexity measure as illustrated in the following example. Consider a feature of the form
which has a depth of and a total width of . Here refers to the ”norm” (the number of nonzero elements in the corresponding vector) and the additional 3 corresponds to the intercept terms.
To complete the Bayesian model one needs to specify priors for , the vector of regression parameters for which and, if necessary, for the dispersion parameter .
(7)  
(8) 
Prior distributions on and
are usually defined in a way to facilitate efficient computation of marginal likelihoods (for example by specifying conjugate priors) and should be carefully chosen for the applications of interest. Specific choices are described in Section
4 when considering different real data sets.2.4 Extensions of DBRM
Due to the modelbased approach of DBRM different kinds of extensions can be considered. One important extension is to include latent variables, both to take into account correlation structures and overdispersion. Simply replace (1b) by
(9) 
In this formulation of the DBRM model, equation (9) now includes possible components where indicates whether the corresponding latent variable is to be included into the model or not. The latent Gaussian variables with covariance matrices
allow to describe different correlation structures between individual observations (e.g. autoregressive models). The matrices typically depend only on a few parameters, so that in practice one has
.The model vector now becomes where . Similar to the restriction on the number of features that can be included in a model, we introduce an upper limit
on the number of latent variables. The total number of models with nonzero prior probability will then be
. The corresponding prior for the model structures is defined by(10) 
Here the function is a measure for the complexity of the latent variable
, which is assumed to be a nondecreasing function of the number of hyperparameters defining the distribution of the latent variable. In the current implementation we simply count the number of hyperparameters. The prior is further extended to include
(11) 
2.5 Bayesian inference
Posterior marginal probabilities for the model structures are, through Bayes formula, given by
(12) 
where denotes the marginal likelihood of given a specific model . Due to the huge size of it is not possible to calculate the sum in the denominator of (12) exactly. In Section 3 we will discuss how to obtain estimates of using MCMC algorithms.
The (estimated) posterior distribution of any statistic of interest (like for example in predictions) becomes
(13) 
The corresponding expectation is obtained via model averaging:
(14) 
An important example is the posterior marginal inclusion probability of a specific feature , which can be estimated by
(15) 
This provides a well defined measure of the importance of an individual component.
3 Fitting of DBRM
In this section we will develop algorithmic approaches for fitting the DBRM model. The main tasks are to (i) calculate the marginal likelihoods for a given model and (ii) to search through the model space . Concerning the first issue one has to solve the integral
(16) 
where are all the parameters involved in model . In general these marginal likelihoods are difficult to calculate. Depending on the model specification we can either use exact calculations when these are available (Clyde et al., 2011) or numerical approximations based on simple Laplace approximations (Tierney and Kadane, 1986), the popular integrated nested Laplace approximation (INLA) (Rue et al., 2009) or MCMC based methods like Chib’s or Chib and Jeliazkov’s method (Chib, 1995; Chib and Jeliazkov, 2001). Some comparison of these methods are presented in Friel and Wyse (2012) and Hubin and Storvik (2016).
The parameters of DBRM for a specified model topology consist of , the regression coefficients for the features, and , the dispersion parameter. If latent Gaussian variables are included into the models, parameters will also be part of . We are not including here the set of coefficients which encompasses all the parameters inside the features of model . These are considered simply as constants, used to iteratively generate features of depth as described in Section 2.2. One can make the model more general and consider as part of the parameter vector . However, solving the integral (16) over the full set of parameters including will become computationally extremely demanding due to the complex nonlinearity and the highdimensional integrals involved. Some possibilities how to tackle this problem in the future are portended in Section 5.
Condsider now task (ii), namely the development of algorithms for searching through the model space. Calculation of requires to iterate through the space including all potential models, which due to the combinatorial explosion (3) becomes computationally infeasible for even a moderate set of input variables and latent Gaussian variables. We therefore aim at approximating by means of finding a subspace which can be used to approximate (12) by
(17) 
Low values of induce both low values of the numerator and small contributions to the denominator in (12), hence models with low mass will have no significant influence on posterior marginal probabilities for other models. On the other hand, models with high values of are important to be addressed. It might be equally important to include regions of the model space where no single model has particularly large mass but there are many models giving a moderate contribution. Such regions of high posterior mass are particularly important for constructing a reasonable subspace and missing them can dramatically influence our posterior estimates.
The mode jumping MCMC algorithm (MJMCMC) was introduced in Hubin and Storvik (2018) for variable selection within standard GLMM models, that is models where all possible features are prespecified. The main ingredient in MJMCMC is the specification of (possibly large) moves in the model space. This algorithm was generalized to the genetically modified MJMCMC algorithm (GMJMCMC) in the context of logic regression by Hubin et al. (2018). The GMJMCMC is not a proper MCMC algorithm in the sense of converging to the posterior distribution although it does provide consistent model estimates by means of the approximation (17). In the following two subsections we are suggesting an adaptation of the GMJMCMC algorithm to DBRM models. Additionally, we derive a fully reversible GMJMCMC algorithm (RGMJMCMC). Since both algorithms rely on the MJMCMC algorithm we start with a short review of this algorithm. Throughout this section without loss of generality we will only consider features and not latent variables. Selection of latent variables is part of the implemented algorithms but only complicates the presentation.
3.1 The mode jumping MCMC algorithm
Consider the case of a fixed predefined set of potential features with no latent variables. Then the general model space is of size and standard MCMC algorithms tend to get stuck in local maxima. The mode jumping MCMC procedure (MJMCMC) was originally proposed by Tjelmeland and Hegstad (1999) for continuous space problems and recently extended to model selection settings by Hubin and Storvik (2018). MJMCMC is a proper MCMC algorithm equipped with the possibility to jump between different modes within the discrete model space. The algorithm is described in Algorithm 1.
(18) 
The basic idea is to make a large jump (changing many model components) combined with local optimization within the discrete model space to obtain a proposal model with high posterior probability. Within a MetropolisHastings setting a valid acceptance probability is constructed using symmetric backward kernels, which guarantees that the resulting Markov chain is ergodic and has the desired limiting distribution
(see Hubin and Storvik, 2018, for details).3.2 Genetically Modified MJMCMC
The MJMCMC algorithm requires that all the covariates (features) defining the model space are known in advance and are all considered at each iteration of the algorithm. For the DBRM models, the features are of a complex structure and a major problem in this setting is that it is quite difficult to fully specify the space in advance (let alone storing all potential features in some data structure). The idea behind GMJMCMC is to apply the MJMCMC algorithm within a smaller set of model components in an iterative setting.
3.2.1 Main algorithm
Throughout our search we generate a sequence of so called populations . Each is a set of features and forms a separate search space for exploration through MJMCMC iterations. Populations dynamically evolve allowing GMJMCMC to explore different parts of the total model space. Algorithm 2 summarizes the procedure, the exact generation of given is described below.
The following result is concerned with consistency of probability estimates of GMJMCMC when the number of iterations increases. The theorem is an adaption of Theorem 1 in Hubin et al. (2018):
Theorem 1.
Let be the set of models visited through the GMJMCMC algorithm. Define to be the set of models visited at iteration within search space . Assume and forms an irreducible Markov chain over the possible states. Then the model estimates based on (17) will converge to the true model probabilities as the number of iterations converges to .
Proof.
Note that the approximation (17) will provide the exact answer if . It is therefore enough to show that the algorithm in the limit will have visited all possible models. Since the state space of the irreducible Markov chain is finite, it is also recurrent, and there exists a stationary distribution with positive probabilities on every model. Thereby, all states, including all possible models of maximum size , will eventually be visited. ∎
Remark
All models visited, also those auxiliary ones which are used by MJMCMC to generate proposals, will be included into . For these models, also computed marginal likelihoods will be stored, making the costly likelihood calculations only necessary for models that have not been visited before.
3.2.2 Initialization
The algorithm is initialized by first applying some procedure (for example based on marginal testing) which selects a subset of input covariates. We denote these preselected components by where for notational convenience we have ordered indices according to preselection which does not impose any loss of generality. Note that depending on the initial preselection procedure, might include a different number of components than all further populations . MJMCMC is then run for a given number of iterations on and the resulting input components with highest frequency (ratio of models including the component) will become the first members of population . The remaining members of will be newly created features generated by applying the transformations described in Section 3.2.3 on members of .
3.2.3 Transition between populations
Members of the new population are generated by applying certain transformations to components of . First some components with low frequency from search space are removed using a operator. The removed components are then replaced, where each replacement is generated randomly by a mutation operator with probability , by a crossover operator with probability , by a modification operator with probability or by a projection operator with probability , where (adaptive versions of these probabilities are considered in section 3.3). The operators to generate potential features of are formally defined below, where the modification, crossover and projection operators have been introduced already in Section 2.1. Combined, these possible transformations fulfill the requirement about irreducibility in Theorem 1.
 Filtration:

Features within with estimated posterior probabilities below a given threshold are deleted with probability . The algorithm offers the option that a subset of is always kept in the population throughout the search.
 Mutation:

A new feature is randomly selected from .
 Modification:

A new feature is created where is randomly selected from . and is randomly selected from .
 Crossover:

A new feature is created by randomly selecting and from .
 Projection:

A new feature is created in three steps. First a (small) subset of is selected by sampling without replacement. Then is specified according to the rules described in Section 2.2 and finally is randomly selected from .
For all features generated with any of these operators it holds that if either the newly generated feature is already present in or it has linear dependence with the currently present features then it is not considered for . In that case a different feature is generated as just described.
3.2.4 Reversible Genetically Modified MJMCMC
The GMJMCMC algorithm described above is not reversible and hence cannot guarantee that the ergodic distribution of its Markov chain corresponds to the target distribution of interest (see Hubin et al., 2018, for more details). An easy modification based on performing both forward and backward swaps between populations can provide a proper MCMC algorithm in the model space of DBRM models. Consider a transition with a given probability kernel. Here is the proposal for a new population, transitions are generated by local MJMCMC within the model space induced by , and the transition is some randomization at the end of the procedure as described in the next paragraph. The following theorem shows the detailed balance equation for the suggested swaps between models.
Theorem 2.
Assume and are generated according to the large jump proposal distribution . Assume further are generated according to . Put
where
(19) 
Then .
Proof.
Define
Then by construction . Define to be a proposal from the distribution . Then the Metropolis Hastings acceptance ratio becomes
which reduces to . ∎
From this theorem it follows that if the Markov chain is irreducible in the model space then it is ergodic and converges to the right posterior distribution. The described procedure marginally generates samples from the target distribution, i.e. according to model posterior probabilities . Note that the populations themselves do not have to be stored, they are only needed for the generation of new models. Instead of using the approximation (17) one can get frequency based estimates of the model posteriors . For a sequence of simulated models from an ergodic MCMC algorithm with as a stationary distribution it holds that
(20) 
and similar results are valid for estimates of the posterior marginal inclusion probabilities (15).
Proposals are obtained as follows. First all members of are included. Then additional features are added similarly as described in Section 3.2.3 but with replaced by the population including all components in . An adaptive version of this can be achieved by dynamically changing to include all features that previously have been considered, the validity of which is explained in Section 3.3.
The randomization is performed by possible swapping of the features within , each with a small probability . Note that this might give a reverse probability being zero if does not include all components in . In that case the proposed model is not accepted. Otherwise the ratio of the proposal probabilities can be written as , where is the Hamming distance (the number of components differing).
3.3 Important computational tricks
To make the algorithms work sufficiently fast our implementation includes several tricks to be described below.
Delayed rejection
In order to make the computations more efficient and avoid unnecessary backward searches we make use of the so called delayed acceptance approach. The most computationally demanding parts of the RGMJMCMC algorithms are the forward and backward MCMC searches (or optimizations). Often the proposals generated by forward search have a very small probability resulting in a low acceptance probability regardless of the way the backwards auxiliary variables are generated. In such cases, one would like to reject directly without performing the backward search. This can be achieved by the delayed acceptance procedure (Christen and Fox, 2005; Banterle et al., 2015) which can be applied in our case due to the following result:
Theorem 3.
Assume and is generated according to the RGMJMCMC algorithm. Accept if both

is preliminarily accepted with a probability

and is finally accepted with a probability .
Then also .
Proof.
In general delayed acceptance results in a decreased total acceptance rate (Banterle et al., 2015, remark 1), but it is still worthwhile due to the computational gain by avoiding the backwards search step in case of preliminary rejection. Delayed acceptance is implemented in the RGMJMCMC algorithm of our Rpackage and is used in the examples of Section 4.
Adaptive proposals
Another important trick consists of using the chain’s history to approximate marginal inclusion probabilities and utilize the latter for the proposal of new populations. Using any measure based on the marginal inclusion probabilities is valid for the reversible algorithm by Theorem 1 from Roberts and Rosenthal (2007) for which two conditions need to be satisfied:
 Simultaneous uniform ergodicity:

For any set of tuning parameters, the Markov chain should be ergodic.
In our case, we have a finite number of models in the model space and hence their enumeration can be performed theoretically within a limited time provided irreducibility of the algorithm. Hence the first condition of the theorem is satisfied.  Diminishing adaptation:

The difference between following transition probabilities converge to zero.
As long as the inclusion probabilities that are used are truncated away from 0 and 1 by a small value , the frequencies will converge to the true marginal inclusion probabilities and the diminishing adaptation condition is satisfied. This is again possible because of the irreducibility of the constructed Markov chains in the finite model space.
Parallelization strategy
Due to our interest in exploring as many unique high quality models as possible and doing it as fast as possible, running multiple parallel chains is likely to be computationally beneficial compared to running one long chain. The process can be embarrassingly parallelized into chains. If one is mainly interested in model probabilities, then equation (17) can be directly applied with now being the set of unique models visited within all runs. A more memory efficient alternative is to utilize the following posterior estimates based on weighted sums over individual runs:
(21) 
Here is a set of arbitrary normalized weights and are the posteriors obtained with either equation (17) or (20) from run of GMJMCMC or RGMJMCMC. Due to the irreducibility of the GMJMCMC procedure it holds that where is the number of iterations. Thus for any set of normalized weights the approximation converges to the true posterior probability and one could use for example
. However, uniform weights have the disadvantage of potentially giving too much weight to posterior estimates from chains that have not quite converged. In the following heuristic improvement
is chosen to be proportional to the posterior mass detected by run ,This choice indirectly penalizes chains that cover smaller portions of the model space. When estimating posterior probabilities using these weights we only need, for each run, to store the following quantities: for all statistics of interest and as a ’sufficient’ statistic of the run. There is no further need of data transfer between processes. A proof that this choice of weights gives consistent estimates of posterior probabilities is given in Hubin et al. (2018).
4 Applications
In this section we will first present three examples addressing prediction in the classification setting, where the performance of DBRM with GMJMCMC and RGMJMCMC is compared with nine competing algorithms. Then we present two examples of model inference after fitting deep regression models with GMJMCMC and RGMJMCMC. Additionally two examples are presented in the Appendix, where the first one considers data simulated using a logic regression model and the second one illustrates the extended DBRM including latent Gaussian variables to analyze epigenetic data.
4.1 Prediction
The first three examples of binary classification use the following publicly available data sets: NEO objects data from NASA Space Challenge 2016 (LLC, 2016), a breast cancer data set (Wolberg et al., 1992) and some data concerned with spam emails (Cranor and LaMacchia, 1998)
. The performance of DBRM is compared with the following competitive algorithms: tree based (TXGBOOST) and linear (LXGBOOST) gradient boosting machines, elastic networks (LASSO and RIDGE), deep dense neural networks with multiple hidden fully connected layers (DEEPNETS), random forest (RFOREST), naive Bayes (NBAYES), and simple
frequentistlogistic regressions (LR). The corresponding R libraries, functions and their parameters settings are given in supplementary scripts.DBRM is fitted using either the GMJMCMC algorithm (DBRM_G) or the reversible version (DBRM_R), where, additionally to the standard algorithms parallel versions using threads were applied (DBRM_G_PAR and DBRM_R_PAR). For all classification examples the set of nonlinear transformations is defined as , with . Additionally a DBRM model with maximum depth (LBRM) is included, which corresponds to a linear Bayesian model using only the original covariates.
Within DBRM, we apply logistic regression with independent observations, namely:
(22a)  
(22b) 
The Bayesian model uses the model structure prior (6) with and . The resulting posterior corresponds to performing model selection with a criterion whose penalty on the complexity is similar to the AIC criterion, which is known (at least for the linear model) to be asymptotically optimal in terms of prediction (Burnham and Anderson, 2002).
The logistic regression model does not have a dispersion parameter and the Bayesian model is completed by using Jeffrey’s prior for the regression parameters
Here is the determinant of the Fisher information matrix.
Predictions based on DBRM are made according to
where we have used the notation for a response variable in the test set. Furthermore
with denoting the set of all explored models and
where is the posterior mode in . The most common threshold for prediction is . Calculation of marginal likelihoods are performed through the Laplace approximation.
To evaluate the predictive performance of algorithms we report the accuracy of predictions (ACC), false positive rate (FPR) and false negative rate (FNR), defined as follows:
Here is the size of the test data sample. For algorithms with a stochastic component, runs were performed in the training data set and the test set was analysed with each of the obtained models, where we kept the split between training and test samples fixed. We then report the median as well as the minimum and maximum of the evaluation measures across those runs. For deterministic algorithms only one run was performed.
Example 1: Neo asteroids classification
The dataset consists of characteristic measures of 20766 asteroids, some of which are classified as potentially hazardous objects (PHO), whilst others are not. Measurements of the following nine explanatory variables are available:
Mean anomaly, Inclination, Argument of perihelion, Longitude of the ascending node, Rms residual, Semi major axis, Eccentricity, Mean motion, Absolute magnitude.Algorithm  ACC  FNR  FPR 

LBRM  0.9999 (0.9999,0.9999)  0.0001 (0.0001,0.0001)  0.0002 (0.0002,0.0002) 
DBRM_G_PAR  0.9998 (0.9986,1.0000)  0.0002 (0.0001,0.0021)  0.0000 (0.0000,0.0000) 
DBRM_R_PAR  0.9998 (0.9964,0.9999)  0.0002 (0.0001,0.0052)  0.0000 (0.0000,0.0000) 
DBRM_R  0.9998 (0.9946,1.0000)  0.0002 (0.0001,0.0076)  0.0002 (0.0000,0.0056) 
DBRM_G  0.9998 (0.9942,1.0000)  0.0002 (0.0001,0.0082)  0.0002 (0.0000,0.0072) 
LASSO  0.9991 (,)  0.0013 (,)  0.0000 (,) 
RIDGE  0.9982 (,)  0.0026 (,)  0.0000 (,) 
LXGBOOST  0.9980 (0.9980,0.9980)  0.0029 (0.0029,0.0029)  0.0000 (0.0000,0.0000) 
LR  0.9963 (,)  0.0054 (,)  0.0000 (,) 
DEEPNETS  0.9728 (0.8979,0.9979)  0.0384 (0.0018,0.1305)  0.0000 (0.0000,0.0153) 
TXGBOOST  0.8283 (0.8283,0.8283)  0.0005 (0.0005,0.0005)  0.3488 (0.3488,0.3488) 
RFOREST  0.8150 (0.6761,0.9991)  0.1972 (0.0003,0.3225)  0.0162 (0.0000,0.3557) 
NBAYES  0.6471 (,)  0.0471 (,)  0.4996 (,) 
The training sample consisted of objects (32 of which are PHO, whilst the other 32 are not) and the test sample of the remaining objects. The results of Table 1
show that even with such a small training set most methods tend to perform very well. The naive Bayes classifier has the smallest accuracy with a huge number of false positives. The tree based methods also have comparably small accuracy, where tree based gradient boosting in addition delivers too many false positives. Random forests tend to have on average too many false negatives, though there is huge variation of performance between different runs ranging from almost perfect accuracy down to accuracy as low as the naive Bayes classifier.
The DBRM methods are among the best methods for this data set and there is practically no difference between DBRM_R and DBRM_G. The best median performance has LBRM which indicates that nonlinear structures do not play a big role in this example and all the other algorithms based on linear features (LASSO, RIDGE, logistic regression, linear gradient boosting) performed similarly well. LBRM gives the same result in all simulation runs, the parallel versions of DBRM give almost the same model as LBRM and only rarely add some nonlinear features, whereas the single threaded versions of DBRM much more often include nonlinear features (Table 4). The slight variation between simulation runs suggests that in spite of the general good performance of DBRM_G and DBRM_R both algorithms have not fully converged in some runs.
Example 2: Breast cancer classification
The second example consists of breast cancer data with observations from 357 benign and 212 malignant tissues. Features are obtained from digitized images of fine needle aspirates (FNA) of breast mass. Ten realvalued features are computed for each cell nucleus: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry and fractal dimension
. For each feature, the mean, standard error, and ”worst” or largest value (mean of the three largest values) per image were computed, resulting in 30 input variables per image, see
Wolberg et al. (1992) for more details on how the features were obtained. A randomly selected quarter of the images was used as a training data set, the remaining images as a test set.Algorithm  ACC  FNR  FPR 

DBRM_R_PAR  0.9765 (0.9695,0.9812)  0.0479 (0.0479,0.0479)  0.0074 (0.0000,0.0184) 
DBRM_G_PAR  0.9742 (0.9695,0.9812)  0.0479 (0.0479,0.0536)  0.0111 (0.0000,0.0184) 
RIDGE  0.9742 (,)  0.0592 (,)  0.0037 (,) 
LBRM  0.9718 (0.9648,0.9765)  0.0592 (0.0536,0.0702)  0.0074 (0.0000,0.0148) 
DBRM_G  0.9695 (0.9554,0.9789)  0.0536 (0.0479,0.0809)  0.0148 (0.0037,0.0326) 
DEEPNETS  0.9695 (0.9225,0.9789)  0.0674 (0.0305,0.1167)  0.0074 (0.0000,0.0949) 
DBRM_R  0.9671 (0.9577,0.9812)  0.0536 (0.0479,0.0702)  0.0148 (0.0000,0.0361) 
LR  0.9671 (,)  0.0479 (,)  0.0220 (,) 
LASSO  0.9577 (,)  0.0756 (,)  0.0184 (,) 
LXGBOOST  0.9554 (0.9554,0.9554)  0.0809 (0.0809,0.0809)  0.0184 (0.0184,0.0184) 
TXGBOOST  0.9531 (0.9484,0.9601)  0.0647 (0.0536,0.0756)  0.0326 (0.0291,0.0361) 
RFOREST  0.9343 (0.9038,0.9624)  0.0914 (0.0422,0.1675)  0.0361 (0.0000,0.1010) 
NBAYES  0.9272 (,)  0.0305 (,)  0.0887 (,) 
Qualitatively the results presented in Table 2 are quite similar to those from Example 4.1. The naive Bayes classifier and random forests have the worst performance where NBAYES gives too many false positives and RFOREST too many false negatives, though less dramatically than in the previous example. All the algorithms based on linear features are performing much better which once again indicates that in this dataset nonlinearities are not of primary importance. Nevertheless both versions of the DBRM algorithm, and in this example also DEEPNETS, are among the best performing algorithms. DBRM run on 32 parallel threads gives the highest median accuracy and performs substantially better than DBRM run only on one thread.
Example 3: Spam classification
In this example we are using the data from Cranor and LaMacchia (1998) for detecting spam emails. The concept of ”spam” is extremely diverse and includes advertisements for products and web sites, money making schemes, chain letters, the spread of unethical photos and videos, et cetera. In this data set the collection of spam emails consists of messages which have been actively marked as spam by users, whereas nonspam emails consist of messages filed as workrelated or personal. The data set includes 4601 emails, with 1813 labeled as spam. For each email, 58 characteristics are listed which can serve as explanatory input variables. These include 57 continuous and 1 nominal variable, where most of these are concerned with the frequency of particular words or characters. Three variables provide different measurements on the sequence length of consecutive capital letters. The data was randomly divided into a training data set of 1536 emails and a test data set of the remaining 3065 emails.
Table 3 reports the results for the different methods. Once again the naive Bayes classifier performed worst. Apart from that the order of performance of the algorithms is quite different from the first two examples. The tree based algorithms show the highest accuracy whereas the five algorithms based on linear features have less accuracy. This indicates that nonlinear features are important in this dataset to discriminate between spam and nonspam. As a consequence DBRM performs better than LBRM.
Specifically the parallel version of DBRM provides almost the same accuracy as DEEPNETS, with the minimum accuracy over 100 runs being actually larger, the median and maximum accuracy quite comparable. However, tree based gradient boosting and random forests perform substantially better which is mainly due to the fact that they can optimize cutoff points for the continuous variables. One way to potentially improve the performance of DBRM would be to include multiple characteristic functions, like for example , into the set of nonlinear transformations to allow the generation of features with splitting points like in random trees.
Algorithm  ACC  FNR  FPR 

TXGBOOST  0.9465 (0.9442,0.9481)  0.0783 (0.0745,0.0821)  0.0320 (0.0294,0.0350) 
RFOREST  0.9328 (0.9210,0.9413)  0.0814 (0.0573,0.1174)  0.0484 (0.0299,0.0825) 
DEEPNETS  0.9292 (0.9002,0.9357)  0.0846 (0.0573,0.1465)  0.0531 (0.0310,0.0829) 
DBRM_R_PAR  0.9268 (0.9162,0.9390)  0.0897 (0.0780,0.1057)  0.0538 (0.0415,0.0691) 
DBRM_G_PAR  0.9251 (0.9139,0.9377)  0.0897 (0.0766,0.1024)  0.0552 (0.0445,0.0639) 
DBRM_G  0.9243 (0.9113,0.9328)  0.0927 (0.0808,0.1116)  0.0552 (0.0465,0.0658) 
DBRM_R  0.9237 (0.9106,0.9351)  0.0917 (0.0801,0.1116)  0.0557 (0.0474,0.0672) 
LR  0.9194 (,)  0.0681 (,)  0.0788 (,) 
LBRM  0.9178 (0.9168,0.9188)  0.1090 (0.1064,0.1103)  0.0528 (0.0523,0.0538)) 
LASSO  0.9171 (,)  0.1077 (,)  0.0548 (,) 
RIDGE  0.9152 (,)  0.1288 (,)  0.0415 (,) 
LXGBOOST  0.9139 (0.9139,0.9139)  0.1083 (0.1083,0.1083)  0.0591 (0.0591,0.0591) 
NBAYES  0.7811 (,)  0.0801 (,)  0.2342 (,) 
Complexities of the features for the prediction examples
One can conclude from these three examples that DBRM has good predictive performance both when nonlinear patterns are present (Example 4.1 and 4.1) or when they are not (Example 4.1). Additionally DBRM has the advantage that its generated features are highly interpretable. Excel sheets are provided as supplementary material which present all features detected by DBRM with posterior probability larger than 0.1 and Table 4 provides the corresponding frequency distribution of the complexity of these features.
In Example 4.1 concerned with the asteroid data all reported nonlinear features had a complexity of 2. As mentioned previously the parallel version of DBRM detected way less nonlinear features than the simple versions which suggests that DBRM_G and DBRM_R have not completely converged in some simulation runs. Approximately half of the nonlinear features were modifications and the other half interactions. In this example not a single projection was reported in all simulation runs by any of the DBRM implementations.
Example1: Asteroid  

complexity  DBRM_G  DBRM_R  DBRM_G_PAR  DBRM_R_PAR  LBRM 
1  8.9600  8.9700  9.0000  9.0000  9.0000 
2  2.5800  2.6200  0.0500  0.1500  0.0000 
Total  11.540  11.590  9.0500  9.1500  9.0000 
Example2: Breast cancer  
complexity  DBRM_G  DBRM_R  DBRM_G_PAR  DBRM_R_PAR  LBRM 
1  11.300  11.730  14.200  10.790  29.830 
2  3.0900  3.0600  0.0400  0.2100  0.0000 
3  0.3000  0.0000  0.0000  0.0000  0.0000 
6  0.0000  0.0100  0.0000  0.0000  0.0000 
7  0.0000  0.0100  0.0000  0.0000  0.0000 
Total  14.420  14.810  14.240  11.000  29.830 
Example3: Spam mail  
complexity  DBRM_G  DBRM_R  DBRM_G_PAR  DBRM_R_PAR  LBRM 
1  36.340  36.090  39.870  39.170  49.830 
2  14.450  14.830  21.470  22.430  0.0000 
3  2.8300  3.1700  5.2400  5.8100  0.0000 
4  0.6900  0.5700  1.3600  1.3600  0.0000 
5  1.1500  1.0900  1.5600  1.6800  0.0000 
6  0.9200  0.7400  1.2400  1.0700  0.0000 
7  0.3700  0.4000  0.5700  0.4200  0.0000 
8  0.2500  0.2200  0.3300  0.1700  0.0000 
9  0.0400  0.0800  0.1600  0.1100  0.0000 
10  0.1500  0.1100  0.1100  0.1800  0.0000 
Total  57.190  57.300  71.910  72.400  49.830 
Also in Example 4.1 the parallel versions of DBRM reported a substantially smaller number of nonlinearities than the singlethreaded version. Over all simulation runs only DBRM_R detected 2 projections (with complexity 6 and 7, respectively). Otherwise in this example interactions were more often detected than modifications. Interestingly the nonlinear features reported by the parallel versions of DBRM consisted only of the following two interactions: (standard error of the area) (worst texture) reported 3 times by DBRM_G_PAR and 10 times by DBRM_R_PAR and (worst texture) (worst concave points) reported once by DBRM_G_PAR and 11 times by DBRM_R_PAR. While LBRM includes almost always all 30 variables in the model (in 100 simulation runs only 17 out of 3000 possible linear features had posterior probability smaller than 0.1), DBRM delivers more parsimonious models.
In Example 4.1 there is much more evidence for nonlinear structures. The nonlinear features with highest detection frequency over simulation runs in this example were always modifications. For DBRM_R_PAR there were 10 modifications of depth 2 which were detected in more than 25 simulation runs. For example sin() was reported 46 times and gauss() 41 times. The features atan() and tanh() were reported 41 times and 38 times, respectively, which provides strong evidence that a nonlinear transformation of is an important predictor. For DBRM_G_PAR the results are quite similar and the mentioned four modifications are also among the topranking nonlinear features. Although modifications were most important in terms of replicability over simulation runs, in this example DBRM also found many interactions and projections. From the 3204 nonlinear features reported by DBRM_G_PAR there were more than 998 which included one interaction, 116 with two interactions and even 3 features with three interactions. Furthermore there were 353 features including one projection, 12 features with two nested projections and even 3 features where three projections were nested. However, these highly complex features typically occurred only in one or two simulation runs. In spite of the really good performance of the parallel versions of DBMR it seems that even more parallel threads and longer chains might be necessary to get consistent results over simulation runs in this example.
4.2 Model inference
Examples 4.2 and 4.2 are based on data sets describing physical parameters of newly discovered exoplanets. The data was originally collected and continues to be updated by Hanno Rein at the Open Exoplanet Catalogue Github repository (Rein, 2016). The input covariates include planet and host star attributes, discovery methods, and dates of discovery. We use a subset of samples containing all planets with no missing values to rediscover two basic physical laws which involve some nonlinearities. We compare the performance of DBRM_G and DBRM_R when running different numbers of parallel threads. We restrict ourselves to DBRM here because to our best knowledge no other machine learning approaches can be used for the detection of sophisticated nonlinear relationships in closed form.
For both examples we utilize DBRM models with conditionally independent Gaussian observations:
(23)  
(24) 
We consider two different sets of nonlinear transformations, , , , , and , , , where we restrict the depth to and the maximum number of features in a model to . is an adaptation of the set of transformations used in the first three examples. Adding results in a model space which includes a closed form expression of Kepler’s 3rd law in Example 4.2. is a somewhat larger set where the last two functions are specifically motivated to facilitate generation of interesting features linking the mass and luminosity of stars (Kuiper, 1938; Salaris and Cassisi, 2005).
For the prior of the model structure (6) we choose giving a BIC like penalty for the model complexity. The parameter priors are specified as
(25)  
(26) 
where is the determinant of the corresponding Fisher information matrix. Hence (26) is Jeffrey’s prior for the coefficients. In this case, marginal likelihoods can be computed exactly.
The focus in these examples is on correctly identifying important features. Consequently we are using a threshold value of for the feature posteriors to define positive detections which is larger than the threshold used when reporting relevant features for prediction in the first three examples. To evaluate the performance of algorithms we report estimates for the power (Power), the false discovery rate (FDR), and the expected number of false positives (FP) based on simulation runs. These measures are defined as follows.
Here denotes the identification of in run of the algorithm and is the index of a true feature, which means a feature which is in accordance with the well known physical laws. For Kepler’s third law several features can be seen as equivalent true positives and consequently the definition of Power and FDR will be slightly modified.
Example 4: Jupiter mass of the planet
In this example we consider the planetary mass as a function of its radius and density. It is common in astronomy to use the measures of Jupiter as units and a basic physical law gives the nonlinear relation
(27) 
Here is the planetary mass measured in units of Jupiter mass (denoted PlanetaryMassJpt from now on). Similarly the radius of the planet is measured in units of Jupiter radius and the density of the planet is measured in units of Jupiter density. Hence in the data set the variable RadiusJpt refers to , and PlanetaryDensJpt denotes . The approximation sign is used because the planets are not exactly spherical but rather almost spherical ellipsoids.
DBRM according to (23)(24) is used to model PlanetaryMassJpt as a function of the following ten potential input variables: TypeFlag, RadiusJpt, PeriodDays, SemiMajorAxisAU, Eccentricity, HostStarMassSlrMass, HostStarRadiusSlrRad, HostStarMetallicity, HostStarTempK, PlanetaryDensJpt. In order to evaluate the capability of GMJMCMC and RGMJMCMC to detect true signals we run each algorithm times. To illustrate to which extent the performance of DBRM depends on the number of parallel runs we furthermore consider computations with 1, 4 and 16 threads, respectively. In each of the threads the algorithms were first run for 10 000 iterations, with population changes at every 250 iteration, and then for a larger number of iterations based on the last population (until a total number of 10 000 unique models was obtained). Results for GMJMCMC and RGMJMCMC using different numbers of threads are summarized in Table 5 both for and .
DBRM_G_PAR  DBRM_R_PAR  

NL set  Threads  Power  FP  FDR  Power  FP  FDR 
16  1.00  0.00  0.00  0.97  0.06  0.03  
4  0.79  0.40  0.21  0.61  0.73  0.39  
1  0.42  1.21  0.58  0.33  1.63  0.67  
16  0.93  0.36  0.215  0.94  0.29  0.175  
4  0.69  0.49  0.34  0.63  0.64  0.375  
1  0.42  1.25  0.58  0.29  1.54  0.71 
Clearly the more resources become available the better DBRM performs. RGMJMCMC and GMJMCMC both manage to find the correct model with rather large Power (reaching gradually one) and small FDR (reaching gradually zero), when the number of parallel threads is increased. When using only a single thread it often happens that instead of the correct feature some closely related features are selected (see the Excel sheet Mass.xlsx in the supplementary material for more details). Results for the set are slightly better than for which illustrates the importance of having a good set of transformations when interested in model inference. Power is lower and FDR is larger for which is mainly due to the presence of in the set of nonlinearities. The feature is quite similar to the correct law (27) and moreover has lower complexity than the feature . Hence it is not surprising that it is often selected, specifically when DBRM was not run sufficiently long to fully explore features with larger complexity.
Example 5: Kepler’s third law
In this example we want to model the semimajor axis of the orbit SemiMajorAxisAU as a function of the following 10 potential input variables: TypeFlag, RadiusJpt, PeriodDays, PlanetaryMassJpt, Eccentricity, HostStarMassSlrMass, HostStarRadiusSlrRad, HostStarMetallicity, HostStarTempK, PlanetaryDensJpt.
Kepler’s third law says that the square of the orbital period of a planet is directly proportional to the cube of the semimajor axis of its orbit. Mathematically this can be expressed as
(28) 
where is the gravitational constant, is the mass of the planet, is the mass of the corresponding hosting star and . Equation (28) can be reformulated as
Comments
There are no comments yet.