1 Introduction
Machine learning (ML) approaches are able to achieve an outstanding performance in regression problems with respect to certain metrics, such as the mean squared error. However, they often cannot provide valid uncertainties on the distribution of the target variables (Song et al., 2019). To address this in regression problems, Kuleshov et al. (2018); Song et al. (2019) proposed to apply posthoc
corrections, where a pretrained probabilistic model is taken as output in a second model, so that the result is a better calibrated distribution. Alternatively, quantile regression has been considered, however, quantilelevel calibrated regressors do not ensure proper calibration for a particular prediction (unlike classification), an issue that can be overcome by distributional regression models
(Rigby & Stasinopoulos, 2005). The general idea here is to extend classical mean regression models, which relate the expectation of a dimensional target variable to input features, to models for an arbitrary parametric distribution , with density provided that exists for all . Since all parameterscan be nonlinear transformations
of inputs , i.e. one can avoid the pretraining and correction step of posthocapproaches. In addition, this framework can easily account for skewed, count, mixed or multivariate outputs. A simple example is the univariate normal distribution where mean
and variance (or scale,
) can be learned. On the downside distirbutional regression models assume structured transformations , not allowing for highdimensional interactions of features or nontabular inputs. To extend this approach for more complex structures of tabular inputs, recent proposals try to add a single layer neural network (Umlauf et al., 2018) to a structured predictor in a regression model or permit a regression predictor in the class of generalized linear models (GLMs) that is based on a deep neural network (Tran et al., 2019).In this paper, we propose an orthogonalization cell that allows for the joint estimation of the structured model part and the unstructured deep neural network (DNN) predictor combined as the direct sum (in a vector space meaning) in a unifying network. This leads to a new network architecture enabling the estimation of interpretable linear or nonlinear effects of single features, while still allowing for higherorder interactions of those features, as well as the inclusion of nonstandard inputs, such as images or text, through the DNN predictor. This approach exploits the advantages of a unified network structure similar to
(Cheng et al., 2016), who focus on the application to recommender systems and GLMs. We consider distributional learning, which includes GLMs as a special case. Additionally, our approach permits uncertainty assessment using Bayesian principles and, in particular, efforts to make additive predictors identifiable, which is crucial for the interpretation of resulting models.1.1 Main Features
Modularity:
Our holistic concept of estimating the regression model within a network permits the specification of model components such as the deep network structure, the optimizer or the inference procedure independently of the distribution that is assumed or mimicked by a corresponding loss function. Furthermore, by projecting the deep network predictor into the orthogonal complement of the structured predictor and projections therein, we allow for flexible separability assumptions and corresponding constraints.
Semistructured predictor/multimodality: Our approach encompasses nonlinear relationships of features of any kind, as well as the possibility to account for nontabular data inputs due to the deep network part. As a consequence, we technically constitute a multimodal deep learning approach (Ngiam et al., 2011), while giving researchers the freedom to define a structured model corresponding to their hypotheses.
Highdimensional regression: In contrast to similar approaches, the given framework can easily handle settings with more features than observations.
Calibration of target variables:
The distributional framework considered in this paper ensures calibration of the target distribution through the rich class of potential distributions and the ability to have feature effects on all distribution parameters. Examples include skewed, (zeroinflated) count, mixed or multivariate outputs. In the latter case, fitting regression models as neural networks allows for both a joint distribution accounting for potential dependencies between outputs and any input/output association.
Improved inferential aspects: Incorporating structured additive effects into a neural network allows for more control over the influence of certain features, can improve predictive performance and makes hypothesis testing and other statistical inference procedures available in deep learning use cases. This is also in line with the results of our experiments, indicating that the inclusion of structured effects tends to be less prone to overfitting.
Automatic feature selection:
The framework has an inbuilt model selection mechanism and allows researchers to separate simple linear and nonlinear effects of features from the “blackbox” part of the neural network, thereby also yielding an interpretable model in the spirit of interpretable machine learning (IML; DoshiVelez & Kim, 2017).Efficient implementation:
Estimating the model within a unifying neural network enables us to take advantage of fast and welltested machine learning platforms. We use TensorFlow (TF) and addon module TensorFlow Probability (TFP) which allows extensions of the framework in for our purposesand offers multiple options for statistical inferential procedures
(Abadi et al., 2016; Dillon et al., 2017).2 SemiStructured Deep Distributional Learning
The proposed architecture is a large network, consisting of several smaller networks with different input layers, layers with trainable parameters and a common distribution layer, where each parameter of the distribution can be modeled using a combination of smaller networks. We call our approach semistructured deep distributional learning (SDDL) and describe its components in the remainder of this section.
2.1 The Core Architecture in a Nutshell
An exemplary architecture of an orthogonalization cell (see Section 2.4 for details) is visualized in Figure 1. The cell processes the available feature vector with linear input features , structured nonlinear input features and features that are passed through a deep neural network. We call these three components structured linear inputs, structured nonlinear inputs and unstructured (nonlinear) inputs. Both structured input types are modeled as a single unit hidden layer with linear activation functions and different regularization terms. The unstructured deep neural network model part(s) can be arbitrarily specified. Inputs in one or more DNNs are features that are assumed to have a nonlinear effect or interact with each other in a complex way. A distinction is made between DNNs whose unstructured inputs are also part of the structured inputs, i.e. , and DNNs whose inputs solely appear in the unstructured predictor (illustrated by an XORnode in Figure 1). In the latter case, the DNN outputs are passed through to the Deep input gate, directly summed up with structured inputs and fed into the distributional layer. DNNs that also share inputs with or are connected to the Deep decomposed input gate, which applies an orthogonalization operation to the penultimate DNN layer before adding its outputs to the remaining predictors.
2.2 Output Model Structure
In contrast to conventional regression models only modeling the conditional mean of , distributional regression (DR) allows to regress features on potentially all parameters of the outcome distribution. By assuming a parametric distribution the goal of DR is to estimate the relationship of each of the distribution parameters and the given features through monotonic and differentiable response functions
The predictors hence specify the relationship between each feature in and the (transformed) parameters , while ensures that possible parameter space restrictions on are fulfilled for each .
In order to embed DR into a neural network, we use distributional layers, a neural network pendant to DR proposed by several authors, e.g., Dillon et al. (2017). The idea of distributional layers is simple and compelling. As for DR all or a subset of features are used to estimate each distribution parameter of a chosen parametric distribution . Instead of outputting the predicted mean or any other statistic derived from , the outcome
is evaluated on basis of the (probability) density function
by calculating the negative loglikelihoodand using the result as a loss function for backpropagation and updates of weights
and biases .2.3 Predictor Structure
DR typically assumes the structured predictors , to be an additive decomposition of a linear part and nonlinear functions . Our SDDL approach additionally allows for one or more deep neural network predictors with inputs , which can be (partially) shared by the networks and can be (partially) identical to the features :
Note that we have suppressed the index for the distributional parameter here to not overload the notation, i.e. to avoid . The functions represent penalized smooth nonlinear effects of features using a basis representation with a function approximation by a linear combination of basis functions. A univariate nonlinear effect of feature is, e.g., approximated by , where is the th basis function (such as regression splines, polynomial bases or Bsplines) evaluated at the observed feature
. This class is not limited to univariate effects. Tensor product representations allow for twodimensional nonlinear interactions but also nonlinear effects of higherorder interactions. Despite being nonlinear, the structured predictors are easily understandable due to the additive structure, but often limited to no or only lower order interactions to preserve the interpretability and tractability of the whole model.
2.4 Identifiability
Identifiability is crucial when feature inputs overlap in the structured and unstructured model part. If not constrained the deep learning effects can also capture linear effects of those features, making the attribution of the effect to either one of the model parts not identifiable. The easiest example is the bias , where if multiple functions contain different biases , the level of the overall predictor stays the same when subtracting a constant from and simultaneously adding it to . We ensure identifiability of by using an orthogonalization operation , also yielding centered nonlinear functions, i.e., to make the model’s overall bias term identifiable (alternative constraints are known to lead to higher uncertainty in subsequent statistical inference (Wood, 2017)). Let be features with observations of the structured linear network part and the projection matrix for which is the linear projection of the columns of into the space spanned by the linear features for any . To ensure the identifiability of the structured linear part with respect to the nonlinear parts of the DNN, the projection into the orthogonal complement is applied to the DNN’s outputs in its second last layer. This yields . The result is then multiplied by the weights of the DNN’s output layer and added to the structured linear network output, yielding the final predictor that is fed into the distributional layer. If both, structured linear and structured nonlinear parts are present, we first use this orthogonalization operation to ensure identifiability between the linear and nonlinear structured parts, combine them and then apply the same operation for separation of the whole structured from the unstructured deep learning predictor.
A schematic architecture of such an operation is depicted as orthogonalization cell in Figure 1, showing one of several possible architectures for combining two structured and one unstructured predictor. While the orthogonalization operation per default attributes as much effect to the structured model part as possible, a tuning parameter can be introduced to control the amount of attribution to the predictors, , where is the default and values smaller 1 attribute more of the linear effect to the nonlinear unstructured effect.
3 Inferential Aspects
Having defined our network structure, we now turn to inferential aspects including penalization of structured nonlinear terms and corresponding prior assumptions.
3.1 Penalization and Priors
It is believed that neural networks hold an implicit regularization behaviour due to gradientbased optimization (see, e.g., Arora et al., 2019), which we were able to confirm in some of our simulation studies. Despite showing reasonably good nonlinear effect estimation in structured additive parts of our network without additional penalization in some cases, we also observed rather coarse estimated nonlinear effects or even convergence difficulties when not penalizing structured nonlinear effects. We therefore allow for additional penalization of smooth structured effects using kernel regularization in the corresponding layers with tunable smoothing parameter for each smooth term and appropriate penalty matrices .
Tuning a large number of smooth terms, potentially in combination with deep learning model parts, is a challenging task. We foster easy tuning by defining the smoothness of each effect in terms of the degrees of freedom
(df; see, e.g., Buja et al., 1989) and implement an efficient calculation using the DemmlerReinsch Orthogonalization (DRO, cf. Ruppert et al., 2003). The latter can easily be parallelized or sped up using a randomized SVD matrix decomposition (Erichson et al., 2019). We thereby also allow for meaningful default penalization by setting to the same value for each effect and ensure enough flexibility by choosing as the largest value possible among all smoothing terms . Note that the maximum for eachis given by the corresponding number of columns of the given basis representation. Alternatively in a Bayesian network version of our framework, smoothing can be enforced by prior distribution assumptions on the weights of the basis function. We take advantage of the correspondence between Bayesian model estimation and smoothing splines by using
as precision matrix for a zero mean normal prior. Depending on the specification of , it is not uncommon that is rank deficient. To guarantee proper posteriors in such cases we add proper priors for the null space of .3.2 Tuning and Convergence
While optimization of the model can be done by minimizing the corresponding negative loglikelihood of the specified distribution, deep distributional regression models can easily under or overfit the given data. In general, we recommend splitting the data and monitoring the validation loss to ensure generalization ability of the model and, if necessary, to use early stopping. To restrict flexibility in structured predictors, the for each structured nonlinear term can be additionally reduced or linear effects can be estimated using  or
penalties. Moving features from the DNN to one of the structured model parts can further regularize and stabilise the model. As the convergence of weights in the structured layers may be slow, we suggest monitoring both the test error and weights over all epochs to check for either overfitting or nonconvergence.
3.3 Bayesian Posterior Estimation
Uncertainty quantification of estimated model weights in DR, also known as epistemic uncertainty, can be accounted for through Bayesian inference paradigms. As mentioned in Section
3.1this can be achieved by placing appropriate prior distributions on the model weights. However, resulting posterior distributions are complex and not available in closed form. Different approaches, such as Markov Chain Monte Carlo (MCMC), integrated nested Laplace approximation for a restricted class of models
(INLA; Rue et al., 2009) or variational inference (VI; Blei et al., 2017) have been popularized in the literature. The latter has the main advantage to scale well in highdimensions. We will briefly introduce the concept of VI as a basis for estimation of the proposed network structure in Section 2.The basic idea of VI is to approximate the augmented posterior with variational density . Here, denotes a collection of all trainable weights and biases, and
is a vector of variational parameters which are chosen to minimize the KullbackLeibler divergence between
and . As shown in Ormerod & Wand (2010) this is equivalent to maximizing the variational lower bound with respect to . VI can be incorporated in our framework using Bayes by Backprop (Blundell et al., 2015), which allows for different combinations of prior for layer weights and variational posterior families with easy interchangeability.In order to define a distribution over layer weights in the network, existing weight initialization arguments are replaced by distributions over the weights that may, in turn, carry trainable parameters. These so called Bayesian layers (BL; Tran et al., 2019) constitute commonly used layers, such as convolutional layers, and replace their deterministic counterparts with freely selectable positions in the network. Per default, our framework uses these layers to change from a purely frequentist to a Bayesian approach by defining all layers as BL fed into the distributional layer. In other words, by defining priors for the linear predictor of each parameter.
3.4 Model Evaluation and Selection
When the goal is to properly predict a distribution of interest, results can be compared on the basis of proper scoring rules (Klein et al., 2015), such as the logscore. These also allow for an objective comparison and selection among different distributions. An equally important aspect in model selection is the choice of features to include in the structured terms of the model’s parameters. In contrast to other regression frameworks, where in particular for highdimensional settings with more features than observations no onesizefitsall approach is available, a property of neural networks is their capability of learning more effects than data points available. Combined with the orthogonalization from Section 2.4, our framework thus seamlessly allows for model selection when including all features in all types of predictor parts (i.e., being equal to and with ). This can be seen as a variance decomposition approach, where the model separates the linear and nonlinear effects (with no interaction) from the remaining deep unstructured effects.
4 Experiments on Synthetic Data
We conduct three different simulation specifications with synthetic data to assess the performance of our proposed framework with respect to: a) the goodnessoffit in terms of the estimation of regression coefficients and structured functional relationships to demonstrate the framework’s capability to properly estimate structured models terms and b) the prediction performance, which will be measured by the predictive logscores evaluated on the test data set. The first experiment demonstrates the orthogonalization property of the proposed framework, whereas the second simulation compares the estimation results from semistructured deep distributional learning to various software packages with a particular focus on additive distributional regression. A third experiment investigates estimation performance and epistemic uncertainty quantification of SDDL using different network predictors and mean field VI via BL.
4.1 Orthogonalization and Variance Decomposition
Here we mimic a situation where the practitioner’s interest explicitly lies in decomposing certain feature effects into structured linear, structured nonlinear and unstructured nonlinear parts, for instance for reasons of interpretability. We simulate 20 data sets with observations and features
drawn from a uniform distribution
. For the outcome we consider the cases , (Normal), (Bernoulli) and (Poisson). The predictor contains each feature as linear effect, nonlinear effect and an interaction with the otherfeatures. Our framework explicitly models the true linear and nonlinear term by separating both effects via orthogonalization and uses a fully connected DNN with two hiddenlayers with 32 and 16 hidden units and ReLu activation to account for the interaction. By projecting the output of the second last layer into the orthogonal complement of the structured predictors, we ensure identifiability of the linear and nonlinear effects. We do
not tune the model but use the DRO approach described in Section 3.1 and a fixed number of epochs.Results: Figure 2
visualizes the estimated and true nonlinear relationships between selected features and the outcome. The root mean squared error (RMSE) for the linear effects over different effect sizes ranging from 0.2 to 2 are 0.0145, 0.1725 and 1.2734 for the normal distributions, 0.0394 for the Poisson distribution and 0.6904 for the Bernoulli distribution, which are reasonable small relative to the effect sizes. Overall the resulting estimates in Figure
2 demonstrate the capability of the framework to recover the partial nonlinear effects and only the simulation using a normal distribution with, which amounts to a maximum of 4% signaltonoise ratio (predictor variance divided by noise variance), shows an overfitting behaviour.
4.2 Model Comparison
bamlss  SDDL  gamlss  MBB  

Neg. Logscores  Normal  300  10  1.51 (0.68)  1.85 (0.55)  7.08 (8.07)  4.20 (0.47) 
75  2.84 (0.94)  10.4 (0.95)  
2500  10  0.55 (0.06)  0.96 (0.18)  0.57 (0.08)  3.71 (0.24)  
75  0.64 (0.06)  1.11 (0.12)  0.69 (0.07)  8.85 (0.51)  
Gamma  300  10  1.15 (0.10)  1.32 (0.31)  1.04 (0.09)  1.13 (0.11)  
75  1.50 (0.36)  2.34 (0.87)  2.05 (0.99)  1.56 (0.15)  
2500  10  1.01 (0.02)  0.93 (0.03)  0.83 (0.02)  0.96 (0.03)  
75  1.01 (0.03)  1.24 (0.05)  0.84 (0.03)  1.01 (0.04)  
Logistic  300  10  1.44 (0.14)  1.75 (0.12)  4.38 (2.84)  3.28 (0.37)  
75  2.55 (0.48)  2.22 (0.18)  124 (104)  4.89 (0.26)  
2500  10  1.7 (0.04)  1.15 (0.06)  1.15 (0.04)  
75  2.47 (0.36)  1.16 (0.08)  1.23 (0.05)  4.71 (0.1)  
RMSE  Normal  300  10  0.89 (0.61)  0.37 (0.28)  0.59 (0.15)  0.84 (0.22) 
75  1.05 (1.14)  0.47 (0.42)  0.67 (0.26)  1.29 (0.86)  
2500  10  0.50 (0.71)  0.22 (0.22)  0.25 (0.34)  0.66 (0.35)  
75  0.48 (0.70)  0.19 (0.22)  0.24 (0.32)  1.14 (0.64)  
Gamma  300  10  0.13 (0.06)  0.14 (0.04)  0.08 (0.02)  0.11 (0.04)  
75  0.15 (0.06)  0.18 (0.07)  0.12 (0.05)  0.14 (0.05)  
2500  10  0.19 (0.18)  0.10 (0.04)  0.04 (0.03)  0.10 (0.08)  
75  0.19 (0.20)  0.07 (0.03)  0.04 (0.02)  0.11 (0.07)  
Logistic  300  10  1.69 (0.98)  0.28 (0.14)  0.61 (0.18)  0.97 (0.36)  
75  1.82 (1.21)  0.26 (0.15)  0.66 (0.44)  1.30 (0.84)  
2500  10  2.20 (0.82)  0.18 (0.15)  0.24 (0.32)  0.73 (0.41)  
75  3.52 (0.97)  0.13 (0.13)  0.25 (0.32)  1.19 (0.66) 
In this simulation we compare the estimation and prediction performance of our framework with likelihoodbased optimization (gamlss), a Bayesian optimization (bamlss) and a modelbased boosting (MBB) routine. The different frameworks for distributional regression are available in the software packages gamlss (Rigby & Stasinopoulos, 2005), gamboostLSS (Mayr et al., 2012) and bamlss (Umlauf et al., 2018). For three different distributions (normal, gamma, logistic) we investigate a combination of different number of observations and different number of linear feature effects in the location ( and 2 linear effects in the scale parameter. In addition to the linear feature effects we add 10 nonlinear effects for the location and 2 nonlinear effects for the scale.
Results: Table 1 summarizes the mean logscores of test data points and mean RMSE measuring the average of the deviations between true and estimated structured effects. We observe that our approach (SDDL) often yields the best or second to best estimation and prediction performance, while being robust to more extreme scenarios in which the number of observations is small in comparison to the number of feature effects. In this situation other approaches tend to suffer from convergence problems.
4.3 Epistemic Uncertainty
The final simulation has two main purposes. First we want to compare the estimation performance of our framework when specified with and without BL using the MSE. Second, we investigate 90%credible intervals using BL for smooth terms to assess the validity of inference statements drawn from the estimated variational posterior distribution. Specifically, we report the pointwise coverage rates and their power measured by the pointwise coverage at zero. We use a homoscedastic normal distribution with
observations and either include 1, 3 or 5 structured nonlinear predictors that are penalized using . We also investigate the model fits when using an additional 1 hidden layer neural network with 16 hidden units and / or using a Bayesian version of the network.Results: Results suggest that neither the number of nonlinear functions nor the inclusion of a DNN has a notable influence on the performance. By varying the degreesoffreedom of the smooths in further experiments, we found that an appropriate penalization is essential for both proper estimation and the properties of resulting posterior credible intervals. Table 2 compares the mean squared deviation between estimated and true curves as well as the coverage and power of corresponding intervals. Simulations with variational layers show slightly better estimation performance for the posterior mean curves in comparison to the nonBayesian smooth estimation. Intervals are too conservative for the first two curves but lack coverage for other smooth functions. Power investigations yield diverse results, but as for coverage, values show better performance for curves for which induces an appropriate smoothing penalty.
MSE  

without BL  with BL  Coverage  Power  
1  0.01 (0.00)  0.01 (0.00)  1.00 (0.00)  0.83 (0.03) 
2  0.03 (0.00)  0.03 (0.00)  0.95 (0.03)  0.07 (0.19) 
3  0.09 (0.00)  0.05 (0.01)  0.64 (0.06)  0.12 (0.09) 
4  0.08 (0.00)  0.05 (0.01)  0.70 (0.09)  0.91 (0.01) 
5  0.09 (0.00)  0.09 (0.00)  0.53 (0.03)  0.00 (0.00) 
5 Benchmarks and RealWorld Data Sets
In addition to the experiments on synthetic data, we provide a number of benchmarks in several realworld data sets in this section. If not stated otherwise, we report measures as averages and standard deviations (in brackets) over 20 random network initializations. Moreover, the hidden and output layers of the DNNs in the SDDL approach use ReLu and linear activation functions per default, respectively.
5.1 Deep Generalized Mixed Models
Tran et al. (2019) use a panel data set with 595 individuals and 4165 observations from Cornwell & Rupert (1988) as an example for fitting deep mixed models by accounting for within subject correlation. Performance is measured in terms of within subject predictions of the log of wage for future time points. We follow their analysis by training the model on the years and predicting the years . We use a normal distribution with constant variance and model the mean with the same neural network predictor as done by Tran et al. (2019), but exclude the subject ID, which we use as an structured random effect for each individual : with being 12 features also used in Tran et al. (2019), independent and identically distributed random effects and as well as two different DNNs.
Results: Our approach yields an average MSE of () which makes our method competitive to the approach of Tran et al. (2019, MSE=).
5.2 Deep Distributional Models
The first application illustrates
the distributional aspect of our framework. Following Rodrigues & Pereira (2018), we consider the motorcycle dataset from Silverman (1985). In contrast to Rodrigues & Pereira (2018), who present a framework to jointly predicting quantiles, our approach models the entire distribution, including a prediction for all quantiles in in one single model.
Results: As we model the distribution itself and not the quantiles explicitly, our approach does not suffer from quantile crossings. Using the quantiles and , the approach by Rodrigues & Pereira (2018) yields an RMSE of 0.526 (0.003) with an average of 3.050 (0.999) quantile crossings on all test data points. Our approach with DNN for the mean and linear time effect for the distribution’s scale in contrast does not suffer from quantile crossings and yields an RMSE of 0.536 (0.016). A visualization of the fitted mean with selected quantiles is depicted in Figure 3.
5.3 HighDimensional Classification
Ong et al. (2018) aim at predicting various forms of cancer in highdimensional gene expression data sets (Colon, Leukaemia and Breast cancer). The authors conduct a Bayesian approach with horseshoe priors and VI, named VAFC in the following. We instead use the SDDL approach and combine a linear model with a small DNN (one or two hidden layers and up to 16 hidden units) where we use the proposed orthogonalization operation for the DNN and apply this model to all three data sets with training sample sizes of 42, 38 and 38 and test set sizes of 20, 34 and 4, respectively. The number of features is =2000 for the Colon data, and = 7129 for the Leukaemia and Breast datasets. As an additional comparison, we fit a standard DNN with the same architecture as for the SDDL approach but no additional structured predictors, a sigmoid activation function and binary crossentropy loss function.
Results: Table 3
compares the average (standard deviation) of the Area under the Receiver Operator Characteristic Curve (AUROC). While all approaches yield an AUROC of one, our SDDL approach is able to outperform the VAFC and standard DNN approach on the other two data set.
SDDL  DNN  VAFC (4)  VAFC (20) 

1.00 (0.00)  1.00 (0.00)  1.00 (0.00)  1.00 (0.00) 
0.98 (0.02)  0.82 (0.23)  0.91 (0.06)  0.90 (0.07) 
1.00 (0.00)  1.00 (0.00)  0.95 (0.10)  0.84 (0.12) 
5.4 Deep Calibrated Regression
Finally, we use four regression datasets (Diabetes, Boston, Airfoil, Forest Fire) analyzed by Song et al. (2019) to benchmark our SDDL against the two posthoc calibration methods isotonic regression (IR; Kuleshov et al., 2018) and the GPBeta model (GPL; Song et al., 2019)
with 16 inducing points. The uncalibrated model for the latter two is a Gaussian process regression (GPR) which performed better than ordinary least squares and standard neural networks in
Song et al. (2019). The SDDL assumes a distribution for the output with and . Note that further finetuning could be done for SDDL in terms of the output distribution (e.g. house prices are positive and may be skewed rather than symmetric). Here, we only tune the specific structure for the predictors , which consist of structured linear and nonlinear effects as well as a DNN for all features, tanh activation functions in the hidden layer(s) and different number of hidden units. We denote those DNNs by and with the number of hidden units in brackets. Specifically for the four data sets we define as , (Diabetes); , , where (Boston); , , where being the indices for three of five available numerical features (Airfoil); and , with index for each month (Forest Fire). We split the data into 75% for training and 25% for model evaluation, measured by negative logscores.Results: Table 4 suggests that compared to other calibration techniques our method yields more stable results, while allowing to include structured predictors for features of interest. Even though we did not finetune the output distribution, we perform as good as the benchmarks in terms of average negative logscores.
SDDL  GPR  IR  GPB  

Diabetes  5.33 (0.00)  5.35 (5.76)  5.71 (2.97)  5.33 (6.24) 
Boston  3.07 (0.11)  2.79 (2.05)  3.36 (5.19)  2.70 (1.91) 
Airfoil  3.11 (0.02)  3.17 (6.82)  3.29 (1.86)  3.21 (4.70) 
Forest F.  1.75 (0.01)  1.75 (7.09)  1.00 (1.94)  2.07 (9.25) 

6 Conclusion
We develop a flexible architecture that allows for combining recent advances from statistics and machine learning. By embedding different structured additive predictors into a neural network architecture while ensuring identifiability, we enable the estimation of common use cases of distributional regression with the option to have a deep learning predictor that can account for unstructured or highdimensional data. We make use of flexible as well as scalable deep learning platforms by transferring the fitting problem to a holistic deep learning model. Simulations, benchmark and application studies demonstrate the generality of our proposed approach. It will be of interest to investigate improvements for uncertainty quantification in Bayesian layers for neural networks
(Yao et al., 2019) and to elaborate further on inference concepts (Buchholz et al., 2018).References
 Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283, 2016.
 Arora et al. (2019) Arora, S., Cohen, N., Hu, W., and Luo, Y. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, pp. 7411–7422, 2019.
 Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 Blundell et al. (2015) Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. Proceedings of the 32nd International Conference on Machine Learning, 37:1613–1622, 2015.
 Buchholz et al. (2018) Buchholz, A., Wenzel, F., and Mandt, S. QuasiMonte Carlo variational inference. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 668–677, 2018.
 Buja et al. (1989) Buja, A., Hastie, T., and Tibshirani, R. Linear smoothers and additive models. The Annals of Statistics, 17(2):453–510, 1989. doi: 10.1214/aos/1176347115.
 Cheng et al. (2016) Cheng, H.T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., et al. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. ACM, 2016.
 Cornwell & Rupert (1988) Cornwell, C. and Rupert, P. Efficient estimation with panel data: An empirical comparison of instrumental variables estimators. Journal of Applied Econometrics, 3(2):149–155, 1988.
 Dillon et al. (2017) Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B., Alemi, A., Hoffman, M., and Saurous, R. A. Tensorflow distributions. arXiv preprint arXiv:1711.10604, 2017.
 DoshiVelez & Kim (2017) DoshiVelez, F. and Kim, B. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
 Erichson et al. (2019) Erichson, N. B., Voronin, S., Brunton, S. L., and Kutz, J. N. Randomized matrix decompositions using R. Journal of Statistical Software, 89(11), 2019.
 Klein et al. (2015) Klein, N., Kneib, T., Lang, S., and Sohn, A. Bayesian structured additive distributional regression with an application to regional income inequality in Germany. Annals of Applied Statistics, 9(2):1024–1052, 2015.
 Kuleshov et al. (2018) Kuleshov, V., Fenner, N., and Ermon, S. Accurate uncertainties for deep learning using calibrated regression. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 2796–2804, 2018.
 Mayr et al. (2012) Mayr, A., Fenske, N., Hofner, B., Kneib, T., and Schmid, M. Generalized additive models for location, scale and shape for highdimensional data  a flexible approach based on boosting. Journal of the Royal Statistical Society, Series C  Applied Statistics, 61(3):403–427, 2012.
 Ngiam et al. (2011) Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 689–696, 2011.
 Ong et al. (2018) Ong, V. M.H., Nott, D. J., and Smith, M. S. Gaussian variational approximation with a factor covariance structure. Journal of Computational and Graphical Statistics, 27(3):465–478, 2018.
 Ormerod & Wand (2010) Ormerod, J. T. and Wand, M. P. Explaining variational approximations. The American Statistician, 64(2):140–153, 2010.
 Rigby & Stasinopoulos (2005) Rigby, R. A. and Stasinopoulos, D. M. Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(3):507–554, 2005.
 Rodrigues & Pereira (2018) Rodrigues, F. and Pereira, F. C. Beyond expectation: Deep joint mean and quantile regression for spatiotemporal problems. arXiv preprint arXiv:1808.08798, 2018.
 Rue et al. (2009) Rue, H., Martino, S., and Chopin, N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):319–392, 2009.
 Ruppert et al. (2003) Ruppert, D., Wand, M. P., and Carroll, R. J. Semiparametric regression. Cambridge University Press, Cambridge and New York, 2003.
 Silverman (1985) Silverman, B. W. Some aspects of the spline smoothing approach to nonparametric regression curve fitting. Journal of the Royal Statistical Society: Series B (Methodological), 47(1):1–21, 1985.
 Song et al. (2019) Song, H., Diethe, T., Kull, M., and Flach, P. Distribution calibration for regression. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5897–5906. PMLR, 2019.
 Tran et al. (2019) Tran, D., Mike, D., van der Wilk, M., and Hafner, D. Bayesian layers: A module for neural network uncertainty. 33rd Conference on Neural Information Processing System (NeurIPS), 2019.
 Tran et al. (2019) Tran, M.N., Nguyen, N., Nott, D., and Kohn, R. Bayesian deep net GLM and GLMM. Journal of Computational and Graphical Statistics, 2019.
 Umlauf et al. (2018) Umlauf, N., Klein, N., and Zeileis, A. Bamlss: Bayesian additive models for location, scale, and shape (and beyond). Journal of Computational and Graphical Statistics, 27(3):612–627, 2018. doi: 10.1080/10618600.2017.1407325.
 Wood (2017) Wood, S. N. Generalized additive models: an introduction with R. Chapman and Hall/CRC, 2017.
 Yao et al. (2019) Yao, J., Pan, W., Ghosh, S., and DoshiVelez, F. Weight uncertainty in neural networks. ICML 2019 Workshop on Uncertainty and Robustness in Deep Learning, 2019.
Comments
There are no comments yet.