Deep learning (DL) models are renowned for outperforming traditional methods in perceptual tasks like image or sound classification [Lecun2015]. For classification tasks the DL model is usually treated as a probabilistic model, where the output nodes (after softmax) are interpreted as the probability for the different classes. The networks are trained by minimizing the negative log-likelihood (NLL). DL is also successfully used in regression tasks with complex data. However, regression DL models are often not used in a probabilistic setting and yield only a single point estimate for the expected value of the outcome.
Many real world applications are not deterministic, in a sense that a whole range of outcome values is possible for a given input. In this setting, we need a model that does not only predict a point estimate for the outcome conditioned on a given input, but a whole conditional probability distribution (CPD).
In many regression applications a Gaussian CPD is assumed, often with the additional assumption that the variance of the Gaussian does not depend on the input(homoscedasticity). In the Gaussian homoscedastic cases only one node in the last layer is needed and interpreted as the conditional mean of the CPD . In this case the NLL is proportional to the mean squared error
, which is one of the best known loss functions in regression. Minimizing the loss function yields the maximum likelihood estimate of the parameterof the Gaussian CPD. The parameter of the CPD is estimated from the resulting residuals , an approach that only works in the homoscedastic case. In case of non-constant variance, a second output node is needed which controls the variance.
The likelihood approach generalizes to other types of CPDs, especially if the CPD is known to be a member of a parameterized distribution family with parameters conditioned on the input .
Ii Related Work
A simple parameterized distribution like the Gaussian is sometimes not flexible enough to model the conditional distribution of the outcome in real-world applications. A straightforward approach to model more flexible CPDs is to use a mixture of simple distributions. Mixtures of Gaussian, where the mixing coefficients and the parameters of the individual Gaussians are controlled by a neural network, are known as neural density mixture networks and have been used for a long time [Bishop1994].
Another way to achieve more complex distributions is to average over many different CPDs, where each CPD is a simple distribution. This approach is taken by ensemble models [NIPS2017_7219] or alternatively Bayesian models. A Bayesian deep learning treatment for regression has been done by MC-Dropout [Gal2015, Gal2017b]
Transformation models take a different approach to model a flexible distributions. Here, a parameterized bijective transformation function is learned that transforms between the flexible target distribution of the outcome and a simple base distribution of a variable , such as a Standard Gaussian. The likelihood of the data can be easily determined from the base distribution after transformation using the change of variable formula. This allows to determined the parameters of the transformation using the maximum likelihood estimation.
Ii-a Normalizing flows
On the “deep learning” side the normalizing flows (NF) have recently successfully been used to model complex probability distributions of high-dimensional data[Rezende2015]; see [Papamakarios2019] for an extensive review. A typical NF transforms from a simple base distribution with density (often a Gaussian) to a more complex target distribution . This is implemented by means of a parameterized transformation function .
A NF usually consists of a chain of simple transformations interleaved with non-linear activation functions. To achieve a bijective overall transformation, it is sufficient that each transformation in the chain is bijective. The bijectivity is usually enforced by choosing a strictly monotone increasing transformation. A simple but efficient and often used transformation consists of a scale termand a shift term :
To enforce the required monotony of the transformation needs to be positive. Note that the density in after transformation is determined by the change of variable formula:
The Jacobian-Determinant ensures that is again normalized. The parameters in the transformation function, here are controlled by the output of a neural network carefully designed so that the Jacobi-Determinant is easy to calculate [Papamakarios2019]. To ensure that is positive, the output of a network is transformed e.g. by . By chaining several such transformations with non-linear activation functions inbetween, a target distributions can be achieved which has not only another mean and spread, but also another shape than the base distribution.
NFs are very successfully used in modeling high dimensional unconditional distributions. They can be used for data generation, like creating photorealistic samples human faces [Kingma2018]. However, little research is done yet with respect to using NF in a regression-like setting, where the distribution of a low-dimensional variable is modeled conditioned on a possible high-dimensional variable [trippe2018],[Rothfuss]. In the very recent approach of [Rothfuss] a neural network controls the parameters of a chain of normalizing flow, consisting of scale and shift terms combined with additional simple radial flows.
Ii-B Most Likely Transformation
On the other side, transformation models have a long tradition in classical statistics, dating back to the 1960’s starting with the seminal work of Box and Cox on transformation models [Box1964]. Recently this approach has been generalized to a very flexible class of transformations called most likely transformation (MLT) [Hothorn2018]. The MLT models a complex distribution
of a one-dimensional random variableconditioned on a possibly high-dimensional variable . While multivariate extensions of the framework exists [klein2019multivariate], we restrict our treatment to the practically most important case of a univariate response . As in normalizing flows the fitting is done by learning a parameterized bijective transformation of to a simple distribution , such as a standard Gaussian . The parameters are determined using the maximum likelihood principle to the training data. For continuous outcome distributions, the change of variable theorem allows computing the probability density of the complicated target distribution from the simple base distribution via:
As in the NF, the factor ensures the normalization of . The key insight of the MLT method is to use a flexible class of polynomials, e.g. the Bernstein polynomials, to approximate the transformation function
where . The core of this transformation is the Bernstein basis of order , generated by the Beta-densities . It is known that the Bernstein polynomials are very flexible basis functions and uniformly approximate every function in , see [Hothorn2018] for a further discussion. The second reason for the choice of the Bernstein basis is that in order for to be bijective, a strict monotone transformation of is required. A strict increase of a polynomial of the Bernstein basis can be guaranteed by simply enforcing that the coefficients of the polynomial are increasing, i.e. . It should be noted that nonparametric methods of the transformation function dominate the literature, and smooth low-dimensional spline approaches only recently received more attention; see [Tian_Hothorn_Li_2020] for a comparison of the two approaches.
A useful special case of the transformation function, is the so-called linear transformation function which has the following form:
This linear transformation model (LTM) is much less flexible than the transformation in Equation 4. Its main advantage is the interpretability of the coefficients as in classical regression models. Linear transformation models have been used for a long time in survival analysis. For example a Cox model is a linear transformation model: If the base distribution is a minimum extreme value distribution, the coefficients are the log hazard ratios. If the base distribution is chosen to be the a standard logistic distribution then the coefficients
are the log odds ratios, see[Hothorn2018] for a detailed discussion.
In the following, we combine the ideas of NF parameterized by neural networks and MLT. Our model consists of a chain of four transformations constructing the parametric transformation function:
Only the sigmoid transformation has no parameters, whereas have learnable parameters. In a regression task we allow in our flexible model all parameters to change with . We use neural networks to estimate the parameters (see Figure 1). We describe the chain of flows going from the untransformed variable
to the standard normally distributed variable(see Figure 2). The MLT approach based on Bernstein polynomials needs a target variable which is limited between . For our model, we do not want to apriori define a fixed upper and lower bound of to restrict its range. Instead, we transform in a first flow , to the required range of . This is done by a scaled and shifted transformation
followed by a sigmoid function via (see Figure2):
The scale and shift parameters are allowed to depend on and are controlled by the output of two neural networks with input . In order to enforce to be monotonically increasing, we need to restrict
. This is enforced by a softplus activation of the last neuron (see Figure1).
The second transformation is the MLT transformation from Equation (4) consisting of a Bernstein polynomial with parameters which can depend on and are controlled by a NN with input and as many output nodes as we have parameters. If the order of the Bernstein polynomial is , then theis sufficient when working with a Gaussian base distribution. But with higher order Bernstein polynomials we can achieve very flexible transformation that can e.g. turn a bimodal distribution into a Gaussian (see Figure 2). To ensure a monotone increasing , we enforce the above discussed condition by choosing for and being the -th output of the network. We set (see Figure 1).
The final flow in the chain is again a scale and shift transformation into the range of the standard normal (see Figure 2). The parameters are also allowed to depend on and are controlled again by NNs, where the softplus transformation is used to guarantee a positive (see Figure 1). First experiments showed that the third flow is not necessary for the performance but accelerates training (data not shown).
The total transformation is given by chaining the three transformations resulting in (see Figure trafo):
In order to demonstrate the capacity of our proposed approach, we performed three different experiments.
Iv-a One-Dimensional Toy Data
We use simulated toy data sets with a single feature and a single target variable to illustrate some properties of the proposed transformation model. In the first example the data generating process is given by imposing exponential noise with increasing amplitude on a linear increasing sinusoidal (see Figure 3). For some -values two fitted CPDs are shown, one resulting from our NN based transformation model with (DL_MLT) Equation 8 and one from a less flexible linear transformation model (LTM) Equation 5. Figure 3 clearly demonstrates that the complex transformation model can, in contrast to the simple linear transformation model, cope with an non-monotone -dependency of the variance. To quantify the goodness of fit, we us the negative log-likelihood (NLL) of the data (smaller is better), which is -2.00 for the complex and -0.85 for the simple linear transformation model.
The second example, shown on the lower panel of Figure 3, is a challenging bimodal distribution in which the spread between the modes strongly depends on but the conditional mean of the outcome does not depend on . Both models can capture the bimodal shape of the CPD, but the LTM is not flexible enough to adapt its bimodal CPD with changing . The proposed flexible transformation model DL_MLT is able to also take the dependency of the spread between the modes of the distribution into account.
Iv-B UCI-Benchmark Data Sets
|Data Set||N||DL_MLT||NGBoost||MC Dropout||Deep Ensembles||Gaussian Process||MDN||NFN|
|Boston||506||2.42 0.050||2.43 0.15||2.46 0.25||2.41 0.25||2.37 0.24||2.49 0.11||2.48 0.11|
|Concrete||1030||3.29 0.02||3.04 0.17||3.04 0.09||3.06 0.18||3.03 0.11||3.09 0.08||3.03 0.13|
|Energy||768||1.06 0.09||0.60 0.45||1.99 0.09||1.38 0.22||0.66 0.17||1.04 0.09||1.21 0.08|
|Kin8nm||8192||-0.99 0.01||-0.49 0.02||-0.95 0.03||-1.20 0.02||-1.11 0.03||NA||NA|
|Naval||11934||-6.54 0.03||-5.34 0.04||-3.80 0.05||-5.63 0.05||-4.98 0.02||NA||NA|
|Power||9568||2.85 0.005||2.79 0.11||2.80 0.05||2.79 0.04||2.81 0.05||NA||NA|
|Protein||45730||2.63 0.006||2.81 0.03||2.89 0.01||2.83 0.02||2.89 0.02||NA||NA|
|Wine||1588||0.67 0.028||0.91 0.06||0.93 0.06||0.94 0.12||0.95 0.06||NA||NA|
|Yacht||308||0.004 0.046||0.20 0.26||1.55 0.12||1.18 0.21||0.10 0.26||NA||NA|
Comparison of prediction performance (test NLL, smaller is better) on regression benchmark UCI datasets. The best method for each dataset is bolded, as are those with standard errors that overlap with the standard errors of the best method.
To compare the predictive performance of our NN based transformation model with other state-of-the-art methods, we use nine well established benchmark data sets (see table bench) from the UCI Machine Learning Repository. Shown is our approach (Deep_MLT), NGBoost [Duan2020], MC Dropout [Gal2015], Deep Ensembles [NIPS2017_7219], Gaussian Process [Duan2020], noise regularized Gaussian Mixture Density Networks (MDN) and normalizing flows networks (NFN) [Rothfuss]
. As hyperparameters for our Deep_MLT model, we usedfor all datasets and a -regularization constants of 0.01 for the smaller datasets (Yacht, Energy, Concrete, and Boston) no regularization was used in the other datasets. To quantify the predictive performance, we use the negative log-likelihood (NLL) on test data. The NLL is a strictly proper score that takes its minimum if and only if the predicted CPD matches the true CPD and is therefore often used when comparing the performance of probabilistic models.
In order to compare with other state-of-art models, we follow the protocol from Hernandez and Lobato [Hernandez-Lobato2015] that was also employed by Gal [Gal2015]. For validating our method, the benchmark data sets were split into several folds containing training data (90 percent) and test-data (10 percent). We downloaded the specific folds as used in [Gal2015] from https://github.com/yaringal/DropoutUncertaintyExps. We determined the hyperparameters of our model using 80 percent of the training data, keeping 20 percent for validation. The only preprocessing step has been a transformation to of and based on the training data. In contrast to other approaches in Table (I) like [Gal2015] and [Duan2020], we choose only one set of hyperparameters for each dataset and do not adapt the hyperparameters for each individual run separately. We can do this since our model has only few and quite stable hyperparameters. Specifically, we verified that the initial choice of the number of layers in the network is appropriate (no tuning has been done). We further choose the parameter of the -regularization of the networks and the order of the Bernstein polynomials. After choosing the hyperparameters, we trained the same model again on the complete training fold and determined the NLL on the corresponding test fold. This procedure allowed us to compare the mean, the standard error of the test NLL with reported values from the literature.
The results in Table (I) show that our DL_MLT model yields, overall, a competitive performance with other state-of-the-art models. In the two tasks (Naval and Wine) the transformation model clearly outperforms all existing models. We followed up on the reasons and visually inspected the predicted CPDs in the different tasks. For Naval, we found that the model needs extremely long training periods, approximately 120’000 iterations compared to approximately 20’000, for the other data sets. When trained for such as a period it’s CPD is a very narrow single spike (see lower left panel in Figure 4). We speculate that the reason for worse NLL of the other models is that they have not been trained long enough. For the other data-sets, were we outperformed the existing methods, the striking finding was, that for many instances very non-Gaussian CPDs are predicted. For the Wine dataset the predicted CPDs are not always uni-model but quite often show a multi-modal shape (see upper right panel in Figure 4). This makes perfectl sense, since the target variable in the dataset, the subjective quality of the wine is an integer value ranging from 0 to 10111Here we treat the values as continuous quantities as done in the reference methods, when taking the correct data type into account.. Hence, if the model cannot decide between two levels, it predicts a bimodal CPD , which shows that it can learn to a certain extend the discrete structure of the data (see upper right panel in Figure 4).
To investigate if we need more flexibility to model this kind of multi-modal outcome distributions, we refitted the Wine data with and obtained an significantly improved test NLL of . For the other data-sets the underlying cause of the better performance of our DL_MLT model is unclear to us.
Iv-C Age Estimation from Faces
In this experiment, we want to demonstrate the usefulness of the proposed NN based transformation model for a typical deep learning context dealing with complex input data, like images. For this purpose, we choose the UTKFace dataset containing images of cropped faces of humans with known age ranging from 1 to 116 years [Zhang]. The spread of the CPD indicates the uncertainty of the model about the age. We did not do any hyperparameter search for this proof of concept and used 80 percent of the data for training. The color images of size were downscaled to
and feed into a small convolutional neural network (CNN) comprising three convolutional layers with 141840 parameters, as indicated in Figure1.
Estimating the age is a challenging problem. First, age is non negative. Second, depending on the age, it is more or less hard to estimate the age. As humans we would probably be not more than two years off when estimating the age of a one year old infant, but for a thirty or ninety year old person the task is much harder. Therefore, we expect that also a probabilistic model will predict CPDs with different spreads and shapes. Usually this problem is circumvented by transforming the age by e.g. a log transformation. Here, we did not apply any transformation to the raw data and let the network find to correct CPD in an end-to-end fashion. After training for 15’000 iterations, a NLL of 3.83 on the test-set has been reached. Since we did not find a suitable probabilistic benchmark for the dataset, we show in Figure 5 some typical results for the CPD for people of different ages.
The results of the model meet our expectations yielding narrow CPDs for infants, while for older people the spread of the CPD broadens with increasing age.
V Conclusion and outlook
We have demonstrated that deep transformation models are a powerful approach for probabilistic regression tasks, where for each input a conditional probability distribution on is predicted. Such probabilistic models are needed in many real-world situations where the available information does not deterministically define the outcome and where an estimate of the uncertainty is needed.
By joining ideas from statistical transformation models and deep normalizing flows the proposed model is able to outperform existing models from both fields when the shape of the conditional distribution is complex and far away from a Gaussian.
Compared to statistical transformation models the proposed deep transformation model does not require predefined features. It can be trained in an end-to-end fashion from complex data like images by prepending a CNN.
Though mixture density networks [Bishop1994] are able to handle non-Gaussian CPDs they often tend to overfit the data and require a careful tuning of regularization to reach a good performance [Rothfuss]. Since deep transformation models estimate a smooth, monotonically increasing transformation function, they always yield smooth conditional distributions. Therefore, we need only mild regularization in cases where the NN for the parameter estimation would overfit the train data. For the UCI-datasets, we empirically observed the need for mild regularization only in cases the size of the dataset is below 1500 points.
The proposed model can, in principle, be extended to higher dimensional cases of the response variable, see [klein2019multivariate] for a statistical treatment. The fact that our model is very flexible and does not impose any restriction on the modeled CPD, allows it to learn the restrictions of the dataset, like age being positive, from the data. For limited training data that can lead to situations where some probability is also assigned to impossible outcomes, such as negative ages. It is possible to adapt our model to respect these kind of limitations or to predict e.g. discrete CPDs. We will tackle this in future research. For discrete data we will use the appropriate likelihood that is described in terms of the corresponding discrete density , where is the observed value and the largest value of the discrete support with . For the Wine dataset for example, where the outcome is an ordered set of ten answer categories, such an approach would result in a discrete predictive model matching the outcome
. At the moment, we also did not incorporate the ability to handle truncated or censored data which is possible with statistical transformation models, see[Hothorn2018]. For the Boston dataset for example, all values with are in fact right-censored, thus the probability would be the correct contribution to the likelihood.
Another limitation of the proposed flexible transformation NN is its black box character. Picking special base distributions, such as the standard logistic or minimum extreme value distributions, and using linear transformation models should allow to disentangle and describe the effect of different input features by means of mean shifts, odds ratios, or hazard ratios. This can be translated into interpretable deep learning transformation models, which we plan to do in the near future.
The authors would like to thank Lucas Kook, Lisa Herzog, Matthias Hermann, and Elvis Murina for fruitful discussions.