Let’s face it. The work of statisticians is considered boring in the public eye. Nobody publishes page turners on the thrilling aspects of data analysis, yet the quest for a good model can be as exciting as detective work. One of my favourite paperback characters is LAPD detective Harry Bosch in the crime novels of Michael Connelly. Like Harry, who follows the traces left by the murderer on the crime scene to form a theory about the culprit, the experienced data analyst follows the traces left by the data-generating process in the residuals of an over-simplistic model. Unlike Harry, who of course always succeeds in arresting the murderer, the statistician can never be sure whether the correct or even an approximately useful model was found. In the quests for a suspect or for a good model, parsimonious explanations are preferred by Occam’s razor. Therefore, in residual-based model diagnostics, the data analyst starts with a very simple model, whose complexity is increased by step-wise refinement until all signs of lack of fit disappear from the residuals. I refer to such a procedure as “bottom-up model choice” because one moves from simple to more complex models. In this tutorial, I consider moving in the opposite direction, i.e. from complex to simple models, for distributional regression. This “top-down approach” to model choice begins with the most complex model that one can come up with that explains both signal and noise without overfitting the data. In a regression setup, such a model would describe as accurately as possible the conditional distribution of the response given the explanatory variables. Once such a model is established as a benchmark for comparison with simpler models, one can start to reduce model complexity step-wise. In the crime novel scenario, the top-down data analyst takes the role of an eyewitness at the scene. What one “sees” in this process is, of course, still a portrayal and not the real thing. There is no way to “see” the correct model. In top-down model choice, however, the trajectories through model space will be guided by assessments of vital models. In bottom-up model choice, by contrast, the horizon is limited by the amount of information that one can find in traces in deceased models.
In this tutorial, I focus on top-down model choice in continuous regression problems. Conceptually, a regression model is a family of conditional distributions for some response given a specific configuration of explanatory variables . The model describes both signal and noise, i.e. the variability explained by the explanatory variables and the unexplained variability. Unfortunately, this point of view only applies to relatively simple models that assume a certain parametric distribution, whose parameters partially depend on the explanatory variables. So-called “non-parametric regression models” (Fahrmeir et al., 2013) often restrict their attention to the signal , with non-linear conditional mean function , while treating the noise, i.e.2001)
, are extremely powerful when estimating complex conditional mean functions. However, one cannot infer the entire conditional distribution using random forests or similar methods. This renders top-down model choice impossible because reductions in complexity require switching between different model classes or even crossing the borders between the parametric and non-parametric empires. The comparison of two models from different classes is difficult, and thus it is difficult to decide whether the simpler model is more appropriate than the more complex one.
The implementation of top-down model choice is much simpler when the most complex and the most simple model are members of the same family. Conditional transformation models from the transformation family of distributions (Hothorn et al., 2014, 2017) include many important established off-the-shelf regression models. In addition, tailored models can be created, in vivo with our brains and in silico using open-source software, which allow smooth transitions between models of different complexity. In a nutshell, the class of conditional transformation models
assumes that the conditional distribution function of given can be written as the composition of an a priori
specified continuous cumulative distribution functionand some conditional transformation function . The latter function monotonically increases in its first argument for each . It is important to note that the entire conditional distribution, and not just its mean, is modelled by . In this sense, and unlike common regression models, there is no decomposition into signal (the conditional mean) and noise (the remaining higher moments) in this class of transformation models. Changing model complexity means changing the complexity of the conditional transformation function , and I thus refer to top-down model choice in conditional transformation models as top-down transformation choice.
Model complexity in the class of conditional transformation models is linked to smooths of varying complexity with respect to . These conditional transformation functions may vary with the explanatory variables in arbitrary ways, including interactions and non-linearities. In this paper, I consider model choice as an art rather than an exact science. No formal algorithm leading to an “optimal” model will be presented. Instead, I argue that the possibility of modelling a cascade of decreasingly complex conditional distribution functions in the same model class gives us new possibilities to investigate goodness of fit or lack thereof. A fair amount of subjectivity will remain in this process, as is always the case in classical model diagnostics. I shall be less concerned with the technical subtleness of parameter estimation in the models discussed here and refer the reader to more formal results published elsewhere. Instead, I illustrate practical aspects of top-down transformation choice by a tour-de-force through transformation models describing the impact of lifestyle parameters, such as smoking or physical activity, on the body mass index (BMI) distribution in the Swiss population.
I will proceed by introducing the Swiss Health Survey and the variables dealt with in Section 2. In a very simple setup, I first illustrate a bottom-up route, starting with a normal linear model and ending with a more complex non-normal transformation model, for describing the BMI distribution of females and males at various levels of smoking (Section 3). I then try to reduce the complexity again until an interpretable model that fits the data roughly as well as the most complex model can be found. In addition to a consideration of sex and smoking, I consider age and some lifestyle variables in a more realistic setup of top-down transformation choice in Section 4.
2 Body Mass Index in the Swiss Health Survey
The Swiss Health Survey (SHS) is a population-based cross-sectional survey. It has been conducted every five years since 1992 by the Swiss Federal Statistical Office (Bundesamt für Statistik, 2013). For this tutorial, I restricted the sample to individuals aged between and years from the 2012 survey. Study samples were obtained by stratified random sampling using a database with all private household landline telephone numbers. Data were collected by telephone interviews and self-administered questionnaires. Height and weight were self-reported in telephone interviews. Observations with extreme values of height and weight were excluded (highest and lowest percentile by sex). Smoking status was categorised into never, former, light ( cigarettes per day), moderate () and heavy smokers (
). Never smokers stated that they did not currently smoke and never regularly smoked longer than six months; former smokers had quit smoking but have smoked for more than six months during their life course. One cigarillo or pipe counted as two cigarettes, and one cigar counted as four cigarettes. The following lifestyle variables were included and assessed by telephone interview and self-administered questionnaire: fruit and vegetable consumption, physical activity, alcohol intake, level of education, nationality and place of residence. Fruit and vegetable consumption was combined in one binary variable that comprised the information on whether both fruits and vegetables were consumed daily or not. The variable describing physical activity was defined as the number of days per week a subject started to sweat during leisure time physical activity and was categorised asdays, days and none. Alcohol intake was included using the continuous variable gram per day. Education was included as highest degree obtained and was categorised into mandatory (International Standard Classification of Education, ISCED 1-2), secondary (ISCED 3-4), and tertiary (ISCED 5-8) (UNESCO Institute for Statistics, 2012). Nationality had the two categories Swiss and foreign. Language reflected cultural and regional differences within Switzerland, and the three categories German/Romansh, French and Italian were taken into account. Sampling weights of this representative survey were considered for the estimation of all models reported in this tutorial. More detailed information about this study and an analysis using simple transformation models is given in Lohse et al. (2017).
3 Sex- and Smoking-specific BMI Distributions
I start with the very simple situation where the conditional distribution of BMI depends on sex and smoking only. Smoking was assessed on five different levels (never smoked, former smokers, light smokers, medium smokers and heavy smokers). Therefore, I am interested in the conditional distribution of BMI in these groups of participants. Figure 1 presents the empirical cumulative distribution functions, i.e. the non-parametric maximum-likelihood estimators for the underlying continuous distributions, for each of the
combinations of sex and smoking. At the same time, the plot also represents the uncompressed raw data. With a high-enough resolution, one could recover the original BMI values and the corresponding sampling weights from such an image. Consequently, goodness of fit can be assessed by overlaying the empirical cumulative distribution functions with their model-based counterparts in this simple setup. I will try to find a suitable parametric model this way. In addition to this rather informal approach, I will study the increase of the log-likelihoods as model complexity is increased. In the classical bottom-up approach, one would start with a very simple model assuming conditional normal distributions. The next section discusses possible choices in this model class.
3.1 Normal Models
The normal cell-means model with constant variance
assumes normal distributions with a common variance for all conditional BMI distributions. Means are allowed to vary between the groups defined by sex and smoking. The notation indicates that the conditional mean is specific to each combination of sex and smoking in this cell-means model, i.e. there are a total of parameters
. With a residual standard error of, a log-likelihood of was obtained, and the estimated cell-means, with confidence intervals, are shown in Table 1.
|Never||23.57 (23.47–23.68)||25.18 (25.06–25.30)|
|Former||23.94 (23.76–24.12)||26.46 (26.30–26.62)|
|Light||22.85 (22.59–23.11)||25.00 (24.73–25.27)|
|Medium||23.45 (23.18–23.72)||25.00 (24.75–25.26)|
|Heavy||23.72 (23.36–24.08)||25.90 (25.65–26.15)|
How well does this model fit the data? I want to answer this question by graphically comparing the conditional distribution functions obtained from this model to the corresponding empirical conditional distributions, and thus the raw data. The model-based conditional cumulative distribution functions
overlay the empirical cumulative distribution functions in Figure 2. While not being completely out of line, the considerable differences between the empirical and model-based distribution functions certainly leave room for improvement. An obvious increase in the complexity allows for group-specific variances in the model
The log-likelihood in this -parameter model increased to , and the corresponding conditional distribution functions in Figure 3 were closer to the empirical cumulative distribution functions. For males, the model-based normal distributions were very close to the empirical conditional BMI distributions. For females, however, there still was a considerable discrepancy between model and data, especially in the lower tails. The BMI distributions of females deviated from normality much more than the BMI distributions of males (note that I am not saying that males are normal and females are not!). It is clear that one has to move to a non-normal error model, at least for females, and the transformation models discussed below are a convenient way to do so.
The normal models are a special case of transformation models and thus the latter class is a very natural extension of the former. To see the connection, consider the conditional distribution function
where is a linear function of with parameters and . In more complex models, we will use the parameter
(or parameter vectorfor basis-transformed response values ) for modelling transformations of the response . Shift parameters describing effects of explanatory variables only, i.e. no response-varying effects, will be denoted by or later on. The above re-parameterisation shows, quite unsurprisingly, that a normal model features a linear transformation function . Consequently, non-normal conditional distributions can be obtained by allowing a non-linear transformation function for each combination of sex and smoking in the transformation models presented in the next section.
3.2 Non-normal Transformation Models
The core concept of a transformation model is a potentially non-linear monotonically increasing transformation function , here for each combination of sex and smoking. For computational convenience, I parameterised the transformation functions in terms of Bernstein polynomials (Farouki, 2012). For each of the groups, I modelled the transformation function by a Bernstein polynomial of order , where are the corresponding basis functions of BMI. A monotonically increasing Bernstein polynomial of order features monotonically increasing parameters . Maximum-likelihood estimation was performed (Hothorn et al., 2017) using the mlt (Hothorn, 2017a, b) add-on package to the R system for statistical computing (R Core Team, 2017). With the corresponding total parameters, the maximum log-likelihood of the model
was ; the notation indicates that the parameters were estimated for each combination of sex and smoking. One can hardly differentiate the resulting model-based conditional distribution functions from the empirical cumulative BMI distribution functions in Figure 4. Because a separate transformation function was estimated for each combination of sex and smoking, this model can be referred to as a transformation model stratified by sex and smoking. Based on this model, one can understand non-normality as deviation of the transformation function from a linear function. Figure 5 shows the sex- and smoking-specific transformation functions of model (3) along with the linear transformation functions obtained from the normal cell-means model (2) with heterogeneous variances. In the centre of the distributions, the two curves overlap, but the tails are not described well by the normal distribution. The differences between the two curves are more pronounced for females, corresponding to the larger deviations from normality observed earlier.
One nice feature of model (3) is the possibility to easily derive characterisations of the distribution other than the distribution function. Density, quantile, hazard, cumulative hazard or other characterising functions can be derived from (3), and Figure 6
depicts the densities for males and females at the various levels of smoking. The right-skewness of the distribution, and thus deviation from normality, was more pronounced for females. The BMI distributions for females put more weight on smaller BMI values for females than for males. Except for heavy smokers, the effects of smoking seemed to be rather small.
The model fit of this stratified transformation model is now satisfactory, as it essentially smoothly interpolates the empirical distribution functions and thus the data in Figure4. This most complex model describes the data well, but unfortunately, it is difficult to learn anything from this model. That is, one wants to understand the differences between the conditional distributions in terms of simple parameters and not complex non-linear functions. A simpler model is needed. A top-down approach to transformation choice might help to identify a model with simpler and interpretable transformation functions, but any necessary compromises to the model fit should not be too demanding.
Because the BMI distributions differed most between males and females, I first simplify the model by conditioning on smoking and stratifying by sex, i.e. I introduce sex-specific transformations and sex-specific smoking effects , the latter being constant for all arguments of the conditional distribution function, in the model
This model features two transformation functions and . For never smokers (the reference category), these two transformation functions describe the conditional BMI distributions, i.e.
For the remaining smoking categories, one sex-specific parameter describes how the conditional BMI distribution of a smoker differs from the conditional BMI distribution of a person who never smoked by a simple shift term . Because of the “linear” shift term, this model could be referred to as a stratified linear transformation model. This is, as often in statistics, a misnomer, because the transformation of the response, i.e. of the BMI values , is non-linear.
The log-likelihood for this model with parameters was found to be , a moderate reduction compared to the log-likelihood of the most complex transformation model (). Figure 7 shows only minor differences between the empirical and model-based conditional distribution functions. Thus, it seems that a more parsimonious model was found without paying too high a price in terms of log-likelihood reduction.
The conceptual problem with this model, however, is lack of interpretability of the shift term . In contrast to the means in the normal models (1) and (2), there is no direct interpretation of in terms of moments of the conditional distribution described by this model. This issue can be addressed by changing the cumulative distribution function of the standard normal to the cumulative distribution function of the standard logistic
When the cut-off
is fixed, this is a logistic regression model for the binary outcomevs. . The transformation function is now a sex-specific intercept, and
are the sex-specific log-odds ratios for the eventcompared to the baseline category (never smokers). Because this shift term does not depend on , the model assumes proportionality of the smoking odds with respect to the cut-off . Stratification by sex allows non-proportional smoking odds with respect to sex. In fact, the sex-specific conditional distributions of males and females can still differ in very general ways because two separate Bernstein polynomials describe the conditional distributions for males and females. The model can be seen as a stratified proportional odds model for continuous responses or a continuous form of logistic regression analyses, jointly performed for all possible cut-off points under the assumption of constant parameters . Similar models, however without stratification, were studied by Manuguerra and Heller (2010) using parametric intercept functions and recently by Liu et al. (2017) treating the intercept function as a nuisance parameter in non-parametric maximum likelihood estimation. Lohse et al. (2017) provide a comparison of parameter estimation in the presence of interval-censored body mass index observations.
The parameterisation with a negative shift term seems unconventional from a logistic regression point of view, but it simplifies interpretation. With model (5), , and thus positive shift parameters indicate a shift of the BMI distribution towards higher BMI values. Corresponding odds ratios larger than one mean that BMI distributions are shifted to the right, compared to the BMI distribution in the reference category.
Unfortunately, there was some further reduction in the log-likelihood (), and interpretability doesn’t come for free. However, the model-based and empirical conditional BMI distribution functions look very much the same as presented in Figure 7 (additional plot not shown). The sex-specific BMI-independent odds-ratios of smoking, compared to never smoking, are given in Table 2. Former smokers had, on average, a larger BMI compared to never smokers, and the effect was stronger for males. A similar effect was observed for male heavy smokers. Female light smokers showed a BMI distribution shifted to the left, compared with female never smokers.
|Former||1.19 (1.08–1.31)||1.95 (1.77–2.14)|
|Light||0.75 (0.65–0.85)||0.95 (0.82–1.09)|
|Medium||0.98 (0.85–1.12)||0.93 (0.81–1.06)|
|Heavy||1.10 (0.92–1.32)||1.43 (1.25–1.63)|
Maintaining interpretability, one could go further and assume equal smoking effects for males and females in the model
The log-likelihood was again reduced () for this model with parameters. In addition, the odds-ratios presented in Table 2 indicate severe differences in the smoking effects between males and females; therefore, I refrain from looking at this or even simpler models and stop the top-down transformation choice here. Of course, this very simple example only worked because it was possible to compare models and raw data directly on the scale of the conditional BMI distribution functions for two categorical explanatory variables, sex and smoking. In the second part, I will consider additional, and also numeric, explanatory variables in a more realistic setup.
4 Conditional BMI Distributions
My aim is to estimate the conditional BMI distribution given sex, smoking, age and the lifestyle variables alcohol intake, education, physical activity, fruit and vegetables consumption, residence and nationality as explanatory variables . In the conditional transformation model
the conditional transformation function depends on these variables in a yet unspecified way. Top-down transformation choice ideally allows one to start without too many headaches, i.e. an algorithm for fitting this model to handle the potentially many explanatory variables of mixed type allows relatively complex non-linear transformation functions. Such a model can be written as
assuming that each conditional distribution is parameterised in terms of a Bernstein polynomial of order . The parameters of this polynomial, however, depend on the explanatory variables in a potentially complex way, featuring interactions and non-linearities. Tree and forest algorithms (Hothorn and Zeileis, 2017) allow such “conditional parameter functions” , and thus the corresponding conditional BMI distributions, to be estimated in a black-box manner without the necessity to a priori specify any structure of . I will first use trees and forests to understand the complexity of the impact of the explanatory variables on the BMI distribution. Later on, I will apply a top-down approach to transformation choice to obtain simpler transformation models that allow more straightforward interpretation.
4.1 Transformation Trees and Forests
A transformation tree (Hothorn and Zeileis, 2017) starts with an unconditional transformation model
and a corresponding maximum-likelihood estimator . The algorithm proceeds by assessing correlations between the score contributions evaluated at and the explanatory variables sex, smoking, age and . A binary split is implemented in the most discriminating cut-off point of the variable showing the highest correlation to any score. The procedure is repeated until a certain stop criterion applies. The result is a partition of the data. The algorithm is sensitive to distributional changes, i.e. the conditional BMI distributions in the subgroups of this partition may vary with respect to the mean BMI and also with respect to higher BMI moments. In each subgroup, the unconditional model (6) was used to estimate for this subgroup. Because each observation in this subgroup is then associated with a dedicated parameter vector , the log-likelihood for the tree model could be evaluated as the sum of the likelihoods in the subgroups. The log-likelihood of the tree presented in Figure 8 is . The first split is in sex, so in fact two sex-specific models are given here. Three age groups (, , ) for females and three age groups (, , ) for males are distinguished. Education contributed to understanding the BMI distribution of females and males. Location, scale and shape of the conditional BMI distributions varied considerably. The variance increased with age, and higher-educated people tended to have lower BMI values. These are interesting insights, but the model is of course very rough.
A transformation forest (Hothorn and Zeileis, 2017) allows less rough conditional parameter functions
to be estimated. There are no longer any restrictions regarding the conditional parameter functions. In this sense, a transformation forest is the “most complex model one can think of” as mentioned in the introduction. The random forest class of models is considered to be very accurate, insensitive to hyperparameter tuning and without a tendency to overfit. In the following, I shall use this method to obtain a benchmark for better-interpretable transformation models following the top-down model selection approach.
The generic random forest algorithm essentially relies on multiple transformation trees fitted to subsamples of the data, with a random selection of variables to be considered for splitting in each node. Unlike the original random forest (Breiman, 2001), a transformation model can be understood as a procedure assigning a parametric model to each observation. For subject , the forest conditional distribution function is
In this sense, a transformation forest “predicts” a fully parametric model for each subject, albeit with a very flexible conditional parameter. The conditional parameter was obtained from a locally adaptive maximum-likelihood estimator based on so-called nearest neighbour weights (Hothorn and Zeileis, 2017). A considerable improvement in the transformation forest log-likelihood () was observed. In fact, this is the largest log-likelihood I was able to achieve. Thus, this transformation forest is the best- fitting model for the BMI data.
On the downside, this black-box model makes is very difficult to understand the impact of the explanatory variables on the conditional BMI distribution. The likelihood-based permutation variable importance (Figure 9
) indicated that only sex, age, education, physical activity and smoking have an impact on BMI, where again sex seems to be the most important variable. Age was a more important factor than education or physical activity, and thus the only numeric variable one needs to consider. The association between sex, smoking, age and BMI as described by the transformation forest is given in terms of a partial dependency plot of conditional deciles in Figure10. In general, the median BMI increases with age, as does the BMI variance. For males, there seemed to be a level-effect whose onset depends on smoking category. Females tended to higher BMI values, and the variance was larger compared to males. There seemed to be a bump in BMI values for females, roughly around years. This corresponds to mothers giving birth to their first child around this age. It is important to note that the right-skewness of the conditional BMI distributions in Figure 10 renders conditional normal distributions inappropriate, even under variance heterogeneity.
This complex model would be sufficient if one was only interested in the estimation of conditional BMI distributions for persons with specific configurations of the sex, smoking, age and the remaining explanatory variables. The variable importances can be used to rank variables according to their impact on the conditional BMI distributions but cannot replace effect measures, let alone an assessment of their variability. Communication with subject-matter scientists and publication of results in subject-matter journals requires simplification of these models. Top-down transformation choice can help to find models of appropriate complexity, as will be seen in the next section.
4.2 Conditional Transformation Model
The analysis using transformation trees and especially transformation forests revealed strong effects of sex and age; the latter variable was not considered in our analysis presented in Section 3. A more structured model roughly as powerful as the transformation forest must therefore allow the conditional distribution of BMI to change with both sex and age in very general ways. The remaining variables were less important, and one can hopefully cut some corners here by assuming simple linear main effects for these variables. I start the top-down search for a simpler model with a conditional transformation model of the form
The transformation function implements a sex-specific bivariate smooth-surface function of BMI and age, which was of course monotonic in its first argument. The surface function for males explains age-induced changes in the conditional distribution of BMI. In contrast to transformation forests, the assumption was made that the function is smooth in both and age and not only in
. I parameterise this function as a tensor product of two Bernstein polynomials of order 5, one for BMI and one for age, with sex-specific-dimensional parameter vector , in other words as . Except for smoking, the remaining variables entered only as the linear shift term of main effects. In light of its fifth rank in the permutation variable importance (Figure 9), it may seem a bit inconsequent to treat smoking differently than the other variables. However, the stratified analysis in Section 3 suggested the need for sex-specific smoking effects, and I thus include the interaction term also in this model. The expit function around the transformation function ensures interpretability of all regression coefficients on the log-odds scale.
With parameters, the log-likelihood of model (7) was only slightly smaller than the log-likelihood of the transformation forest (). In a certain sense, this conditional transformation model can be seen as an approximation of the black-box transformation forest. The effects of sex, smoking and age, with all remaining variables being constant, are again best visualised using the conditional decile functions (Figure 11). The decile functions are now smooth in age due to the parameterisation of the age effect in terms of Bernstein polynomials. For males, the BMI increased with age; the BMI reduction in males older than years was not visible in the decile curves of the transformation forests (Figure 10). The slope was largest for young men up to years, followed by a linear increase until the age of . The male BMI distribution was right skewed, with only a small increase in the variance towards older people. For females, a bump in the BMI distribution was again identified around the age of , corresponding to pregnancies and breast-feeding times. The effect seemed more pronounced in higher deciles. Right skewness and a variance increase towards older women can be inferred from this figure.
The main advantage of this complexity reduction is the interpretability of the regression coefficients and in terms of BMI-independent log-odds ratios. The sex-specific smoking effects and the effects of the remaining variables as odds ratios are given in the left column of Table 3. Further simplification can be achieved by replacing the bivariate surface function of and age by a sex-specific, BMI-varying linear effect of age in the distribution regression model presented in the next section.
4.3 Distribution Regression
The term “distribution regression” (Chernozhukov et al., 2013) is commonly used to describe response-varying coefficients. In survival analysis, the term “time-varying coefficients” is more typical. Here, a BMI-varying coefficient of age is a means of simplifying the conditional transformation model (7). In the simpler model, I assume a smoothly varying but sex-specific coefficient of age . The transformation function is again the simple transformation of BMI given sex introduced in model (4). The model reads
The log-likelihood decreased considerably in this model with parameters. The effects of smoking and the remaining variables (except age) are given in the middle column of Table 3 as odds ratios. When the dependency of BMI deciles on sex, smoking and age were depicted (Figure 12), the linear structure regarding age was obvious. The age-varying slopes and the pregnancy bump could not be identified by this simpler model. Right-skewness and variance heterogeneity for females remained visible. The variance increase in older males now seemed questionable. For my taste, the replacement of two bivariate functions by two univariate functions does not really help model interpretation, as one would have to plot these two functions in any case. The severe reduction of the log-likelihood indicated that the effect of age is better described in a conditional transformation model of the form (7). Nevertheless, I will go one step further and connect the stratified linear transformation model (5) with a model of the same form featuring age and the lifestyle variables in addition to sex and smoking.
4.4 Stratified Linear Transformation Model
I extend the stratified linear transformation model (5) with an sex-specific age effect and a linear predictor of the remaining variables
The log-likelihood was further reduced to . In this model, the sex differences in the age effects were completely gone, as the odds ratios for a one-year increase were () for males and () for females. In light of the more complex structure of the age effect identified by the more complex models, one would incorrectly draw the conclusion of equal age effects for males and females based on this oversimplified model. The effects of the remaining parameters are given in the right column of Table 3.
|Model (7)||Model (8)||Model (9)|
|Former||1.04 (0.95–1.15)||1.06 (0.96–1.16)||1.06 (0.96–1.16)|
|Light||0.81 (0.70–0.92)||0.81 (0.71–0.93)||0.81 (0.71–0.93)|
|Medium||0.92 (0.80–1.06)||0.94 (0.82–1.08)||0.94 (0.82–1.08)|
|Heavy||0.91 (0.76–1.09)||0.93 (0.78–1.12)||0.94 (0.78–1.12)|
|Former||1.44 (1.31–1.59)||1.47 (1.33–1.62)||1.47 (1.33–1.61)|
|Light||1.02 (0.88–1.17)||1.01 (0.88–1.16)||1.01 (0.88–1.16)|
|Medium||0.87 (0.76–1.00)||0.90 (0.79–1.03)||0.91 (0.79–1.04)|
|Heavy||1.13 (0.99–1.30)||1.21 (1.06–1.39)||1.22 (1.07–1.40)|
|Alcohol intake (g/d)||1.00 (1.00–1.00)||1.00 (1.00–1.00)||1.00 (1.00–1.00)|
|Fruit and vegetables|
|Low||1.07 (1.01–1.13)||1.08 (1.02–1.14)||1.08 (1.02–1.14)|
|Moderate||1.11 (1.04–1.19)||1.15 (1.08–1.23)||1.16 (1.08–1.23)|
|Low||1.25 (1.16–1.34)||1.30 (1.21–1.40)||1.30 (1.21–1.40)|
|Secondary (II)||0.72 (0.66–0.79)||0.79 (0.73–0.87)||0.80 (0.73–0.87)|
|Tertiary (III)||0.48 (0.43–0.52)||0.56 (0.51–0.61)||0.56 (0.51–0.62)|
|Foreign||1.17 (1.09–1.25)||1.23 (1.15–1.31)||1.24 (1.16–1.32)|
|French speaking||0.89 (0.83–0.94)||0.88 (0.83–0.94)||0.88 (0.83–0.94)|
|Italian speaking||0.81 (0.72–0.93)||0.81 (0.71–0.92)||0.81 (0.71–0.92)|
The three columns presented in Table 3 refer to the same parameters, estimated by three models differing only with respect to the complexity of the age effect. The effects of smoking, alcohol intake, education, physical activity, fruit and vegetables consumption, residence and nationality were remarkably constant. Alcohol intake had no impact on the BMI in this study, and right shifts in BMI distributions were associated with low fruit and vegetable consumption, moderate and low physical activity, short education, being a foreigner or living in the German-speaking part of Switzerland. These conclusions can be drawn from all three models in the same way. The effects of smoking were less pronounced than the effects obtained in our initial analysis that ignored age and the lifestyle variables (Table 2). Light smokers had lower BMIs than never smokers; the remaining effects are questionable.
The core of top-down transformation choice is a family of decreasingly complex, yet fully comparable, conditional transformation models. Model parameterisation and interpretation in the family of transformation models are always based on the conditional distribution function
Unlike most classical models featuring explicit parameters for conditional means or conditional variances, transformation models describe conditional distributions explicitly and moments implicitly. What might seem as a disadvantage is in fact, as I hope I could convince the readers of, a very attractive feature of transformation models for regression analysis. In this tutorial, I exclusively defined and interpreted models for conditional distributions. The corresponding distribution functions were used to compare transformation models with the empirical cumulative distribution function (Figures2, 3, 4 and 7). The conditional transformation function was used to assess deviations from normality in Figure 5. Conditional densities
based on all BMI measurements with sampling weights . Conditional quantile functions
Varying model complexity only affects the flexibility of these functions that characterise conditional distributions, but not the corresponding interpretations.
A unique feature of conditional transformation models is the ability to formulate, estimate, compare, evaluate, interpret and understand models seemingly as far apart as a normal linear model with constant variance and a transformation forest in the same theoretical framework. Straightforward answers to some questions that have plagued data analysis for decades, for example “Is it appropriate to assume normal errors?” or “How should the response be transformed prior to analysis?”, are easily obtained from conditional transformation models.
One practical and interesting question relates to the impact of the order of the Bernstein polynomial . The choice implements a linear function, and with , conditional normal distributions are obtained. For , converges uniformly to the true and unknown transformation function in a model . Because is a monotonic function, too-erratic behaviour cannot occur, even for very large , and overfitting is not an issue (see Hothorn, 2017b, for numerical examples). In the model (3), increasing the order from to led to a very small increase in the log-likelihood from to . In the extreme case of very large , the conditional distribution function closely interpolates the empirical cumulative distribution function. The latter estimator is consistent, as is the transformation model (Hothorn et al., 2017).
This tutorial did not address any issue regarding model estimation or model inference. Details about maximum-likelihood estimation in conditional transformation models can be found in Hothorn et al. (2017). Locally adaptive maximum-likelihood estimation for transformation trees and transformation forests has been introduced in Hothorn and Zeileis (2017). More elaborate discussions of model parameterisation in conditional transformation models and of connections to other models can be found in Hothorn et al. (2014) and Hothorn et al. (2017). Applications of conditional transformation models can be found in Hothorn et al. (2013), Möst et al. (2014) and Möst and Hothorn (2015). An introduction to the mlt add-on package (Hothorn, 2017a) for maximum-likelihood estimation in conditional transformation models, including models for ordinal or censored and truncated responses, is available in Hothorn (2017b).
Data from the Swiss Health Survey 2012 can be obtained from the Swiss Federal Statistics Office (Email: email@example.com). Data is available for scientific research projects, and a data protection application form must be submitted. More information can be found here http://www.bfs.admin.ch/bfs/de/home/statistiken/gesundheit/erhebungenSupplementary.
The code used for producing the results presented in this paper can be evaluated on a smaller artificial data set sampled from the transformation forest by running demo("BMI") from the trtf package (Hothorn, 2017c).
I thank the students participating in the course “STA660 Advanced R Programming” that I taught in the spring semester of 2017 for producing the code underlying Figure 8 as part of their homework assignments. Parts of this paper were written during a research sabbatical at Universität Innsbruck financially supported by the Swiss National Science Foundation (grant number IZSEZ0_177091).
- Breiman (2001) Breiman L (2001). “Random Forests.” Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324.
- Bundesamt für Statistik (2013) Bundesamt für Statistik (2013). Die Schweizerische Gesundheitsbefragung 2012 in Kürze – Konzept, Methode, Durchführung. Bern. URL http://www.bfs.admin.ch.
- Chernozhukov et al. (2013) Chernozhukov V, Fernández-Val I, Melly B (2013). “Inference on Counterfactual Distributions.” Econometrica, 81(6), 2205–2268. doi:10.3982/ECTA10582.
- Fahrmeir et al. (2013) Fahrmeir L, Kneib T, Lang S, Marx B (2013). Regression. Models, Methods and Applications. Springer-Verlag, New York, U.S.A.
- Farouki (2012) Farouki RT (2012). “The Bernstein Polynomial Basis: A Centennial Retrospective.” Computer Aided Geometric Design, 29(6), 379–419. doi:10.1016/j.cagd.2012.03.001.
- Hothorn (2017a) Hothorn T (2017a). mlt: Most Likely Transformations. R package version 0.2-1, URL https://CRAN.R-project.org/package=mlt.
- Hothorn (2017b) Hothorn T (2017b). Most Likely Transformations: The mlt Package. R package vignette version 0.2-0, URL https://CRAN.R-project.org/package=mlt.docreg.
- Hothorn (2017c) Hothorn T (2017c). trtf: Transformation Trees and Forests. R package version 0.3-0, URL https://CRAN.R-project.org/package=trtf.
- Hothorn et al. (2013) Hothorn T, Kneib T, Bühlmann P (2013). “Conditional Transformation Models by Example.” In VMR Muggeo, V Capursi, G Boscaino, G Lovison (eds.), “Proceedings of the 28th International Workshop on Statistical Modelling,” pp. 15–26. Universitá Degli Studi Di Palermo. ISBN 978-88-96251-47-8.
- Hothorn et al. (2014) Hothorn T, Kneib T, Bühlmann P (2014). “Conditional Transformation Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1), 3–27. doi:10.1111/rssb.12017.
- Hothorn et al. (2017) Hothorn T, Möst L, Bühlmann P (2017). “Most Likely Transformations.” Scandinavian Journal of Statistics. Accepted 2017-06-19, URL https://arxiv.org/abs/1508.06749.
- Hothorn and Zeileis (2017) Hothorn T, Zeileis A (2017). “Transformation Forests.” Technical report, arXiv 1701.02110. URL https://arxiv.org/abs/1701.02110.
Liu et al. (2017)
Liu Q, Shepherd BE, Li C, Harrell FE (2017).
“Modeling Continuous Response Variables Using Ordinal Regression.”Statistics in Medicine. doi:10.1002/sim.7433.
- Lohse et al. (2017) Lohse T, Rohrmann S, Faeh D, Hothorn T (2017). “Continuous Outcome Logistic Regression for Analyzing Body Mass Index Distributions.” F1000Research, 6, 1933. doi:10.12688/f1000research.12934.1.
- Manuguerra and Heller (2010) Manuguerra M, Heller GZ (2010). “Ordinal Regression Models for Continuous Scales.” The International Journal of Biostatistics, 6(1). doi:10.2202/1557-4679.1230.
- Möst and Hothorn (2015) Möst L, Hothorn T (2015). “Conditional Transformation Models for Survivor Function Estimation.” International Journal of Biostatistics. doi:10.1515/ijb-2014-0006.
- Möst et al. (2014) Möst L, Schmid M, Faschingbauer F, Hothorn T (2014). “Predicting Birth Weight with Conditionally Linear Transformation Models.” Statistical Methods in Medical Research. doi:10.1177/0962280214532745.
- R Core Team (2017) R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
- UNESCO Institute for Statistics (2012) UNESCO Institute for Statistics (2012). International Standard Classification of Education – ISCED 2011. Montreal. URL http://www.uis.unesco.org/Education/Documents/isced-2011-en.pdf.