1 Introduction
Let’s face it. The work of statisticians is considered boring in the public eye. Nobody publishes page turners on the thrilling aspects of data analysis, yet the quest for a good model can be as exciting as detective work. One of my favourite paperback characters is LAPD detective Harry Bosch in the crime novels of Michael Connelly. Like Harry, who follows the traces left by the murderer on the crime scene to form a theory about the culprit, the experienced data analyst follows the traces left by the datagenerating process in the residuals of an oversimplistic model. Unlike Harry, who of course always succeeds in arresting the murderer, the statistician can never be sure whether the correct or even an approximately useful model was found. In the quests for a suspect or for a good model, parsimonious explanations are preferred by Occam’s razor. Therefore, in residualbased model diagnostics, the data analyst starts with a very simple model, whose complexity is increased by stepwise refinement until all signs of lack of fit disappear from the residuals. I refer to such a procedure as “bottomup model choice” because one moves from simple to more complex models. In this tutorial, I consider moving in the opposite direction, i.e. from complex to simple models, for distributional regression. This “topdown approach” to model choice begins with the most complex model that one can come up with that explains both signal and noise without overfitting the data. In a regression setup, such a model would describe as accurately as possible the conditional distribution of the response given the explanatory variables. Once such a model is established as a benchmark for comparison with simpler models, one can start to reduce model complexity stepwise. In the crime novel scenario, the topdown data analyst takes the role of an eyewitness at the scene. What one “sees” in this process is, of course, still a portrayal and not the real thing. There is no way to “see” the correct model. In topdown model choice, however, the trajectories through model space will be guided by assessments of vital models. In bottomup model choice, by contrast, the horizon is limited by the amount of information that one can find in traces in deceased models.
In this tutorial, I focus on topdown model choice in continuous regression problems. Conceptually, a regression model is a family of conditional distributions for some response given a specific configuration of explanatory variables . The model describes both signal and noise, i.e. the variability explained by the explanatory variables and the unexplained variability. Unfortunately, this point of view only applies to relatively simple models that assume a certain parametric distribution, whose parameters partially depend on the explanatory variables. Socalled “nonparametric regression models” (Fahrmeir et al., 2013) often restrict their attention to the signal , with nonlinear conditional mean function , while treating the noise, i.e.
all higher moments of the conditional distribution, as a nuisance or essentially ignoring it. Such procedures, for example random forests
(Breiman, 2001), are extremely powerful when estimating complex conditional mean functions. However, one cannot infer the entire conditional distribution using random forests or similar methods. This renders topdown model choice impossible because reductions in complexity require switching between different model classes or even crossing the borders between the parametric and nonparametric empires. The comparison of two models from different classes is difficult, and thus it is difficult to decide whether the simpler model is more appropriate than the more complex one.
The implementation of topdown model choice is much simpler when the most complex and the most simple model are members of the same family. Conditional transformation models from the transformation family of distributions (Hothorn et al., 2014, 2017) include many important established offtheshelf regression models. In addition, tailored models can be created, in vivo with our brains and in silico using opensource software, which allow smooth transitions between models of different complexity. In a nutshell, the class of conditional transformation models
assumes that the conditional distribution function of given can be written as the composition of an a priori
specified continuous cumulative distribution function
and some conditional transformation function . The latter function monotonically increases in its first argument for each . It is important to note that the entire conditional distribution, and not just its mean, is modelled by . In this sense, and unlike common regression models, there is no decomposition into signal (the conditional mean) and noise (the remaining higher moments) in this class of transformation models. Changing model complexity means changing the complexity of the conditional transformation function , and I thus refer to topdown model choice in conditional transformation models as topdown transformation choice.Model complexity in the class of conditional transformation models is linked to smooths of varying complexity with respect to . These conditional transformation functions may vary with the explanatory variables in arbitrary ways, including interactions and nonlinearities. In this paper, I consider model choice as an art rather than an exact science. No formal algorithm leading to an “optimal” model will be presented. Instead, I argue that the possibility of modelling a cascade of decreasingly complex conditional distribution functions in the same model class gives us new possibilities to investigate goodness of fit or lack thereof. A fair amount of subjectivity will remain in this process, as is always the case in classical model diagnostics. I shall be less concerned with the technical subtleness of parameter estimation in the models discussed here and refer the reader to more formal results published elsewhere. Instead, I illustrate practical aspects of topdown transformation choice by a tourdeforce through transformation models describing the impact of lifestyle parameters, such as smoking or physical activity, on the body mass index (BMI) distribution in the Swiss population.
I will proceed by introducing the Swiss Health Survey and the variables dealt with in Section 2. In a very simple setup, I first illustrate a bottomup route, starting with a normal linear model and ending with a more complex nonnormal transformation model, for describing the BMI distribution of females and males at various levels of smoking (Section 3). I then try to reduce the complexity again until an interpretable model that fits the data roughly as well as the most complex model can be found. In addition to a consideration of sex and smoking, I consider age and some lifestyle variables in a more realistic setup of topdown transformation choice in Section 4.
2 Body Mass Index in the Swiss Health Survey
The Swiss Health Survey (SHS) is a populationbased crosssectional survey. It has been conducted every five years since 1992 by the Swiss Federal Statistical Office (Bundesamt für Statistik, 2013). For this tutorial, I restricted the sample to individuals aged between and years from the 2012 survey. Study samples were obtained by stratified random sampling using a database with all private household landline telephone numbers. Data were collected by telephone interviews and selfadministered questionnaires. Height and weight were selfreported in telephone interviews. Observations with extreme values of height and weight were excluded (highest and lowest percentile by sex). Smoking status was categorised into never, former, light ( cigarettes per day), moderate () and heavy smokers (
). Never smokers stated that they did not currently smoke and never regularly smoked longer than six months; former smokers had quit smoking but have smoked for more than six months during their life course. One cigarillo or pipe counted as two cigarettes, and one cigar counted as four cigarettes. The following lifestyle variables were included and assessed by telephone interview and selfadministered questionnaire: fruit and vegetable consumption, physical activity, alcohol intake, level of education, nationality and place of residence. Fruit and vegetable consumption was combined in one binary variable that comprised the information on whether both fruits and vegetables were consumed daily or not. The variable describing physical activity was defined as the number of days per week a subject started to sweat during leisure time physical activity and was categorised as
days, days and none. Alcohol intake was included using the continuous variable gram per day. Education was included as highest degree obtained and was categorised into mandatory (International Standard Classification of Education, ISCED 12), secondary (ISCED 34), and tertiary (ISCED 58) (UNESCO Institute for Statistics, 2012). Nationality had the two categories Swiss and foreign. Language reflected cultural and regional differences within Switzerland, and the three categories German/Romansh, French and Italian were taken into account. Sampling weights of this representative survey were considered for the estimation of all models reported in this tutorial. More detailed information about this study and an analysis using simple transformation models is given in Lohse et al. (2017).3 Sex and Smokingspecific BMI Distributions
I start with the very simple situation where the conditional distribution of BMI depends on sex and smoking only. Smoking was assessed on five different levels (never smoked, former smokers, light smokers, medium smokers and heavy smokers). Therefore, I am interested in the conditional distribution of BMI in these groups of participants. Figure 1 presents the empirical cumulative distribution functions, i.e. the nonparametric maximumlikelihood estimators for the underlying continuous distributions, for each of the
combinations of sex and smoking. At the same time, the plot also represents the uncompressed raw data. With a highenough resolution, one could recover the original BMI values and the corresponding sampling weights from such an image. Consequently, goodness of fit can be assessed by overlaying the empirical cumulative distribution functions with their modelbased counterparts in this simple setup. I will try to find a suitable parametric model this way. In addition to this rather informal approach, I will study the increase of the loglikelihoods as model complexity is increased. In the classical bottomup approach, one would start with a very simple model assuming conditional normal distributions. The next section discusses possible choices in this model class.
3.1 Normal Models
The normal cellmeans model with constant variance
(1) 
assumes normal distributions with a common variance for all conditional BMI distributions. Means are allowed to vary between the groups defined by sex and smoking. The notation indicates that the conditional mean is specific to each combination of sex and smoking in this cellmeans model, i.e. there are a total of parameters
. With a residual standard error of
, a loglikelihood of was obtained, and the estimated cellmeans, with confidence intervals, are shown in Table 1.Sex  

Smoking  Female  Male 
Never  23.57 (23.47–23.68)  25.18 (25.06–25.30) 
Former  23.94 (23.76–24.12)  26.46 (26.30–26.62) 
Light  22.85 (22.59–23.11)  25.00 (24.73–25.27) 
Medium  23.45 (23.18–23.72)  25.00 (24.75–25.26) 
Heavy  23.72 (23.36–24.08)  25.90 (25.65–26.15) 
How well does this model fit the data? I want to answer this question by graphically comparing the conditional distribution functions obtained from this model to the corresponding empirical conditional distributions, and thus the raw data. The modelbased conditional cumulative distribution functions
overlay the empirical cumulative distribution functions in Figure 2. While not being completely out of line, the considerable differences between the empirical and modelbased distribution functions certainly leave room for improvement. An obvious increase in the complexity allows for groupspecific variances in the model
(2) 
The loglikelihood in this parameter model increased to , and the corresponding conditional distribution functions in Figure 3 were closer to the empirical cumulative distribution functions. For males, the modelbased normal distributions were very close to the empirical conditional BMI distributions. For females, however, there still was a considerable discrepancy between model and data, especially in the lower tails. The BMI distributions of females deviated from normality much more than the BMI distributions of males (note that I am not saying that males are normal and females are not!). It is clear that one has to move to a nonnormal error model, at least for females, and the transformation models discussed below are a convenient way to do so.
The normal models are a special case of transformation models and thus the latter class is a very natural extension of the former. To see the connection, consider the conditional distribution function
where is a linear function of with parameters and . In more complex models, we will use the parameter
(or parameter vector
for basistransformed response values ) for modelling transformations of the response . Shift parameters describing effects of explanatory variables only, i.e. no responsevarying effects, will be denoted by or later on. The above reparameterisation shows, quite unsurprisingly, that a normal model features a linear transformation function . Consequently, nonnormal conditional distributions can be obtained by allowing a nonlinear transformation function for each combination of sex and smoking in the transformation models presented in the next section.3.2 Nonnormal Transformation Models
The core concept of a transformation model is a potentially nonlinear monotonically increasing transformation function , here for each combination of sex and smoking. For computational convenience, I parameterised the transformation functions in terms of Bernstein polynomials (Farouki, 2012). For each of the groups, I modelled the transformation function by a Bernstein polynomial of order , where are the corresponding basis functions of BMI. A monotonically increasing Bernstein polynomial of order features monotonically increasing parameters . Maximumlikelihood estimation was performed (Hothorn et al., 2017) using the mlt (Hothorn, 2017a, b) addon package to the R system for statistical computing (R Core Team, 2017). With the corresponding total parameters, the maximum loglikelihood of the model
(3) 
was ; the notation indicates that the parameters were estimated for each combination of sex and smoking. One can hardly differentiate the resulting modelbased conditional distribution functions from the empirical cumulative BMI distribution functions in Figure 4. Because a separate transformation function was estimated for each combination of sex and smoking, this model can be referred to as a transformation model stratified by sex and smoking. Based on this model, one can understand nonnormality as deviation of the transformation function from a linear function. Figure 5 shows the sex and smokingspecific transformation functions of model (3) along with the linear transformation functions obtained from the normal cellmeans model (2) with heterogeneous variances. In the centre of the distributions, the two curves overlap, but the tails are not described well by the normal distribution. The differences between the two curves are more pronounced for females, corresponding to the larger deviations from normality observed earlier.
One nice feature of model (3) is the possibility to easily derive characterisations of the distribution other than the distribution function. Density, quantile, hazard, cumulative hazard or other characterising functions can be derived from (3), and Figure 6
depicts the densities for males and females at the various levels of smoking. The rightskewness of the distribution, and thus deviation from normality, was more pronounced for females. The BMI distributions for females put more weight on smaller BMI values for females than for males. Except for heavy smokers, the effects of smoking seemed to be rather small.
The model fit of this stratified transformation model is now satisfactory, as it essentially smoothly interpolates the empirical distribution functions and thus the data in Figure
4. This most complex model describes the data well, but unfortunately, it is difficult to learn anything from this model. That is, one wants to understand the differences between the conditional distributions in terms of simple parameters and not complex nonlinear functions. A simpler model is needed. A topdown approach to transformation choice might help to identify a model with simpler and interpretable transformation functions, but any necessary compromises to the model fit should not be too demanding.Because the BMI distributions differed most between males and females, I first simplify the model by conditioning on smoking and stratifying by sex, i.e. I introduce sexspecific transformations and sexspecific smoking effects , the latter being constant for all arguments of the conditional distribution function, in the model
(4) 
This model features two transformation functions and . For never smokers (the reference category), these two transformation functions describe the conditional BMI distributions, i.e.
For the remaining smoking categories, one sexspecific parameter describes how the conditional BMI distribution of a smoker differs from the conditional BMI distribution of a person who never smoked by a simple shift term . Because of the “linear” shift term, this model could be referred to as a stratified linear transformation model. This is, as often in statistics, a misnomer, because the transformation of the response, i.e. of the BMI values , is nonlinear.
The loglikelihood for this model with parameters was found to be , a moderate reduction compared to the loglikelihood of the most complex transformation model (). Figure 7 shows only minor differences between the empirical and modelbased conditional distribution functions. Thus, it seems that a more parsimonious model was found without paying too high a price in terms of loglikelihood reduction.
The conceptual problem with this model, however, is lack of interpretability of the shift term . In contrast to the means in the normal models (1) and (2), there is no direct interpretation of in terms of moments of the conditional distribution described by this model. This issue can be addressed by changing the cumulative distribution function of the standard normal to the cumulative distribution function of the standard logistic
(5) 
When the cutoff
is fixed, this is a logistic regression model for the binary outcome
vs. . The transformation function is now a sexspecific intercept, andare the sexspecific logodds ratios for the event
compared to the baseline category (never smokers). Because this shift term does not depend on , the model assumes proportionality of the smoking odds with respect to the cutoff . Stratification by sex allows nonproportional smoking odds with respect to sex. In fact, the sexspecific conditional distributions of males and females can still differ in very general ways because two separate Bernstein polynomials describe the conditional distributions for males and females. The model can be seen as a stratified proportional odds model for continuous responses or a continuous form of logistic regression analyses, jointly performed for all possible cutoff points under the assumption of constant parameters . Similar models, however without stratification, were studied by Manuguerra and Heller (2010) using parametric intercept functions and recently by Liu et al. (2017) treating the intercept function as a nuisance parameter in nonparametric maximum likelihood estimation. Lohse et al. (2017) provide a comparison of parameter estimation in the presence of intervalcensored body mass index observations.The parameterisation with a negative shift term seems unconventional from a logistic regression point of view, but it simplifies interpretation. With model (5), , and thus positive shift parameters indicate a shift of the BMI distribution towards higher BMI values. Corresponding odds ratios larger than one mean that BMI distributions are shifted to the right, compared to the BMI distribution in the reference category.
Unfortunately, there was some further reduction in the loglikelihood (), and interpretability doesn’t come for free. However, the modelbased and empirical conditional BMI distribution functions look very much the same as presented in Figure 7 (additional plot not shown). The sexspecific BMIindependent oddsratios of smoking, compared to never smoking, are given in Table 2. Former smokers had, on average, a larger BMI compared to never smokers, and the effect was stronger for males. A similar effect was observed for male heavy smokers. Female light smokers showed a BMI distribution shifted to the left, compared with female never smokers.
Sex  

Smoking  Female  Male 
Never  1  1 
Former  1.19 (1.08–1.31)  1.95 (1.77–2.14) 
Light  0.75 (0.65–0.85)  0.95 (0.82–1.09) 
Medium  0.98 (0.85–1.12)  0.93 (0.81–1.06) 
Heavy  1.10 (0.92–1.32)  1.43 (1.25–1.63) 
Maintaining interpretability, one could go further and assume equal smoking effects for males and females in the model
The loglikelihood was again reduced () for this model with parameters. In addition, the oddsratios presented in Table 2 indicate severe differences in the smoking effects between males and females; therefore, I refrain from looking at this or even simpler models and stop the topdown transformation choice here. Of course, this very simple example only worked because it was possible to compare models and raw data directly on the scale of the conditional BMI distribution functions for two categorical explanatory variables, sex and smoking. In the second part, I will consider additional, and also numeric, explanatory variables in a more realistic setup.
4 Conditional BMI Distributions
My aim is to estimate the conditional BMI distribution given sex, smoking, age and the lifestyle variables alcohol intake, education, physical activity, fruit and vegetables consumption, residence and nationality as explanatory variables . In the conditional transformation model
the conditional transformation function depends on these variables in a yet unspecified way. Topdown transformation choice ideally allows one to start without too many headaches, i.e. an algorithm for fitting this model to handle the potentially many explanatory variables of mixed type allows relatively complex nonlinear transformation functions. Such a model can be written as
assuming that each conditional distribution is parameterised in terms of a Bernstein polynomial of order . The parameters of this polynomial, however, depend on the explanatory variables in a potentially complex way, featuring interactions and nonlinearities. Tree and forest algorithms (Hothorn and Zeileis, 2017) allow such “conditional parameter functions” , and thus the corresponding conditional BMI distributions, to be estimated in a blackbox manner without the necessity to a priori specify any structure of . I will first use trees and forests to understand the complexity of the impact of the explanatory variables on the BMI distribution. Later on, I will apply a topdown approach to transformation choice to obtain simpler transformation models that allow more straightforward interpretation.
4.1 Transformation Trees and Forests
A transformation tree (Hothorn and Zeileis, 2017) starts with an unconditional transformation model
(6) 
and a corresponding maximumlikelihood estimator . The algorithm proceeds by assessing correlations between the score contributions evaluated at and the explanatory variables sex, smoking, age and . A binary split is implemented in the most discriminating cutoff point of the variable showing the highest correlation to any score. The procedure is repeated until a certain stop criterion applies. The result is a partition of the data. The algorithm is sensitive to distributional changes, i.e. the conditional BMI distributions in the subgroups of this partition may vary with respect to the mean BMI and also with respect to higher BMI moments. In each subgroup, the unconditional model (6) was used to estimate for this subgroup. Because each observation in this subgroup is then associated with a dedicated parameter vector , the loglikelihood for the tree model could be evaluated as the sum of the likelihoods in the subgroups. The loglikelihood of the tree presented in Figure 8 is . The first split is in sex, so in fact two sexspecific models are given here. Three age groups (, , ) for females and three age groups (, , ) for males are distinguished. Education contributed to understanding the BMI distribution of females and males. Location, scale and shape of the conditional BMI distributions varied considerably. The variance increased with age, and highereducated people tended to have lower BMI values. These are interesting insights, but the model is of course very rough.
A transformation forest (Hothorn and Zeileis, 2017) allows less rough conditional parameter functions
to be estimated. There are no longer any restrictions regarding the conditional parameter functions. In this sense, a transformation forest is the “most complex model one can think of” as mentioned in the introduction. The random forest class of models is considered to be very accurate, insensitive to hyperparameter tuning and without a tendency to overfit. In the following, I shall use this method to obtain a benchmark for betterinterpretable transformation models following the topdown model selection approach.
The generic random forest algorithm essentially relies on multiple transformation trees fitted to subsamples of the data, with a random selection of variables to be considered for splitting in each node. Unlike the original random forest (Breiman, 2001), a transformation model can be understood as a procedure assigning a parametric model to each observation. For subject , the forest conditional distribution function is
In this sense, a transformation forest “predicts” a fully parametric model for each subject, albeit with a very flexible conditional parameter. The conditional parameter was obtained from a locally adaptive maximumlikelihood estimator based on socalled nearest neighbour weights (Hothorn and Zeileis, 2017). A considerable improvement in the transformation forest loglikelihood () was observed. In fact, this is the largest loglikelihood I was able to achieve. Thus, this transformation forest is the best fitting model for the BMI data.
On the downside, this blackbox model makes is very difficult to understand the impact of the explanatory variables on the conditional BMI distribution. The likelihoodbased permutation variable importance (Figure 9
) indicated that only sex, age, education, physical activity and smoking have an impact on BMI, where again sex seems to be the most important variable. Age was a more important factor than education or physical activity, and thus the only numeric variable one needs to consider. The association between sex, smoking, age and BMI as described by the transformation forest is given in terms of a partial dependency plot of conditional deciles in Figure
10. In general, the median BMI increases with age, as does the BMI variance. For males, there seemed to be a leveleffect whose onset depends on smoking category. Females tended to higher BMI values, and the variance was larger compared to males. There seemed to be a bump in BMI values for females, roughly around years. This corresponds to mothers giving birth to their first child around this age. It is important to note that the rightskewness of the conditional BMI distributions in Figure 10 renders conditional normal distributions inappropriate, even under variance heterogeneity.This complex model would be sufficient if one was only interested in the estimation of conditional BMI distributions for persons with specific configurations of the sex, smoking, age and the remaining explanatory variables. The variable importances can be used to rank variables according to their impact on the conditional BMI distributions but cannot replace effect measures, let alone an assessment of their variability. Communication with subjectmatter scientists and publication of results in subjectmatter journals requires simplification of these models. Topdown transformation choice can help to find models of appropriate complexity, as will be seen in the next section.
4.2 Conditional Transformation Model
The analysis using transformation trees and especially transformation forests revealed strong effects of sex and age; the latter variable was not considered in our analysis presented in Section 3. A more structured model roughly as powerful as the transformation forest must therefore allow the conditional distribution of BMI to change with both sex and age in very general ways. The remaining variables were less important, and one can hopefully cut some corners here by assuming simple linear main effects for these variables. I start the topdown search for a simpler model with a conditional transformation model of the form
(7)  
The transformation function implements a sexspecific bivariate smoothsurface function of BMI and age, which was of course monotonic in its first argument. The surface function for males explains ageinduced changes in the conditional distribution of BMI. In contrast to transformation forests, the assumption was made that the function is smooth in both and age and not only in
. I parameterise this function as a tensor product of two Bernstein polynomials of order 5, one for BMI and one for age, with sexspecific
dimensional parameter vector , in other words as . Except for smoking, the remaining variables entered only as the linear shift term of main effects. In light of its fifth rank in the permutation variable importance (Figure 9), it may seem a bit inconsequent to treat smoking differently than the other variables. However, the stratified analysis in Section 3 suggested the need for sexspecific smoking effects, and I thus include the interaction term also in this model. The expit function around the transformation function ensures interpretability of all regression coefficients on the logodds scale.With parameters, the loglikelihood of model (7) was only slightly smaller than the loglikelihood of the transformation forest (). In a certain sense, this conditional transformation model can be seen as an approximation of the blackbox transformation forest. The effects of sex, smoking and age, with all remaining variables being constant, are again best visualised using the conditional decile functions (Figure 11). The decile functions are now smooth in age due to the parameterisation of the age effect in terms of Bernstein polynomials. For males, the BMI increased with age; the BMI reduction in males older than years was not visible in the decile curves of the transformation forests (Figure 10). The slope was largest for young men up to years, followed by a linear increase until the age of . The male BMI distribution was right skewed, with only a small increase in the variance towards older people. For females, a bump in the BMI distribution was again identified around the age of , corresponding to pregnancies and breastfeeding times. The effect seemed more pronounced in higher deciles. Right skewness and a variance increase towards older women can be inferred from this figure.
The main advantage of this complexity reduction is the interpretability of the regression coefficients and in terms of BMIindependent logodds ratios. The sexspecific smoking effects and the effects of the remaining variables as odds ratios are given in the left column of Table 3. Further simplification can be achieved by replacing the bivariate surface function of and age by a sexspecific, BMIvarying linear effect of age in the distribution regression model presented in the next section.
4.3 Distribution Regression
The term “distribution regression” (Chernozhukov et al., 2013) is commonly used to describe responsevarying coefficients. In survival analysis, the term “timevarying coefficients” is more typical. Here, a BMIvarying coefficient of age is a means of simplifying the conditional transformation model (7). In the simpler model, I assume a smoothly varying but sexspecific coefficient of age . The transformation function is again the simple transformation of BMI given sex introduced in model (4). The model reads
(8)  
The loglikelihood decreased considerably in this model with parameters. The effects of smoking and the remaining variables (except age) are given in the middle column of Table 3 as odds ratios. When the dependency of BMI deciles on sex, smoking and age were depicted (Figure 12), the linear structure regarding age was obvious. The agevarying slopes and the pregnancy bump could not be identified by this simpler model. Rightskewness and variance heterogeneity for females remained visible. The variance increase in older males now seemed questionable. For my taste, the replacement of two bivariate functions by two univariate functions does not really help model interpretation, as one would have to plot these two functions in any case. The severe reduction of the loglikelihood indicated that the effect of age is better described in a conditional transformation model of the form (7). Nevertheless, I will go one step further and connect the stratified linear transformation model (5) with a model of the same form featuring age and the lifestyle variables in addition to sex and smoking.
4.4 Stratified Linear Transformation Model
I extend the stratified linear transformation model (5) with an sexspecific age effect and a linear predictor of the remaining variables
(9)  
The loglikelihood was further reduced to . In this model, the sex differences in the age effects were completely gone, as the odds ratios for a oneyear increase were () for males and () for females. In light of the more complex structure of the age effect identified by the more complex models, one would incorrectly draw the conclusion of equal age effects for males and females based on this oversimplified model. The effects of the remaining parameters are given in the right column of Table 3.
Model (7)  Model (8)  Model (9)  
Smoking (Females)  
Never  
Former  1.04 (0.95–1.15)  1.06 (0.96–1.16)  1.06 (0.96–1.16) 
Light  0.81 (0.70–0.92)  0.81 (0.71–0.93)  0.81 (0.71–0.93) 
Medium  0.92 (0.80–1.06)  0.94 (0.82–1.08)  0.94 (0.82–1.08) 
Heavy  0.91 (0.76–1.09)  0.93 (0.78–1.12)  0.94 (0.78–1.12) 
Smoking (Males)  
Never  
Former  1.44 (1.31–1.59)  1.47 (1.33–1.62)  1.47 (1.33–1.61) 
Light  1.02 (0.88–1.17)  1.01 (0.88–1.16)  1.01 (0.88–1.16) 
Medium  0.87 (0.76–1.00)  0.90 (0.79–1.03)  0.91 (0.79–1.04) 
Heavy  1.13 (0.99–1.30)  1.21 (1.06–1.39)  1.22 (1.07–1.40) 
Alcohol intake (g/d)  1.00 (1.00–1.00)  1.00 (1.00–1.00)  1.00 (1.00–1.00) 
Fruit and vegetables  
High  
Low  1.07 (1.01–1.13)  1.08 (1.02–1.14)  1.08 (1.02–1.14) 
Physical activity  
High  
Moderate  1.11 (1.04–1.19)  1.15 (1.08–1.23)  1.16 (1.08–1.23) 
Low  1.25 (1.16–1.34)  1.30 (1.21–1.40)  1.30 (1.21–1.40) 
Education  
Mandatory (I)  
Secondary (II)  0.72 (0.66–0.79)  0.79 (0.73–0.87)  0.80 (0.73–0.87) 
Tertiary (III)  0.48 (0.43–0.52)  0.56 (0.51–0.61)  0.56 (0.51–0.62) 
Nationality  
Swiss  
Foreign  1.17 (1.09–1.25)  1.23 (1.15–1.31)  1.24 (1.16–1.32) 
Region  
German speaking  
French speaking  0.89 (0.83–0.94)  0.88 (0.83–0.94)  0.88 (0.83–0.94) 
Italian speaking  0.81 (0.72–0.93)  0.81 (0.71–0.92)  0.81 (0.71–0.92) 
The three columns presented in Table 3 refer to the same parameters, estimated by three models differing only with respect to the complexity of the age effect. The effects of smoking, alcohol intake, education, physical activity, fruit and vegetables consumption, residence and nationality were remarkably constant. Alcohol intake had no impact on the BMI in this study, and right shifts in BMI distributions were associated with low fruit and vegetable consumption, moderate and low physical activity, short education, being a foreigner or living in the Germanspeaking part of Switzerland. These conclusions can be drawn from all three models in the same way. The effects of smoking were less pronounced than the effects obtained in our initial analysis that ignored age and the lifestyle variables (Table 2). Light smokers had lower BMIs than never smokers; the remaining effects are questionable.
5 Discussion
The core of topdown transformation choice is a family of decreasingly complex, yet fully comparable, conditional transformation models. Model parameterisation and interpretation in the family of transformation models are always based on the conditional distribution function
Unlike most classical models featuring explicit parameters for conditional means or conditional variances, transformation models describe conditional distributions explicitly and moments implicitly. What might seem as a disadvantage is in fact, as I hope I could convince the readers of, a very attractive feature of transformation models for regression analysis. In this tutorial, I exclusively defined and interpreted models for conditional distributions. The corresponding distribution functions were used to compare transformation models with the empirical cumulative distribution function (Figures
2, 3, 4 and 7). The conditional transformation function was used to assess deviations from normality in Figure 5. Conditional densitiesare depicted in Figure 6 and those for each terminal node of the transformation tree are shown in Figure 8. Densities defined the loglikelihood
based on all BMI measurements with sampling weights . Conditional quantile functions
helped to visualise age effects in Figures 10, 11 and 12. Effect measures for sex, smoking and lifestyle variables in Tables 2 and 3 were obtained as ratios of conditional odds functions
Varying model complexity only affects the flexibility of these functions that characterise conditional distributions, but not the corresponding interpretations.
A unique feature of conditional transformation models is the ability to formulate, estimate, compare, evaluate, interpret and understand models seemingly as far apart as a normal linear model with constant variance and a transformation forest in the same theoretical framework. Straightforward answers to some questions that have plagued data analysis for decades, for example “Is it appropriate to assume normal errors?” or “How should the response be transformed prior to analysis?”, are easily obtained from conditional transformation models.
One practical and interesting question relates to the impact of the order of the Bernstein polynomial . The choice implements a linear function, and with , conditional normal distributions are obtained. For , converges uniformly to the true and unknown transformation function in a model . Because is a monotonic function, tooerratic behaviour cannot occur, even for very large , and overfitting is not an issue (see Hothorn, 2017b, for numerical examples). In the model (3), increasing the order from to led to a very small increase in the loglikelihood from to . In the extreme case of very large , the conditional distribution function closely interpolates the empirical cumulative distribution function. The latter estimator is consistent, as is the transformation model (Hothorn et al., 2017).
This tutorial did not address any issue regarding model estimation or model inference. Details about maximumlikelihood estimation in conditional transformation models can be found in Hothorn et al. (2017). Locally adaptive maximumlikelihood estimation for transformation trees and transformation forests has been introduced in Hothorn and Zeileis (2017). More elaborate discussions of model parameterisation in conditional transformation models and of connections to other models can be found in Hothorn et al. (2014) and Hothorn et al. (2017). Applications of conditional transformation models can be found in Hothorn et al. (2013), Möst et al. (2014) and Möst and Hothorn (2015). An introduction to the mlt addon package (Hothorn, 2017a) for maximumlikelihood estimation in conditional transformation models, including models for ordinal or censored and truncated responses, is available in Hothorn (2017b).
Reproducibility
Data from the Swiss Health Survey 2012 can be obtained from the Swiss Federal Statistics Office (Email: sgb12@bfs.admin.ch). Data is available for scientific research projects, and a data protection application form must be submitted. More information can be found here http://www.bfs.admin.ch/bfs/de/home/statistiken/gesundheit/erhebungenSupplementary.
The code used for producing the results presented in this paper can be evaluated on a smaller artificial data set sampled from the transformation forest by running demo("BMI") from the trtf package (Hothorn, 2017c).
Acknowledgements
I thank the students participating in the course “STA660 Advanced R Programming” that I taught in the spring semester of 2017 for producing the code underlying Figure 8 as part of their homework assignments. Parts of this paper were written during a research sabbatical at Universität Innsbruck financially supported by the Swiss National Science Foundation (grant number IZSEZ0_177091).
References
 Breiman (2001) Breiman L (2001). “Random Forests.” Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324.
 Bundesamt für Statistik (2013) Bundesamt für Statistik (2013). Die Schweizerische Gesundheitsbefragung 2012 in Kürze – Konzept, Methode, Durchführung. Bern. URL http://www.bfs.admin.ch.
 Chernozhukov et al. (2013) Chernozhukov V, FernándezVal I, Melly B (2013). “Inference on Counterfactual Distributions.” Econometrica, 81(6), 2205–2268. doi:10.3982/ECTA10582.
 Fahrmeir et al. (2013) Fahrmeir L, Kneib T, Lang S, Marx B (2013). Regression. Models, Methods and Applications. SpringerVerlag, New York, U.S.A.
 Farouki (2012) Farouki RT (2012). “The Bernstein Polynomial Basis: A Centennial Retrospective.” Computer Aided Geometric Design, 29(6), 379–419. doi:10.1016/j.cagd.2012.03.001.
 Hothorn (2017a) Hothorn T (2017a). mlt: Most Likely Transformations. R package version 0.21, URL https://CRAN.Rproject.org/package=mlt.
 Hothorn (2017b) Hothorn T (2017b). Most Likely Transformations: The mlt Package. R package vignette version 0.20, URL https://CRAN.Rproject.org/package=mlt.docreg.
 Hothorn (2017c) Hothorn T (2017c). trtf: Transformation Trees and Forests. R package version 0.30, URL https://CRAN.Rproject.org/package=trtf.
 Hothorn et al. (2013) Hothorn T, Kneib T, Bühlmann P (2013). “Conditional Transformation Models by Example.” In VMR Muggeo, V Capursi, G Boscaino, G Lovison (eds.), “Proceedings of the 28th International Workshop on Statistical Modelling,” pp. 15–26. Universitá Degli Studi Di Palermo. ISBN 9788896251478.
 Hothorn et al. (2014) Hothorn T, Kneib T, Bühlmann P (2014). “Conditional Transformation Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1), 3–27. doi:10.1111/rssb.12017.
 Hothorn et al. (2017) Hothorn T, Möst L, Bühlmann P (2017). “Most Likely Transformations.” Scandinavian Journal of Statistics. Accepted 20170619, URL https://arxiv.org/abs/1508.06749.
 Hothorn and Zeileis (2017) Hothorn T, Zeileis A (2017). “Transformation Forests.” Technical report, arXiv 1701.02110. URL https://arxiv.org/abs/1701.02110.

Liu et al. (2017)
Liu Q, Shepherd BE, Li C, Harrell FE (2017).
“Modeling Continuous Response Variables Using Ordinal Regression.”
Statistics in Medicine. doi:10.1002/sim.7433.  Lohse et al. (2017) Lohse T, Rohrmann S, Faeh D, Hothorn T (2017). “Continuous Outcome Logistic Regression for Analyzing Body Mass Index Distributions.” F1000Research, 6, 1933. doi:10.12688/f1000research.12934.1.
 Manuguerra and Heller (2010) Manuguerra M, Heller GZ (2010). “Ordinal Regression Models for Continuous Scales.” The International Journal of Biostatistics, 6(1). doi:10.2202/15574679.1230.
 Möst and Hothorn (2015) Möst L, Hothorn T (2015). “Conditional Transformation Models for Survivor Function Estimation.” International Journal of Biostatistics. doi:10.1515/ijb20140006.
 Möst et al. (2014) Möst L, Schmid M, Faschingbauer F, Hothorn T (2014). “Predicting Birth Weight with Conditionally Linear Transformation Models.” Statistical Methods in Medical Research. doi:10.1177/0962280214532745.
 R Core Team (2017) R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.Rproject.org/.
 UNESCO Institute for Statistics (2012) UNESCO Institute for Statistics (2012). International Standard Classification of Education – ISCED 2011. Montreal. URL http://www.uis.unesco.org/Education/Documents/isced2011en.pdf.
Comments
There are no comments yet.