 # Predicting Regression Probability Distributions with Imperfect Data Through Optimal Transformations

The goal of regression analysis is to predict the value of a numeric outcome variable y given a vector of joint values of other (predictor) variables x. Usually a particular x-vector does not specify a repeatable value for y, but rather a probability distribution of possible y–values, p(y|x). This distribution has a location, scale and shape, all of which can depend on x, and are needed to infer likely values for y given x. Regression methods usually assume that training data y-values are perfect numeric realizations from some well behaived p(y|x). Often actual training data y-values are discrete, truncated and/or arbitrary censored. Regression procedures based on an optimal transformation strategy are presented for estimating location, scale and shape of p(y|x) as general functions of x, in the possible presence of such imperfect training data. In addition, validation diagnostics are presented to ascertain the quality of the solutions.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In regression analysis one has a system under study with associated attributes or variables. The goal is to estimate the unknown numeric (outcome) value of one of the variables , given the known joint values of other (predictor) variables associated with the system. It is seldom the case that a particular set of –values gives rise to a unique value for . There are other variables that influence whose values are neither controlled nor observed. Specifying a particular set of joint values for results in a probability distribution of possible –values, , induced by the varying values of . This distribution has a location , scale and shape, all of which can depend on . Using a training data set of previously solved cases in which both the outcome and predictor variable values are jointly observed, the goal is to produce an estimate of the distribution of given .

Most regression procedures explicitly or implicitly approximate

by a generic (usually normal) distribution that is symmetric with constant scale,

(homoscedasticity). Only its location function is estimated from the training data. That estimate can then be used to estimate the presumed constant scale by

 ^s=∑i∈Th(|yi−^f(xi)|) (1)

on a “test” data set not used to estimate . Here the particular function employed depends upon the assumed distribution . In this setting a –prediction at a future –value is taken to be since this is the most likely value of under a symmetric distribution assumption.  The uncertainty of all predictions is gauged by (1). The value of is often taken to be a measure of the lack-of-quality of the solution, especially in competitions.

The quantity (1) reflects the prediction uncertainty averaged over the marginal distribution of all –values, . The actual uncertainty for predictions at a particular is characterized by , the scale of corresponding to that –value. It is seldom the case that this uncertainty is the same or even similar for different –values. Generally there is substantial variation in the scale over the distribution of (heteroscedasticity). When a predicted value of is being used in an actual decision making application (rather than in a competition) knowing its corresponding lack–of–accuracy can be important information influencing the decision outcome.

It is important to note that the location and scale are characteristics (parameters) of the distribution . In particular, the scale function represents an “irreducible” error for predictions at . Obtaining more training data or using more effective learning algorithms cannot reduce this error. They can only reduce the error in the estimated values of the parameters and . This latter “reducible”  error is usually much smaller than the irreducible error characterized by the intrinsic scale . The only way to decrease irreducible error is to use a more informative set of predictor variables .

Along with location and scale, higher moments can also vary with

giving rise to changing shape of for different values of . For example,

may be asymmetric with varying degrees of skewness for different

–values. This will effect inference on likely values of for a given .

Almost all regression procedures implicitly assume that the outcome is a real valued variable with the potential to realize any value for which the marginal distribution of

 p(y)=∫p(y|x)p(x)dx

has support. For most hypothesized used in regression analysis this implies all real values . In many applications, however, this ideal is not realized. Recorded –values may be restricted to a small distinct set with many ties. The tied values themselves may be unknown. Only an order relation among them is recorded (ordinal regression). In other applications, one may only be able to specify (possibly overlapping) intervals that contain each training data –value, rather than the actual value itself (censoring). The specified intervals may be open or closed.

This paper describes an omnibus regression procedure (OmniReg) that can use imperfect training data such as described above to produce estimates of the location , scale and shape of as general functions of . These can be used to assess the distribution of likely values of for new observations for which only the predictor values are known.

Section 2

presents the basic gradient boosting strategy for jointly estimating the location

and scale of a symmetric  in the presence of general censoring of the outcome variable . Section 3 generalizes the procedure by presenting a method for for constructing the optimal transformation of , , on which to perform the corresponding location–scale estimation. Section 4 outlines diagnostic procedures for checking the consistency of derived solutions. Section 5 further generalizes the procedure by incorporating asymmetry, as a function of , into the transformed solutions. Illustrations using three publicly available data sets, one censored and two uncensored, are presented in Section 6. Application to ordinal regression, where only an order relation among the -values is observed, is discussed in Section 7. Connections of this method to other related techniques are discussed in Section 8. Section 9 outlines the use of the derived distribution estimates for point prediction. Section 10 provides a concluding discussion.

## 2 Estimation

Following Tobin (1958) we assume that the outcome

is a possibly imperfect measurement of a random variable

that follows a well behaved probability distribution with support on the entire real line. In particular, we initially suppose that for a given follows an additive error model

 y∗=f(x)+s(x)⋅ε (2)

with location and scale , where both are general functions of to be estimated from the training data. The random variable follows a standard logistic distribution

 ε∼exp(−ε)(1+exp(−ε))2. (3)

This distribution has a Gaussian shape near its center and continuously evolves to that of a Laplace distribution in the tails. This provides robustness to potential outliers while maintaining power for nearly normally distributed data.

### 2.1 Data

The data imperfections described above can all be treated as being different aspects of censoring. The outcome value for a censored observation is unknown; one can only specify an interval containing its corresponding value . If the observation is uncensored. If the observation is said to be left censored at ; if it is right censored at . Otherwise it is interval censored. An actual data set can consist of a mixture of all of these censored types with or without the inclusion of uncensored observations. In the case of discrete –values with ties, one can consider the observations at each tied value to be interval censored between the midpoint of the previous and next set of possibly tied values.

### 2.2 Loss function

We use maximum likelihood based on the logistic distribution (2) (3) to estimate the location and scale functions from the training data . The –value for each observation is specified by a lower bound and upper bound . If the probability that is given by

 Pr(y∗i∈[ai,bi])=11+exp((f(xi)−bi)/s(xi))−11+exp((f(xi)−ai)/s(xi)).

If , it is given by

 Pr(y∗i=bi)=1s(xi)exp((f(xi)−bi)/s(xi))(1+exp((f(xi)−bi)/s(xi)))2. (4)

The loss function then becomes the negative log–likelihood

 L(a,b,f(x),s(x)) =I(a=b)[log(s(x))+(b−f(x))/s(x)+2log(1+exp((f(x)−b)/s(x))] −I(a

The function estimates are then

 (^f(x),^s(x))=argminf,s∈FN∑i=1L(ai,bi,f(xi),s(xi)) (6)

where is some chosen class of functions.

### 2.3 Implementation

The function class in (6

) employed here consists of linear combinations of decision trees induced by gradient boosting (Friedman 2001)

 ^f(x)=Kf∑k=1T(f)k(x) (7)

and

 ˆlog(s(x))=Ks∑k=1T(s)k(x). (8)

Each and is a different CART decision tree (Breiman et. al. 1984) using as predictors. The trees in each expansion (7) (8) are sequentially induced using the respective generalized residuals

 ~rf(a,b,f|s)=−∂L(a,b,f,s)∂f (9)

and

 ~rs(a,b,s|f)=−∂L(a,b,f,s)∂log(s) (10)

as the outcome with the loss function is given by (5). At each step the evaluation of and is based on the previously induced trees in the respective sequences. The quantity is estimated in (8) because the loss function (5) is not convex in , but is convex in . This also enforces the constraint that the estimated scale is always greater than zero.

After each tree in (7) is built the predicted value in each of its terminal nodes is taken to be where is the solution to the line search

 ∑i∈t~rf(ai,bi,^f(xi)+~ft|^s(xi))=0, (11)

and is a small number (learning rate). For the trees in (8) the terminal node predictions are given by with

 ∑i∈t~rs(ai,bi,^s(xi)⋅~st|^f(xi))=0. (12)

In (11) (12) and are the current location and scale function estimates used to induce the corresponding tree.

The number of trees and in (7) and (8) respectively are taken to be those that minimize the negative log–likelihood (6) as evaluated on an independent “test” set of observations not used to learn and .

The tree sequences (7) (8) are alternatively induced using an iterative boosting algorithm:

Start: constant

Loop {

tree–boost given

tree–boost given

} Until change threshold.

Starting with being a constant function, gradient boosting is applied to estimate a corresponding using (9) (11). Given that boosting is applied to estimate its corresponding using (10) (12). This is then used to estimate a new and so on. The alternating iterative process continues until the change in both functions on successive iterations is below a small threshold. Usually only two or three iterations are required.

## 3 Optimal transformations

The strategy described in Section 2 attempts to capture location and scale as functions of the predictor variables under an additive error assumption (2). However, there is no guarantee that the actual in any application even approximately has this property. Violation can result in highly distorted estimates . While accurate probability estimates are essential for proper inference, they are especially important for estimation in the presence of censoring (Section 2.1). Given a censoring interval , knowledge concerning the corresponding underlying outcome value is derived solely from its estimated distribution.

One way to mitigate this problem is by transforming the outcome variable using a monotonic function. The goal here is to find a monotonic transformation function such that (2) at least approximately holds for the transformed variable. That is

 g(y)≃f(x)+s(x)⋅ε. (13)

To the extent (13) holds, represents the best transformation for estimating the corresponding and . All inference can be performed on the transformed variable using the distribution of (3). Corresponding -quantiles referencing the distribution of the original untransformed variable , , are given by .

For any given transformation one can directly solve (6) for its corresponding and , using the methods described in Section 2.3, by simply taking to be the outcome variable in place of . That is and . Since is monotonic its cumulative distribution for any must be the same as that of its corresponding untransformed outcome , . That is, at any so that their respective data averages are equal

 1NN∑i=1G(g(y)|xi))=1NN∑i=1F(y|xi)). (14)

Given and this (14) defines the optimal transformation .

The right hand side of (14) can be estimated by the empirical marginal cumulative distribution of , . By assumption (13) the cumulative distribution of at each is

 G(g(y)|x)=11+exp((f(x)−g(y))/s(x)). (15)

Substituting the corresponding estimates for the distribution parameters one has from (14)

 1NN∑i=111+exp((^f(x)−^g(y))/^s(x))=^F(y). (16)

Given any value for one can solve (16) for its corresponding estimated transformed value using a line search method. Note that as defined by (16) is only required to be monotonic and is not restricted to any other function class.

If all of the training data -values are uncensored, or the censoring intervals are restricted to non–overlapping bins, then the corresponding is trivially obtained by ranking the training data -values. If not, it can be estimated using Turnbull’s non–parametric self–consistency algorithm (Turnbull 1976).

These considerations suggest an alternating optimization algorithm to jointly solve for the functions , and . Starting with an initial guess for (say ), one solves (5) (6) for the corresponding and as described in Section 2.3 using the current as the outcome variable. Given the resulting solution location and scale functions one evaluates a new using (16). This transformation function replaces the previous one resulting in new estimates for and from (5) (6). These in turn produce a new transformation through (16) and so on. The process is continued until the estimates stop changing, converging to a fixed point. Usually three to five iterations are sufficient.

## 4 Diagnostics

No procedure works equally well in all applications. Simply because a program returns a result does not insure its validity. It is important to assess whether or not the results accurately reflect their corresponding population quantities by subjecting them to diagnostic consistency checks or constraints. To the extent these constraints are satisfied they provide necessary, but not sufficient, evidence for the validity of the model. Those that are violated uncover corresponding model inadequacies. In this section diagnostics are presented for checking the consistency of the probability distribution estimates.

### 4.1 Marginal distribution plots

The fundamental premise of the procedure is that given any , its corresponding transformed outcome is a random variable that follows a symmetric logistic distribution with location and scale . If so, its corresponding cumulative distribution is

 Pr(^g(y)

It is not possible to verify (17) for any single observation . However it can be tested for subsets of data. Let be a subset of observations drawn from a validation data set not used to obtain the estimates , or . The subset must be selected using only predictor variable values , or quantities derived from them, and not involve the outcome . Then the predicted cumulative distribution of for is

 Pr(^g(y)

One can compare this to the actual empirical distribution of for . This  can be computed by sorting the or by using Turnbull’s non–parametric algorithm if there is censoring present. The size of the subset should be large enough to obtain reliable estimates of the empirical distribution.

The empirical and predicted distributions can be compared with quantile–quantile (Q-Q) plots. The vertical axis of a Q-Q plot represents the empirical quantiles of the data subset. The abscissa represents the same quantiles based on the predicted distribution (18). To the extent that the two distributions are the same, points corresponding to the same quantile value will tend to be equal and thus lie close to a 45–degree straight line. If the two distributions have the same shape and scale but differ in location then the corresponding quantiles will lie close to a line parallel to the diagonal but shifted by the location difference. If the shape of the two distributions is the same but they differ in scale the corresponding quantiles will lie on a straight line with a non unit slope reflecting the differing scales. To the extent the shapes of the two distributions differ the respective quantiles will fail to lie on a straight line.

If the OmniReg model is correct the prediction (18) holds for any -defined data subset . It is not feasible to test it over all possible subsets of the data to look for discrepancies. One must judiciously choose revealing subsets, perhaps based on domain knowledge. One generic possibility is to select subsets based on joint values of the estimated location and scale . That is . This approach is illustrated on the data example in Section 6.1.

### 4.2 Standardized residual plots

The marginal distribution diagnostics described in Section 4.1 can be computed for data containing mixtures of uncensored and any kind of censored outcomes. However if all outcomes in the validation data set are uncensored, the predicted standardized residuals

 ^r(y|x)=(^g(y)−^f(x))/^s(x) (19)

can be directly computed and examined. It is these standardized residuals that are central to inference. Under procedure assumptions, and to the extent that the transformation , location and scale function estimates are accurate, the predicted standardized residuals (19) follow a standard logistic distribution (3) for any . As above this can’t be verified for any single observation , but it can be tested for groups of data as described in Section 4.1. These diagnostics are illustrated on the data examples in Sections 6.2 and 6.3.

## 5 Asymmetry

The optimal transformation strategy (Section 3) attempts to find a monotonic function such that (13) approximately holds. To the extent it is successful the transformed distribution must be symmetric or close to it. Although it often succeeds, there is no guarantee that such a transformation exists in a particular application. When that happens the procedure can compromise accuracy of function estimates by attempting to approximate distribution symmetry. This will likely be reflected in the diagnostic plots described in Sections 4.1 and 4.2.

A way to mitigate this problem is to relax the symmetry requirement on in the transformed setting. In particular the logistic distribution can be generalized to incorporate asymmetry by providing different scales on the positive and negative residuals

 p(z|f,sl,su)=2sl+su[I(z≤f)exp((f−z)/sl)(1+exp((f−z)/sl))2+I(z>f)exp((f−z)/su)(1+exp((f−z)/su))2]. (20)

Here represents the mode, a scale parameter for the negative residuals and a corresponding scale for the positive residuals. Figure 1 shows a plot of (20) for , , . Figure 1: Probability density of asymmetric logistic distribution (20) with mode f =0, lower scale sl=2 and upper scale su=1.

Note that this definition of an asymmetric logistic distribution is different from the “skew” logistic distributions proposed by Johnson et. al. (1995). It (20) is faster to compute in that it only requires the evaluation of a single exponential function, whereas the skew logistic requires two exponentials and a logarithm. Also, its parameters have a straightforward interpretation.

The iterative gradient boosting strategy of Section 2.3 is easily modified to incorporate this generalization. The probability density (20) and its cumulative distribution

 CDF(z|f,sl,su)=2slsl+su{I(z≤f)1+exp((f−z)/sl)+I(z>f)[12+susl(11+exp((f−z)/su))−12)]} (21)

simply replace the corresponding symmetric logistic distributions in the formulation of a loss function analogous to (5). Three functions , and are estimated by gradient boosting:

Start: constant

Loop {

tree–boost given

tree–boost given

tree–boost given

} Until change threshold

For the optimal transformation strategy of Section 3, (21) simply replaces (15) in (16).

A limitation of this asymmetric gradient boosting algorithm is that its convergence is quite slow. This is caused by the strong effect that each scale parameter has on the solution for the other scale parameter. However in the special case of (20) with uncensored outcomes there is a trick that dramatically speeds convergence. This approach is based on the symmetric logistic loss function (5) with .

At each there are estimates for the three functions , and . Associated with each training observation is a location  and single scale parameter . If then , otherwise . At each iteration of asymmetric boosting a new location function is estimated based on all of the training data . Next a new lower scale function is estimated using only the training data with currently negative residuals . Then a new upper scale function is estimated using only the training data with currently positive residuals . All estimation is based on the symmetric logistic loss function (5). This leads to an alternative algorithm for uncensored :

Start: constant

Loop {

tree–boost given

tree–boost given

tree–boost given

} Until change threshold.

This algorithm uncouples the direct effect of each scale function  on the estimation of the other. They only interact indirectly through their effect on the location function estimate. Convergence usually requires four to six iterations.

The diagnostic plots of Section 4.1 are easily generalized to asymmetric error model. The asymmetric cumulative distribution (21) simply replaces (17). For the standardized residual plots (Section 4.2) an asymmetrically standardized residual

 ^ra(y|x)=I[^g(y)≤^f(x)](^g(y)−^f(x))/^sl(x)+I[^g(y)>^f(x)](^g(y)−^f(x))/^su(x) (22)

replaces the symmetric one (19).

## 6 Data examples

In this section the procedures described in the previous sections are illustrated using three data sets. The first involves censoring in which none of the actual outcome

-values are known. The other two are well known regression data sets from the Irvine Machine Learning Repository and involve uncensored outcomes. For all data sets two analyses are performed. In the first, location

and scale estimates are derived assuming that the underlying variable follows a symmetric logistic distribution (2) (3) as described in Section 2, without the application of any transformation. In the second analysis the procedure described in Section 3 is applied to jointly estimate the optimal transformation along with its corresponding location and scale functions. In addition, an asymmetric analysis (Section 5) is applied to the third data set. All presented results are based on predictions using a validation data set not involved in model construction or selection.

### 6.1 Questionnaire data

This data set contains demographics on people who filled out questionnaires at shopping malls in the San Francisco Bay Area (Hastie, Tibshirani, and Friedman 2009). The exercise here is to predict a person’s age given the other demographic information listed on their questionnaire. The predictor variables are listed in Table 2.

Table 2

Questionnaire predictor variables

1 Occupation 2 Type of home 3 Gender 4 Martial status 5 Education 6 Annual income 7 Lived in Bay Area 8 Dual Incomes 9 Persons in household 10 Persons in household under 18 11 Householder status 12 Ethnic classification 13 Language

Questionnaire age values are specified as being in one of seven intervals as shown in Table  3.

Table 3

Specified age intervals

1 2 3 4 5 6 7
17 and under 18–24 25–34 35–44 45–54 55–64 65 and older

This outcome can be considered as censored in non overlapping intervals with bounds . There are questionnaires. These are randomly divided into observations for model construction, for selecting the number of trees in each model  (7) (8) , and for validation. Figure 2: Questionnaire data. Q–Q plots of empirical versus predicted quantiles for untransformed Omnireg solution for four data subsets obtained by partitioning the location ^f(x) and ^s(x) estimates at their respective medians.

Figure 2 shows four Q–Q plots of the empirical versus predicted quantiles of the marginal distribution of age (Section 4.1) based on the untransformed solution. Each plot represents data within a joint interval of and values. The interval boundaries for are obtained by partitioning at its corresponding median (rows – bottom to top). Within each such location interval separate intervals for are constructed using its median evaluated within that location interval (columns – left to right). Thus the lower left plot is based on the  25% of the observations with the smallest locations and smallest scales. The top right involves the 25% of the observations with jointly the largest estimated locations and scales. Because of the censoring this marginal distribution is observable only at the six values that separate the censoring intervals. In Fig. 2, points representing extreme empirical  quantiles based on less than observations are not shown due to their instability. One sees In Fig. 2 that the predicted quantiles approximately follow the actual empirical quantiles. However there are a few clear discrepancies in the right plots representing the larger scale estimates.

Figure 3 shows the sequence of transformation solutions for seven iterations of the optimal transformation procedure described in Section 3. The respective transformations are defined only at the six values separating the censoring intervals. To aid interpretation lines connecting the corresponding points representing the same solution are included. Here blue represents the result of the first iteration, red the last, and black the intermediate ones. After about four iterations convergence appears to be achieved. The result at the seventh iteration (red) is chosen as the optimal transformation estimate . Figure 3: Solutions at successive iterations of the optimal transformation algorithm on questionnaire data. First is colored blue, two through six black and seventh red. Figure 4: Questionnaire data. Q–Q plots of empirical versus predicted quantiles for transformed ^g(y) OmniReg solution for four data subsets obtained by partitioning the corresponding location ^f(x) and ^s(x) estimates at their respective medians.

Figure 4 shows Q–Q plots of the empirical distribution of transformed age versus that predicted by its corresponding location and scale functions. The four data subsets are constructed in the same manner as in Fig. 2. Here in the transformed setting one sees a closer correspondence with model predictions especially for the larger scale values. Figure 5: Predicted probability distributions for three questionnaire data test set observations. Left: transformed setting. Right: original untansformed (age) setting. Upper: CDF, ^F(y|x). Lower: PDF, ^p(y|x). Bottom intervals represent those specified for the corresponding three censored observations.

Unlike usual regression procedures that return a single number as a prediction, OmniReg produces a function representing the predicted distribution of given . Figure 5 shows such predicted functions for

-values of three observations in the validation data set. The left plots show the distributions in the transformed setting, whereas the right plots reference the original outcome variable (age). The upper plots show the cumulative distributions and the lower ones show the corresponding probability density functions. The intervals indicated at the bottom of each plot represent the actual censoring intervals for the three chosen observations (Table 3). Perhaps the most useful representation is the predicted CDF of the original untransformed

(upper right), since from it one can directly read probability intervals for (age) predictions.

### 6.2 Online news popularity data

This data set (Fernandes, Vinagre and P. Cortez 2015) is available for download from the Irvine Machine Learning Data Repository. It summarizes a heterogeneous set of features about articles published by Mashable web site over a period of two years. The goal is to predict the number of shares in social networks (popularity). There are observations (articles). Associated with each are attributes to be used as predictor variables . These are described at the download web site https://archive.ics.uci.edu/ml/datasets/online+news+popularity#. The outcome variable for each article is its number of shares. For this analysis these data are randomly divided into observations for model construction, for selecting the number of trees in each model  (7) (8) , and observations for validation.

The number of shares in these data varies from less than 100 to over 100000 and is heavily skewed toward larger values. For data of this type it is common to apply a log–transform to the outcome variable. Figure 6 shows a histogram of for these data. There seems to be a sharp change in the nature of this distribution at shares. In the first analysis we directly model without performing any further transformation (Section 2). In the second we apply the technique of Section 3 to estimate the optimal transformation .

Since these data are uncensored one can directly analyze the predicted residuals as described in Section 4.2. Fig. 7 shows nine standardized residual Q–Q plots based on the untransformed model using nine different subsets of the validation data set. These subsets were constructed as described in Section 6.1 (Fig. 2), but with partitions at the corresponding 33% quantiles of and to create nine regions.

Each frame in Fig. 7  displays four Q-Q plots. The black lines represent a comparison of the estimated standardized residuals to a logistic distribution. For reference, the orange, green and violet lines show comparisons to the normal, Laplace and slash distributions respectively. The slash (Rogers and Tukey 1972) is the distribution of the ratio of normal and uniform random variables and has very heavy tails. The blue line in each plot is the 45–degree diagonal. Here one sees that the standardized residuals predicted by the model do not closely follow a logistic distribution (black line) for any of the location–scale defined data subsets. This suggests that inference based on this model using the log–transformation is highly suspect. Figure 7: Online news popularity data. Diagnostic Q-Q plots of actual vs. predicted distribution of untransformed standardized residuals (y− ^f(x))/^s(x) for nine data subsets delineated by joint intervals of the 1/3 quantiles of estimated location ^f(x) and scale ^s(x). The black, orange, green and violet lines respectively represent comparisons to logistic, normal, Laplace and slash distributions.

Applying the optimal transformation procedure of Section 3 to the log-transformed data one obtains the sequence of transformation estimates for each iteration shown in Fig. 8. The first estimate is colored blue, the seventh red, and the intermediates black. The blue hash marks below the abscissa delineate 1% intervals  of the -distribution. Convergence appears to occur after three iterations. Figure 8: Solutions at successive iterations of the optimal transformation algorithm on online news popularity data. First is colored blue, two through six black and seventh red. The blue hash marks below the abscissa delineate 1% intervals  of the y-distribution.

Using the model based on the seventh (red) transformation produces the corresponding standardized residual plots shown in  Fig. 9. Here the predicted residuals very closely follow a standard logistic distribution (black line) for all data subsets. A slight exception might be the lower right plot (small location, large scale) where the extreme positive residuals appear somewhat too large. This diagnostic provides little evidence of overall lack of fit of the transformed model or violation of its assumptions. Of course, one cannot absolutely exclude the possibility that there are other diagnostics that might demonstrate such evidence. Figure 9: Online news popularity data. Diagnostic Q-Q plots of actual vs. predicted distribution of optimal transformed standardized residuals (^g(y)− ^f(x))/^s(x) for nine data subsets delineated by joint intervals of the 1/3 quantiles of estimated location ^f(x) and scale ^s(x). The black, orange, green and violet lines respectively represent comparisons to logistic, normal, Laplace and slash distributions.

Figure 10 shows curves representing the model predicted probability distributions of for -values of three observations taken from the validation data set. The format is the same as in Fig. 5. The circles at the bottom of each plot represent the actual realized for these three chosen observations. Distribution asymmetry and heteroscedasticity are evident in the untransformed setting (right frames). Prediction probability intervals can be directly determined from the upper right plot. Figure 10: Predicted probability distributions for three test set observations from the online news popularity data. Left: transformed setting. Right: original untansformed (log–shares) setting. Upper: CDF, ^F(y|x). Lower: PDF, ^p(y|x). Bottom points represent the corresponding recorded log10(shares) for the three selected observations.

### 6.3 Million Song Data Set

These data are a subset of the Million Song Data Set and are also available for download from the Irvine Machine Learning Data Repository. It consists of a training data set of 463715 and a test set of 51630 song recordings. We divide a randomly selected subsample of the training data into a learning data subset of 50000 observations and one for model selection consisting of  20000 observations. All diagnostics are performed on the 51630 song test data set. There are 89 predictor variables measuring various acoustic properties of the recordings. The outcome variable is the year the recording was made ranging from 1922 to 2011. Figure 11: Distribution of recording year for the million song data set.

Figure 11 shows a histogram of the outcome variable . There are relatively few songs in the data set recorded before 1955. Modeling the data without transformations (Section 2) produces the standardized residual diagnostic plots on the test data set shown in Fig. 12. The data subsets are constructed in the same manner as for Fig. 7. Again, none of the predicted standardized residual distributions in any of the data subsets are remotely close to being standard logistic (black line). Figure 12: Million song data set. Diagnostic Q-Q plots of actual vs. predicted distribution of untransformed standardized residuals (y− ^f(x))/^s(x) for nine data subsets delineated by joint intervals of the 1/3 quantiles of estimated location ^f(x) and scale ^s(x). The black, orange, green and violet lines respectively represent comparisons to logistic, normal, Laplace and slash distributions. Figure 13: Solutions at successive iterations of the optimal transformation algorithm on million song data set. First is colored blue, two through six black and seventh red. Figure 14: Million song data set. Diagnostic Q-Q plots of actual vs. predicted distribution of transformed standardized residuals (^g(y)− ^f(x))/^s(x) for nine data subsets delineated by joint intervals of the 1/3 quantiles of estimated location ^f(x) and scale ^s(x). The black, orange, green and violet lines respectively represent comparisons to logistic, normal, Laplace and slash distributions.

Applying the optimal transformation strategy of Section 3 produces the sequence of transformation estimates for seven iterations shown in Fig. 13. There is little change after two iterations. Figure 14 shows the corresponding standardized residual Q–Q plots for the seventh transformed solution. Although not perfect, these residuals much more closely follow a logistic distribution (black line) than for the untransformed solution (Fig. 12). The residual distributions for some of the samples are seen to differ mainly for extreme positive values where the data have somewhat narrower tails.

Figure 14 suggests that the optimal transformation strategy of Section 3 based on the additive symmetric error model (13) has not produced totally accurate probability estimates. This in turn suggests trying the asymmetric procedure of Section 5. Figure 15 shows the resulting diagnostic Q–Q plots when the second asymmetric algorithm of Section 5 is applied to the original untransformed outcome data.  The result is seen to be (at best) no better than the corresponding symmetric error results (Fig. 12). Figure 15: Million song data set. Diagnostic Q-Q plots of actual vs. predicted distribution of untransformed asymmetric standardized residuals (22) for nine data subsets delineated by joint intervals of the 1/3 quantiles of estimated location ^f(x)and those of the geometric mean √^sl(x)^su(x) of the scales. The black, orange, green and violet lines respectively represent comparisons to logistic, normal, Laplace and slash distributions.

Figure 16 shows the sequence of transformations produced by the optimal transformation strategy of Section 3 used in conjunction with the asymmetric estimation procedure of Section 5. The estimated optimal transformation (red) has the same general shape as the one produced by the symmetric procedure (Fig. 13) with small to moderate differences mainly at the lower extreme. Figure 16: Solutions at successive iterations of the asymmetric optimal transformation algorithm on million song data set. First is colored blue, two through six black and seventh red.

Figure 17 shows the resulting diagnostic plots produced by the asymmetric error model in its optimally transformed setting. For this diagnostic the respective data subsets represent the validation data partitioned at the 33% quantiles of their location (mode) estimates (bottom to top) and the 33% quantiles of the geometric mean of their two scale estimates (left to right). The (asymmetric) standardized residuals (22) are seen to very closely follow a standard logistic distribution (3) for all data subsets. Figure 17: Million song data set. Diagnostic Q-Q plots of actual vs. predicted distribution of transformed asymmetric standardized residuals (22) for nine data subsets delineated by joint intervals of the 1/3 quantiles of estimated location ^f(x) and geometric mean √^sl(x)^su(x) of the scales estimates. The black, orange, green and violet lines respectively represent comparisons to logistic, normal, Laplace and slash distributions.

Figure 18 shows plots of the lower (blue) and upper (red) scale estimates against corresponding location estimates in the optimal transformed setting. Here the upper scales are seen to be almost constant (homoscedastic) whereas the lower scales vary by roughly a factor of four. Neither seem to have any association with the location estimates. Figure 18: Million song data. Upper scale estimates ^su(x) (red) and lower scale estimates ^sl(x) (blue) vs. mode ^f(x) for asymmetric optimal transformed setting. Figure 19: Predicted probability distributions based on the optimally transformed asymmetric model for three test set observations from the million song data. Left: transformed setting. Right: original untansformed (year) setting. Upper: CDF, ^F(y|x). Lower: PDF, ^p(y|x). Bottom points represent the corresponding dates for the three selected recordings.

Figure 19 shows the predicted recording year probability distributions for three recordings in the test data set based on the transformed asymmetric model. The format is the same as that in Figs. 5 and 10. Unlike the distributions in those figures that are based on symmetric transformed models, here one sees some asymmetry of the distributions in the transformed setting. In the original outcome setting (recording year) there is seen to be massive skewness and heteroscedasticity. The predictive model is pretty certain that the recording indicated by green was made after 2005. For the blue recording it predicts somewhere between 1990 and 2010. It reports having little indication as to when the recording in red was made except that it is likely, but not certainly, before the other two. The circles at the bottom of each plot represent the actual dates for these three recordings.

## 7 Ordinal regression

In the examples of Section 6 there is a known measurement scale for the values of the outcome variable . Values of as well as the interval bounds in (5) (6) reference this scale. In ordinal regression no such measurement scale exists. Only a relative order relation among the -values is specified. Usually ordinal regression is applied in the context of many ties so that there are only a few unique grouped values such as small, medium, large, extra large. For any set of joint predictor variable values the main goal is to predict where labels the groups. In this sense ordinal regression can be viewed as classification where there is an order relation among the class labels.

OmniReg with optimal transformations (Sections 2 , 3 and 5) is fundamentally a (generalized) ordinal regression procedure. Its solutions depend only on the values of the cumulative distribution (ranks) of , (16). As noted in Section 2.1 tied -values can be considered as interval censored with lower bound at the midpoint of its value and the next lower value, and upper bound the midpoint of its value and the next higher value. The lower bound of the first group can be taken to be and the upper bound of the last . This was the strategy used in Section 6.1 (Table 3).

Note that for the problem analyzed in Section 6.1 there was no fundamental requirement that the censoring intervals be on the age measurement scale. Any scale that produces the same order (grouping) could have been used. Changing the measurement scale is equivalent to changing the starting transformation for the iterative procedure of Section 3. Except for possible numerical effects the resulting solutions should be essentially equivalent.

Censoring in ordinal regression is easily incorporated with this approach. If it is unknown which of several adjacent groups contains training observation , its lower bound is set to the minimum of the lower bounds over those groups, and its upper bound is set to the maximum of the corresponding group upper bounds. Also, with this approach to ordinal regression computation does not depend on the number of groups.

The predicted probability of an observation being in the th group is given by where and are respectively the lower and upper bound specified for the th group on the chosen initial measurement scale. is given by (18) for symmetric models, and (21) for asymmetric models, based on the OmniReg function estimates , , and or , .

## 8 Related work

The OmniReg procedure presented here combines approaches from several separate topics each of which has a long and rich history in Statistics, Econometrics and Machine Learning. The procedure presented in Section 2 is a straight forward generalization of Tobit analysis (Tobin 1958) to incorporate logistic errors, general censoring, heteroscedasticity, and general nonparametric function estimation via boosted tree ensembles. There are many previous works that incorporate various subsets of these generalizations.

### 8.1 Censoring

Although various forms of censoring actually occur in many types of applied regression problems, it has been studied mainly in the context of survival analysis where the outcome variable  is time to some event. Right censoring occurs when that time exceeds the end of the study. The most popular tool for survival analysis is the Cox proportional hazards model (Cox 1972) which estimates the hazard function up to a multiplicative function of time

 λ(y|x)=p(y|x)/(1−CDF(y|x))=h(y)⋅exp(f(x)).

Here is an unknown baseline hazard and the function is estimated from the data. Cox originally proposed a linear model , but various authors have since presented nonlinear generalizations. OmniReg (Sections 2 and 3) can be applied to survival data with mixtures of any type of censoring. Because it estimates it can be used to directly estimate the absolute hazard .

### 8.2 Transformations

Transforming an outcome variable in order to increase its compatibility with regression procedure assumptions has a long history in Statistics. Mosteller and Tukey (1977) proposed a ladder of reexpressions for different variable types based upon their realizable values. Box and Cox (1964) proposed the first data driven approach by selecting a power transformation for numeric variables that maximizes the data Gaussian likelihood in the context of homoscedasticity and linear models.

Gifi (1990), and Breiman and Friedman (1985), proposed estimating outcome transformations for additive modeling by minimizing

 N∑i=1(g(yi)−p∑j=1hj(xij))2/N∑i=1g2(yi) (23)

jointly with respect to centered functions and under a smoothness constraint. With Gifi (1990) smoothness was imposed by restricting all functions to be in the class of monotone splines with a given number and placement of knots. Solving (23) then becomes a canonical correlation problem implemented by alternating linear least–squares parameter estimation. Breiman and Friedman (1985) directly used nonparametric data smoothers to estimate all functions also using an alternating approach similar to that used in Section 3.  Both approaches assume an additive homoscedastic model for as a function of .

Solutions to (23) maximize the correlation between and

over the joint distribution

of and . As a result the solution transformations depend on the marginal distribution of the predictor variables (Buja 1990). A consequence is that the expected solutions of (23) applied to data arising from the regression model

 g∗(yi)=p∑j=1h∗j(xij)+εi

with will not generally be and . Distributions of that tend to exhibit clustering are especially problematic.

In order to alleviate this problem Tibshirani (1988) proposed finding transformations and that that minimize

 N∑i=1(g(yi)−p∑j=1hj(xj))2

where the monotone function

is taken to be the variance stabilizing transformation

 Var(g(y)|p∑j=1hj(xj))= \emph{% constant}. (24)

Solutions are obtained by an alternating optimization algorithm.

In the language of this paper (24) will hold if the location function (13) is additive in and the scale function is independent of . In particular, (24) does not necessarily imply homoscedasticity constant in the transformed setting . For all the examples in this paper the association between the location and scale estimates for the optimal happened to be quite weak but there was still considerable variation in scale, as with for example in Fig. 18.

The transformation strategy outlined in Sections 2, 3 and 5 is directly regression based in that it is conditional on . The basic assumption is that the standardized residuals (19) (22) follow a standard logistic distribution (3) at any irrespective of the distribution of . Also note that the estimate obtained from (16) is automatically monotonic.

### 8.3 Quantile regression

For known -values and in the absence of censoring a natural alternative to the procedures described in Sections 2, 3 and 5 is quantile regression based on gradient boosted trees. One can estimate the th quantile function, , of using the loss function

 L(y,qp)=(y−qp)[pI(y≥qp)+(p−1)I(y

Friedman (2001) suggested using (25

) in the context of gradient boosting and it has been incorporated in many implementations (see Ridgeway 2007 and Pedregosa 2011). Recently, Athey, Tibshirani, and Wagner (2017) implemented quantile regression in random forests. With boosted tree quantile regression (

25) simply replaces (5) as the loss function in the procedure outlined in Sections 2.2 and 2.3, and only one function is estimated separately for each quantile . One can use this procedure to independently estimate functions for several quantiles , and then  for any use to approximate the corresponding quantiles of . This approach makes no assumptions concerning a generating model for , such as (13) or (20). It thus may provide useful results in applications where those assumptions are seriously violated. However, when they are not decreased accuracy can result.

### 8.4 Direct maximum likelihood estimation

For -values on a known measurement scale, another natural alternative to the procedure described in Sections 2, 3 and 5 is applying direct maximum likelihood such as in GAMLSS (Rigby and Stasinopoulos 2006). A parameterized probability density for is hypothesized and then its parameters are fitted as functions of the predictor variables by maximum likelihood. Parametric, linear and additive functions are considered. These are fit to data by numerically maximizing the corresponding (non convex) log–likelihood with respect to the parameters defining all of the functions of

. The assumed probability density can have parameters controlling its location, scale, skewness and kurtosis which are all modeled as functions of the predictors in this manner.

A similar paradigm is used in Sections 2 and 5 based on the two and three parameter logistic distributions. A difference is that the parameters are taken to be more flexible functions of as represented by boosted decision tree ensembles. It is straightforward to generalize the gradient tree boosting strategy of Section 2 and 5 to other distributions with more that two or three parameters as functions of the predictor variables . Whether such added complexity translates to improved or reduced accuracy will be application dependent. In any application the diagnostics described in Sections 4.1 and 4.2  may be helpful in assessing model deficiencies.

The central difference between the GAMLSS method and OmniReg lies in the optimal transformation strategy (Section 3). The GAMLSS approach assumes that the hypothesized parametric distribution describes . OmniReg assumes that there exists some transformation for which follows the specified distribution. It then nonparametrically estimates that transformation jointly with its corresponding distribution parameters as functions of . This substantially enlarges the scope of application. The data examples of Sections 6.2 and 6.3 illustrate the gains that  can be made by this transformation strategy. They also illustrate that the optimal transformations for a given problem need not be close to any function in commonly assumed parametric families such as the power family .

Arrays of multiple Q–Q plots each based on different regions of the -space were proposed by van Buuren and Fredriks (2001). The diagnostic plots of Section 4.2 are a straightforward generalization. Those of Section 4.1 can be viewed as a generalized version for arbitrarily censored data.

### 8.5 Ordinal regression

There is a large literature on ordinal regression (see McCCullagh 1980 and Gutiérrez et. al.

2016). A common method is to apply a sequence of binary linear logistic regressions to separately estimate the probability of being below each of the successive group boundaries. The probability of being within each of the groups is then obtained by successive differences between these estimates. In order to avoid negative values the coefficients of all linear models are constrained to be the same, with only the intercepts changing.

The approach most similar in spirit to the one here is Messner et. al. (2014). They propose a statistical model similar to (3) (13). The location and log-scale functions are modeled as linear functions of the predictor variables by numerical maximum likelihood. The transformation is preselected and not systematically fit to the data.

## 9 Point prediction

The focus of this work is on the estimation of the distribution functions . These can be used to infer likely values of for given -vectors. Sometimes the goal of regression is to return a single value that minimizes the prediction risk at

 h(x)=argminh∫L(y,h)p(y|x)dy. (26)

Here represents a loss, relevant to the application at hand, for predicting when the actual value is . One way to attempt to solve (26) is

 ^h(x)=argminh(x)∈HN∑i=1L(yi,h(xi)). (27)

Here is a class of functions usually defined by the fitting procedure employed. An alternative strategy is to use

 ^h(x)=argminh∫L(y,h)^p(y|x)dy (28)

where is an estimate of , for example through (4) (5) (6). For some and , (28) can be solved analytically. Otherwise it can be easily solved by numerical methods. Note that there is no requirement that be convex in (28). Usually (27) is used when there is no censoring, and (28) is applied when censoring is present. Which of the two is best in terms of accuracy in any particular application depends on the specific nature of that application. If there exists a validation data set not used for training with known (uncensored) -values one can use

 CV=∑i∈VL(yi,^h(xi)) (29)

to estimate the accuracy of any . An advantage of (28) is that solutions for a variety of loss functions , as well as other statistics, can be easily obtained without having to recompute .

## 10 Discussion

The goal of this work (OmniReg) is to provide a general unified procedure that can be used for predictive inference in a variety of regression problems. These include heteroscedasticity, asymmetry, ordinal regression, and general censoring.

Censoring has received relatively little attention in machine learning even though it is a common occurrence in regression problems. Often there are restrictions on the measured values of the outcome . Section 6.1 illustrates a case where measurements of a continuous outcome (age) are restricted to six discrete values (intervals). Another common restriction is where the observed -values cannot be negative, such as payout on insurance policy claims. There is usually a large mass of zero–valued outcomes along with some positive ones. In this case it is not straight forward to formulate and estimate a corresponding . One way (Tobin 1958) is to consider the observed -values as censored below zero measurements of a latent variable with an unrestricted . This can then be estimated from the data using the techniques of Sections 2, 3 and 5.

Formal predictive inference has also received relatively little attention. However, informal inference is at the heart of prediction. The presumption is that a predicted value is “somewhere close” to the actual outcome -value. Predictive inference simply quantifies this concept through an estimated probability distribution. As seen in the data examples of Section 6 the nature of “somewhere close” can be very different for different predictions in the same problem. Also this probability distribution can be used to obtain point estimates for any loss function through (28). These can sometimes be more accurate than corresponding direct estimates (27).

A central feature of OmniReg is the optimal transformation strategy of Section 3. This extends the applicability of the method to a much wider class of problems. It is also at the heart of the ordinal regression strategy of Section 7. As seen in the data examples, a particular assumed probability distribution  (here (4) or (20)) is seldom appropriate for the original outcome . However, there is often a corresponding optimal transformation for which it is much more appropriate. This optimal transformation may not be close to any member of a common parametric family as was seen in Figs. 8, 13 and 16.

The basic OmniReg method is indifferent to the technique used to obtain the parameter function estimates , , or , . An iterative strategy based on gradient boosted trees was used in Sections 2.3, 3 and 5. This seems to work well in a variety of applications. However, there may be some for which other methods work as well or better. A crucial but delicate ingredient is regularization, especially for the location estimate. An overfitted location estimate will cause a severe negative bias in the log–scale estimates or