1 Introduction
In machine learning and statistics the observed features may be very nonnormal (nongaussian) and asymmetric (skewed) which often complicates the next steps of the analysis. Therefore it is customary to preprocess the data by transforming such features in order to bring them closer to normality, after which it typically becomes easier to fit a model or to make predictions. To be useful in practice, it must be possible to automate this preprocessing step.
In order to transform a positive variable to give it a more normal distribution one often resorts to a power transformation (see e.g.
[10]). The most often used function is the BoxCox (BC) power transform studied by [3], given by(1) 
Here stands for the observed feature, which is transformed to using a parameter . A limitation of the family of BC transformations is that they are only applicable to positive data. To remedy this, Yeo and Johnson [12] proposed an alternative family of transformations that can deal with both positive and negative data. These YeoJohnson (YJ) transformations are given by
(2) 
and are also characterized by a parameter . Figure 1 shows both of these transformations for a range of values. Note that yields a linear relation. Transformations with compress the right tail of the distribution while expanding the left tail, making them suitable for transforming rightskewed distributions towards symmetry. Similarly, transformations with are designed to make leftskewed distributions more symmetrical.
Estimating the parameter for the BC or YJ transformation is typically done using maximum likelihood, under the assumption that the transformed variable follows a normal distribution. However, it is well known that maximum likelihood estimation is very sensitive to outliers in the data, to the extent that a single outlier can have an arbitrarily large effect on the estimate. In the setting of transformation to normality, outliers can yield transformations for which the bulk of the transformed data follows a very skewed distribution, so no normality is attained at all. In situations with outliers one would prefer to make the nonoutliers approximately normally distributed, while the outliers may stay outlying. So, our goal is to achieve central normality, where the transformed data look roughly normal in the center and a few outliers may deviate from it. Fitting such a transformation is not easy, because a point that is outlying (as determined by, say, a boxplot rule) in the original data may not be outlying in the transformed data, and vice versa. The problem is that we do not know beforehand which points may turn out to be outliers in the optimally transformed data.
Some proposals exist in the literature to make the estimation of the parameter
in BC more robust against outliers, mainly in the context of transforming the response variable in a regression (
[4, 8, 9]), but here we are not in that setting. For the YJ transformation there are very few robust methods available. In [11] a trimmed maximum likelihood approach was explored, in which the objective is a trimmed sum of log likelihoods in which the lowest terms are discarded. We will study this method in more detail later.Note that both the BC and YJ transformations suffer from the complication that their range depends on the parameter . In particular, for the BC transformation we have
(3) 
whereas for the YJ transformation we have:
(4) 
So, for certain values of
the range of the transformation is a half line. This is not without consequences. First, most wellknown symmetric distributions are supported on the entire line, so a perfect match is impossible. More importantly, we argue that this can make outlier detection more difficult. Consider for instance the BC transformation with
which has the range . Suppose we transform a data set to . If we let making it an extremely clear outlier in the original space, then in the transformed space. So a transformed outlier can be much closer to the bulk of the transformed data than the original outlier was in the original data. This is undesirable, since the outlier will be much harder to detect this way. This effect is magnified if is estimated by maximum likelihood, since this estimator will try to accommodate all observations, including the outliers.We illustrate this point using the TopGear dataset ([1]) which contains information on 297 cars, scraped from the website of the British television show Top Gear. We fit a BoxCox transformation to the variable miles per gallon (MPG) which is strictly positive. The left panel of Figure 2 shows the normal QQplot of the MPG
variable before transformation. (That is, the horizontal axis contains as many quantiles from the standard normal distribution as there are sorted data values on the vertical axis.) In this plot the majority of the observations seem to roughly follow a normal distribution, that is, many points in the QQplot lie close to a straight line. There are also three far outliers at the top, which correspond to the Chevrolet Volt and Vauxhall Ampera (both with 235 MPG) and the BMW i3 (with 470 MPG). These cars are unusual because they derive most of their power from a plugin electric battery, whereas the majority of the cars in the data set are gaspowered. The right panel of Figure
2 shows the BoxCox transformed data using the maximum likelihood (ML) estimate , indicating that the BC transformation is fairly close to the log transform. We see that this transformation does not improve the normality of the MPG variable. Instead it tries to bring the three outliers into the fold, at the expense of causing skewness in the central part of the transformed data and creating an artificial outlier at the bottom.The variable Weight shown in Figure 3 illustrates a different effect. The original variable has one extreme and 4 intermediate outliers at the bottom. The extreme outlier is the Peugeot 107, whose weight was erroneously listed as 210 kilograms, and the next outlier is the tiny Renault Twizy (410 kilograms). Unlike the MPG variable the central part of these data is not very normal, as those points in the QQplot do not line up so well. A transform that would make the central part more straight would expose the outliers at the bottom more. But instead the ML estimate is hence close to which would correspond to not transforming the variable at all. Whereas the MPG variable should not be transformed much but is, the Weight variable should be transformed but almost isn’t.
In section 2 we propose a new robust estimator for the parameter , and compare its sensitivity curve to those of other methods. Section 3 presents a simulation to study the performance of several estimators on clean and contaminated data. Section 4 illustrates the proposed method on real data examples, and section 5 concludes.
2 Methodology
2.1 Fitting a transformation by minimizing a robust criterion
The most popular way of estimating the of the BC and YJ transformations is to use maximum likelihood (ML) under the assumption that the transformed variable follows a normal distribution, as will briefly be summarized in subsection 2.3. However, it is well known that ML estimation is very sensitive to outliers in the data and other deviations from the assumed model. We therefore propose a different way of estimating the transformation parameter of a transformation.
Consider an ordered sample of univariate observations and a symmetric target distribution . Suppose we want to estimate the parameter of a nonlinear function such that
come close to quantiles from the standard normal cumulative distribution function
. We propose to estimate as:(5) 
Here is the Huber Mestimate of location of the , and is their Huber Mestimate of scale. Both are standard robust univariate estimators (see [6]). The
are the usual equispaced probabilities that also yield the quantiles in the QQplot (see, e.g., page 225 in
[5]). The function needs to be positive, even and continuously differentiable. In least squares methods , but in our situation there can be large absolute residuals caused by outlying values of . In order to obtain a robust method we need a bounded function. We propose to use the wellknown Tukey bisquare function given byThe constant is a tuning parameter, which we set to 0.5 by default here.
To calculate numerically, we evaluate the
objective in Equation 5 on a grid of
candidate values , and pick the one
minimizing the objective.
2.2 Rectified BoxCox and YeoJohnson transformations
In this section we propose a modification of the classical BC and YJ transformations, called the rectified
BC and YJ transformations. They make a continuously differentiable switch to a linear transformation in the tails of the BC and YJ functions. The purpose of these modified transformations is to remedy two issues. First, the range of the classical BC and YJ transformations depends on
and is often only a half line. And second, as argued in the introduction, the classical transformations often push outliers closer to the majority of the data, which makes the outliers harder to detect. Instead the range of the proposed modified transformations is always the entire real line, and it becomes less likely that outliers are masked by the transformation.For , the BC transformation is designed to make rightskewed distributions more symmetrical, and is bounded from above. In this case we define the rectified BC transformation as follows. Let be positive value. The rectified BC transformation is defined as
(6) 
Similarly, for and a positive we put
(7) 
For the YJ transformation we construct rectified counterparts in a similar fashion. For and a value we define the rectified YJ transformation as in (6) with replaced by :
(8) 
Analogously, for and we define as in (7) with replaced by :
(9) 
Figure 4 shows
such rectified BC and YJ transformations.
What are good choices of and
? Since the original data is often asymmetric, we cannot just use a center (like the median) plus or minus a fixed number of (robust) standard deviations. Instead we set
equal to the first quartile of the original data, and for
we take the third quartile. Other choices can be used, but more extreme quantiles would yield a higher sensitivity to outliers.2.3 Reweighted maximum likelihood
We now describe a reweighting scheme to increase the accuracy of the estimated while preserving its robustness. For a data set the classical maximum likelihood estimator for the YeoJohnson transformation parameter is given by the which maximizes the normal loglikelihood. After removing constant terms this can be written as:
(10) 
where is the maximum likelihood scale of the transformed data given by
(11) 
The last term in (10) comes from the derivative of the YJ transformation. Criterion (10
) is sensitive to outliers since it depends on a classical variance and the unbounded term
. This can be remedied by using weights. Given a set of weights we define a weighted maximum likelihood (WML) estimator by(12) 
where now denotes the weighted variance of the transformed data:
(13) 
If the weights appropriately downweight the outliers in the data, the WML criterion yields a more robust estimate of the transformation parameter.
For the BC transform the reasoning is analogous, the only change being the final term that comes from the derivative of the BC transform. This yields
(14) 
In general, finding robust data weights is not an easy task. The problem is that the observed data can have a (very) skewed distribution and there is no straightforward way to know which points will be outliers in the transformed data when is unknown. But suppose that we have a rough initial estimate of . We can then transform the data with yielding , which should be a lot more symmetric than the original data. We can therefore compute weights on using a classical weight function. Here we will use the ”hard rejection rule” given by
(15) 
with and as in (5). With these weights we can compute a reweighted estimate by the WML estimator in (12). Of course, the robustness of the reweighted estimator will depend strongly on the robustness of the initial estimate .
Note that the reweighting step can be iterated, yielding a multistep weighted ML estimator. In our simulation study we found that more than 2 reweighting steps provided no further improvement in terms of accuracy. We will always use two reweighting steps from here onward.
2.4 The proposed method
Combining the above ideas, our proposed reweighted maximum likelihood (RewML) method consists of the following steps:

Step 1. Compute the initial estimate by maximizing the robust criterion (5). When fitting a BoxCox transformation, replace by the rectified function . When fitting a YeoJohnson transformation, replace by the rectified function .

Step 3. Repeat step 2 once and stop.
2.5 Other estimators of
We will compare our proposal with two existing
robust methods.
The first is the robustified ML estimator proposed by Carroll in 1980 ([4]). The idea was to replace the variance in the ML formula (10) by a robust variance estimate of the transformed data. Carroll’s method was proposed for the BC transformation, but the idea can be extended naturally to the estimation of the parameter of the YJ transformation. The estimator is then given by
(16) 
where denotes the usual Huber Mestimate of scale [6] of the transformed data set .
2.6 Sensitivity curves
In order to assess robustness against an outlier, stylized sensitivity curves were introduced on page 96 of [2]. For a given estimator and a cumulative distribution function they are constructed as follows:

Generate a stylized pseudo data set of size :
where the for are equispaced probabilities that are symmetric about 1/2. We can for instance use .

Add to this stylized data set a variable point to obtain

Calculate the sensitivity curve in by
where ranges over a grid chosen by the user. The purpose of the factor is to put sensitivity curves with different values of on a similar scale.
The top panel of Figure 5 shows the sensitivity curves for several estimators of the parameter of the YJ transformation. We chose so the true transformation parameter is 1, and . The maximum likelihood estimator ML of (10) has an unbounded sensitivity curve, which is undesirable as it means that a single outlier can move arbitrarily far away. The estimator of Carroll (16) has the same property, but is less affected in the sense that for a high the value of is smaller than for the ML estimator. The RewML estimator that we proposed in subsection 2.4 has a sensitivity curve that lies close to that of the ML in the central region of , and becomes exactly zero for more extreme values of . Such a sensitivity curve is called redescending, meaning that it goes back to zero. Therefore a far outlier has little effect on the resulting estimate. We also show MTL95, the trimmed likelihood estimator described in subsection 2.5 with . Its sensitivity curve is also redescending, but in the central region it is more erratic with several jumps.
The lower panel of Figure 5 shows the
sensitivity curves for the BoxCox transformation
when the true parameter is , i.e. the
clean data follows a lognormal distribution .
We now put on the horizontal axis, since
this makes the plot more comparable to that for
YeoJohnson in the top panel.
Also here the ML and Carroll’s estimator have an
unbounded sensitivity curve. Our RewML estimator
has a redescending SC which again behaves similarly to
the classical ML for small , whereas the
sensitivity to an extreme outlier is zero.
The maximal trimmed likelihood estimator MTL95
has large jumps reaching values over 40 in the
central region.
Those peaks are not shown because the other curves
would be hard to distinguish at that scale.
3 Simulation
3.1 Compared Methods
For the BoxCox as well as the YeoJohnson transformations we perform a simulation study to compare the performance of several methods, including our proposal. The estimators under consideration are:
3.2 Data generation
We generate clean data sets as well as data
with a fraction of outliers.
The clean data are produced
by generating a sample of size from the standard
normal distribution, after which the inverse of the BC
or YJ transformation with a given is applied.
For contaminated data we replace a percentage
of the standard normal data by outliers at a fixed
position before the inverse transformation is applied.
For each such combination of and we
generate data sets.
To be more precise, the percentage of contaminated points takes on the values 0, 0.05, 0.1, and 0.15, where corresponds to uncontaminated data. The parameter characterizing the position of the contamination is an integer that can take on values from 0 to 10. For the YJ transformation we take the true transformation parameter equal to 0.5, 1.0, or 1.5 . We chose these values because for between 0 and 2 the range of YJ given by (4) is the entire real line, so the inverse of YJ is defined for all real numbers. For the BC transformation we take for which the range (3) is also the real line. For a given combination of , and the steps of the data generation are:

Generate a sample from the standard normal distribution. Then replace a fraction of the points in by itself when , and by when .

Apply the inverse BC transformation to , yielding the data set given by. For YJ we put .

Estimate from using the methods described in subsection 3.1.
We then estimate the bias and mean squared error (MSE) of each method by
where ranges over the generated data sets.
3.3 Results for the YeoJohnson transformation
We first consider the effect of an increasing
percentage of contamination on the different estimators.
In this setting we fix the position of the contamination
by setting .
Figure 6 shows the bias and MSE
of the estimators for an increasing contamination
percentage on the horizontal axis.
The results in the top row are for data generated with
, whereas the middle row was generated
with and the bottom row with
.
In all rows the classical ML and the Carroll estimator
have the largest bias and MSE, meaning they react
strongly to far outliers, as suggested by their
unbounded sensitivity curves in Figure 5.
In contrast with this both RewML and RewML2 perform much
better as their bias and MSE are closer to zero.
Up to about of outliers their curves are almost
indistinguishable, but beyond that RewML outperforms
RewML2 by a widening margin. This indicates that using
the rectified YJ transform in the first step of the
estimator (see subsection 2.4) is more robust
than using the plain YJ in that step, even though
the goal of the entire 3step procedure RewML is to
estimate the of the plain YJ transform.
In the same Figure 6 we see
the behavior of the maximum trimmed likelihood
estimators MTL95, MTL90 and MTL85.
In the middle row is 1, and we see that
MTL95, which fits of the data, performs well
when there are up to of outliers and
performs poorly when there are over of
outliers.
Analogously MTL90 performs well as long as there
are at most of outliers, and so on.
This is the intended behavior.
But note that for these estimators
also have a substantial bias when the fraction of
outliers is below what they aim for, as can
be seen in the top and bottom panels of
Figure 6.
For instance MTL85 is biased when is under
, even for when there are no
outliers at all.
So overall the MTL estimators only performed well
when the percentage of trimming was equal to
1 minus the percentage of outliers in the data.
Since the true percentage of outliers is almost
never known in advance, it is not recommended to
use the MTL method for variable transformation.
Let us now investigate what happens if we
keep the percentage of outliers fixed, say at
, but vary the position of the
contamination by letting .
Figure 7 shows the resulting bias
and MSE, with again in the top row,
in the middle row,
and in the bottom row.
For and the ML, Carroll, RewML and RewML2
methods give similar results, since the contamination
is close to the center so it cannot be considered
outlying.
But as increases the classical ML and the Carroll
estimator become heavily affected by the outliers.
On the other hand RewML and RewML2 perform much
better, and again RewML outperforms RewML2.
Note that the bias of RewML moves toward zero
when is large enough. We already noted this
redescending behavior in its sensitivity curve
in Figure 5.
The behavior of the maximum trimmed likelihood methods depends on the value of used to generate the data. First focus on the middle row of Figure 7 where so the clean data is generated from the standard normal distribution. In that situation both MTL90 and MTL85 behave well, whereas MTL95 can only withstand of outliers and not the generated here. One might expect the MTL method to work well as long as its excludes at least the number of outliers in the data. But in fact MTL85 does not behave so well when differs from 1, as seen in the top and bottom panels of Figure 7, where the bias remains substantial even though the method expects 15% of outliers and there are only 10% of them. As in Figure 6 this suggests that one needs to know the actual percentage of outliers in the data in order to select the appropriate for the MTL method, but that percentage is typically unknown.
3.4 Results for the BoxCox transformation
When simulating data to apply the BoxCox transformation to, the most natural choice of
is zero since this is the only value for which the range of BC is the entire real line. Therefore we can carry out the inverse BC transformation on any data set generated from a normal distribution, so the clean data follows a lognormal distribution. The top panel of Figure
8 shows the bias and MSE for of outliers with . We see that the classical ML and the estimator of Carroll are sensitive to outliers when grows. Our reweighted method RewML performs much better. The RewML2 method only differs from RewML in that it uses the nonrectified BC transform in the first step, and does not do as well since its bias goes back to zero at a slower rate.The MTL estimators perform poorly here.
The MTL95 version trims only of the data points
so it cannot discard the 10% of outliers, leading
to a behavior resembling that of the classical ML.
But also MTL90 and MTL85, which trim enough data
points, obtain a large bias which goes in the
opposite direction, accompanied by a high MSE.
The MSE of MTL90 and MTL85 lie entirely above the
plot area.
Finally, we consider a scenario with . In that case the range of the BoxCox transformation given by (3) is only so the transformed data cannot be normally distributed (which is already an issue for the justification of the classical maximum likelihood method). But the transformed data can have a truncated normal distribution. In this special setting we generated data from the normal distribution with mean 1 and standard deviation , and then truncated it to (keeping points), so the clean data are strictly positive and have a symmetric distribution around 1. In the bottom panel of Figure 8 we see that the ML and Carroll estimators are not robust in this simulation setting. The trimmed likelihood estimators also fail to deliver reasonable results, with curves that often fall outside the plot area. On the other hand RewML still performs well, and again does better than RewML2.
4 Real data examples
4.1 Car data
Let us revisit the positive variable MPG from the TopGear data shown in the left panel of Figure 2. The majority of these data are already roughly normal, and three far outliers at the top deviate from this pattern. Before applying a BoxCox transformation we first scale the variable so its median becomes 1. This makes the result invariant to the unit of measurement, whether it is miles per gallon or, say, kilometers per liter. The maximum likelihood estimator for BoxCox tries to bring in the outliers and yields , which is close to corresponding to a logarithmic transformation. The resulting transformed data in the left panel of Figure 9 are quite skewed in the central part, so not normal at all, which defeats the purpose of the transformation. This result is in sharp contrast with our reweighted maximum likelihood (RewML) method which estimates the transformation parameter as . The resulting transformed data in the right panel does achieve central normality.
The variable Weight in the left panel of Figure 3 is not very normal in its center and has some outliers at the bottom. The classical ML estimate is , close to which would not transform the data at all, as we can see in the resulting left panel of Figure 10. In contrast, our RewML estimator obtains which substantially transforms the data, yielding the right panel of Figure 10. There the central part of the data is very close to normal, and the outliers at the bottom now stand out more, as they should.
4.2 Glass data
For a multivariate example we turn to the glass data [7] from chemometrics, which has become something of a benchmark. The data consists of archeological glass samples, which were analyzed by spectroscopy. Our variables are the intensities measured at 500 wavelengths. The variables are strictly positive, and many do not look normally distributed.
We first applied a BoxCox transformation to each variable with obtained from the ML method of (10). For each variable we then standardize the transformed data to where and are given by (11). This yields a standardized transformed data set with again 180 rows and 500 columns. In order to detect outliers in this matrix we compare each value to the interval which has a probability of exactly for standard normal data. The top panel of Figure 11 is a heatmap of the standardized transformed data matrix where each value within is shown as yellow, values above are red, and values below are blue. This heatmap is predominantly yellow because the ML method tends to transform the data in a way that masks outliers, so not much structure is visible.
Next, we transformed each variable by BoxCox with obtained by the RewML method. The variables are standardized accordingly to where and are given by (13) using the final weights in (14). The resulting heatmap is in the bottom panel of Figure 11. Here we see much more structure, with red regions corresponding to glass samples with unusually high spectral intensities at certain wavelengths. This is because the RewML method aims to make the central part of each variable as normal as possible, which allows outliers to deviate from that central region. The resulting heatmap has a subjectmatter interpretation since wavelengths correspond to chemical elements. It indicates that some of the glass samples (with row numbers between 22 and 30) have a higher concentration of phosphor, whereas rows 57–63 and 74–76 had an unusually high amount of calcium. The red zones in the bottom part of the heatmap were caused by the fact that the measuring instrument was cleaned before recording the last 38 spectra.
5 Conclusion
In our view, a transformation to normality should fit the central part of the data well, and not be determined by any outliers that may be present. This is why we aim for central normality, where the transformed data is close to normal (gaussian) with the possible exception of some outliers that can remain further out. Fitting such a transformation is not easy, because a point that is outlying (as determined by, say, a boxplot rule) in the original data may not be outlying in the transformed data, and vice versa.
To address this problem we introduce a combination of three ideas: a highly robust objective function (5), the rectified BoxCox and YeoJohnson transforms in subsection 2.2 which we use in our initial estimate, and a reweighted maximum likelihood procedure for transformations. This combination turns out to be a powerful tool for this difficult problem.
Preprocessing real data by this tool paves the
way for applying subsequent methods, such as anomaly
detection and wellestablished model fitting and
predictive techniques.
Software availability.
The R code and a script
reproducing the examples can be downloaded from the
website
http://wis.kuleuven.be/statdatascience/robust/software .
Acknowledgement. This work was supported by grant C16/15/068 of KU Leuven.
References

[1]
Alfons, A.: robustHD: Robust Methods for HighDimensional Data (2019).
URL https://CRAN.Rproject.org/package=robustHD. R package version 0.6.1  [2] Andrews, D., Bickel, P., Hampel, F., Huber, P., Rogers, W., Tukey, J.: Robust Estimates of Location. Princeton University Press, Princeton, NJ (1972)
 [3] Box, G.E.P., Cox, D.R.: An analysis of transformations. Journal of the Royal Statistical Society. Series B 26, 211–252 (1964). URL http://www.jstor.org/stable/2984418
 [4] Carroll, R.J.: A robust method for testing transformations to achieve approximate normality. Journal of the Royal Statistical Society. Series B 42, 71–78 (1980). URL http://www.jstor.org/stable/2984740
 [5] Hoaglin, D., Mosteller, F., Tukey, J.: Understanding Robust and Exploratory Data Analysis. Wiley, New York (1983)
 [6] Huber, P.: Robust Statistics. Wiley, New York (1981)
 [7] Lemberge, P., De Raedt, I., Janssens, K., Wei, F., Van Espen, P.: Quantitative zanalysis of 16th17th century archaeological glass vessels using PLS regression of EPXMA and XRF data. Journal of Chemometrics 14, 751–763 (2000)
 [8] Marazzi, A., Villar, A.J., Yohai, V.J.: Robust response transformations based on optimal prediction. Journal of the American Statistical Association 104, 360–370 (2009)
 [9] Riani, M.: Robust transformations in univariate and multivariate time series. Econometric Reviews 28, 262–278 (2008)
 [10] Tukey, J.W.: On the comparative anatomy of transformations. Annals of Mathematical Statistics 28, 602–632 (1957)
 [11] Van der Veeken, S.: Robust and nonparametric methods for skewed data. Ph.D. thesis, KU Leuven, Belgium (2010)
 [12] Yeo, I.K., Johnson, R.A.: A new family of power transformations to improve normality or symmetry. Biometrika 87, 954–959 (2000). URL http://www.jstor.org/stable/2673623
Comments
There are no comments yet.