Python package for information theory.
We describe an approach to improving model fitting and model generalization that considers the entropy of distributions of modelling residuals. We use simple simulations to demonstrate the observational signatures of overfitting on ordered sequences of modelling residuals, via the autocorrelation and power spectral density. These results motivate the conclusion that, as commonly applied, the least squares method assumes too much when it assumes that residuals are uncorrelated for all possible models or values of the model parameters. We relax these too-stringent assumptions in favour of imposing an entropy prior on the (unknown, model-dependent, but potentially marginalizable) distribution function for residuals. We recommend a simple extension to the Mean Squared Error loss function that approximately incorporates this prior and can be used immediately for modelling applications where meaningfully-ordered sequences of observations or training data can be defined.READ FULL TEXT VIEW PDF
Python package for information theory.
A library of information-theoretic methods for data analysis and machine learning, implemented in Python and NumPy.
The method of least squares is the dominant approach in regression problems. The Mean Squared Error (MSE) is frequently therefore the default
loss function in many practical applications: and is commonly used for regression in fields as diverse as machine learning, statistics, physical sciences, decision theory, econometrics, and finance. It is criticised for over-use and a lack of robustness(e.g. berger85, ), but has yet to be displaced. In this paper we will level another criticism at the MSE and other similarly-constructed loss functions: they leave out important information as a result of their too-strict assumptions about the uncorrelatedness of model residuals.
Statistical analysis of correlated residuals has a long history in econometrics and time series analysis, that began with tests of the presence of autocorrelation at lag in residuals from least squares regression (vonneumann41, ; durbinwatson50, ; durbinwatson51, ; durbinwatson71, ). Tests were later developed for the presence of autocorrelation at any lag (boxpierce70, ; ljungbox78, ). The question became acute as econometrics developed, as autocorrelations in residuals for autoregressive time series models (where some of the regressors are lagged dependent variables) leads in general to non-consistency and bias. Sophisticated tests have therefore been developed to help validate such models (breusch78, ; godfrey78, ; proia18, ) and make them robust to further confounding real-world factors such as conditional heteroskedasticity (e.g. godfreytremayne05, ).
While prior work in this area has to been to devise tests of autocorrelation that can be applied after best-fitting regression models have been determined, e.g. by least squares, we propose that the loss function itself could be extended to penalize models showing autocorrelated residuals during the process of fitting. Such correlations can either be created by the presence of correlated errors in observations of the independent variable, or by the application of the wrong model. In the first Section we illustrate this for the important category of overfitting (or over-specified) models by performing simple simulations of model fitting in one dimension.
In later Sections we make a simple entropy argument for an extension to the MSE loss function that is readily and efficiently calculated, at least in one dimension or wherever residuals can be meaningfully ordered (e.g. time series forecasting). We illustrate from our simulations how this modified loss function will effectively penalize overfitting models, potentially providing natural regularization, better generalization and enhanced robustness to outliers.
Consider a sample of data values for our dependent variable of interest , situated at corresponding locations
in the indpendent variable. We will construct a regression model for estimates ofwhich we will denote
for a vector of model parameters. The residuals between the model and the data are defined as where for each element of the sample.
To explore the effects of overfitting, we will draw as i.i.d. standard normal variables, the distribution of residuals you would expect from a good fit to the data, and then attempt to fit these values. The aim is to simulate what happens when a model is given unnecessary additional freedom.
We simulate a simple case of observations for our randomly-drawn sample , with corresponding evenly-spaced locations for . We define the model of order as a truncated Fourier Series:
Figure 1 shows three examples of least-squares fitting to different randomly-generated samples , and the resulting residuals. We can see that as the overfitting order increases to 41 there is observable structure in the sequence of residuals .
The aim of this paper is to show that there is valuable information about modelling in the way that residual values are distributed with respect to one another in the domain of input variables. The tools of correlation and spectral analysis are the primary aids for assessing these properties. We define the (circular) residual autocorrelation function at lag as
where denotes the complex conjugate of (we will assume real-valued hereafter so that ), and where denotes an element of the sequence extended by periodic summation to an infinitely long, repeating sequence of period so that for , we have .
The term in the denominator of (2), commonly referred to as the Residual Sum of Squares (RSS) or Sum of Squared Errors of prediction (SSE), normalizes the function so that . At lags statistically independent residuals have . For purely real residuals (2) is an even symmetric function of , and is Hermitian in general.
This definition of the residual autocorrelation function is convenient because it can be related simply to the Discrete Fourier Transform (DFT) of the sequence of residuals, where
and the inverse transform is defined as
We will often shorthand the DFT operations on a full sequence using the symbol , e.g. and . Using the circular form of the Wiener-Khinchin theorem for DFTs we may then write , which can be calculated with great computational efficiency thanks to the Fast Fourier Transform (FFT) algorithm (cooleytukey65, ).
Figure 2 shows for the model fits of Figure 1. Significant negative correlation between neighbouring residuals is seen at lag for the and models. The model shows the strongest departures from at lags , with an oscillating character. Figure 3a, shows the average of residual autocorrelation values across independent realizations of from a suite of simulations at each model order . There is structure.
As well as being useful for the efficient calculation of , the sequence is a powerful equivalent representation. It defines the power spectral density of our sequence of residuals, which we denote
For later convenience we will also define a corresponding correlation spectral power density , the DFT of . and are both even symmetric functions of
for purely real residuals. For i.i.d. residuals with a defined variance, these functions are expected to show a flat,white noise spectrum with constant expectation values for all .
for some sequence of residuals with power spectral density . This choice makes the structure of the power spectrum easy to visualize: with increasing , we see that an increasing share of the lower- spectral modes are suppressed, starting with the lowest . The apparent structure in the modelling residuals from overfitting arises because these residuals no longer match the flat power spectrum distribution that would be expected if they could be assumed to be i.i.d. – in a very real sense they have been high-pass filtered.
The Fourier Series model (1) is a special case: it forms an othonormal basis, its coefficients and have direct correspondence to the real and imaginary parts of the DFT itself, and it provides an optimally compact (in terms of ) series representation of arbitrary real-valued sequences .
The suppression of lower- power in by overfitting models in the simulations performed above is therefore, in fact, effectively total; not only on average but also in individual cases. In general models will often not have complete freedom to precisely replicate arbitrary variation in observed data (although see (zhangetal17, )), nor will they so wholly suppress Fourier modes in residuals.
For this reason it is interesting to generalize to a set of model basis functions that do not match the modes of the DFT precisely, and cannot represent arbitrary patterns of residuals in any finite series. Candidates for such a basis are the Chebyshev polynomials of the first kind, (see (arfkenetal13, )). We define the Chebyshev polynomial series model of order as:
The absolute values of these functions are bounded by 1 on the interval
, and they are a convenient basis for polynomial interpolation in bounded spaces. As for the Fourier Series, we simulate a simple case ofobservations in our sample . We place these at corresponding evenly-spaced locations for .
Once more, we performed independent realizations of for each Chebyshev model order (as the Chebyshev polynomial series model does not afford a complete representation of an arbitrary sequence at any level of truncation, unlike the Fourier Series, the choice of maxmimum is somewhat arbritrary). Figure 4 shows the averaged residual autocorrelation and power spectral signature , as a function of order , across this suite of simulations. We see similar behaviour to the Fourier Series case, with increasing modelling order leading to increasing supression of lower- power. Due to the less compact representation of arbitrary sequences provided by Chebyshev polynomials, the suppression of lower- power by overfitting is more gradual.
As discussed in Section 1, the MSE loss function is very commonly used in optimization and regression applications, and is defined using the notation above as simply
. Underpinning the widespread use of the MSE as a loss function is the Gauss-Markov theorem: that for a linear model fit to observations with uncorrelated, equal variance and zero mean errors, the best linear unbiased estimator of any linear combination of the observations is that which minimizes the MSE. Where observational errors can be assumed to also follow a Gaussian distribution, this estimator is also the Maximum Likelihood Estimator (e.g.(pressetal92, )).
But we have seen that residuals are not necessarily uncorrelated. The model residuals passed to the MSE loss function will only be uncorrelated in general if the ‘true’ model has been applied, and the model parameters are optimal. Anything else will, in general, either underfit or overfit, introducing correlations as shown above and as discussed (in a specific application) by (rowe10, ). In empirical settings and machine learning applications the ‘true’ model may be unknown and potentially unknowable. The application of the MSE loss function in regression tasks, often by default, assumes too much in assuming that model residuals are uncorrelated – irrespective of model selection or under variation in model parameter values – however plausible this assumption may be about the underlying observational errors. This causes us to unwittingly discard valuable information.
The information discarded is not captured by the MSE, which is invariant under permutation of any ordered sequence of the residual, because it is embedded within knowledge of the relative locations of residuals with respect to one another. The aim of this paper is to make initial step towards using this information in the optimization process so as to build better regularized, and better generalizing, models. If this can be done at all, computational efficiency and ease of adoption will be valuable characteristics of the solution. With these aims in mind, we will explicitly target as simple a modification of the widely-used MSE loss function as possible.
We begin with the simplest possible applicable probabily distribution for correlated residuals, a multivariate Gaussian. We write the probability density function as
where for convenience we have adopted vector notation to describe the sequence of residuals , with corresponding mean and covariance matrix . We will assume .
For ordinary least squares it would be additionally assumed that, where is the identity matrix. and is the variance of errors. Then identifying (8) with the statistical likelihood of the model leads to the log-likelihood function
Minimizing the MSE therefore maximizes the likelihood, irrespective of any value taken by the (likely unknown) variance .
We will seek a new loss function of the form , where the function is to be determined. To identify a candidate, we will relax assumptions about the covariance matrix of residuals in (8) relative to the least squares case, and instead assume the slightly more general
where is an , symmetric, positive semidefinite correlation matrix with unit values on the diagonals. We will have little a priori knowledge of the contents of , which as we know from Section 2 can vary with model and model parameters, as well as with any correlations that may existing in the underlying observational errors.
We propose to weaken the common assumption that
by instead introducing the following form for the prior probability on:
where is the differential entropy for a multivariate Gaussian with covariance matrix , and is a scaling constant that we introduce to allow the user to balance the contribution to the loss function from this prior. The prior can then be simplified to
We motivate the functional form of (11) by direct analogy to the Boltzmann entropy formula (e.g. (hill12, )), in the intuitive belief that the most likely distribution of residuals is that which has the greatest number of possible configurations per unit of variance. As is maximized when , applying this prior will penalize distributions of residuals showing any correlations, whether from overfitting or underfitting (rowe10, ).
Crucially, we will also assume that can be approximated as a circulant matrix. (Although not true in general, particularly for small samples, the approximation becomes asymptotically more accurate as increases and/or as residual correlations themselves become negligible, since is a circulant matrix.) Any circulant matrix is fully determined by its first column which we label .
As is a correlation matrix, the column consists of the ‘true’ values around which the (circular) residual autocorrelation function of equation (2), defined for a single sample of residual values, will be distributed. It can be seen that the average of over realizations of residuals from multiple independent samples of observations , given the same model, will form an asymtotically consistent estimator for . But for any sufficiently free, circulant matrix model of even a single sample of residuals allows us to calculate a that can itself be identified as the maximum likelihood estimate of ; we will denote this estimate . If is assumed circulant this also defines a corresponding maximum likelihood estimate for the full matrix, .
is a circulant matrix its eigenvalues are the DFT of. The same holds for the eigenvalues of :
where . Since the determinant of a matrix is given by the product of its eigenvalues, can be written as
As discussed above, the MSE loss function can be related to the log-likelihood function of the multivariate Gaussian with uncorrelated residuals. In a full, formally-correct treatment this likelihood should be expanded to include a freely-varying which can then be marginalized over as a nuisance parameter, combining both the prior and the likelihood of observed correlated residuals given in the integrand. But this is reserved for future work. Instead we propose an approximate treatment that assumes that the likelihood of any observed residual correlations given some can be approximated as very sharply peaked, indeed as being a Dirac delta function , around . This makes the marginalization trivial to compute, and adding a penalty term to the multivariate Gaussian log-likelihood of (9) to create new loss function:
where we will now omit terms that have no dependence on the parameter vector . Dividing both sides by we arrive at the equivalent
Defining the Mean Log Power of residual correlations as , and making a final approximation that the MSE is a consistent, sufficient estimator for (valid in the limit ) we arrive at the following proposal for a residual entropy-sensitive extension of the MSE loss function:
Figure 5 shows statistics of realized values of , i.e. the proposed multiplier on the MSE in the case, for every residual sample in the simulations of Section 2. We see an increasing function of overfitting order (although unavoidable aliasing in the experimental setup introduces artifacts at very high in the Chebyshev case), motivating the use of our proposed loss function (17) in the suppression of residual correlations, and thereby providing automatic regularization and a safeguard against overfitting.
Least squares has been the dominant approach towards empirical regression problems for over 200 years (legendre1805, ; pressetal92, ), and the MSE loss function is at its core. Yet we do believe that the MSE discards valuable information about the locations of residuals. We also believe that penalizing models that show discernible structure in residuals, residuals that have hitherto been blindly assumed to be uncorrelated, is both intuitively appealing and can be theoretically justified on entropic principles.
We have made a first attempt towards incorporating these beliefs directly into the loss function. We presented results that suggest the MSE may be improved by a simple (although approximate) multiplicative factor given in equation (17), a factor that can be calculated with relative computational efficiency thanks to the FFT algorithm. But many issues and outstanding questions remain.
We have made mathematical approximations that are asymptotically true, e.g. being approximately circulant (valid only for large or as ) or that the MSE forms a consistent and sufficient estimator for in practice (valid as ). Assumptions were also made to avoid marginalization over and to retain the MSE as part of the proposed loss function, approximating the likelihood function of residual correlations as being sharply distributed around the (observed) maximum likelihood estimate. The impact of these simplifying assumptions on practical optimization needs to be tested, and a fuller treatment may in fact lead to a better approximate form by which to encourage greater residual entropy.
To calculate a meaningful power spectral density of residuals, the residuals must first be ordered into a sequence that has some meaning in the input space. This is trivial in one dimension, less so in higher dimensions. For higher dimensional regression problems we would initially propose using residuals ordered along each of the most significant principal components of the input space, in turn, as determined via PCA. Structure seen in residuals along any of the principal directions in the space of input observations should be penalized, but whether this will be a powerful enough constraint is an open question. For very high dimensional problems, it might be prudent to work in configuration (lag) space, and estmate correlations in annular bins of cosine distance. Where data are not regularly spaced in the independent variables it may also be valuable to average binned residuals before calculating .
Clearly, an important next step will be applying the loss function of equation (17) in practical settings, to explore its behaviour outside of artificial simulations and the dynamic interplay of the MSE and the multiplicative factor . Of utmost importance will be to establish whether the new addition to the loss function is ‘safe’, i.e. that modelling outcomes are no worse than when using MSE alone.
Finally, an important application not explored here is that of outlier rejection – to the extent that contaminating outliers drag a best-fitting model away from the bulk of the other data points, they will necessarily introduce (positive) correlations in modelling residuals. The loss function described above offers hope of systematically enhancing robustness to outliers by penalizing such fits automatically.