Forecasting time series is an exceptionally difficult task due to the risk of overfitting on the dataset, in particular in the case of overparametrized networks [zhang16], [zhang98]. In other words, when using the past to predict the future one has to be certain to have succeeded in extracting a signal from the past that will propagate to the future, and not simply fitted a complex function on the past. Neural networks, while being powerful function approximators that are relatively easy to optimise, can lead to poor extrapolation in time series forecasting due to the latter. Due to their ability to approximate almost any function it is of the essence to ensure that the network is learning the signal of interest instead of the noise. Understanding the structure of neural networks and the ability of a trained network to perform well on unseen data is therefore of utmost importance, and the main objective of this paper.
The loss surface of a neural network, defined as a function of the loss over the weights, is typically highly non-convex and can, for a deep network, depend on a large number of parameters (the weights). Even for a simple network, the number of local minima and saddle points in the loss surface may grow exponentially in the number of parameters. The general shape of this loss function, and also the differences in the loss functions of small and large neural neural networks, is an active topic of research[choromanska15], [li17]. In terms of theory, a recent line of work has related the neural network loss surface to Gaussian random fields [choromanska15], [auffinger13], [bray07], [dauphin14]
. Alternatively, random matrix theory has been used to obtain insight into the loss surface[pennington17]. In more empirical lines of work, the authors of [li17] found that adding more layers to a network gives rise to a more non-convex loss surface, so that adding more layers can complicate the training of the neural network by causing the optimisation methods to get stuck in sub-optimal critical points.
The above work gives insight into the structure of the loss surface on the training dataset. For noisy time series, a trained network is able to generalise well, that is perform well on unseen data, when it is not overfitting on noise in the training dataset. However, as is mentioned by the authors of [zhang16], if the network is big enough (i.e. overparametrized) it can even fit a random noise dataset almost perfectly, but it will most certainly have bad performance out of sample. Understanding the structure of the minima so that a network will perform well on unseen data can give insight into setting up the training methods, for example to avoid convergence on noise.
There are different ways of measuring the learning capability of a neural network (see [bernier00] for an overview). One is the output sensitivity, or the first derivatives i.e. the Jacobian, of the error with respect to the input (see [novak18]) or the weights. The other measure is the statistical sensitivity which evaluates the output range variation of a node when its inputs or weights are perturbed. As is shown in [bernier00], this is equivalent to considering second derivatives, the Hessian, of the loss function with respect to the input or weight values. The statistical sensitivity with respect to the weights gives a measure for the smoothness of the error surface, with a small value of the statistical sensitivity implying a small output variation when weights are perturbed. The sensitivity with respect to the inputs measures the input noise immunity. Using the statistical sensitivity as a measure for generalisation is intuitive in the sense that we are interested in the robustness of the network when the input is perturbed. It has also been proposed that the Hessian with respect to the weights can be used as a measure for generalisation. These flat minima in the weight space correspond to simpler functions learned [hochreiter97] or can be related to the Bayesian evidence [smith18].
Regularisation in the network can help to obtain a learned function with lower complexity such that a better generalisation may be obtained. Typical explicit regularisation methods, such as or regularisation or multiplicative noise injection (such as dropout [srivastava14]), contribute to the generalisability of the trained function by restricting the function complexity in some way. For the regularisation method to work well, we need to understand how to make a trade-off between the complexity of the function and its ability to fit the data. This trade-off is known as the information bottleneck [tishby15]
, and we study this in the coming sections to understand the effects of the trade-off on the learned function. Alternatively to explicit regularization, the noise in a stochastic gradient descent method (SGD) can act as an implicit regularizer. The gradient is computed over batches and, as opposed to computing the full gradient, SGD thus introduces non-isotropic noise into the optimisation scheme. It can be shown that this drives the parameters away from sharp minima towards the broader ones. In particular, the noise variance is proportional to the learning rate over the batch size, so that a large learning rate and small batch size result in a higher noise component. This has been shown in previous work, e.g.[smith18], [chaudhari18] and will be a focus of this work as well.
The novelty of our contribution consists in a thorough analysis of what generalisability means for time series forecasting with fully-connected neural networks. In particular, time series do not satisfy the typical assumption in statistical learning theory of i.i.d. data. While generalisibility for image datasets has been studied extensively, the problem is much more complex for time series: the dataset is typically much smaller, the signal-to-noise ratio might be low, the distribution can be non-stationary and there is little intuitive indication of what the underlying pattern in the data must be. Understanding what it means for a neural network to have good generalisibility and how this can be achieved through the learning algorithms will be the main task of this paper.
We assume that the reader is familiar with the general neural network concepts such as optimisation methods like stochastic gradient descent and its parameters and the neural network architectures. For a general introduction to this we refer to [bishop07]. The rest of this paper is structured as follows: in Section 2 the loss surface structure is studied in a simplified setting; in Section 3 the weight Hessian is introduced as a generalisation metric; in Section 4 the input Hessian is defined and the relation between the weight and input Hessian is described; in Section 5 it is discussed how to make the trade-off between complexity and data fit and how one can influence complexity during the training of the network; finally Section 6 presents the numerical results.
2 Loss surface structure
In this section we give some background about neural networks and the properties of their loss surfaces and the implications of this structure for generalisation capabilities.
2.1 Loss surface as a Gaussian random field
The loss surface of a neural network is defined as the loss function over the high-dimensional weight space. This loss surface can be related to a Gaussian process on a high-dimensional space [dauphin14], [choromanska15]. With this insight, one is able to obtain theoretical results on the structure of the loss surface. We shortly repeat this derivation and discuss its implications. Let the inputs to the neural network be given by . Let be the weight matrix in layer with element
connecting neuronin layer and in layer . Define
as the vectorized total weights in the network, so thatwith , with thus being the dimension of the weight space. In the rest of this paper the dimension refers to column vectors. The first layer output, for , is given by
is the non-linear activation function,is the activation of the first layer and is the pre-activation output. Each subsequent layer outputs,
The final layer output is then given by,
with a scaling factor. Assume the data is given as a set of inputs and outputs generated from some data-generating distribution . Typical loss functions are the mean absolute error,
or the mean squared error,
where the expected values are taken over the data generating distribution.
We define a critical point and its index as follows,
Definition 1 (A critical point and its index).
A critical point of some differentiable function is point where all partial derivatives of the function are zero. In this work, we also refer to a critical point in a more loose definition as the point to which the optimisation algorithm for the neural network has converged. For a function of
variables, the number of negative eigenvalues of the Hessian matrix, the matrix of second-order derivatives of the loss function with respect to the parameters (defined more explicitly in (15)), at a critical point is called the index of the critical point.
Following the derivation in [choromanska15]
, let the non-linear activation function be the rectified linear unit defined asand replace the activation function in (3) by the term , which denotes whether a path , where labels any of the paths from the input to the output, is active or not. We obtain,
Here refers to the -th element of the input vector and is the number of paths from a given network input to its output.
with the summation over the inputs and representing the summation over the further possible paths in the network. We remark that this expression is similar to a deep linear model multiplied by the factor .
The second key assumption in this section is to let the input elements be sampled independently as (and let
for simplicity). Due to the summation being over independent standard Gaussian random variables,is equal to a Gaussian process on the weight space. Letting the loss function be given by the absolute loss as in (4) in which the expected value can be taken over the activations,
due to the being sampled from a distribution, this loss function follows a Gaussian process distribution. For a particular value of it is equal to the well-studied Hamiltonian of spin-glass systems [auffinger13] and previous work on Gaussian random fields can be applied [bray07], [fyodorov07] to gain insight into the structure of the critical points.
2.2 Structure of the critical points
In this section we briefly summarize the results on the loss surface structure of Gaussian random fields in high dimensions. The works of [bray07] and [fyodorov07] show that for Gaussian random fields on high-dimensional spaces the critical points of the surface posess a particular structure. In [bray07] the authors show, by means of a generalised Kac-Rice formula, a linear dependence between the index of a critical point (its index ) and its loss value (the error ). A similar result can be obtained for neural networks as is done in [auffinger13], [choromanska15]. Let be the number of different weights in the network, which is assumed to be the -th root of the total number of paths from input to output in the network,
Under the assumptions made in Section 3.2, the loss surface of a neural network on a high-dimensional parameter space, in other words for deep and wide networks or as increases, has the following properties,
let ; there exists a layered structure of critical points: critical values in a band above the global minimum are more likely to be local minima, the band consists of local minima and saddle points of index 1, the band consists of local minima and saddle points of index 1 and 2, and so on;
local minima dominate over saddle points in a band of values close to the global minimum;
high-index critical points lie at high loss levels; in other words, a high value of corresponds to a high loss level .
To conclude, by making several assumptions on the activation function and the distribution of the inputs, it is possible to relate the neural network loss function to a particular kind of Gaussian random field, as commonly encountered in spin-glass systems. By an application of the Kac-Rice theorem, one is able to obtain a relationship between the index of a critical point of this Gaussian random field and its value. It can be shown that high-index saddle points lie at high loss levels, while local minima are close to the global minimum.
2.3 Loss and the entropy
As was shown in the previous section, under certain – albeit restrictive – assumptions on the deep neural network, the loss surface is given by a Gaussian random field on a high-dimensional space; this space represents the weight space and its dimension is given by the number of weights in the network as determined via the number of nodes per layer and the number of layers used. Gaussian random fields on high-dimensional spaces posess a particular structure of the locations of the critical points in the asymptotic setting. Similarly, there exists a result on the entropy of these critical points. We recall here a result on the width of the minima, as stated in [becker18]. To measure the width of a minimum , consider the entropy which is defined as,
with being the Hessian matrix. A larger entropy means larger basin volume, or a wider minimum. We then state the following Theorem on the width of the minima, as is given and proved in [becker18].
Theorem 2 (Expected entropy [becker18]).
Let be some loss level. The expected entropy of the Hessian of the loss function that takes value asymptotically, has the following expected entropy
where is the number of layers and is the probability of the Bernoulli distributed weights being one.
The above theorem gives a relation between the number of layers in the network, the loss level at a particular point in the weight space and the width at that point in the weight space. In particular, the lower the train loss the lower the entropy and thus the sharper the minima. In other words, wider minima lie at higher loss values in the loss surface. This seems intuitive: in order to obtain a low train loss, one has to fit a more complex function which passes through all the obervations. A good fit to the training data is however not sufficient to obtain good out of sample performance; we will discuss the concept of generalisation in the next section.
Since the data generating distribution is typically unknown, in extrapolation problems one assumes to have access to samples drawn (i.i.d.) from this distribution. One assumes , for some unknown function ; so that the are noisy observations of the true function of interest. In our setting we are interested in time series forecasting, i.e. we have where is the time index, so that historical points of are used to predict its future, in this setting one-day-ahead, value. This can be extended to containing the historical observations of multiple time series used to forecast . Note that here the i.i.d. assumption is violated since the observations should clearly be dependent. Nevertheless, using these datapoints as input a neural network can be used to extract a meaningful repeating pattern from the dataset. Note that the ’s, and thus the ’s, are noisy observations. We define the sample loss function as the loss function on that dataset, i.e. for the squared loss we obtain,
where for is the train dataset and the neural network output.
Generalisation is the relationship between a trained networks’ performance on train data versus its performance on test data. This is a highly desirable property for neural networks, where ideally the performance on the train data should be similar to the performance on similar but unseen test data. In general, the generalisation error of a neural network model can be defined as the failure of the hypothesis to explain the dataset sample. It is measured by the discrepancy between the true error and the error on the sample dataset,
In statistical learning theory a bound on this error is typically dependent on the complexity of the hypothesis class where the hypothesis is in, as well as on the number of samples in the dataset. Obtaining bounds on this error is a topic of active research with recent advancements including [dziugaite17], [zhou18] where the authors use PAC-Bayes theory. In the rest of the paper we will use the notation to denote the empirical loss function as computed on the sampled data.
Typically, a trained network is able to generalise well when it is not overfitting on noise in the train dataset. Since neural networks are known to be universal approximators and thus – when the network is large enough – are able to approximate any function, when training one aims to extract a meaningful pattern in the data instead of learning a flexible function that is able to fit all training points. In particular in the setting of overparametrized networks it is easy for the network to fit the training points, however being able to avoid this overfitting is an essential task. A somewhat straightforward way to define the generalisation capability is to study the robustness of the network with respect to input perturbations. For some input perturbation , the change in the loss function should be small,
When the neural network is heavily overfitting on the noise, a small change in the input parameters might result in a large change in the neural network output. In this setting the generalisation is related to a smoothness assumption on the function output. In the coming sections our goal is to understand and be able to control the generalisation of neural networks in the overparametrized, deep neural networks for time series forecasting, a setting in which the signal in the series can be weak and we lack the availability of large datasets. We aim to define metrics that can be used to measure when a learning algorithm can be expected to perform well.
3 The weight Hessian
The Hessian with respect to the weights will be used in order to obtain insight on the noise robustness of the weights, giving a metric for measuring the networks’ capability to generalise well to unseen data.
The Hessian of the loss function with respect to the weights has elements
which represents the rate of change of the derivative with respect to in the direction of
. The Hessian thus represents the curvature of the loss surface of the neural network. The eigenvectors and eigenvalues represent the direction and curvature in that direction, respectively. For the large neural networks typically used in image processing, computing and storing the Hessian can be very time-consuming. In the case of time series forecasting the networks used will be smaller, but nevertheless the Hessian can contain thousands of elements. In the rest of this paper we will sometimes drop the dependence of the loss function on the input.
The Hessian gives insight into the flatness of the minimum, and, as we will show in Section 4.2, this can be related to input noise resistance of the output function. In this sense the Hessian can be related to the minimum description length, where a Hessian with small eigenvalues corresponds to a simpler function being learned. Alternatively, the Hessian is used in second-order optimisation methods where the step size in each direction is inversely proportional to the curvature in that direction: in directions with large curvature it takes small steps, while in directions with small curvature it takes larger steps [martens10].
3.2 Learning rate, batch size and the Hessian
It has been mentioned in prior research [kenton17], [mandt17], [smith18], [chaudhari18] that a relationship exists between the test error and the learning rate and batch size used in the SGD updating scheme. In this section we obtain a similar conclusion through a slightly different derivation. Let the gradient in a mini-batch be and the full gradient be , where is the weight space dimension, defined respectively as,
The weight update rule is given by,
is the learning rate. By the central limit theorem, ifi.i.d. then,
Note that the approximation of the noise by a Gaussian distribution holds in the limit of the sample size tending to infinity and when the gradients for the batches are not heavy-tailed. Although the sample size is typically finite and the gradient distribution can be heavy-tailed, the approximation is widely used.
The weight update rule can then be rewritten as,
where with . If convergence has been reached,
Note that for ease of notation we have omitted the dependence of the loss function on the input data . By a Taylor expansion method for the loss function evaluated on the full training data we have,
where denotes the gradient of the total loss and the Hessian of the total loss. Note that we thus approximate the loss surface by a quadratic function. Then, using (20) we obtain,
From this expression we see that the Hessian at convergence is small if a large learning rate or a small batch size has been used. Our simple derivation shows that if convergence is obtained, the Hessian in that point in weight space is smaller for larger learning rates and smaller batch sizes, i.e. the Hessian is inversely proportional to the fraction . In case of using full batch gradient descent, thus if , the relationship between convergence and the learning rate remains the same: convergence achieved with a large learning rate results in a smaller Hessian which in turn corresponds to a wider minimum.
3.3 The weight Hessian and generalisation
Generalisation in our setting refers to a kind of robustness of the trained network. More specifically, particular transformations of the input do not decrease the accuracy of the classification/forecast as computed on the train set. In other words, when a network has been trained on a set of patterns, certain transformations of these patterns should still be interpreted correctly. A higher robustness to input noise, which can be measured by Jacobians/Hessians with respect to input or weights, leads to better generalisation, i.e. the smaller the Hessian the wider the minimum.
Consider the eigenspectrum of the Hessian, i.e. the set of eigenvalues , , determined via,
where is the eigenvector corresponding to the eigenvalue . If the eigenvalues of are positive (resp. negative) at some critical point , that point is a local minimum (resp. maximum), and if the critical point has both positive and negative eigenvalues it is called a saddle point (see also Definition 1 on the index of a critical point). The eigenvector corresponding to the largest eigenvalue of indicates the direction of greatest curvature of the loss function. The size of the positive eigenvalues is thus a measure for how well a minimum will generalise to unseen data. A large positive eigenvalue in the direction of the corresponding eigenvector thus means that a sharp increase in loss will occur in that direction in weight space. If a minimum is wide, and thus has small eigenvalues in many directions, the minimum is better resistant to noisy transformations of the weights, while a sharp minimum has a higher sensitivity to the noise in the weights. A sharp minimum is thus said to have overfitted on the noise in the training dataset, while a wider minimum may imply that a ‘simpler’ and more robust function has been learned.
In the work of [hochreiter97] the authors show that flat minima correspond to a minimization of the expected description length of the neural network function induced by the weights. The authors of [smith18] show a relationship between the Bayesian evidence and the Hessian, showing that maximizing the evidence corresponds to a minimization of the Hessian, by approximating the evidence with a Taylor expansion of the cost function.
3.3.1 Downsides of the weight Hessian
A metric that is commonly used to measure the width of the minimum is the trace. The trace of a squared matrix of size is defined as,
where denotes the elements on the diagonal of the Hessian and denotes the eigenvalues. While prior research has noted that a lower Hessian (in terms of trace or some other norm) leads to better generalisation, there has also been contradicting evidence. One critique on using the trace of the weight Hessian for measuring generalisation capabilities is that the Hessian can be scaled in such a way that the output function remains the same but the trace of the Hessian can become large or small. This is the conclusion of the work of [dinh17]. Consider a neural network with two layers, so that the output is given by
Note that if is the rectified linear unit , then we can scale the weights by a constant without changing the output as follows,
The gradient and Hessian of the loss with respect to the weights can be modified by . We have,
where the derivatives are taken with respect to and . Similarly,
For large weights, e.g. when is large, the second derivative with respect to these weights is smaller, however if the next-layer weights are scaled by the output function has not changed. Therefore, due to the many symmetries existing in a neural network, there exist symmetries that will not modify the output function and the generalisation capabilities, but are able to scale the Hessian with respect to particular weights to be larger or smaller. Norms like the trace norm or the Frobenius norm are thus not enough to determine the generalisation capabilities if one compares minima that are equivalent through these symmetries.
This analysis thus shows that for each individual minimum found, a large trace of the Hessian is not informative. In [kenton17] the authors claim that while these sharp minima with a generalisation similar to a wide minima exist, SGD does not converge to these minima and for the minima found by SGD the width correlates well with generalisation. We obtain a similar result in the numerical experiments in Section 6. This can be explained by the fact that the Hessians’ sensitivity to scaling becomes an issue for or , i.e. when the scaled and unscaled minima are far away in weight space, because then a minimum with a large Hessian norm can have similar generalisation as a minimum with a small norm. We claim that for an SGD algorithm started from random initialisations, if these initialisations are relatively close in the weight space, the algorithm will not converge to minima that are far away. In other words, in order to compare the generalisation capabilities of the minima found for a particular network initialised from a particular distribution such that the initialisations are close, the Hessian should measure the generalisation capability well enough. As an alternative metric to study generalisation we also propose the Hessian with respect to the input as introduced in the next section. This metric does not suffer from the scaling symmetries in the weight space of the deep neural network. A way to keep the weight Hessian invariant against these kind of transformations is to consider the weight Hessian multiplied by the weight matrix; note that
where , denotes the vectorised forms of , , respectively. The Hessian multiplied by the weight matrix should be resistant against the scaling from (26) and result in a Hessian of similar size as the one of the original, unscaled weights. We study the this metric as a measure for generalisation in the numerical experiments in Section 6.
4 Input Hessian
Besides the weight Hessian, the input Hessian will also be used as a metric for out-of-sample performance. In this section we discuss the relationship between input noise resistance of the network and the input and weight Hessians.
The curvature of the loss surface as a function of the weights is given by the Hessian with respect to the weights, as discussed previously. This measures the flatness of particular critical points, and correlates with generalisation capabilities. Alternatively, as used in e.g [novak18], the Jacobian with respect to the input can be used as a measure for generalisation. This measures the smoothness of the output function with respect to the input parameters. In [sokolic17] the authors also study this Jacobian as a measure for generalisation and propose to explicitly penalize the Frobenius norm of the Jacobian with respect to the training data in the training objective to find minima that generalise well. For this local sensitivity metric one can define the Jacobian with respect to the input as a vector with elements
where the network is considered to have one output node; it is trivial to extend the definition to multiple outputs. An input Jacobian with small elements, would imply that the output function is robust against small changes in the input data, meaning that a better generalisation can be achieved.
Besides the Jacobian, the Hessian with respect to the input can also be used. This measures the curvature of the output function or loss function with respect to a varying input. To be consistent with the weight Hessian, we compute the Hessian with respect to the input of the loss function. This measures the curvature of the loss function, and thus the output function, with respect to the inputs, such that a Hessian with small eigenvalues means a smoother output function. Define the elements of the input Hessian of an input as,
The Jacobian and the Hessian are averaged over the data samples , in the train dataset in order to obtain an average sensitivity metric over the input space.
4.2 Relation between the input and weight Hessian
Neural networks are considered to be robust if they are resistant to noise in the input. As mentioned in the previous section, flat minima in weight space are linked to good generalisation; furthermore this flatness can be controlled through the learning rate. In this section we study the relation between flatness and the input noise robustness. Previous work [seong18] has also studied this relation, and the authors proposed an optimal learning rate to obtain good generalisation. Consider the output of a one-layer neural network,
where (assuming the output ) and . Denote by , the vectorised forms of , . We have,
where and . Let be the vectorized form of . Then,
so that if the output is resistant to additive noise in the first weight matrix then the output is resistant to additive noise in the input. Note that additive noise resistance is just one particular type of input transformation one might be interested in; other types could be e.g. resistance to multiplicative input noise, or more complex transformations of the input space such as translations.
Consider now the Taylor expansion around the weights,
In order to have a minimum such that , i.e. the output function should be resistant to additive noise in the input, the elements of the gradient and Hessian of the first-layer weights need to be small. The smoothness of the output function with respect to the input can be controlled during training in several ways. As we showed in Section 3.2, the flatness in weight space can be controlled through the learning rate or batch size in the SGD updates. Biasing the optimisation algorithm into wider minima in weight space results in smoother functions with lower information complexity in the input space. The width is measured by the weight Hessian, so a lower weight Hessian should give better resistance to additive input noise.
In the above derivation we thus showed that the stability of the output function with respect to input noise is equivalent to noise resistance in the first layer weights. In order to see the effect of the stability requirement on further layer weights note that the expression in (35) can be written as,
for some and such that . Let again , be the vectorised forms of , . Then,
From this expression we see that if the Hessian with respect to is not sufficiently small, for a deeper network the remaining noise can be damped by if the loss function is sufficiently flat in weight space in the directions of the second-layer weights. This may explain why deep networks can be more robust: the noise that is not fully dampened by sufficient flatness in the first-layer weight directions due to the eigenvectors in directions tangent to the first-layer weights having large eigenvalues (i.e. sharp increases of the loss function in those directions), can still be dampened in the further layers if the eigenvalues corresponding to the eigenvectors in the directions of the further weights are sufficiently small. On the other hand, adding nodes per layer does not aid in dampening the noise; on the contrary the more nodes per layer the smaller the eigenvalues of the Hessian in all the directions of these layer weights should be.
4.3 The learning rate, batch size and input Hessian
In Section 3.2 we showed the effects of the learning rate and batch size on the weight Hessian. Here we show that SGD and its hyperparameters also put restrictions on the input Jacobian and Hessian. Consider SGD where the weight update rule is given by (19). By a first-order Taylor expansion in for some noise it holds,
In other words, (19) can be rewritten as,
Therefore, the noise from the stochastic gradient descent can be related to noise in the input and SGD can be interpreted to minimize a jittered cost function. In turn, by a derivation similar to [reed92], taking the loss function to be the MSE, one can find,
where is the Hessian with respect to the input of the neural network output and the output function in the loss term on the r.h.s. is given by . In other words, a relation exists between training with SGD and the minimization of the loss function regularized with the first- and second-order derivatives of the output with respect to the input. Therefore, SGD imposes smoothness assumptions on the output function with respect to the input.
5 How low can we go?
As mentioned before, at least theoretically under particular assumptions, the entropy of a minimum decreases with the loss value. In other words, the lower the loss at a minimum, the larger the trace of the weight Hessian. A similar property can be said to hold for the input Hessian, with functions overfitting on the noise having lower train loss but a higher trace of the input Hessian. The trade-off between optimally representing the data and the smoothness of the function, i.e. the capability to compress the function by dismissing irrelevant input data, is known as the information bottleneck. In the optimal case, a neural network should learn to extract the most informative patterns, with the most compact function possible.
5.1 The information bottleneck
In the work of e.g. [tishby15], [achille18], [tishby17] the information bottleneck is studied for neural networks. The information bottleneck is used to extract the most relevant information that the input variable contains about the output variable. Let denote the mutual information where and are as usual random variables sampled from some data-generating distribution ,
where is the joint probability density of and and and are the marginals of and respectively. The mutual information measures the information that and share; i.e. it quantifies the amount of information obtained about one of the random variables by observing the other. Let be a representation of , such that the distribution of is fully described by the conditional . This representation is sufficient for if , in other words if contains all the relevenant information had about . It is minimal if is smallest among the sufficient representations, in other words if the complexity of the represenation is the lowest. The trade-off between the sufficiency and optimality is formulated as the minimization of the information bottleneck Lagrangian,
where operates as the trade-off parameter between the complexity (first term) and the sufficiency (second term). For an overparametrized network to fit a complex dataset (e.g. memorize random noise [zhang16]) it has to pay a price in terms of the information complexity. Bounding the information complexity can thus prevent the overfitting, but the trade-off between the two may depend on the particular dataset used.
In the case of neural networks, the representation is governed by the learned weights , which can be viewed as a random variable depending on the data and the optimisation. The mutual information of the weights and the data can be denoted by . The flat minima, i.e. the ones with small eigenvalues of the weight Hessian, can be interpreted as having low information. In other words, since the minimum is flat, the weights can be stored at lower precision, requiring fewer bits and having a lower information value. This result is derived more precisely in Proposition 4.3 of [achille18]; here we state their main result. Let denote a local minimum of the cross-entropy loss and is the Hessian at that point. For the optimal choice of the posterior , the following bound can be obtained,
where and denotes the nuclear norm. The nuclear norm is given by for square, real matrices. The trace norm of the Hessian is equivalent to the -norm of the vector of eigenvalues of the Hessian; therefore minimizing the nuclear norm is the same as reducing the rank of the original matrix (fewer non-zero eigenvalues). This thus states that flat minima have low information. The authors in [achille18] furthermore derive that when decreasing the information in the weights (by some form of regularization), one automatically improves the minimality and thus the invariance of the function. The converse, i.e. low information implies flatness, does not need to hold; in other words, as mentioned in Section 3.3, there exist minima with good generalisation that are not flat.
5.2 Dependence on noise
Consider a time series , with a signal to noise ratio of . Suppose the neural network output should be resistant to noise in the signal. In this case the loss function should satisfy the following objective,
Relating this to the train and test set, we assume that is the test data with a noise component different from that in the train data . In this setting, a small corresponds to a small difference between the error on the train data and the error on the test data, or, in other words, a small generalisation error. By a Taylor expansion in the input one obtains,
Then, taking expected values we have,
where we have used the fact that i.i.d.. From this it follows that,
In other words, the amount of noise in the input the neural network has to be resistant to is inversely proportional to the input (and thus weight) Hessian. This is intuitive in classification problems. Consider an image or a time series one would like to classify. If the output function has to be resistant tonoise, i.e. the classification output is invariant to noise in the input, we require the Hessian to be small, so that the shifted input still results in the same output. The higher the noise resistance should be, the smaller the Hessian. The downside is that the requirement for the learned function to posess more resistance against noise can also decrease the performance of the classifier.
5.3 Obtaining better generalisable minima
In this section we summarize which hyperparameters can be used to control the trade-off between generalisation and complexity (typically the train data fit).
In Section 3.2 the relationship between the weight Hessian and the learning rate was discussed. It was shown that using a higher learning rate can result in wider minima if convergence is obtained. In Section 4.3 a the same kind of relationship is derived between the learning rate and the input Hessian. In [seong18] the authors also make a link between a high learning rate and a wide minimum. In particular, they claim that using a high learning rate allows the training algorithm to escape from sharp minima in the weight space so that the optimisation algorithm converges to smoother and wider minima that are able to generalise better to unseen data. By starting with a small learning rate the weights do not diverge in the beginning when the gradients tend to be large, but due to the increase in learning rate the weights do not converge to a sharp local minimum either. A similar learning rate schedule was proposed in [smith17sc], where the authors used a cyclic learning rate with one cycle and a large maximum learning rate. Our derivations in the previous sections thus give a theoretical explanation as to why the learning rate can be used as a control parameter for generalisation. In Section 6 we study this numerically. In order to avoid the size of the gradient influencing the minima to which the optimisation converges (as does happen in the previous works of citeseong18), we propose to normalise the gradient by its norm.
The batch size used, similar to the learning rate, determines the size of the noise of the SGD, as discussed in Sections 3.2 and 4.3. A smaller batch size results in a larger variance of the noise, which in turn, according to (42) results in the smoothing terms, i.e. the input Jacobian and Hessian, having a larger weight in the optimisation objective. This makes the trade-off between data fit and function complexity more biased towards obtaining a low function complexity. The batch size can thus determine the amount of smoothness required in the output function learned by the neural network.
Number of iterations
A small learning rate results in smaller steps taken in the network. In other words, for the same number of training iterations a smaller learning rate may give rise to a minimum with a higher loss value. By a similar argument, a smaller number of training iterations gives rise to a minimum with higher loss. By Theorem 2, at least in theory, the higher-loss minima should also have a higher entropy and thus a better generalisation. Therefore, early stopping, or equivalenty training with fewer iterations, terminates the algorithm at a point in the loss surface with higher entropy. Training the network for fewer iterations should avoid overfitting on the noise, and by controlling the train error and the Hessian one can stop training when the sufficient trade-off between fit and smoothness has been obtained. In this way, the number of training iterations can be used to control the smoothness of the obtained solution, i.e. the trade-off in terms of data fit and the information complexity of the learned solution.
6 Numerical results
The neural network we consider, contains hidden layers with nodes per layer. The weights are initialized as and trained with SGD. The learning rate is set to , with mini-batches of size , and iterations are used to minimize the mean squared error (MSE). The network is trained to predict the value at time given the time series at times , i.e. historical datapoints. We use the hyperbolic tangent as the activation function. The results are presented for 20 trained networks starting from different initializations. The typical measures of generalisation based on the input and weight Hessians are particular norms of the matrices. The trace of the weight Hessian will be used, i.e.
Similarly, the trace of the input Hessian averaged over the data samples will be used as a metric for generalisation,
While in case of the Hessians we work with the trace of the matrices as a measure of generalisation, for the input Jacobian we use the Frobenius norm and use as a metric of generalisation the sensitivity of the output with respect to the input averaged over the data samples,
6.1 Artificial data
In this section we use artifical datasets to gain understanding of the neural network and its generalisation capabilities. We show that as expected from the theory, a linear relation exists between the trace of the input and weight Hessians (i.e. the function complexity) and the generalisation error. In our results, a higher function complexity means that the network is learning a function which is overfitting on the noise. This is clearly undesirable and results in a worse test set performance. The main task of this section is to show how to recognize when the network starts overfitting, and how to avoid this using the optimisation hyperparameters as described in Section 5.3.
6.1.1 Random noise
We simulate 100 datapoints from an distribution. An overparametrized neural network with the number of parameters larger than the sample size will be able to perfectly fit this random noise, see Figure 1. However, in order to obtain a low loss, the network will have to significantly increase its function complexity, as can be measured by the norms of the weight and input Jacobians and Hessians. This is shown in Figures 2-3: the traces of the input and weight Hessians significantly increase to obtain a small loss value. After some number of iterations, if the MSE has converged to some value, the Hessian remains approximately constant, with small fluctuations due to the stochasticity of the optimisation algorithm. When the network starts to learn the high-frequency components, here the noise, the norms of the input and weight Hessians increase significantly. In order to thus avoid convergence on noise one has to keep these norms small.
6.1.2 Noisy sine function
Consider now the function , with and . The network input consists of and is trained to forecast . Note that this time series is clearly not stationary, but contains seasonality which is a common feature of time series such as weather measurements.
Generalisation and the number of iterations
Figure 4 and 6 shows the trace norm of the input and weight Hessians plotted against the train error and the generalisation error, respectively. There exists a linear relation between the trace of the input and weight Hessians and the generalisation error: a smaller trace norm results in lower generalisation error. This effect is slightly less significant for the deeper network. Furthermore, training longer results in a solution of higher complexity, which in this case is undesired since a higher complexity means that the function is overfitting on the noise. This is in accordance with the theoretical result on the entropy and the loss in Theorem 2, where it was claimed that the lower the train loss, the sharper the minimum, or the larger the output function complexity is. Training longer allows to access lower points in the loss surface with a lower train error as seen in Figure 4. These lower points have a lower entropy, which results in a higher generalisation error as seen from Figure 6.
Increasing the noise amplitude
Consider now the sine function with the noise coefficient given by . In Figure 7 we observe that in order to obtain a generalisation error in the high noise case () similar to that in the low noise case (, Figure 6) the Hessian should be much smaller. This corresponds with the theoretical analysis in Section 5.2, where it was observed that with more noise in the signal one requires a lower Hessian in order to obtain a similar generalisation error. Finding a low complexity solution on noisy data can be difficult due to the possibility of overfitting, since no explicit smoothness contraints are imposed. Deep networks are even more prone to overfitting due to the higher number of parameters and the sharper gradient descent directions. Thus, in order to obtain sufficiently smooth solutions for deep networks one needs to adapt the training method or cost function accordingly. For the training method, as seen in Figure 7 taking fewer steps – or equivalently (not shown in the plots but discussed in Section 5.3) using smaller learning rates– results in smaller generalisation error.
Generalisation and the learning rate
Here we study the effects of the learning rate on generalisation. Figure 8
shows the test error plotted against the input and weight Hessians obtained by training the neural network with different learning rates. Using a larger learning rate results in wider minima while a smaller learning rate tends to converge to sharper minima, however a significant amount of outliers are found in both cases. We used batch gradient descent and scaled the gradient by itsnorm – i.e. the gradient in the SGD updates is given by – in order to avoid the gradient size influencing the minima width. We remark that the relation between a larger learning rate and wider minima seems to be more clear in the deep neural network where the usage of the higher learning rates results in more minima clustered at lower values of the trace. While the Hessian is correlated with the test error, i.e. a small test error means a smaller Hessian, the test set performance is not significantly better using the higher learning rates. Even though convergence has been obtained, the network weights have converged to a minimum that underfits the data and therefore the output function can have a worse test set performance.
Generalisation and the batch size
In Figure 11 we plot the generalisation error and the input and weight Hessians for different batch sizes. Using a smaller batch size causes the network to converge to minima with lower input and weight Hessians, which in turn correspond to minima with lower generalisation error. As expected from the theory in Section 5.3 batch size has a significant influence on the smoothness of the output function with respect to input and weights, and can be used as a control for the trade-off between train and generalisation error. Out of the three controls considered: number of iterations, learning rate and batch size, the number of iterations and the batch size appear to be the most effective ones for controlling the trade-off.
The scaled Hessian as a metric for generalisation
Here we use the Hessian multiplied by the weights as defined in (30) as a measure for generalisation. While good results were obtained with the original weight Hessian, it is of interest to see if the amount of outliers (here minima with different Hessians but similar generalisation erros) decreases when using the scaled weight Hessian. In Figure 10 we see that similar to the unscaled weight Hessian a clear linear dependence exists between the size of the Hessian and the generalisation error. This dependence does not seem to be more significant as compared to the unscaled Hessian, showing that as expected from Section 3.3, the scaling sensitivity of the weight Hessian is not a significant problem for the minima found by SGD, and the weight Hessian is a valid metric for measuring generalisation capability, despite its scaling sensitivity.
6.2 Real-world data
In this section we study generalisability for several real-world time series forecasting. We show that the norm of the input and weight Hessian is a good metric for measuring the capability of a network to generalise, and that similar to the artifical dataset, the hyperparameters defined in Section 5.3 are very effective for controlling the trade-off between smoothness – as measured by the Hessians – and data fit – as measured by the train loss.
6.2.1 Index data
Financial data is highly non-linear, non-stationary and has a very low signal-to-noise ratio [cont01]. Overfitting on the training data and not being able to generalise well to unseen data is therefore a challenge. We will use a network of size , . The input data will consist of historical daily absolute returns of the S&P500 index, historical daily absolute returns of the CBOE 10 year interest rate, and historical daily absolute returns of the volatility index (VIX), so that the total input into the neural network will consist of 15 nodes. The train period consists of data from 2017-01-03 until 2018-02-02, and the test data from 2018-02-03 until 2018-08-13. Given the value of the time series the returns are computed as . The returns are then normalized using the mean and variance. The network output will consist of the prediction for the next day return of the S&P500 index. In Table 1 we present the MSE and hit rate (computed as the number of up or down movements predicted correctly) for different hyperparameters averaged over 20 sampled networks. A larger trace of the input or weight Hessian appears to correspond to a worse performance; similarly, training longer results in overfitting. A smaller batch size corresponds to a smaller weight Hessian in the final layer, but it does not seem to result in better performance due to e.g. underfitting the signal. Financial returns are highly noisy and non-linear and distinguishing the signal in the data from noise remains challenging. Nevertheless we showed that the techniques presented in the paper can be used to bias the algorithm into minima that have more (additive) noise resistance.
In this section we train a network for predicting the daily minimum temperature in Melbourne, Australia. The dataset contains observations over the period of 1988-01-01 until 1990-12-31. We will use a network of size , . The input data will consist of historical daily obervations of the temperature. The results for the MSE for different training hyperparameters are presented in Table 2. As expected, training with fewer iterations and using smaller batch sizes results in a smoother output function with respect to the input and causes the training algorithm to converge to wider minima. The temperature data has a clear seasonal pattern but the daily obervations vary due to noise; using smaller batch sizes or training shorter has a regularising effect on the output function, so that the network does not overfit on the noise in the data, but continues to follow the main trend.
In this work we studied generalisation capabilities of neural networks trained for the purpose of time series forecasting. We showed that there is a correspondence between good generalisation capability and small input and weight Hessians of the loss function at the minima found after training. A small input or weight Hessian corresponds to the smoothness of the trained function, or, in other words, the resistance of the output function to noise in the input or weights, respectively. The challenge lies in finding the optimal tradeoff between fit of the data and smoothness of the learned function, so as to avoid overfitting on the noise and underfitting on the signal of interest. We showed how to use the learning rate, the batch size and the number of iterations used in the training algorithm to bias the network into minima that posess a certain structure. Other aspects that may influence generalisation capabilities are the kind of activation function used: while not reported we noticed that the network is prone to overfitting when using the piecewise linear ReLU compared to the hyperbolic tangent or the sigmoid function. Furthermore, the network size itself also matters: deep networks, due to the larger amount of parameters, will more easily overfit on the noise, obtaining a low training error but a bad out-of-sample performance.
While this work provided some insight into obtaining good generalisation for time series forecasting, forecasting remains a challenging task due to the non-linear and non-stationary distribution of the data. The typical assumption in statistical learning theory of having i.i.d. samples from some data-generating distribution does not hold in time series: there is a dependence between the obervations through time and the underlying distribution may change due to unobserved variables so that the train and test data might not be identically distributed. A relaxation that has become standard to deal with the independence assumption is to assume that the observations are drawn from a stationary mixing distribution (see e.g. [agarwal13], [mcdonald17]). The authors of [kuznetsov18] provide bounds that also hold for non-stationary time series. A related issue is that of generalisation in neural networks across different noise distributions. As has been mentioned in the work of [geirhos18]
, neural networks have trouble generalising when the noise distribution in the data they were trained on differs from the noise distribution in the test dataset. Understanding and solving this issue will prove valuable in time series forecasting, where the distribution of the noise in the observations could change over time. Obtaining theoretical results on e.g. the link between the generalisation error and the Hessian, and understanding how to make machine learning algorithms work in order to generalise in a non-i.i.d. setting is still a relevant and active topic of research which we aim to address in future work.