Log In Sign Up

Time Series Prediction under Distribution Shift using Differentiable Forgetting

Time series prediction is often complicated by distribution shift which demands adaptive models to accommodate time-varying distributions. We frame time series prediction under distribution shift as a weighted empirical risk minimisation problem. The weighting of previous observations in the empirical risk is determined by a forgetting mechanism which controls the trade-off between the relevancy and effective sample size that is used for the estimation of the predictive model. In contrast to previous work, we propose a gradient-based learning method for the parameters of the forgetting mechanism. This speeds up optimisation and therefore allows more expressive forgetting mechanisms.


page 1

page 2

page 3

page 4


Bayesian Regression Approach for Building and Stacking Predictive Models in Time Series Analytics

The paper describes the use of Bayesian regression for building time ser...

Empirical Risk Minimization for Time Series: Nonparametric Performance Bounds for Prediction

Empirical risk minimization is a standard principle for choosing algorit...

What went wrong and when? Instance-wise Feature Importance for Time-series Models

Multivariate time series models are poised to be used for decision suppo...

AdaRNN: Adaptive Learning and Forecasting of Time Series

Time series has wide applications in the real world and is known to be d...

Model-Attentive Ensemble Learning for Sequence Modeling

Medical time-series datasets have unique characteristics that make predi...

Adaptive Dynamic Model Averaging with an Application to House Price Forecasting

Dynamic model averaging (DMA) combines the forecasts of a large number o...

Time-Discounting Convolution for Event Sequences with Ambiguous Timestamps

This paper proposes a method for modeling event sequences with ambiguous...

1 Introduction

We introduce a new framework for the problem of predicting time series that exhibit distribution shift, also known as concept drift in the domain of engineering Lu2019 or forecasting under unstable environments in economics Rossi2012. Time series exhibiting distribution shift appear in a wide range of domains Zliobaite16 from industrial management to biomedical applications. Financial time series distribution shift has been studied in the concept drift literature Harries96, as well as more recently in the statistics McCarthy2016 and learning theory Kuznetsov2020

literatures. We develop the forgetting mechanism approach to predicting time series under distribution shift. The key novel contribution of our work is a gradient-based learning framework for directly optimising the validation performance of differentiable forgetting mechanisms. The main advantages of our framework are that it is easy-to-use with any machine learning method based on empirical risk minimisation, provides quick estimation of forgetting mechanism hyperparameters – thereby facilitating expressive forgetting mechanisms – and maintains competitive performance to other common approaches for handling time series distribution shift. While our framework can be applied to any general time series distribution shift problem, we show how it may in particular improve financial risk predictive modelling. Our code is available


2 Problem Definition

Consider a sequential supervised learning problem where for each

, we wish to learn a mapping between features and labels where , for some parameter space . The labels, conditional on the features, follow a time-varying distribution . At each time point , we are interested in minimising the one-step-ahead path-dependent risk , over

, for a given loss function

. We follow an Empirical Risk Minimisation (ERM) Kuznetsov2020 approach in which we estimate the optimal model parameters by minimising the risk estimator .

3 Model Description

3.1 Model Adaptation to Distribution Shift via Weighted Empirical Risk Minimisation

Our approach to accommodate distribution shift in the ERM formulation is through the choice of the one-step-ahead risk estimator

where the weights are the outputs of a forgetting mechanism .

Figure 1: Method visualisation: time series training samples (top) are weighted by the forgetting mechanism (bottom). The forgetting mechanism, which is parameterised by and optimises predictive performance on the validation set data, assigns weight on the most recent, representative training samples.

The purpose of re-weighting empirical loss samples is to find a linear combination of training set sample losses that yields a model with improved generalisation. We can encode inductive biases by choosing the functional form of the forgetting mechanism. For instance, a monotonically decreasing forgetting mechanism encodes the hypothesis that more recent samples are more relevant for training. We introduce examples of specific functional forms in Section 5.3. The learning of forgetting mechanism parameters controls the trade-off between the adaptivity of the forecaster and the statistical efficiency of the learning procedure: forgetting mechanisms that assign high weight to a few points are able to capture more relevant data at the cost of reducing the effective sample size that is used to fit the underlying model.

3.2 Differentiable Bi-Level Optimisation

We jointly learn the predictive model and forgetting mechanism parameters by solving the bi-level optimisation problem

where and represent the upper and lower level objective functions

Here, and denote training and validation sets, where the validation set is selected to be the most recent data samples, for some choice of . The assumption underlying this choice of validation set is that the distribution does not, on average, substantially change in any given small time span Kuznetsov2020.

Applying the implicit function theorem to bi-level optimisation, we are able to perform gradient-based learning of the forgetting mechanism parameters . Specifically, Gould2016 provide the following Lemma for differentiating through the argmin in the lower level problem:

Lemma 3.1.

Let be an element of , the set of twice continuously differentiable functions in each argument, and let . Then the derivative of with respect to is given by


Therefore, by the chain rule, the total derivative of

with respect to is given by:


where we use the gradient expression in equation (1). Equation (2) gives us the required expression to apply gradient descent in to the upper-level objective function:

where is the step size.

3.3 Gradient Optimisation Routine

Since the upper level problem is generally non-convex with multiple local minima, we propose running gradient descent with multiple restarts to explore different local solutions.

Once we have estimated the parameters using bi-level optimisation on the train-validation split, we refit our predictive model on the combined train and validation sets. The weighting scheme for the training data points is given by the forgetting mechanism using the estimated while a uniform weighting scheme is applied to the validation data points: each of these is given the same weight as the last training set point, .

There are further considerations when applying the bi-level optimisation procedure using a neural network as the underlying predictive model. Optimisation techniques such as flatness-aware learning

SWA and Hessian approximation duv_hyp are relevant for numerical stability and large Hessian computations; see Appendix C for further discussion.

4 Related Work

Recursive least squares Haykin86; Rossi2012; Harvey90 provides an early example of the use of a simple forgetting mechanism for model adaptation. Since then, the concept drift literature has studied the use of forgetting mechanisms for time series distribution shift on an algorithm-by-algorithm basis Koychev2000; Klinkenberg2004. The field of non-stationary time series forecasting McCarthy2016; Kuznetsov2020 has also considered the use of forgetting mechanisms for dealing with distribution shift. McCarthy2016 propose an alternative Bayesian formulation to our ERM-based approach. Kuznetsov2020 propose an estimator for the one-step-ahead risk that is based on a weighted empirical risk. They derive a learning guarantee on the one-step-ahead generalisation error that holds in the non-stationary time series setting. They minimise this learning guarantee over the weights in their empirical risk and the parameters of their model. The first key difference between their work and ours is that we model the weights using a parametric forgetting mechanism. Secondly, they optimise an upper bound on the generalisation error to derive estimates whereas we directly minimise the validation-set prediction error to derive .

Our work distinguishes itself from the existing concept drift and non-stationary time series forecasting literature as it provides gradient-based learning for a generic forgetting mechanism and ERM-based learning model. A key novelty of our work is gradient-based minimisation of the validation error over the parameters of the forgetting mechanism. Gradient-based approaches are known to be much faster Hutter18 than grid search, which has typically been used to estimate the forgetting mechanism parameters in the concept drift Koychev2000; Klinkenberg2004 and non-stationary time series literature McCarthy2016. Faster learning allows for more complex, richly-parameterised forgetting mechanisms that would be too computationally burdensome to fit using grid search. Our method connects time series prediction under distribution shift to the growing field of automated machine learning Hutter18, allowing us to leverage powerful modern automatic differentiation libraries for efficient implementation.

Beyond time series prediction, the idea of re-weighting empirical loss to address distribution shift has been considered in the context of importance weighting for transfer learning


and robust deep learning

Shu2019. For instance, in order to address the problem of data noise and class imbalance, Shu2019 propose to train a neural network that outputs the ERM importance weight of a training sample given that sample’s loss. The work of Shu2019 and Jenni2018 illustrates the interest in applying differentiable bi-level optimisation methods to the problem of generalisation in non-temporal data settings.

Model FixedRegime () RandomWalk () RandomRegime () Stat () Factor () Vol ()

4.00* 17.2* 4.20 2.54 2.60* 8.77
Window 2.62 *3.10* 4.63* 2.57* 2.62* 8.96
StateSpace 2.73* *3.30* 5.25* 2.60* 3.81* 9.75*
Arima 2.65* 13.4* 4.41* 2.57* 9.24*
Dbf 4.44* 23.3* 4.55* 2.64* 5.11* 10.7*
GridSearchExp 2.63 3.00 4.31* 2.58* 2.59* 8.94
GradExp 3.96* 17.2* 4.20 2.55 2.59* 8.77
GradMixedDecay 2.60 2.80 4.39* 2.57* 2.49 8.76

Table 1: This table reports the Mean Squared Error (MSE) for each model averaged across Monte Carlo repetitions or across time series (for the real data). For each data set, the MSE of the best performing model is shown in bold. A star is placed by a model’s MSE whenever its MSE is significantly larger, at the 5% level, than the best model’s MSE using a paired Wilcoxon signed-rank test333Note that the independence assumption of the Wilcoxon test does not hold in the real data setting due to correlations between time series. Wilcoxon1945.

5 Experiments

5.1 Synthetic Data

Our synthetic data experiments are based on those from Kuznetsov2020 in order to facilitate comparison with their method. Four distribution shift settings are considered: regularly occurring change points, gradual distribution drift, irregularly occurring change points and no distribution shift. Data are generated by the following equations for and :

  • FixedRegime: , where if and 0.9 otherwise.

  • RandomWalk: , where .

  • RandomRegime: , where and is the stochastic process on such that

  • Stat: .

The task in each setting is to forecast the time series value

using a no-intercept linear model with a feature vector that consists of the last three values of the time series

. The train-validation-test split for the synthetic data experiments is , and . We use 192 Monte Carlo runs for each data setting.

5.2 Real Data

We consider a financial dataset consisting of 19 years of daily log-returns for 50 NYSE equities; see Appendix A for data processing details.

Our first real data experiment investigates distribution shift in the problem of risk factor modelling Fama1993. Specifically, we linearly decompose each equity return in excess of the risk-free rate using a three factor model parameterised by :

The three Fama-French factors correspond to market risk , the risk factor related to size and the risk factor related to book-to-market equity Fama1993.

Our second real data experiment investigates distribution shift in the problem of forecasting the absolute values of financial returns Taylor1986. For this experiment, we restrict our dataset to 15 Exchange Traded Fund (ETF) instruments tracking industry sectors, commodity and bond prices222The ETFs used are listed in the Appendix A.. We forecast the next day absolute value of the return for each ETF, , using the five previous absolute return values for that ETF with a linear model parameterised by :

For each real data time series, we use expanding time series cross validation to evaluate the methods. We start training for each time series using 6 years of data, the validation set is 150 days and we use the next 150 days as out-of-sample data. The train-validation-test split is moved forward by 150 days for the next model update.

5.3 Methods

We consider two simple instances of forgetting mechanisms in our experiments


Forgetting mechanism (3) corresponds to exponential decay while (4) corresponds to a mixture of various functional forms of decay. We fit each of these forgetting mechanisms using differentiable bi-level optimisation and call the resulting methods GradExp for mechanism (3) and GradMixedDecay for mechanism (4

). Our gradient optimisation routines are implemented using 5 restarts from random initialisation, with each run performing 50 epochs of stochastic gradient updates. As a comparison to gradient-based optimisation, we fit forgetting mechanism (

3) using grid search over : the corresponding method is known as GridSearchExp.

We compare against five baselines: Stationary, Window, StateSpace, ARIMA and DBF. The Stationary method assumes no distribution shift and fits the underlying models using unweighted historical samples. The commonly used Window method applies uniform weights to a fixed length interval of the most recent samples. In the StateSpace model, the state is the vector of model parameters which is modelled using random-walk update state transition equations brockwell2009. ARIMA

is an autoregressive model (not applicable to the factor experiment) often used in econometrics

brockwell2009. DBF is an implementation of the two-step, discrepancy-based forecasting algorithm from Kuznetsov2020. Details of model implementation and hyperparameters are reported in the Appendix B.

5.4 Results

We see from Table 3 that the GradMixedDecay performs best in 4 out of 6 datasets. It has near-best performance in the RandomRegime and Stat data settings. These results demonstrate the competitive predictive performance under distribution shift that can be achieved with the use of a forgetting mechanism which is trained using gradient-based bi-level optimisation. The outperformance of GradMixedDecay over GradExp illustrates the advantage of using more expressive differentiable forgetting mechanisms, which would be too computationally burdensome to fit using grid search. We see that GradExp has comparable performance to GridSearchExp on most datasets. This shows that gradient-based optimisation is capable of achieving comparable performance to grid search in the low-dimensional cases when grid search is feasible.

6 Conclusion

In summary, we propose learning forgetting mechanisms for time series prediction under distribution shift using gradient-based optimisation. Our approach is flexible, fast and can achieve competitive performance to baseline methods for time series prediction under distribution shift. In future work, we will investigate more complex forgetting mechanisms and sophisticated predictive models. Another interesting research extension would be to adapt our method to the online learning setting so that the predictive model can be updated in response to streaming data. More broadly, our paper motivates further research between the topics of distribution shift, time series prediction and hyper-parameter learning.


The authors would like to thank Mihai Cucuringu, Gesine Reinert and Jake Topping for their feedback. Both authors are supported by the EPSRC CDT in Modern Statistics and Statistical Machine Learning (EP/S023151/1) and the Alan Turing Institute (EP/N510129/1).


Appendix A Data

We consider the universe of NYSE equities spanning from 04-01-2000 to 31-12-2019 from Wharton’s CRSP database WRDS. The data consists of daily closing prices from which we compute daily log-returns. We also compute the average daily dollar volume that is traded for each equity. We subset to the 50 equities that have the largest average daily dollar volume. Any missing prices are forward-filled prior to the calculation of log-returns.

Factor returns were downloaded from Kenneth French’s data repository

The ETF ticker names used for volatility experiments are ‘SPY’, ‘IWM’, ‘EEM’, ‘TLT’, ‘USO’, ‘GLD’, ‘XLF’, ‘XLB’, ‘XLK’, ‘XLV’, ‘XLI’, ‘XLU’, ‘XLY’, ‘XLP’, ‘XLE’.

Appendix B Model Details

Each model is grid searched over its relevant hyper-parameters to maximise predictive performance on the validation set for each train-validation split of the data. In this Appendix, we provide details of the range of hyperparameter values searched over for each model.

Linear models in the Stationary, Window, DBF, GridSearchExp, GradExp and GradMixedDecay

cases were fit using ridge regression with a squared error loss function, where the range of ridge penalty values searched over is


  • GradExp, GradMixedDecay: For optimising

    on the validation set we use standard mini-batch stochastic gradient descent with momentum for all experiments, keeping the learning rate and momentum constant at 0.1 and 0.9 respectively. We perform 5 restarts

    444However, we have found that similar results can be obtained using a single run of gradient descent for most forgetting mechanisms and datasets. of gradient descent from a random initialisation per run. We used a batch size of 32 examples, and train for 50 epochs per restart. We then select the value of with the lowest overall validation loss for each method.

  • Window: For a time series of length , we grid search the length of the window over 25 values linearly spaced in .

  • GridSearchExp: We grid search in Equation (3) over 25 values. In order to ensure that the grid of values considered is of plausible a priori range, the grid is computed based on the window length grid which was searched over in the Window method. In particular, for each window length value considered, we select a value of that results in the GridSearchExp forgetting mechanism assigning almost all of its weight to the same span length.

  • ARIMA: The hyperparameters of ARIMA are the autoregressive lag , the differencing order and the number of moving average components . Following Kuznetsov2020, we grid search over . The model is fit using maximum likelihood under a Gaussian emission model brockwell2009

  • StateSpace: The state-space model’s observation equation is a linear model with Gaussian noise where the states correspond to the predictive model’s regression coefficients for the given problem. In the state equation, the states are updated with a Gaussian increment. The model is fit using maximum likelihood.

  • DBF: The DBF algorithm is an iterative two stage procedure that iteratively fits the predictive model and the weights in the ERM problem (see Kuznetsov2020, Section 7, Equation 14). Their procedure is run for 5 iterations. The lookback parameter, , corresponding to the window in which the discrepancy is thought to be bounded Kuznetsov2020 is set to the value used in the original paper . The DBF algorithm has two penalty parameters and , which correspond to regularisation on the weighting scheme and the model parameters. We grid search both of these parameters over the same grid as the ridge regression penalty.

Appendix C Bi-Level Optimisation for Deep Learning

In ongoing work not presented in this paper, we found that further considerations are required when the lower level problem has multiple local minima (as is the case when using a neural network as the predictive model). We have empirically found that flatness-aware learning SWA of the lower level problem is important for the stability and generalisation performance of the bi-level optimisation solution. We conjecture that this is because a sharp minima leads to a large Hessian condition number and therefore unstable inversion in Equation (1).

Secondly, we have observed that identity approximation to the Hessian in Equation 1 yields competitive results while reducing the computational cost of running bi-level optimisation using a highly-parameterised predictive model such as a neural network. A number of other Hessian approximation schemes have been proposed in the gradient hyper-parameter optimisation literature, see duv_hyp for a comparison of these approaches in the context of hyper-parameter optimisation.