1 Introduction
We introduce a new framework for the problem of predicting time series that exhibit distribution shift, also known as concept drift in the domain of engineering Lu2019 or forecasting under unstable environments in economics Rossi2012. Time series exhibiting distribution shift appear in a wide range of domains Zliobaite16 from industrial management to biomedical applications. Financial time series distribution shift has been studied in the concept drift literature Harries96, as well as more recently in the statistics McCarthy2016 and learning theory Kuznetsov2020
literatures. We develop the forgetting mechanism approach to predicting time series under distribution shift. The key novel contribution of our work is a gradientbased learning framework for directly optimising the validation performance of differentiable forgetting mechanisms. The main advantages of our framework are that it is easytouse with any machine learning method based on empirical risk minimisation, provides quick estimation of forgetting mechanism hyperparameters – thereby facilitating expressive forgetting mechanisms – and maintains competitive performance to other common approaches for handling time series distribution shift. While our framework can be applied to any general time series distribution shift problem, we show how it may in particular improve financial risk predictive modelling. Our code is available
here^{1}^{1}1https://github.com/jaseclarkson/pods_2022_icml_ts.2 Problem Definition
Consider a sequential supervised learning problem where for each
, we wish to learn a mapping between features and labels where , for some parameter space . The labels, conditional on the features, follow a timevarying distribution . At each time point , we are interested in minimising the onestepahead pathdependent risk , over, for a given loss function
. We follow an Empirical Risk Minimisation (ERM) Kuznetsov2020 approach in which we estimate the optimal model parameters by minimising the risk estimator .3 Model Description
3.1 Model Adaptation to Distribution Shift via Weighted Empirical Risk Minimisation
Our approach to accommodate distribution shift in the ERM formulation is through the choice of the onestepahead risk estimator
where the weights are the outputs of a forgetting mechanism .
The purpose of reweighting empirical loss samples is to find a linear combination of training set sample losses that yields a model with improved generalisation. We can encode inductive biases by choosing the functional form of the forgetting mechanism. For instance, a monotonically decreasing forgetting mechanism encodes the hypothesis that more recent samples are more relevant for training. We introduce examples of specific functional forms in Section 5.3. The learning of forgetting mechanism parameters controls the tradeoff between the adaptivity of the forecaster and the statistical efficiency of the learning procedure: forgetting mechanisms that assign high weight to a few points are able to capture more relevant data at the cost of reducing the effective sample size that is used to fit the underlying model.
3.2 Differentiable BiLevel Optimisation
We jointly learn the predictive model and forgetting mechanism parameters by solving the bilevel optimisation problem
where and represent the upper and lower level objective functions
Here, and denote training and validation sets, where the validation set is selected to be the most recent data samples, for some choice of . The assumption underlying this choice of validation set is that the distribution does not, on average, substantially change in any given small time span Kuznetsov2020.
Applying the implicit function theorem to bilevel optimisation, we are able to perform gradientbased learning of the forgetting mechanism parameters . Specifically, Gould2016 provide the following Lemma for differentiating through the argmin in the lower level problem:
Lemma 3.1.
Let be an element of , the set of twice continuously differentiable functions in each argument, and let . Then the derivative of with respect to is given by
(1) 
Therefore, by the chain rule, the total derivative of
with respect to is given by:(2) 
where we use the gradient expression in equation (1). Equation (2) gives us the required expression to apply gradient descent in to the upperlevel objective function:
where is the step size.
3.3 Gradient Optimisation Routine
Since the upper level problem is generally nonconvex with multiple local minima, we propose running gradient descent with multiple restarts to explore different local solutions.
Once we have estimated the parameters using bilevel optimisation on the trainvalidation split, we refit our predictive model on the combined train and validation sets. The weighting scheme for the training data points is given by the forgetting mechanism using the estimated while a uniform weighting scheme is applied to the validation data points: each of these is given the same weight as the last training set point, .
There are further considerations when applying the bilevel optimisation procedure using a neural network as the underlying predictive model. Optimisation techniques such as flatnessaware learning
SWA and Hessian approximation duv_hyp are relevant for numerical stability and large Hessian computations; see Appendix C for further discussion.4 Related Work
Recursive least squares Haykin86; Rossi2012; Harvey90 provides an early example of the use of a simple forgetting mechanism for model adaptation. Since then, the concept drift literature has studied the use of forgetting mechanisms for time series distribution shift on an algorithmbyalgorithm basis Koychev2000; Klinkenberg2004. The field of nonstationary time series forecasting McCarthy2016; Kuznetsov2020 has also considered the use of forgetting mechanisms for dealing with distribution shift. McCarthy2016 propose an alternative Bayesian formulation to our ERMbased approach. Kuznetsov2020 propose an estimator for the onestepahead risk that is based on a weighted empirical risk. They derive a learning guarantee on the onestepahead generalisation error that holds in the nonstationary time series setting. They minimise this learning guarantee over the weights in their empirical risk and the parameters of their model. The first key difference between their work and ours is that we model the weights using a parametric forgetting mechanism. Secondly, they optimise an upper bound on the generalisation error to derive estimates whereas we directly minimise the validationset prediction error to derive .
Our work distinguishes itself from the existing concept drift and nonstationary time series forecasting literature as it provides gradientbased learning for a generic forgetting mechanism and ERMbased learning model. A key novelty of our work is gradientbased minimisation of the validation error over the parameters of the forgetting mechanism. Gradientbased approaches are known to be much faster Hutter18 than grid search, which has typically been used to estimate the forgetting mechanism parameters in the concept drift Koychev2000; Klinkenberg2004 and nonstationary time series literature McCarthy2016. Faster learning allows for more complex, richlyparameterised forgetting mechanisms that would be too computationally burdensome to fit using grid search. Our method connects time series prediction under distribution shift to the growing field of automated machine learning Hutter18, allowing us to leverage powerful modern automatic differentiation libraries for efficient implementation.
Beyond time series prediction, the idea of reweighting empirical loss to address distribution shift has been considered in the context of importance weighting for transfer learning
Fang2020and robust deep learning
Shu2019. For instance, in order to address the problem of data noise and class imbalance, Shu2019 propose to train a neural network that outputs the ERM importance weight of a training sample given that sample’s loss. The work of Shu2019 and Jenni2018 illustrates the interest in applying differentiable bilevel optimisation methods to the problem of generalisation in nontemporal data settings.Model  FixedRegime ()  RandomWalk ()  RandomRegime ()  Stat ()  Factor ()  Vol () 

Stationary 
4.00*  17.2*  4.20  2.54  2.60*  8.77 
Window  2.62  3.10*  4.63*  2.57*  2.62*  8.96 
StateSpace  2.73*  3.30*  5.25*  2.60*  3.81*  9.75* 
Arima  2.65*  13.4*  4.41*  2.57*  –  9.24* 
Dbf  4.44*  23.3*  4.55*  2.64*  5.11*  10.7* 
GridSearchExp  2.63  3.00  4.31*  2.58*  2.59*  8.94 
GradExp  3.96*  17.2*  4.20  2.55  2.59*  8.77 
GradMixedDecay  2.60  2.80  4.39*  2.57*  2.49  8.76 

5 Experiments
5.1 Synthetic Data
Our synthetic data experiments are based on those from Kuznetsov2020 in order to facilitate comparison with their method. Four distribution shift settings are considered: regularly occurring change points, gradual distribution drift, irregularly occurring change points and no distribution shift. Data are generated by the following equations for and :

FixedRegime: , where if and 0.9 otherwise.

RandomWalk: , where .

RandomRegime: , where and is the stochastic process on such that
. 
Stat: .
The task in each setting is to forecast the time series value
using a nointercept linear model with a feature vector that consists of the last three values of the time series
. The trainvalidationtest split for the synthetic data experiments is , and . We use 192 Monte Carlo runs for each data setting.5.2 Real Data
We consider a financial dataset consisting of 19 years of daily logreturns for 50 NYSE equities; see Appendix A for data processing details.
Our first real data experiment investigates distribution shift in the problem of risk factor modelling Fama1993. Specifically, we linearly decompose each equity return in excess of the riskfree rate using a three factor model parameterised by :
The three FamaFrench factors correspond to market risk , the risk factor related to size and the risk factor related to booktomarket equity Fama1993.
Our second real data experiment investigates distribution shift in the problem of forecasting the absolute values of financial returns Taylor1986. For this experiment, we restrict our dataset to 15 Exchange Traded Fund (ETF) instruments tracking industry sectors, commodity and bond prices^{2}^{2}2The ETFs used are listed in the Appendix A.. We forecast the next day absolute value of the return for each ETF, , using the five previous absolute return values for that ETF with a linear model parameterised by :
For each real data time series, we use expanding time series cross validation to evaluate the methods. We start training for each time series using 6 years of data, the validation set is 150 days and we use the next 150 days as outofsample data. The trainvalidationtest split is moved forward by 150 days for the next model update.
5.3 Methods
We consider two simple instances of forgetting mechanisms in our experiments
(3)  
(4) 
Forgetting mechanism (3) corresponds to exponential decay while (4) corresponds to a mixture of various functional forms of decay. We fit each of these forgetting mechanisms using differentiable bilevel optimisation and call the resulting methods GradExp for mechanism (3) and GradMixedDecay for mechanism (4
). Our gradient optimisation routines are implemented using 5 restarts from random initialisation, with each run performing 50 epochs of stochastic gradient updates. As a comparison to gradientbased optimisation, we fit forgetting mechanism (
3) using grid search over : the corresponding method is known as GridSearchExp.We compare against five baselines: Stationary, Window, StateSpace, ARIMA and DBF. The Stationary method assumes no distribution shift and fits the underlying models using unweighted historical samples. The commonly used Window method applies uniform weights to a fixed length interval of the most recent samples. In the StateSpace model, the state is the vector of model parameters which is modelled using randomwalk update state transition equations brockwell2009. ARIMA
is an autoregressive model (not applicable to the factor experiment) often used in econometrics
brockwell2009. DBF is an implementation of the twostep, discrepancybased forecasting algorithm from Kuznetsov2020. Details of model implementation and hyperparameters are reported in the Appendix B.5.4 Results
We see from Table 3 that the GradMixedDecay performs best in 4 out of 6 datasets. It has nearbest performance in the RandomRegime and Stat data settings. These results demonstrate the competitive predictive performance under distribution shift that can be achieved with the use of a forgetting mechanism which is trained using gradientbased bilevel optimisation. The outperformance of GradMixedDecay over GradExp illustrates the advantage of using more expressive differentiable forgetting mechanisms, which would be too computationally burdensome to fit using grid search. We see that GradExp has comparable performance to GridSearchExp on most datasets. This shows that gradientbased optimisation is capable of achieving comparable performance to grid search in the lowdimensional cases when grid search is feasible.
6 Conclusion
In summary, we propose learning forgetting mechanisms for time series prediction under distribution shift using gradientbased optimisation. Our approach is flexible, fast and can achieve competitive performance to baseline methods for time series prediction under distribution shift. In future work, we will investigate more complex forgetting mechanisms and sophisticated predictive models. Another interesting research extension would be to adapt our method to the online learning setting so that the predictive model can be updated in response to streaming data. More broadly, our paper motivates further research between the topics of distribution shift, time series prediction and hyperparameter learning.
Acknowledgements
The authors would like to thank Mihai Cucuringu, Gesine Reinert and Jake Topping for their feedback. Both authors are supported by the EPSRC CDT in Modern Statistics and Statistical Machine Learning (EP/S023151/1) and the Alan Turing Institute (EP/N510129/1).
References
Appendix A Data
We consider the universe of NYSE equities spanning from 04012000 to 31122019 from Wharton’s CRSP database WRDS. The data consists of daily closing prices from which we compute daily logreturns. We also compute the average daily dollar volume that is traded for each equity. We subset to the 50 equities that have the largest average daily dollar volume. Any missing prices are forwardfilled prior to the calculation of logreturns.
Factor returns were downloaded from Kenneth French’s data repository https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html.
The ETF ticker names used for volatility experiments are ‘SPY’, ‘IWM’, ‘EEM’, ‘TLT’, ‘USO’, ‘GLD’, ‘XLF’, ‘XLB’, ‘XLK’, ‘XLV’, ‘XLI’, ‘XLU’, ‘XLY’, ‘XLP’, ‘XLE’.
Appendix B Model Details
Each model is grid searched over its relevant hyperparameters to maximise predictive performance on the validation set for each trainvalidation split of the data. In this Appendix, we provide details of the range of hyperparameter values searched over for each model.
Linear models in the Stationary, Window, DBF, GridSearchExp, GradExp and GradMixedDecay
cases were fit using ridge regression with a squared error loss function, where the range of ridge penalty values searched over is
.
GradExp, GradMixedDecay: For optimising
on the validation set we use standard minibatch stochastic gradient descent with momentum for all experiments, keeping the learning rate and momentum constant at 0.1 and 0.9 respectively. We perform 5 restarts
^{4}^{4}4However, we have found that similar results can be obtained using a single run of gradient descent for most forgetting mechanisms and datasets. of gradient descent from a random initialisation per run. We used a batch size of 32 examples, and train for 50 epochs per restart. We then select the value of with the lowest overall validation loss for each method. 
Window: For a time series of length , we grid search the length of the window over 25 values linearly spaced in .

GridSearchExp: We grid search in Equation (3) over 25 values. In order to ensure that the grid of values considered is of plausible a priori range, the grid is computed based on the window length grid which was searched over in the Window method. In particular, for each window length value considered, we select a value of that results in the GridSearchExp forgetting mechanism assigning almost all of its weight to the same span length.

ARIMA: The hyperparameters of ARIMA are the autoregressive lag , the differencing order and the number of moving average components . Following Kuznetsov2020, we grid search over . The model is fit using maximum likelihood under a Gaussian emission model brockwell2009

StateSpace: The statespace model’s observation equation is a linear model with Gaussian noise where the states correspond to the predictive model’s regression coefficients for the given problem. In the state equation, the states are updated with a Gaussian increment. The model is fit using maximum likelihood.

DBF: The DBF algorithm is an iterative two stage procedure that iteratively fits the predictive model and the weights in the ERM problem (see Kuznetsov2020, Section 7, Equation 14). Their procedure is run for 5 iterations. The lookback parameter, , corresponding to the window in which the discrepancy is thought to be bounded Kuznetsov2020 is set to the value used in the original paper . The DBF algorithm has two penalty parameters and , which correspond to regularisation on the weighting scheme and the model parameters. We grid search both of these parameters over the same grid as the ridge regression penalty.
Appendix C BiLevel Optimisation for Deep Learning
In ongoing work not presented in this paper, we found that further considerations are required when the lower level problem has multiple local minima (as is the case when using a neural network as the predictive model). We have empirically found that flatnessaware learning SWA of the lower level problem is important for the stability and generalisation performance of the bilevel optimisation solution. We conjecture that this is because a sharp minima leads to a large Hessian condition number and therefore unstable inversion in Equation (1).
Secondly, we have observed that identity approximation to the Hessian in Equation 1 yields competitive results while reducing the computational cost of running bilevel optimisation using a highlyparameterised predictive model such as a neural network. A number of other Hessian approximation schemes have been proposed in the gradient hyperparameter optimisation literature, see duv_hyp for a comparison of these approaches in the context of hyperparameter optimisation.