Continual Learning Augmented Investment Decisions

12/06/2018 ∙ by Daniel Philps, et al. ∙ City, University of London 0

Investment decisions can benefit from incorporating an accumulated knowledge of the past to drive future decision making. We introduce Continual Learning Augmentation (CLA) which is based on an explicit memory structure and a feed forward neural network (FFNN) base model and used to drive long term financial investment decisions. We demonstrate that our approach improves accuracy in investment decision making while memory is addressed in an explainable way. Our approach introduces novel remember cues, consisting of empirically learned change points in the absolute error series of the FFNN. Memory recall is also novel, with contextual similarity assessed over time by sampling distances using dynamic time warping (DTW). We demonstrate the benefits of our approach by using it in an expected return forecasting task to drive investment decisions. In an investment simulation in a broad international equity universe between 2003-2017, our approach significantly outperforms FFNN base models. We also illustrate how CLA's memory addressing works in practice, using a worked example to demonstrate the explainability of our approach.



There are no comments yet.


page 3

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Time-series are ubiquitous in modern human activity, not least in finance, and applying continual learning to financial time-series problems would be highly beneficial. However, existing approaches suffer from problems such as catastrophic forgetting, the opacity of implicit memory or outright complexity , arguably making them not well suited to the high magnitude decision making that tends to be required in finance. We introduce Continual Learning Augmentation (CLA), a time-series based memory approach that accumulates models of past states, which are recalled and balanced when states approximately reoccur. We apply CLA to drive investment decisions in international equities markets, the results of which show improved investment performance compared to a feedforward neural network (FFNN) base model. (In further research we have found that CLA outperforms more traditional, linear models also). We also show a worked example of how CLA’s memory addressing adds explainability.

The remainder of this paper is organized as follows. The next section reviews related work on memory modelling. We then introduce CLA in Section three, describing how different states are identified and remembered, and how models of past states are recalled and applied, as well as how less relevant models are forgotten. In Section four, we report experiments with feedforward neural networks that show significant improvements with CLA. In Section five, we illustrate how CLA enables interpretation in terms of references to past states. Finally, we present our conclusions and directions for future work.

2 Related work

Approaches developed for continual learning have been applied to many areas and have employed a wide range of techniques including gated neural networks

[17][2], explicit memory structures [37], prototypical addressing[33], weight adaptation [16][34] to name a few. Having addressed catastrophic forgetting [11] more recent research has turned to solving second order problems, such as overheads of external memory structures [28], problems with weight saturation [24] and the complications of outright complexity [38]. While a number of memory approaches apply to sequential memory tasks [13][14] a far smaller number still have been focused specifically on time-series [21][12][25][10]. It is unclear how effective these approaches would be in dealing with long term continual learning of noisy, non-stationary time-series, commonly found in finance.

2.1 Financial regimes as a memory concept

Regime switching models and change point detection provide a simplified answer to identifying changing states in time-series with the major disadvantage that change points between regimes (or states) are notoriously difficult to identify out of sample [8] and existing econometric approaches are limited by long term, parametric assumptions in their attempts [6, 4, 35, 29, 7, 40, 32]. There is also no guarantee that a change point in a time-series represents a significant change in the accuracy of an applied model, a more useful perspective for modelling different states. Another approach is to focus on the change in the absolute error of a model, aiming to capture as much information as possible regarding changes in the relation between independent and dependent variables [15]. Different forms of residual change have been developed [1, 20, 19, 26, 18]. However, most approaches assume a single or known number of change points in a series and are less applicable to a priori change points or multivariate series [15].

2.2 Memory models

Memory modelling approaches using external memory structures require an appropriate memory addressing mechanism (a way of storing and recalling a memory). Memory addressing is generally based on a similarity measure such as cosine similarity

[13, 27] kernel weighting [36], use of linear models [33] or instance-based similarities, many using K-nearest neighbours [22] [34]. However, these approaches are not obviously well suited to assessing the similarity of noisy and non-stationary, multivariate time-series. Euclidean distance offers a way to compare time-series but has a high sensitivity to the timing of data-points, something that has been addressed by dynamic time warping (DTW) [30, 5]. However, DTW requires normalized data [23] and is also computationally expensive, although some mitigating measures have been developed (see [39]).

3 Continual Learning Augmentation

Continual learning augmentation (CLA) is a regression approach applied as a sliding window stepping forward through time, over input data of one or more time-series. The approach is initialized with a an empty memory structure, and a FFNN base model, , parameterized by . The base model is applied to a multivariate input series, , with variables over time steps. The base model produces a forecast value in each period as time steps forward. A remember function, , appends a new model-memory, , to , on a remember cue defined by the change in the base model’s absolute error at time point . A recall function balances a mixture of base model and model-memory forecasts.

Figure 1: Continual learning augmentation architecture. (a) A regression base model, , parameterized by , is run, stepping forward through time, training at each time step. (b) This base model is structured as a column containing model parameters, , and a contextual reference, . Initially the base model is run as normal, completing a forward pass with the input data to forecast (). (c) As time steps on, becomes observable and a backward pass is conducted where the absolute error of the base model, , determines if a change point has occurred, On a change the function copies the base model column to a new memory column in . The base model is then trained. (d) Over time more change points will be detected and more memory columns will be added. At each time step, all memory columns are run using the current input data . The forecast results of all the columns (including the base model column) are balanced by the function which uses the similarity of the current input, with the contextual reference of each memory, , to weight the output of each column, and of , to result in ).

Figure 1 shows the functional steps of remembering and recalling model-memories.

3.1 Memory management

Repeating patterns are required in the input data to provide memory cues to remember and recall different past states. Model parameters trained in a given past state, , can then be applied if that state approximately reoccurs in the future. When CLA forms a memory, it is stored as a column in an explicit memory structure, similar to [3], which changes in size over time as new memories are remembered and old ones forgotten. Each memory column consists of a copy of a past base model parameterization, , and the training data used to learn those parameters; . As the sliding window steps into a new time period, CLA recalls one or more model-memories by comparing the latest input data () with the training data stored in each memory column (). Memories with training data that are more similar to the current input series will have a higher weight applied to their output () and therefore make a greater contribution to the final CLA output ().

3.2 Remembering

Remembering is triggered by changes in the absolute error series of the base model, , as the approach steps forward through time. These changes are assumed to be associated with changes in state which are indicated by a function . defines a change and stores a pairing of the parameterization of a base model, , and a contextual reference, . Figure 1c) shows how a change is detected by function from a backward-pass, which then results in a new memory column being appended to :


Immediately after the remember event has occurred, a new base model is trained on the current input, overwriting .

Theoretically, for a fair model of a state, would be approximately with a zero valued mean. Therefore the current base mode would cease to be a fair representation of the current state when

exceeds a certain confidence interval, in turn implying a change in state.

represents a critical level for , indicating a change point has occurred in state. Memories are only stored when the observed absolute error series,, spikes above a critical level, :

0:  Initialize memory structure
0:  Initialize
  # Step through time, period by period, starting at the earliest date
  for all time steps in  do
      error after forward pass of base model
     if  then
        append model-memory to
     end if
     # Dependent variable becomes observable
      train base model and overwrite
      learn and update
  end for
Algorithm 1 Remember function

is a hyperparameter, optimized at every time step, to result in a level of sensitivity to remembering that forms an external memory,

, resulting in the lowest empirical forecasting error for the CLA approach over the study term up until time :


Where is the CLA approach expressed as a function of the input series and , yielding (the absolute error of the base model at time ). is a 20 point, equidistant set between the minimum and the maximum values of .

3.3 Recall

The recall of memories takes place in the function , which calculates a mixture of the predictions from the current base model and from model-memories.


The mixture coefficients are based on comparing the similarity of the current time varying context with the contextual references stored with each individual memory. Memories that are more similar to the current context have a greater weight in CLA’s final modelling outcome. Dynamic time warping (DTW) is used to calculate contextual similarity. However, multivariate DTW is computationally expensive [31]. As well as applying traditional constraints to the warping path, we also use a sampling based implementation to reduce expense further. DTW is only applied to a subset of randomly sampled instances from and , sampling over rows, each of which represent different securities in the dataset:


Where is the expected distance, is the number of samples to take and are random integers between 1 and .

The mean, sampled distance is used to determine the similarity between the current context and those of each memory.

3.4 Balancing

Two different approaches to memory balancing were used, firstly, the best individual (i.e. lowest distance) model-memory:


where is an output function to select the model-memory which is most similar to the current context (i.e. ), is the regression output. Secondly, a similarity weighted ensemble of all model-memories, :


Where is the number of memories in the memory structure . As a past state is unlikely to perfectly repeat, a continuous function for balancing model-memories is more likely to generalize better [14] than picking the best single model (which is indeed found to be the case).

4 Simulating investment decisions

Figure 2: Long/short investment simulations. 50 simulations were run using a CLA with (best) balancing and then using (similarity weighted

). For each test the min, max, mean and median total return results are shown. TR is the annualized total return, SD the annualized standard deviation, TR/SD is the Sharpe ratio. Augmentation benefit shows the performance of CLA over the base model where RR is the annualized relative return of CLA over the base model and RR/SD is the information ratio. P-values of the t-stats of the Sharpe and information ratios show that CLA produces statistically significant Sharpe at the 1% level, for the mean and median results and statistically significant information ratios, at the 1% level, for the

similarity weighted simulations. The signs tests show statistically significant hit rates at the 1% level for both best and similarity weighted tests.

CLA is used in a regression task to forecast future expected returns of individual equity securities and used to drive an equities investment simulation. Stock level characteristics were used as the input dataset, to batch train an FFNN over all stocks in each period, forecasting US$ total returns 12months ahead for each stock. Where a forecast was in the top(bottom) decile it was interpreted as a buy(sell) signal.

Although CLA is designed to use non-traditional driver variables, stock level characteristics are commonly expressed using factor loadings [9]

. To provide a more traditional context for testing, factor loadings were used as input data. These were estimated, in-sample at each time step by regressing style factor excess returns against each stock level US$ excess return stream:

, where is the excess return of stock in period , is the excess return of the All Countries World Ex-USA Equities Index, is the relative return of the All Countries World Ex-USA Value Equities Index.

Stock level factor loadings populated a matrix, , which comprized the input data. Each row represented a stock appearing in the index at time (up to 4,500 stocks) and each column related to a coefficient calculated on a specific time lag.

resulted from winsorizing the raw input to eliminate outliers. A FFNN was trained in each period by separating the input data into training, cross validation and testing sets in 75/5/25 proportions. Long/short model portfolios were constructed every six months over the study term, simulating a rebalance every 6months, using equal weighted long positions (buys) and shorts (sells). The simulation encompassed 4,500 international equities in total, covering over 30 countries across developed and emerging markets, corresponding to the the All Countries World Ex-USA Equities Index between 2001-2017. (Note that the first 24months were used as a training period while testing, which was entirely out of sample and free from known data snooping biases, started in 2003). To account for the DTW sampling approach used, multiple test runs were carried out. 50 simulations were run per test for this purpose. Both

best and separately similarity weighted were tested. Further testing was conducted to investigate whether results exhibited only an ensemble effect. An equal weighted balancing approach was also tested and found to generate weaker positive total returns relative to both best and the similarity weighted balancing approaches, demonstrating that CLA is exhibiting more than an ensemble effect

5 Simulation results

5.1 Accuracy: Investment decisions

CLA results for long/short simulations showed a significant return benefit over the FFNN base model. (in further testing it was also found CLA produces positive and statistically significant augmentation benefits over traditional linear models also.) It is also found that CLA is particularly effective at identifying stocks that produce a poor future return. As well as producing good performance relative to the base model, CLA also produced strong positive hit rates, statistically significant to the 5% level or better in both tests. Examining the distribution of simulation results it is notable that as the base model error increases, augmentation benefit also increases, indicating a stronger augmentation benefit when the base model error is higher and vice versa. This property was exhibited in all tests conducted.

5.2 Explainability: Interpretable Memory

Figure 3: Interpretable memory: how recalled memories contribute to simulation performance. The graph in a) uses a single simulation and shows the growth of a $1 investment in 2003, using strategies driven by CLA or the base model. b) shows a representation of the CLA memory structure, where each row in the expanding triangle represents a potential memory. This external memory structure can grow by one memory at each step forward in the simulation, although in practice only four memories were remembered in this simulation. The top row in the triangular graphic represents the base model. The memory with the highest weight in each period is highlighted.

CLA produces results that can be explained by examining which past models have been applied to which approximately repeating states, and with further investigation, why. Figure 3 shows an example of a simulation run, where 3a) shows how the value of $1 would have changed if invested in an investment strategy driven by CLA and, separately, by the base model. Figure 3b) shows the memory structure of CLA, where a new memory can theoretically be appended at every step forwards in the simulation, although only four memories were remembered in this example. Two memories are examined. Firstly, a memory formed in the period ending July 2006, which is used by CLA to outperform the base model in the period Sep 2007 - Oct 2008. Interestingly this is over the period of the Quant Quake and up until just after the collapse of Lehman Brothers in the thick of the 2008 Financial crisis. Secondly, a memory formed in the period ending December 2009 is used by CLA between 2011 and mid 2014, the period affected by the euro zone crisis and subsequent recovery. It is also used by CLA, albeit to far lesser effect, in late 2016.

6 Conclusion

Continual Learning Augmentation (CLA) is able to accumulate knowledge of changing market states over time and is able to apply this knowledge in approximately reoccurring future states. Our aim has been to combine different machine learning concepts to create a memory structure that improves the accuracy of time-series based modelling and produces memory modelling outcomes that are explainable. CLA is to our knowledge the first approach successfully applied to continual learning in the noisy, non-stationary, finance problem space, using an explicit memory approach to drive state dependent decision making. We introduce absolute error change as a memory concept, using the hyperparameter

which learns points that govern remembering and forgetting of model-memories. We also use a sampling approach on multivariate DTW as a similarity measure to access CLA’s memory structure and make the first application of a memory modelling approach to a broad stock selection problem.

6.1 Accuracy of outcomes

We find that CLA produces positive, statistically significant forecasting benefit using a FFNN base model. Long/short tests show positive and statistically significant total returns and a positive and statistically significant augmentation benefit relative to the base model. The similarity weighting of model-memories produces stronger results than simply picking the best model-memory. If CLA were exploited in practice, this outperformance would give significant advantage to investment strategy returns.

6.2 Explainability of memory use

CLA’s memory structure can be interpreted in terms of which past state is relevant to forecasting in the current state. This allows objective comparisons to be made between relevant past states and the current state and also allows for a better understanding of the characteristics of the current state in the context of similar past states. This information can provide deep insights to users to guide decision making.

6.3 Future work

Our results indicate that CLA may be effectively applied to other problems on noisy and non-stationary time-series, in and outside of the finance domain. While our approach is directly applicable to quantitative investment we intend for our research to also be applied to other fields such as computational biology and wearable technology. It is also noted than the nature of absolute error change as a memory concept, introduced in this study, could hold more benefits for memory augmentation or model selection, where change in the absolute error distribution could be used to better identify changing states and to better learn more appropriate models. We feel the benefits of our approach that we established in terms of accuracy and explainability, justify the additional complexity of CLA and warrant research into a fully differentiable structure for learning the relationships between memory cues, memory models and ultimate actions.