1 Introduction
The bond market, in particular the government sector, plays a fundamental role in the overall functioning of the economy and is of paramount importance for financial markets. This is the case, both as an asset class by itself (with an overall size of USD 102.0 trillion, as of 31Dec2016 (Bloomberg, 2017), which compares to a global equity market of USD 66.3 trillion), and because the valuation methods of other asset classes often depend on bond yields as input information, especially for equities and corporate bond yields. In addition, its importance derives from the fact that bonds and fixed income securities in general are a significant component of the portfolios of pension funds and insurance companies. Most commonly, the percentage of bonds varies from 25 to 40% of portfolio assets for pension funds and around 40 to 70% in the case of life insurance companies (OECD, 2015).
Moreover, the bond market is at the early stages of both quantitative investing and electronic marketmaking, clearly lagging behind equities and foreign exchange (forex) markets. This lag becomes evident in the scientific literature. In fact, considerable attention has been devoted to the use of machine learning and development of techniques for equity markets (Booth et al., 2014; Eilers et al., 2014; Ballings et al., 2015; Dunis et al., 2016; Fischer and Krauss, 2018; Kraus and Feuerriegel, 2017; Qin et al., 2017; Sermpinis et al., 2019), and also for forex markets (Gradojevic and Yang, 2006; Huang et al., 2007; Choudhry et al., 2012; Sermpinis et al., 2012; Fletcher and ShaweTaylor, 2013; Sermpinis et al., 2013), just to mention a few studies (more detail in Section 2). In contrast, there is a significant gap in both the academic literature and the finance industry when it comes to the application of machine learning techniques in fixed income markets (Castellani and Santos, 2006; Dunis and Morrison, 2007; Kanevski et al., 2008; Kanevski and Timonin, 2010; Sambasivan and Das, 2017).
Furthermore, most of the applications of machine learning to financial assets tend to be limited to forecasting and comparison of results versus benchmarks. Only very few publications can be found that try to extract additional information from the model or study how the model works (see Section 2). Indeed, advances in machine learning enable enhanced decisionmaking by e.g. using new types of data (Kraus and Feuerriegel, 2017)
and reinforcement learning techniques
(Eilers et al., 2014). However, for machine learning models to be useful in asset management decisionmaking they need to be trustworthy. To achieve this, a better understanding of their functioning is crucial.In this field of machine learning, more precisely in deep learning, one of the most successful models for sequence learning is the long shortterm memory (LSTM) networks. The architecture of this model includes a feedback loop mechanism that enables the model to “remember” past information. This model has been achieving top results in other scientific fields but has not been used on a broad basis in financial applications. This will be further detailed in Section 2. More specifically, in the case of bonds, they have not been studied previously with LSTMs. This is an additional gap in the literature, despite both the importance of this asset class in financial markets and the potential of LSTMs for financial forecasting. The discussion of its potential will be the focus of Section 3.2.
Given the status quo on machine learning research in bonds, the main highlevel objectives of this study are twofold: to assess the potential of LSTM networks for bond yield forecasting, testing their memory advantage versus memoryfree models such as standard feedforward neural networks; and to demystify the preconceived notion of black box associated to the LSTM model. Together these objectives go towards bridging the gaps identified in the literature and presented above. Besides, they contribute to improved knowledge and trustworthiness of LSTM networks, providing asset management practitioners with additional tools for better decisionmaking.
In more detail, our key contributions are as follows. First, we conduct an innovative application of a deep learning model (LSTM) to bonds. The results are compared to memoryfree multilayer perceptrons (MLP). Our results validate the potential of LSTM networks for yield forecasting. This enables their use in intelligent systems for the asset management industry, in order to support the decisionmaking process associated with the activities of bond portfolio management and trading. Additionally, we identify the LSTM’s memory advantage over standard feedforward neural networks, showing that the univariate LSTM model with additional memory is capable of achieving similar results as the multivariate MLP with additional information from markets and the economy.
Second, we go beyond the application of LSTMs, by conducting an indepth study of the model itself (opening the black box), to understand the representations learned by its internal states. Such explanations of what blackbox models learn is a popular topic of interest for certification and litigation purposes. In more detail, we extract and analyse the signals in both states (hidden and cell) and at the gates inside the LSTM memory cell. This is the first contribution to demystify the notion of black box attached to LSTMs using a technique which is fundamentally different from the most relevant ones found in the literature. Other studies are applied to a different type of recurrent neural network
(Giles et al., 2001) or perform an external analysis of the model (Fischer and Krauss, 2018).Third and last, following the extraction of signals at those locations, we proceed to explain the information they contain with exogenous economic and market variables. For that purpose, we develop a new methodology here identified as LSTMLagLasso, based on both Lasso (Tibshirani, 1996) and LagLasso (Mahler, 2009). This methodology is capable of identifying both relevant features and corresponding lags.
The remainder of this paper is structured as follows. In Section 2 the literature review is presented. In Section 3 we introduce the theory behind the deep learning model used in our research, together with its main advantages, limitations and potential for yield forecasting. Section 4 covers the bond yield forecasting study using LSTMs versus MLPs. Section 5 focuses on the internal analysis of signals inside the LSTM model, while Section 6 details the explanation of those signals using exogenous variables, introducing the LSTMLagLasso methodology developed for that purpose. Finally, in Section 7 the main conclusions are outlined together with direction for future work.
2 Literature review
The literature review starts by looking at the main applications of recurrent neural networks in finance and other fields. Then, considering the subset of publications on forecasting financial assets, we analyse in detail the whole scope of the application carried out, to compare and contrast with our research.
2.1 RNNLSTM applications in finance and other fields
The recurrent neural network family of model, especially the LSTM networks, has shown to significantly outperform. In fact, in one of the most recent and comprehensive books on deep learning the authors categorically state that gated RNNs are the most effective sequence models used in practical applications, i.e. LSTMs and gated recurrent unit (GRU) based networks
(Goodfellow et al., 2016).The applications are almost endless, and comprise a wide variety of activities and scientific fields, such as (see Lipton et al. (2015)
): handwriting recognition, text generation, natural language processing (recognition, understanding and generation), time series prediction, video analysis, musical information retrieval, image captioning, music generation, and in interactive type of problems like controlling a robot. For natural language processing in particular, LSTMs are among the most widely used deep learning models to date.
Most of these activities have in common the fact that they have sequential data. And this is what RNNs and LSTMs do best. They can process sequences as input, as output or in the most general case on both sides (Karpathy, 2015). Furthermore, the LSTMs are used to take advantage of their capability to learn longterm dependencies.
In the financial domain, no publication could be found in the current literature using this type of model in fixed income markets. A discussion of its potential in this area will be the focus of Section 3.2. Indeed, applications of RNN and LSTM models were found first and foremost for equities. See, for example, the works by Xiong et al. (2016), Persio and Honchar (2016b, 2017), Fischer and Krauss (2018), Kraus and Feuerriegel (2017), Munkhdalai et al. (2017) and Qin et al. (2017). In addition, substantial work can also be found on forex markets (Giles et al., 2001; Maknickienė and Maknickas, 2012; Persio and Honchar, 2016a). Other applications can be detected for financial crises prediction (Gilardoni, 2017) and for credit risk evaluation in P2P platforms, or peertopeer lending (Zhang et al., 2017).
2.2 Main scope in the financial applications
Considering the applications to other financial assets presented in Section 2.1, what most of them have in common is that they are pure applications of the model. This usually includes the implementation of the model to the selected asset class (equity, forex, or other), the calculation of errors, and the comparison with other models or benchmarks. In some notable exceptions, the research goes beyond the pure application, and analyses the information inside the model. The main studies in this context are presented below.
To begin with, Strobelt et al. (2018) have developed a tool that facilitates the visualisation of signals inside the LSTM memory cell, called “LSTMVis”. This analysis can be used to identify hidden state dynamics in the LSTM model that would otherwise be lost information. There are some indications in the literature that this type of information extraction from the gates may be relevant, as will be seen below. This visualisation tool is especially orientated to the manipulation of sequences of words. They may consist of English words for translation or sentiment detection, or other type of symbolic input as musical notes or code.
Based on the “LSTMVis” visualisation tool, Persio and Honchar (2017) provide examples of activation time series in RRNs. The authors mention that this type of analysis may be able to detect trends in time series and, in this case, the signals could be used as indicators. However, it is not specified in their paper at what level in the model the authors captured those signals (states or gates and their identification). Nevertheless, this study represents a first attempt at this subject, although further research is needed to demonstrate the mentioned detection of trends capability in time series.
In an earlier study, Giles et al. (2001)
refer to the possibility of extracting rules and knowledge from trained recurrent networks modelling noisy time series. To that end, they first convert the input time series into a symbolic representation using selforganizing maps. Then the problem becomes one of grammatical inference and they use RNNs considering the sequence of symbols as inputs. More specifically, they use an Elman type of architecture for the recurrent neural network
(Elman, 1990). In addition, the converted inputs facilitate the extraction of symbolic information from the trained RNNs in the form of deterministic finite state automata. The interpretation of that information resulted in the extraction of simple rules, such as trend following and mean reversion.In contrast to Giles et al. (2001), in our research we use the LSTM model, a more recent type of recurrent neural networks. Moreover, it should be emphasized that the conversion into symbols has a filtering effect on the data and this may be undesirable. Hence, our data preprocessing does not include any type of filtering operation given our view that the data complexity and its volatility do not conform with the concept of noise. This issue will be further detailed in Section 6.1 – LSTMLagLasso.
Last but not least, Fischer and Krauss (2018)
have used a different approach with the same objective of interpreting what happens in the model. We will briefly explain the methodology used, for context and to clarify the differences in relation to our approach. In their research, the authors carry out a notable and comprehensive market prediction study on the component stocks of the S&P 500. Using LSTMs they forecast the next time step, subsequently ranking the individual stocks based on the probability of outperforming the crosssectional median. Then they group the
top and bottom stock performers (). Considering the model’s input sequence of returns for each of those groups, they calculate several descriptive statistics and identify characteristics of the stocks belonging to each of those two groups (top and bottom likely performers). Using this methodology, they found that stocks in the top and bottom group exhibit the following characteristics: high volatility, belowmean momentum, and extreme returns in the last few days with a tendency to revert them in the shortterm. This is especially the case for groups with a smaller number of stocks (smaller
). Since these characteristics were found on direct outputs of the model, they are attributed to the functioning of the LSTM networks.Based on the result obtained, Fischer and Krauss (2018) devised a trading strategy, which consists of selling the (recent past) winners and buying the (recent past) losers. This is a possible but simplified trading strategy. It is known in practice in financial markets as “contrarian”, in the sense that it is counterintuitive, and it is certainly far from consensual. See for example the work of Jegadeesh and Titman (1993) and Wang et al. (2019) supporting the opposite strategy “buy the winners and sell the losers”. On the contrary, Khanal and Mishra (2014) found no clear evidence of a profitable “buy the winners and sell the losers” strategy, considering it on a buy and hold basis for 3 to 12month periods and for the period studied between 1990 and 2012. And finally, the work of Antoniou et al. (2003) supporting the strategy “buy the losers and sell the winners”, considered in the work in analysis. In fact, in the most dangerous situations for a stock and most penalising for a portfolio return, a sequence of negative returns may be just the beginning of a serious bear market for that particular stock, or something even more serious affecting the company that will result in a prolonged correction.
To conclude, the authors (Fischer and Krauss, 2018) conducted an external analysis of the LSTM model, in the sense that they use the output of the model to infer their functioning. No internal analysis is carried out of the LSTM states or gates, and this is the main difference in relation to our research and one of our contributions to the present stateoftheart.
3 Deep learning model
In this section, we present the deep learning model used for this study, namely the LSTM networks. The other model considered is not described here. Further information on standard feedforward neural networks / MLP can be found elsewhere (see, for example,
Bishop (2006), Hastie et al. (2013) and Rumelhart et al. (1986), the latter for the training process using the backpropagation algorithm). Then we discuss the potential of LSTMs for yield forecasting.3.1 Long shortterm memory networks
The LSTM architecture was first introduced by Hochreiter and Schmidhuber (1997), and subsequently adapted by other researchers (Gers et al., 1999; Gers and Schmidhuber, 2000; Graves and Schmidhuber, 2005). The LSTM model is a type of recurrent neural network, having a structure that includes a clever feedback loop mechanism delayed in time, and this structure can be “unrolled” in time. At each time step, the LSTM cell has a structure which is substantially more complex than a standard RNN, incorporating four complete neural networks in each of those cells (also called memory cells). In Figure 1 a simplified diagram of the unrolled chaintype structure is presented, identifying the main components of a memory cell. A more detailed representation of an LSTM cell is presented in Figure 2.
The corresponding equations that govern the modern LSTM model can be expressed in the following form, with Figure 2 showing where these operations occur in the LSTM cell (Hochreiter and Schmidhuber, 1997; Lipton et al., 2015; Goodfellow et al., 2016):
(1)  
(2)  
(3)  
(4) 
(5)  
(6) 
where is the function for the forget gate; are the functions for the input gate and for the input node, respectively; the function for the output gate; is the cell state (also called internal state) at time step and ; the hidden state at time step and ;
is the input vector at time step
; are the weight matrices, with as an example, representing the weight matrices for the connection inputtoforget gate (indices indicating tofrom connections);are the bias vectors;
is the logistic sigmoid activation function;
the hyperbolic tangent activation function; and represents the Hadamard product (i.e. elementwise multiplication).3.2 LSTM advantages, limitations and potential for yield forecasting
After describing the architecture of the LSTMs we now identify the advantages, limitations and potential for yield forecasting. First, the main advantage of the LSTM model is related to the reason why it was developed in the first place. The RNNs could not capture the longterm dependencies due to the vanishing or exploding gradients problem first identified by
Hochreiter (1991) in a diploma thesis (Hochreiter et al., 2001; Schmidhuber, 2015), and in parallel research work by Bengio et al. (1993, 1994). In fact, this memory capability is one of the characteristics that clearly separates this type of model from the standard neural networks, in particular the MLPs. Given the stateless condition of MLP models they only learn fixed function approximations.For modelling financial time series, it seems probable that longterm dependencies are important. Even though the last available value of the series is the one collecting all the available information in the market up to the present moment, inversion points tend to follow certain patterns, frequently exploited by technical analysts who look essentially at chart data. A model with memory and capable of learning longterm dependencies may be beneficial for this reason.
Second, in sequence prediction problems, the sequence imposes an order on the data that must be respected when training and forecasting, i.e. the order of the observations is important for the modelling process. This is the case with financial time series. However, in feedforward neural networks / MLPs, the modelling of time series’ temporal structure is only done indirectly through the consideration of multiple time steps as different input features. Although with this method previous values are included in the regression problem, the natural “sequence” or structure of the time series is not really present in the modelling process and the model does not have any knowledge of it. The LSTMs are the most effective models for sequence learning, modelling these time sequences directly. Additionally, it can input and output sequences time step by time step, enabling variable length inputs and/or outputs. With this property, they overcome one of the main limitations of standard feedforward neural networks.
Third and last, a model for financial time series should be able to perform multiasset forecasting, for the prediction of several targets simultaneously and it would be desirable to perform multistep forecasting, to consider several forecasting horizons into the future. The LSTM type of model is capable of dealing with multivariate problems and also with multistep prediction using sequencetosequence architectures, thus fulfilling these requisites naturally.
On the limitation side, some time series forecasting problems are technically simpler, not requiring the characteristics of a recurrent type of model. This is the case in particular when the most relevant data for making the prediction is within a small window of recent historic values. Here, the capability to deal with longterm dependencies and the model “memory” are clearly not necessary. In this type of situation, MLPs and even linear models may outperform the LSTM pureautoregressive univariate model, with lower complexity (Gers et al., 2002; Brownlee, 2018).
Overall, given the additional complexity of the model, it should only be used when the type of problem we have is better modelled by this type of neural network architecture. And this is reflected in two main conditions: sequential data and when the longterm dependencies may help the forecasting process. Nonetheless, LSTM networks’ potential for yield forecasting seems evident. Despite being predominantly used for nonfinancial applications, their characteristics make them potentially suitable for financial time series predictions.
4 LSTM networks for bond yield forecasting
Now that we have discussed the LSTM model and its potential for yield forecasting, we move on to the empirical work carried out. In this section, we describe the choices made on the dataset used, identifying the target, features and premodelling operations (including the generation of additional features for the MLP model, the traintest split and normalisation).
4.1 Data
Given the interconnectedness and mutual influence of various asset classes in the markets, we consider a large number of features from financial markets. These are selected from government bond markets and from related classes and indicators: credit (corporate bonds), equities, currencies, commodities and volatility. Additional features are added, which are calculated from the previously mentioned features, specifically, bond spreads, slope of the yield curve and simple technical analysis indicators. Furthermore, economic variables are also very important, as clearly exemplified by the well established yieldsmacro models such as the Dynamic Nelson–Siegel model (Diebold and Li, 2006). Hence, a vast range of economic indicators are also included, from different geographic locations. The complete list includes 159 features and, because it is so extensive, it is stored and made publicly available (Nunes, 2020). The target chosen for this study is the generic rate of the 10year Euro government bond yield.
The dataset was obtained from Bloomberg database (Bloomberg, 2017) and covers the period from January 1999 to April 2017, giving 4779 time points of data. The former is the starting date for most time series of the Euro benchmarks and the total period covers several bull and bear markets. Regarding data frequency, the selection was daily closing values, which are easily available for financial assets in general.
As mentioned previously in Section 3.2, in the case of feedforward neural networks / MLPs, the modelling of time series’ temporal structure is done indirectly through the consideration of multiple time steps as different input features. Contrary to the MLP work, there is no need to generate additional features for the LSTM network, since previous time steps are given directly to the model in the form of an input sequence. Hence, for the MLPs, new features are generated from the original ones, corresponding to lagged values of the respective time series. In our research, six time steps are considered, based on previous studies (Mahler, 2009). The selection of the most relevant features is carried out using Lasso (standing for Least Absolute Shrinkage and Selection Operator) regression proposed by Tibshirani (1996).
As is common, we divide the data into two groups, for training and testing the models. In this case, a 70% / 30% split is considered. This refers to the overall static training and testing datasets. The static training dataset is considered for tuning hyperparameters, by subdividing it into new training and validations datasets, ensuring that results are quoted on totally unseen (test) portions of data. When forecasting time series points from the static testing dataset, the actual training data is a dynamic moving window spanning up to the present moment for the time step being considered.
Finally, all data is normalised by subtracting the mean and dividing by the standard deviation of the training dataset (the dynamic moving window). This is also essential, given the wide range of features we are considering, which have very different scales in some cases. For example, the 10year yield is quoted in percentage, usually staying below 6%, while several equity indices reach values well above 10000 (Dow Jones, Nikkei 225 and Hang Seng).
4.2 Methodology
In this section, the empirical work carried out is identified and explained: first, we compare directly the univariate MLP and LSTM models; then, we further assess the LSTM potential for yield forecasting using different input sequences. Moreover, the models considered are specified and additional aspects of the methodology adopted are described, in particular in what concerns moving windows, retraining of models and crossvalidation. A summary of the models used and additional model information is presented in Table 1.
Model information  
Original features  159  
Generated features  795  
Target  10year yield  
Forecasting horizons  0 (next day), 5, 10, 15, 20  
Moving window size  3000 days  
Hidden units MLP  10  
Hidden units LSTM  100  
Time steps MLP  6 days  
Time steps LSTM  6, 21, 61 days  
Model  Short name  Description 
Direct comparison MLP vs. LSTM  
MLP  NN TgtOnly  MLP with target data only 
LSTM  LSTM06  LSTM using input sequence of 6 time steps 
LSTMs with different input sequences  
MLP  NN RelFeat  MLP with relevant features 
NN TgtOnly  MLP with target data only  
LSTM  LSTM06  LSTM using input sequence of 6 time steps 
LSTM21  LSTM using input sequence of 21 time steps  
LSTM61  LSTM using input sequence of 61 time steps 
For a streamlined approach to the model, we apply LSTM networks to a univariate type of problem, i.e. the model has only one feature, corresponding to past values of the target we want to predict. This is justified by the fact that we want to be able to assess the LSTM model potential, so we prefer to make the comparison with MLPs in its most pure condition. In other words, we prefer to perform the comparison without introducing additional features in the LSTM model, which would introduce an extra level of complexity.
When forecasting for longer forecasting horizons beyond one step ahead, there are two methods that can be used: direct or iterative forecasting. On the one hand, in the direct forecast only current and past data is used to forecast directly the time step required, using a horizonspecific model. On the other hand, in the iterative forecast a one step ahead model is iterated forward until the target forecasting horizon is reached. All the models use direct forecasting of targets given that, with financial time series, the prediction errors tend to propagate fast if we were to make iterative predictions. As a result, new neural networks are trained for each forecasting horizon.
For the direct comparison between the univariate MLP and the univariate LSTM, an LSTM is selected that most closely simulates the conditions applied to the MLP neural network, i.e. both considering 6 time steps, based on previous studies (Mahler, 2009). The corresponding models are (Table 1
): Model “NN TgtOnly”, meaning neural network using only the target variable as feature; and Model “LSTM06”. To further assess the potential of LSTMs for yield forecasting, we extend the number of models on both sides. On the LSTM side, we consider three different input sequences in total: 6, 21 and 61 time steps, with the last two corresponding to approximately 1 and 3 calendar months. The selected number of time steps also follow the structure “next day” plus desired period, i.e. 1+20, 1+60, which will be useful later on this work when using sequencetosequence LSTM architectures. On the MLP side, we add to the univariate model, the MLP using the most relevant features selected for the 10year bond yield target and individually for each forecasting horizon (Model “NN RelFeat”). Besides, the comparisons are carried out for all forecasting horizons considered, i.e. next day and (next day plus) 5, 10, 15 and 20 days ahead.
Both MLP and LSTM models are trained using moving windows and retraining of models at every time step. This technique is feasible in real time and is used to take full advantage of the models.
Throughout this study, the main metric used is the mean squared error (MSE), which is commonly used for this purpose. The results in Section 4.3 are presented in the normalised version, i.e. calculated directly from the real and predicted normalised yields (Section 4.1), since the nonnormalised equivalents are scale dependent. Hence, the normalised metric is used to facilitate the comparison of models.
In terms of the number of hidden units for the LSTM, this is set to 100 units (Table 1). A higher number would require additional computational time and this is a good compromise between speed and accuracy. This number is also compatible with the maximum number of relevant variables considered in the MLP study for modelling the longest forecasting horizons. Regarding the MLP number of hidden units, the main conclusion from the hyperparameter tuning is that 10 hidden units is a good compromise, with significant overfitting observed for neural networks with more than 100 units.
4.3 Results and discussion
In this section, the main results are presented for both studies: the direct comparison MLP vs. LSTM and the assessment of LSTM potential using different input sequences. This is followed by a discussion of its implications for the present research.
Starting with the first study, the direct comparison of models, the results obtained are summarised in Figure 3. As can be seen, the results obtained with the LSTM model are better for forecasting horizon of 5 and 10 days, achieving MSE reductions of 25% and 14%, respectively (median values), compared to the univariate MLP. When the horizon is too large (20 days) the advantage of the LSTM is lost. In all other cases, the results are not significantly different.
Another aspect that can be observed from the results is that the standard deviation obtained with LSTMs is lower for all horizons analysed. This is also an important outcome, giving indications of higher stability of this type of model when compared with the more traditional MLP model.
Regarding the second study, LSTMs using different input sequences, the results are shown in Figure 4. When forecasting the next day (Figure 3a), the results obtained from all models are similar. This situation is equivalent to that reported in Nunes et al. (2019)
for next day forecasting with models including multivariate linear regression and a variety of MLPtype of models. One striking advantage revealed by the LSTM is again the lower standard deviation in all LSTM models when compared to both MLPs.
However, it is when we forecast more distant time steps, that the benefits of LSTMs become more evident. Hence, considering the forecasting horizon of (next day plus) 5 days, as shown in Figure 3b, the LSTMs with input sequence length of 6 and 21 time steps (Models “LSTM06” and “LSTM21”, respectively) produce results that significantly outperform both MLP models, with lower errors and much lower standard deviations. When compared to the univariate MLP, these models achieve MSE reductions of 25% and 34%, respectively (median values). On the other hand, the LSTM with an input sequence of 61 days (Model “LSTM61”), does not produce a reduction in error, generating similar median results also with significantly lower standard deviation. Therefore, it appears that forecasting 5 days ahead does not require such a long input sequence of 60 days.
When we consider forecasting horizons of 10 and 15 days (Figures 3c and 3d), all LSTMs tend to perform better than the univariate MLP model (Model “NN TgtOnly”) with lower standard deviations. The MSE reductions achieved range from 2% to 47%. Another important observation is that the LSTMs with longer input sequences are able to reach similar levels of forecasting error to the MLP with the most relevant features (Model “NN RelFeat”), again with lower standard deviation. In fact, the LSTM appears capable of compensating, at least partially, the lack of additional information from markets with additional memory via longer input sequences. This is a promising result for future work.
Finally, the differences are not so clear when we consider forecasting horizon equal to (next day plus) 20 days (Figure 3e). In this case, the LSTMs with longer input sequences (Models “LSTM21” and “LSTM61”) perform better than the univariate MLP (MSE reductions of 17% and 19%, respectively), with slightly lower standard deviation. However, the shorter sequence LSTM (Model “LSTM06”) produces slightly worse median value (MSE increase of 11%), although with lower standard deviation. A possible explanation for this may be that the input sequence length of 6 time steps is already insufficient for forecasting 20 days ahead. Thus, either additional features or longer input sequences are required.
These results suggest that the LSTM architecture needs to take into account the specific problem we are trying to solve and the type of forecasting horizon we aim to predict. Additionally, the results of these comparisons carried out over a wide range of input and forecasting horizons suggest that, under some conditions, the structure in the data is better captured by having models with time delays in them (i.e. LSTMs), which can strike a balance between the use of immediate and distant past data. However, the advantage of one class of models over others is not universal, as indicated by the no free lunch theorem (Wolpert and Macready, 1997). In the next section, we probe the LSTM further, opening the black box, to see if the representations learned in its internal states are interpretable in any way. Such explanations of what blackbox models learn is a popular topic of interest for certification and litigation purposes.
5 Opening the LSTM black box: signals analysis
Neural network methods in general and the LSTM in particular are considered black box type of models. They establish functional relationships between inputs and outputs, but one cannot extract interpretable information from the model itself. In this section we present our contribution to demystify this complex deep learning model, by analysing the signals inside the LSTM memory cell and extracting relevant information. In turn, this is subsequently used to identify relevant explanatory features for this problem (Section 6). This section also includes a brief explanation of the LSTM memory cell, its states and gates, since they are indispensable to understand the methodology used.
5.1 Memory cell: states and gates
We will now explain how an LSTM works, the main components of the cell (Figure 2) and the complete algorithm (Equations 1 to 6). In each LSTM cell the flow of information is controlled by tree gates, namely: forget, input and output gates. The operations performed at cell level are schematically presented in Figure 5. All calculations at each gate depend on the current inputs at the same time step and the previous hidden state (at time step t1). The output from the cell depends on those two variables plus the cell state at time step t.
Forget gate
The forget gate defines which information to remove or “forget” from the cell state. For this purpose, the forget gate has a neural network with a logistic sigmoid activation function ranging from 0 to 1 (Figure 4a). The extremes of that interval correspond to: keep this information (1) or completely remove this information (0). The maths operations performed at this gate are represented by Equation 1.
Input gate and input node
The input gate specifies which information to add to the previous cell state. This part of the cell comprises two elements as shown in Figure 4b. The first component is the input gate, where the inputs (hidden state at time step t1 and inputs at time step t) go through a neural network with a logistic sigmoid activation function (corresponds to node i and function ). The second component is the input node (to differentiate from “gate”) and represents the new “candidates” that could be added to the cell state. These are generated through a neural network with a hyperbolic tangent activation function (corresponds to node g and function ). The corresponding operations carried out in this section of the cell are represented by Equations 2 and 3.
Cell state update
Output gate and hidden state update
The output gate defines the information from the cell state that will be used as output of the memory cell for the present time step. This gate has the fourth neural network of the LSTM cell, with a logistic sigmoid activation function (Figure 4d). The operations applied in this gate are represented by Equation 4. Finally, the actual output from the cell results from the hidden state update, which is computed using Equation 6, taking into account the results of the output gate and the present cell state, pushed through a hyperbolic tangent function (other activation functions may be used).
Cell and hidden states
The information flows through the gates described above, and they are important to justify what happens in the states. However, all the information is transmitted to the following time step through the states. In this sense, the signals from the states summarise all information. The main concept behind the cell state is that it represents the longterm memory of the model, while the hidden state corresponds to the shortterm memory.
5.2 Methodology
The analysis of signals in the LSTM states and gates is conducted at different levels of the memory cell, specifically: forget gate (Equation 1); product of the outputs from the input gate and input node (Equations 2 and 3); output gate (Equation 4); cell state (Equation 5); and hidden state (Equation 6). The product of the outputs from input gate and input node is chosen instead of the individual outputs. The main reason is that the product is what is added to update the previous cell state, thus having most relevant and interpretable information. Using the equations referred above, the signals are calculated in each of those locations, at every time step, and for each individual unit of the LSTM memory cell.
Regarding features, three different cases are analysed to examine whether the behaviour found is consistent under different conditions. In this case we use univariate (feature set 1) and multivariate LSTM models (feature sets 2 and 3). The 10year bond yield is a feature in all sets considered. In the second set we add a technical momentum indicator developed by Merrill Lynch, now Bank of America Merrill Lynch (Garman, 2001)
. This indicator is based on the concept of “reversion to the mean” and assuming normal distribution of the deviations from a shortterm average of 30 working days. The third feature set includes the 10year yield and the two closest benchmarks in the yield curve, the 5year and 30year yield. Yields with maturities adjacent to the one we want to forecast were found to be relevant features in previous work
(Nunes et al., 2019). A summary of the LSTM model used in this study is presented in Table 2.LSTM architecture  

Features  set 1  10year yield 
set 2  10year, momentum indicator  
set 3  10year, 5year, 30year yield  
Target  10year yield  
Forecasting horizon  next day + 5 days  
Moving window size  3000 days  
Hidden units  3  
Sequence in  6 days  
Sequence out  6 days 
Additional options for the signals analysis are justified below. First, for this study we select forecasting horizon of next day + 5 days. The results presented in Section 4.3 show that for this horizon, there is already a significant differentiation between MLP and LSTM models. Second, the number of hidden unit in the LSTM is reduced to three for interpretability purposes. Third and last, sequencetosequence architectures are used with both input and output sequences equal to 6 days. On the one hand, the input sequence equal to 6 days is in line with the option adopted in Section 4.2. On the other hand, the output sequence also equal to 6 days insures that the last value of the output sequence corresponds to the forecasting horizon of next day + 5 days we aim to predict.
5.3 Results and discussion
The main results for the different feature sets are presented in this section, starting with the univariate model (feature set 1). The signals for both hidden and cell states are presented in Figure 6. These signals are plotted against the 10year yield, for reference and facilitate the interpretation process.
From the results, we can observe some similarity between signals of the hidden and cell states. It is worth emphasising that the hidden state at time step is calculated using the cell state at the same time step and the result of the output gate using Equation 6. Consequently, some similarity between those signals may be expected.
A more remarkable property shown in the signals is that unit 1 becomes almost inactive, both in terms of volatility and weight, during two different periods identified as period 1 and 2 in Figure 6. During those, while unit 1 tends to a zero weight, units 2 and 3 take over becoming more active and following more closely the volatility of the 10year bond yield. The periods mentioned are naturally different both in terms of duration and occurrence in time. However, they both correspond to periods in which the 10year yield assumes downward extreme values, more specifically yields of 3.63.7% and below. Taking this into account, there is some evidence that some form of specialisation of the units occur. In this case, upward values covered by unit 1 and downward extreme values controlled by units 2 and 3. That is, specialisation of units covering different yield ranges.
As mentioned before, the hidden and cell states are the most important since they summarise all the information going through all the other gates. The signals at the cell gates are helpful to understand and confirm what happens at the states’ level. In this vein, and as an example of this type of confirmation in relation to the cell state, we present the signals at the following two locations in the cell: input gate input node and forget gate (Figure 7).
We observed that unit 1 becomes almost inactive tending to a zero weight during periods 1 and 2 (Figure 6). This behaviour is better understood and justified at gate level. Indeed, at input gate input node location the resulting signal is adding approximately zero of unit 1 to the previous cell state (Figure 6a). At the same time, the forget gate is increasing the amount to forget of unit 1 (via lower weights) during the same periods (Figure 6b), while decreasing it for the other two units (via higher weights).
As to the hidden state, the above conclusions are combined with the result from the output gate, which shows a marked decline in the weights for unit 1, with opposite movements for units 2 and 3 (Figure 8).
Moving to feature sets 2 and 3, although the behaviour is distinct in each case, the same type of activation / deactivation of units can be identified. For conciseness, an exemplification of the results obtained for the hidden state for both feature sets is provided in Figure 9. For set 2 (Figure 8a), despite the high volatility of the second feature (momentum indicator, Table 2), the same two periods can clearly be observed as described for feature set 1 (Figure 5a). To note, in this case it is unit 3 that assumes the role of previous unit 1. This switch has no relevance since they are all equal units at the start of the learning process. Another interesting observation (not presented here for succinctness), is the fact that given the high volatility of the momentum indicator, two of the units tend to “specialise” in this feature and only one unit in the 10year yield.
For feature set 3 (Figure 8b), the pattern of the signals in relation to the 10year yield data is much looser. One of the reasons that may justify this behaviour is the higher correlation among all yields considered as features (5, 10 and 30year yield). As a result, there is a lower level of dependency on only one of them. Similarly to what was found in the previous feature sets, we can identify some form of yield range specialisation of the units. In this case, unit 3 seems to cover a yield range above 4.5% approximately, becoming much less active subsequently (Figure 8b).
Overall, the most remarkable property found consistently in the LSTM signals, for all feature sets, is the activation / deactivation of units during learning. It can be characterised by an alternation of periods of weight close to zero or low variability, with no significant change in the states, with periods where the units become highly active giving higher contributions to the forecasting process. Furthermore, we found evidence that the LSTM units may specialise in different yield ranges or features considered.
6 New LSTMLagLasso method
Our intention here is to interpret the representations learned by the LSTM model, whose estimation is driven by purely statistical considerations of the error being minimized, in terms of external effects that might have an influence on the bond yields. Should the model extract meaningful features during the learning process, the representations would correlate with exogenous information that was not available to the learning algorithm. In other words, it is precisely because the representations may match such exogenous information that the predictive ability of the blackbox models is good.
6.1 Methodology
In this section, we briefly present the Lasso, the KalmanLagLasso and then evolve to introduce our methodology, the LSTMLagLasso.
Lasso
The Lasso regression we formulate to explain the signals within the LSTM (features extracted by the LSTM) in terms of exogenous variables has the form
(Tibshirani, 1996):(7) 
where is the matrix of features and respective lags; the vector of unknown parameters; are the LSTM cell and hidden state signals (target vectors); is the regularisation parameter; denotes the norm; and the norm.
The Lasso regression determines the parameters of the model by minimising the sum of squared residuals, using an norm penalty for the weights. Due to the type of constraint, it tends to lead to sparse solutions, i.e. some coefficients are exactly zero and as a result the corresponding features are discarded. This is particularly important since it enables a continuous type of feature selection through the tuning of the regularisation parameter , and the identification of the most relevant features for the model.
KalmanLagLasso
Mahler (2009)
introduced the KalmanLagLasso method in a study to predict the monthly changes of the S&P500 index, using macroeconomic and financial variables. The overall procedure included two phases. In the first, Kalman filters
(Kalman, 1960; Niranjan, 1996) were used to denoise the explanatory variables and predict the residuals (part not explained by the model). Then the LagLasso phase is implemented, to determine the most relevant features and respective lag using the filtered/denoised variables to explain the prediction residuals. The LagLasso method is based on Lasso (Tibshirani, 1996), implemented via the modified Least Angle Regression (LARS) algorithm (Efron et al., 2004). The algoritm was modified by Mahler (2009), so that only one lag per feature is selected. Once selected, the other lags of the same feature are eliminated from the active set of features so that they cannot be selected again. The KalmanLagLasso is an elegant method of combining a forecasting method with error analysis, seeking to explain the part not explained by model, i.e. the residuals, with external financial variables.More recently (Montesdeoca and Niranjan, 2020), and using a similar approach, the KalmanLagLasso method has been used to compare the type of information influencing US stock indices (S&P 500 and Dow Jones Industrial Average) in contrast with that affecting cryptocurrencies (Bitcoin and Ethereum).
LSTMLagLasso
The LSTMLagLasso method is the new methodology we developed to explain the signals extracted from the LSTM states. It is inspired in the KalmanLagLasso method, but with significant modifications in terms of model used, target variable to which the LagLasso method is applied, as well as the methodology to determine the relevant features and respective lags.
When compared to the KalmanLagLasso method, our objective is different and we aim to analyse what information is contained in the signals, in particular, if they can be explained by external variables. As a result of this objective, instead of using a Kalman filter implementation of a linear autoregressive model, we use as the main model nonlinear LSTMs.
Our methodology also diverges in the target used for the LagLasso procedure. While in the KalmanLagLasso the target time series is the Kalman residuals, in our methodology we use the signals extracted from the LSTM states. We model the cell and hidden states independently of each other.
On the LagLasso technique, we also differ from the original paper (Mahler, 2009) in several aspects. First, we do not denoise the features and use Lasso directly with the variables and respective lags. To explain this option, we need to mention that Kalman filters are often applied to sensors data (Park et al., 2019), and it is known that this type of data does not provide exact readings/measurements. This is because they introduce their own distortions and they are always corrupted by noise (Maybeck, 1979). In this context, the use of Kalman filter to remove the noise of sensor data is a natural application.
However, in financial markets the concept of noise cannot realistically be applied, in our opinion. Instead, it is the complexity of market dynamics that is responsible for price formation. Indeed, it is the multiplicity of factors influencing the markets that makes the problem extremely complex. Part of those factors are already incorporated in the historic values of the time series, but another important part is coming from new information arriving to the markets in realtime, together with the new expectation and reactions of market participants to that new information. Ultimately, from the interaction of all these factors results a new equilibrium and a new realtime asset valuation. This concept applied more directly to market variables, can also be extended to macroeconomic indicators. Our approach seems to be more appropriate when dealing with financial time series.
The second main difference in our methodology is that we consider all lags selected as relevant in the LSTMLagLasso algorithm and not only one for each feature, as proposed in the KalmanLagLasso procedure (Mahler, 2009). We found no reason to limit the number of lags that may participate in the forecasting process to only one. The LSTMLagLasso methodology is outlined in Algorithm 1.
The LSTMLagLasso method is applied individually to both hidden and cell states, and for each of the three LSTM units. Additional clarification of the options considered are presented below. First, the number of lags is equal to those of the sequences (input and output) used in the LSTM model in Section 5.2, i.e., six lags per feature (Algorithm 1, Line 1). Second, for the external variables to explain the hidden and cell state signals, we use the same large set of features considered for the MLP model (Section 4.1), a list of 159 macroeconomic and market variables not available to the model during the learning process (Algorithm 1, Line 1). Third, for selection of the regularisation parameter we considered both the trend in the number of features against and the forecasting error (Algorithm 1, Line 1). After an initial rapid drop in the number of features, the trend changes significantly, but a clear period of stabilisation cannot be observed. Additionally, for above 1.0, the error starts increasing more rapidly and the quality of fit deteriorating. For this reason we selected for the final Lasso run, as a good compromise between stabilisation and quality of fit. Figure 10 shows an example of LagLasso prediction of LSTM signals with this option for . Forth and last, we exclude from the original features the 10year bond yield, since this information is known to the model, both last available value and respective lags, as the only feature considered in the univariate LSTM model used in Section 4.
6.2 Results and discussion
The application of LSTMLagLasso to the hidden state is presented in Figures 11–13, respectively for hidden units 1 to 3. The figures show the most relevant variables identified by this methodology. Note that we plot the absolute value of the weights and not the weights directly, for easier visualisation of the figure. Since we are using market variables to explain the LSTM signals and not a financial asset target directly, the visualisation of whether the weights are positive of negative is not as relevant as the magnitude of it, i.e. the relevance of the feature as explanatory variable.
The corresponding results regarding the application of LSTMLagLasso to the cell state are not presented here, for brevity, but also because the main points that can be extracted from them only reinforce the conclusions for the hidden state. This can also be observed in the Venn diagram shown in Figure 14, where almost 80% of the relevant features identified are common to both hidden and cell state. An additional point worth noticing is that the cell state needs slightly more explanatory variables (three more).
A similar type of comparison is conducted among the different hidden units, to illustrate the common relevant features identified. The diagram is presented in Figure 15. Here we can see that not all top relevant features are common to all hidden units, although 24.7% of them are common to all three units in the hidden state and 21.4% in the cell state.
An additional summary of results for the most influential relevant variables, is presented in Table 3. For this purpose, only those with an absolute weight greater than or equal to 0.15 are selected.
Feature name  Weight  

Unit 1  Unit 2  Unit 3  
Hidden state  
ECB Refi Rate  0.150  
3M Euribor Fut 4th  0.177  0.248  
GBR 30Y  0.161  0.053  
DEU Bond Fut 30Y  0.674  0.287  0.410 
EUR Swaps 5Y  0.398  
EUR Swaps 30Y  0.192  0.199  0.315 
FTSE 100  0.056  0.160  0.107 
Gold Futures  0.193  0.012  0.004 
US 2Y10Y Spread  0.068  0.087  0.198 
EUR 10Y30Y Spread  0.264  0.214  
EUR 10Y MA 5 days  0.247  0.082  
EUR 30Y MA 200 days  0.298  
EUR 5Y  0.221  0.202  
EUR 30Y  0.450  0.168  
Cell state  
3M Euribor Fut 4th  0.212  
90day Euro$ Fut 5th  0.203  
GBR 30Y  0.151  0.038  
DEU Bond Fut 30Y  0.723  0.921  0.533 
EUR Swaps 5Y  0.258  
EUR Swaps 30Y  0.220  0.186  
Gold Futures  0.168  0.109  0.051 
EUR 10Y30Y Spread  0.271  0.270  
EUR 10Y MA 5 days  0.525  
EUR 30Y MA 200 days  0.273  
EUR 5Y  0.003  0.172  
EUR 30Y  0.240  
From the results a number of conclusions can be drawn (Figures 11–13 and Table 3). First, the results confirm that the signals can be explained by external sources of information, not available to the models both during training or forecasting. Second, the LSTM signals are complex and require a significant number of explanatory variables. Third, we observe that for many features there are several lags that are relevant for the prediction process. The most important lags are the last two values (t, t1) and the lag one week before (t5). The importance of the latter lag is interesting, pointing to some type of weekly seasonality or influence. Given this conclusion that lags are important, selecting only one lag per feature, as proposed in the KalmanLagLasso method (Mahler, 2009), would eliminate this additional information, limiting the forecasting ability.
Fourth and last, using the LSTMlaglasso some of the most relevant features selected are conventional market/macro variables, but others are less common, nonconventional ones. We refer here “conventional” in the sense that they have been used more frequently in modelling financial assets in the past (Nelson and Siegel, 1987; Dunis and Morrison, 2007; ArrietaIbarra and Lobato, 2015), or are more common sense variables for that purpose.
In the conventional group of relevant features, we can refer those related to central bank reference rates (ECB refinancing rate), macroeconomic indicators of inflation (US Consumer Price Index less food and energy, Eurozone Core Monetary Union Index of Consumer Prices), economic growth / growth expectations (Institute for Supply Management Manufacturing, US Industrial Production, US Capacity Utilisation, ZEW Eurozone Expectation of Economic Growth), and labour market (Eurozone Unemployment).
But the explanatory variables go well beyond that group, with a wide range of relevant features in the nonconventional group. Some of them are specific to the bond market, all of them adding significant information to the previous group. The top relevant feature by weight is the long German government bond future (DEU Bond Fut 30Y), reaching the value of 0.921 for unit 2 of the cell state (Table 3
). Note that contrary to what happens at the gates (output of a sigmoid function between 0 and 1), the weights in the LSTM states are not limited to 1. Also within the futures asset class, we find the 3M Euribor and 90d Euro$ Futures (4th and 5th contracts). These contracts have a horizon of approximately one year ahead, thus incorporating investors’ expectations on the evolution of shortterm rates.
In addition, financial instruments with maturities adjacent to the one we want to predict are also included in this group of relevant features, namely: 5 and 30year Euro government bond yield (note that we have excluded the 10year yield from the LSTMLagLasso set of features as mentioned in Section 6.1); 2, 10 and 30year UK government bond yield; and 5, 10 and 30year EUR swap rates. Directly related with yields, we can identify as relevant features intracurve spreads such as US 2–10year spread and EUR 10–30year spread; as well as intercurve spreads, specifically, EURGBR 10year spread and EURJPN 10year spread.
Furthermore, the most relevant features determined via LSTMLagLasso also include the following asset classes and macroeconomic variables: commodities (Gold Futures and Brent Crude Futures); equity indices (Euro Stoxx 50, FTSE100, and S&P500); foreign exchange rates (EURUSD XRate and EURJPY XRate); the ECB Balance Sheet LongTerm Refinancing Operations; OECD Leading Indicators of US, European Area, and Japan; and finally technical analysis indicators (5year, 10year, 30year moving averages of 5, 50 and 200 days).
It is important to emphasise some aspects regarding the latter group of nonconventional relevant features identified using the LSTMLagLasso. First, the 5year and 30year are adjacent maturities to the 10year yield we are studying and tend to lead flattening and steepening movements of the yield curve around the 10year maturity. In particular the long German government bond future is a leveraged instrument with very long maturity and duration. Consequently, they are highly pricesensitive and react very quickly to market movements. This justifies being a top relevant feature. The second aspect worth highlighting is that, contrary to what could be expected, most of the nonconventional features have higher weights than the conventional ones. Besides, the 5year and 30year come more important than the 10year maturity itself (Table 3). This may be explained by the fact that the 10year yield is already known to the model. Third, of note also is the inclusion in those relevant features of indicators related to the ECB balance sheet (ECB Balance Sheet LongTerm Refinancing Operations), at a time when central banks have been involved in largescale asset purchases or quantitative easing, that clearly has an impact on the overall yield levels in the market. Fourth and last, another example is the OECD Leading Indicators in different geographic areas. These indicators are designed with the objective of providing early signals of turning points in economic cycles and so it is interesting to see them identified as relevant features using this methodology.
In summary, the LSTM model captures important data and incorporates the information into its long and shortterm memories. Ultimately, the LSTMLagLasso methodology can also be used for features selection given the richness of information contained in the hidden and cell states.
6.3 Strength of results
In this section, we evaluate the strength of results by assessing whether they could be obtained by chance. The hypothesis we want to test is whether the results obtained with real features and with Gaussian random variables could be part of the same distribution. For that purpose, we apply the LSTMLagLasso method replacing the macroeconomic and market features by the same number of Gaussian random variables. The corresponding mean squared errors are then calculated for each experiment.
The simulation is run one hundred times in order to determine the corresponding probability density function and the results are presented in Figure
16. From these we can safely conclude, with statistical significance, that the results with real features are not obtained by chance.7 Conclusions and future work
This work has three main components. First, we conduct an application of LSTM networks to the bond market, specifically for forecasting the 10year Euro government bond yield, and compare the results to memoryfree standard feedforward neural networks, in particular MLPs. This is the first study of its kind as can be confirmed by the lack of published literature in this area. To this end, we model the 10year bond yield using univariate LSTMs with different input sequences (6, 21 and 61 time steps), considering five forecasting horizons, the next day as well as further into the future, up to next day plus 20 days. Our objective is to compare those LSTM models with univariate MLPs, as well as MLPs using the most relevant features. These are determined using Lasso regression, for each forecasting horizon. We closely follow the same data and methodology for this comparison. In addition, the use of training moving windows incorporating the most recent information as it becomes available has the advantage of increased flexibility to changing market conditions.
The direct comparison of models in identical conditions show that, with the LSTM, we can obtain results that are similar or better and with lower standard deviations. In the comparison with the LSTMs using different input sequences, especially for forecasting horizons equal to 10 and 15 days, we observe that the LSTMs with longer input sequences achieve similar levels of forecasting accuracy to the MLP with the most relevant features, with lower standard deviation. In other words, the univariate LSTM model with additional memory is capable of achieving similar results as the multivariate MLP with additional information from markets and the economy. This is a remarkable achievement and a promising result for future work. Furthermore, the results for the univariate LSTM show that shorter forecasting horizons require smaller input sequences and, viceversa. Therefore, there is a need to adjust the LSTM architecture to the forecasting horizon and in general terms to the conditions of the problem.
In summary, the results obtained in the empirical work validate the potential of LSTMs for yield forecasting and identify their memory advantage when compared to memoryfree models. This enables the incorporation of LSTMs in autonomous systems for the asset management industry, with special relevance to pension funds, insurance companies and investment funds.
Second, with the objective to analyse the internal functioning of the LSTM model and mitigate the preconceived notion of black box normally associated with this type of model, we conduct an indepth internal analysis of the information in the memory cell through time. This is the first contribution with that objective. Alternative works are either applied to a different type of model, or conduct an external analysis of the LSTMs (Section 2.2). To achieve this goal, we select several locations within the memory cell to directly calculate and extract the signals (weights) at each time step and hidden unit. Specifically, the locations are as follows: forget gate, product of the outputs from the input gate and input node, output gate, cell state, and hidden state. This analysis is carried out using sequencetosequence (6 days) LSTM architectures, with uni and multivariate feature sets ( 10year yield; 10year yield plus momentum indicator, and 10year yield plus 5 and 30year yield), with reduced number of hidden units (3 units), for interpretability purposes, and for a forecasting horizon of next day plus 5 days.
Overall, considering all feature sets, the most remarkable property found consistently in the LSTM signals, is the activation / deactivation of units through time, thus contributing or not (respectively) to the forecasting process. Moreover, we found evidence that the LSTM units tend to specialise in different yield ranges or features considered in the model.
In the third study / contribution, we investigate the information contained in the signals extracted from the LSTM hidden and cell states, to examine whether the corresponding time series can be explained by external sources of information. To this effect, we introduce a new methodology here identified as LSTMLagLasso, based on both Lasso and KalmanLagLasso. This methodology is capable of identifying both relevant features and corresponding lags, as the KalmanLagLasso, but with significant modifications (Section 6.1 – LSTMLagLasso).
The findings show that the information contained in the LSTM states is complex, but may be explained by exogenous macroeconomic and markets variables, not known to the model during the learning process. Thus, it is worth exploring this information using the developed LSTMLagLasso methodology, which may be used as an alternative feature selection method. On the relevant features selected with the LSTMLagLasso method, they indicate conventional as well as nonconventional market/macro indicators (Section 6.2), contributing to the prediction process, but which are not commonly used in forecasting models. In addition, the LSTMLagLasso identifies lags as important, in particular , and . Above all, LSTM networks can capture this information and maintain it in the long and shortterm memories, i.e. cell and hidden states.
With respect to future work, our present research focuses on financial asset forecasting, development of methodologies and analysing internally the LSTM model. However, the ultimate purpose in the industry is portfolio management and trading. In relation to asset forecasting, this is a different type of problem. Obtaining correct predictions does not necessarily translate into profitable strategies. Thus, the next step is to implement this type of model in autonomous systems, to assess its potential for trading and portfolio management in fixed income markets. Finally, we want to emphasise that the work described in this paper is a fundamental component necessary for the implementation of those intelligent systems.
Acknowledgements
This work is supported by the UK Engineering and Physical Sciences Research Council (EPSRC Award No. 1921702). All the information required to download the full dataset used in this research (in particular the identification of features), from a Bloomberg Professional terminal, is made publicly available (Nunes, 2020). The authors would like to thank Luis Montesdeoca for helpful discussions during the course of this study.
References
=0mu plus 1mu
References
 Antoniou et al. (2003) Antoniou, A., Galariotis, E. C., and Spyrou, S. I. (2003). Profits from buying losers and selling winners in the London Stock Exchange. Journal of Business & Economics Research (JBER), 1(11). https://doi.org/10.19030/jber.v1i11.3069.
 ArrietaIbarra and Lobato (2015) ArrietaIbarra, I. and Lobato, I. N. (2015). Testing for predictability in financial returns using statistical learning procedures. Journal of Time Series Analysis, 36(5):672–686.

Ballings et al. (2015)
Ballings, M., Van den Poel, D., Hespeels, N., and Gryp, R. (2015).
Evaluating multiple classifiers for stock price direction prediction.
Expert Systems with Applications, 42(20):7046–7056. https://doi.org/10.1016/j.eswa.2015.05.013.  Bengio et al. (1993) Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning longterm dependencies in recurrent networks. In Proceedings of the International Conference on Neural Networks, pages 1183–1188. IEEE.
 Bengio et al. (1994) Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning longterm dependencies with gradient descent is difficult. Transactions on Neural Networks, 5(2):157–166.
 Bishop (2006) Bishop, C. M. (2006). Pattern recognition and machine learning (information science and statistics). SpringerVerlag New York.
 Bloomberg (2017) Bloomberg (2017). Bloomberg professional database Subscription service.

Booth et al. (2014)
Booth, A., Gerding, E., and McGroarty, F. (2014).
Automated trading with performance weighted random forests and seasonality.
Expert Systems with Applications, 41(8):3651–3661. https://doi.org/10.1016/j.eswa.2013.12.009.  Brownlee (2018) Brownlee, J. (2018). Long shortterm memory networks with Python. Develop sequence prediction models with deep learning. Machine Learning Mastery, Jason Brownlee.
 Castellani and Santos (2006) Castellani, M. and Santos, E. A. d. (2006). Forecasting longterm government bond yields: An application of statistical and AI models. ISEG, Departamento de Economia, pages 1–34.
 Choudhry et al. (2012) Choudhry, T., McGroarty, F., Peng, K., and Wang, S. (2012). Highfrequency exchangerate prediction with an artificial neural network. Intelligent Systems in Accounting, Finance and Management, 19(3):170–178.
 Diebold and Li (2006) Diebold, F. X. and Li, C. (2006). Forecasting the term structure of government bond yields. Journal of Econometrics, 130(2):337–364. https://doi.org/10.1016/j.jeconom.2005.03.005.
 Dunis et al. (2016) Dunis, C. L., Middleton, P. W., Karathanasopolous, A., and Theofilatos, K. (2016). Artificial intelligence in financial markets: Cutting edge applications for risk management, portfolio optimization and economics. New Developments in Quantitative Trading and Investment. Palgrave Macmillan UK.
 Dunis and Morrison (2007) Dunis, C. L. and Morrison, V. (2007). The economic value of advanced time series methods for modelling and trading 10year government bonds. European Journal of Finance, 13(4):333–352.
 Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2):407–499.
 Eilers et al. (2014) Eilers, D., Dunis, C. L., Mettenheim, H.J. v., and Breitner, M. H. (2014). Intelligent trading of seasonal effects: A decision support algorithm based on reinforcement learning. Decision Support Systems, 64:100–108. https://doi.org/10.1016/j.dss.2014.04.011.
 Elman (1990) Elman, J. L. (1990). Finding structure in time. Cognitive science, 14(2):179–211.
 Fischer and Krauss (2018) Fischer, T. and Krauss, C. (2018). Deep learning with long shortterm memory networks for financial market predictions. European Journal of Operational Research, 270(2):654–669. https://doi.org/10.1016/j.ejor.2017.11.054.
 Fletcher and ShaweTaylor (2013) Fletcher, T. and ShaweTaylor, J. (2013). Multiple kernel learning with fisher kernels for high frequency currency prediction. Computational Economics, 42(2):217–240.
 Garman (2001) Garman, M. C. (2001). High yield allocations in the short run: Market trends in the allocation decision. Merrill Lynch, Global Securities Research & Economics Group, pages 1–8.
 Gers et al. (2002) Gers, F. A., Eck, D., and Schmidhuber, J. (2002). Applying LSTM to time series predictable through timewindow approaches. In Tagliaferri, R. and Marinaro, M., editors, Proceedings of the Italian Workshop on Neural Nets, WIRN Vietri01, Perspectives in Neural Computing, pages 193–200. Springer.
 Gers and Schmidhuber (2000) Gers, F. A. and Schmidhuber, J. (2000). Recurrent nets that time and count. In Proceedings of the IEEEINNSENNS International Joint Conference on Neural Networks, IJCNN, volume 3, pages 189–194. IEEE.
 Gers et al. (1999) Gers, F. A., Schmidhuber, J., and Cummins, F. (1999). Learning to forget: Continual prediction with LSTM. In Proceedings of the International Conference on Artificial Neural Networks, ICANN, volume 2, pages 850–855. Institution of Engineering and Technology.
 Gilardoni (2017) Gilardoni, G. (2017). Recurrent neural network models for financial distress prediction. Master’s thesis, Politecnico di Milano.
 Giles et al. (2001) Giles, C. L., Lawrence, S., and Tsoi, A. C. (2001). Noisy time series prediction using recurrent neural networks and grammatical inference. Machine learning, 44(12):161–183. https://doi.org/10.1023/A:1010884214864.
 Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
 Gradojevic and Yang (2006) Gradojevic, N. and Yang, J. (2006). Nonlinear, nonparametric, nonfundamental exchange rate forecasting. Journal of Forecasting, 25(4):227–245.
 Graves and Schmidhuber (2005) Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5–6):602–610. https://doi.org/10.1016/j.neunet.2005.06.042.
 Hastie et al. (2013) Hastie, T., Tibshirani, R., and Friedman, J. (2013). The elements of statistical learning: Data mining, inference, and prediction. Springer Series in Statistics, second edition.
 Hochreiter (1991) Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen netzen. Diploma thesis, Technische Universität München.
 Hochreiter et al. (2001) Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2001). Gradient flow in recurrent nets: The difficulty of learning longterm dependencies. In Kremer, S. C. and Kolen, J. F., editors, A Field Guide to Dynamical Recurrent Networks, pages 1–15. WileyIEEE Press.
 Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. (1997). Long shortterm memory. Neural Computation, 9(8):1735–1780.
 Huang et al. (2007) Huang, W., Lai, K. K., Nakamori, Y., Wang, S., and Yu, L. (2007). Neural networks in finance and economics forecasting. International Journal of Information Technology & Decision Making, 6(01):113–140.
 Jegadeesh and Titman (1993) Jegadeesh, N. and Titman, S. (1993). Returns to buying winners and selling losers: Implications for stock market efficiency. The Journal of finance, 48(1):65–91. https://doi.org/10.1111/j.15406261.1993.tb04702.x.
 Kalman (1960) Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35–45. https://doi.org/10.1115/1.3662552.
 Kanevski et al. (2008) Kanevski, M., Maignan, M., Pozdnoukhov, A., and Timonin, V. (2008). Interest rates mapping. Physica A: Statistical Mechanics and its Applications, 387(15):3897–3903. https://doi.org/10.1016/j.physa.2008.02.069.
 Kanevski and Timonin (2010) Kanevski, M. and Timonin, V. (2010). Machine learning analysis and modeling of interest rate curves. Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN, pages 47–52.
 Karpathy (2015) Karpathy, A. (2015). The unreasonable effectiveness of recurrent neural networks. URL https://karpathy.github.io/2015/05/21/rnneffectiveness/ (Accessed on 19Jan2018).
 Khanal and Mishra (2014) Khanal, A. R. and Mishra, A. K. (2014). Is the ‘buying winners and selling losers’ trading strategy profitable in the new economy? Applied Economics Letters, 21(15):1090–1093. https://doi.org/10.1080/13504851.2014.909569.
 Kingma and Ba (2015) Kingma, D. P. and Ba, J. L. (2015). Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, ICLR, pages 1–15. arXiv preprint arXiv:1412.6980v9.

Kraus and Feuerriegel (2017)
Kraus, M. and Feuerriegel, S. (2017).
Decision support from financial disclosures with deep neural networks and transfer learning.
Decision Support Systems, 104:38–48. https://doi.org/10.1016/j.dss.2017.10.001.  Lipton et al. (2015) Lipton, Z. C., Berkowitz, J., and Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019v4, pages 1–38.
 Mahler (2009) Mahler, N. (2009). Modeling the S&P 500 index using the Kalman filter and the LagLasso. In Proceedings of the International Workshop on Machine Learning for Signal Processing, MLSP, pages 1–6. IEEE. https://doi.org/10.1109/MLSP.2009.5306195.
 Maknickienė and Maknickas (2012) Maknickienė, N. and Maknickas, A. (2012). Application of neural network for forecasting of exchange rates and forex trading. In Proceedings of the International Scientific Conference Business and Management, pages 122–127.
 Maybeck (1979) Maybeck, P. S. (1979). Stochastic models, estimation, and control. Mathematics in Science and Engineering. Academic Press.
 Montesdeoca and Niranjan (2020) Montesdeoca, L. and Niranjan, M. (2020). On comparing the influences of exogenous information on bitcoin prices and stock index values. In Pardalos, P., Kotsireas, I., Guo, Y., and Knottenbelt, W., editors, Mathematical Research for Blockchain Economy, MARBLE, pages 93–100. Springer. https://doi.org/10.1007/9783030371104_7.
 Munkhdalai et al. (2017) Munkhdalai, L., Namsrai, O.E., and Ryu, K. H. (2017). A hybrid approach based on long shortterm memory networks and vector autoregression for stock market price prediction. Proceedings of the International Conference on Frontiers of Information Technology, Applications and Tools, FITAT, pages 1–4.
 Nelson and Siegel (1987) Nelson, C. R. and Siegel, A. F. (1987). Parsimonious modeling of yield curves. Journal of Business, pages 473–489.
 Niranjan (1996) Niranjan, M. (1996). Sequential tracking in pricing financial options using model based and neural network approaches. In Mozer, M. C., Jordan, M. I., and Petsche, T., editors, Advances in Neural Information Processing Systems, NIPS, pages 960–966. MIT Press.
 Nunes (2020) Nunes, M. (2020). Dataset information for article “Long shortterm memory networks and laglasso for bond yield forecasting: Peeping inside the black box”. University of Southampton, doi allocated and to be registered by UoS [Dataset].
 Nunes et al. (2019) Nunes, M., Gerding, E., McGroarty, F., and Niranjan, M. (2019). A comparison of multitask and single task learning with artificial neural networks for yield curve forecasting. Expert Systems with Applications, 119:362–375. https://doi.org/10.1016/j.eswa.2018.11.012.
 OECD (2015) OECD (2015). Business and finance outlook. OECD Publishing, Paris.
 Olah (2015) Olah, C. (2015). Understanding LSTM networks. URL https://colah.github.io/posts/201508UnderstandingLSTMs/ (Accessed on 19Jan2018).
 Park et al. (2019) Park, S., Gil, M.S., Im, H., and Moon, Y.S. (2019). Measurement noise recommendation for efficient Kalman filtering over a large amount of sensor data. Sensors, 19(5):1–19. https://doi.org/10.3390/s19051168.
 Persio and Honchar (2016a) Persio, L. D. and Honchar, O. (2016a). Artificial neural networks approach to the forecast of stock market price movements. International Journal of Economics and Management Systems, 1:158–162.
 Persio and Honchar (2016b) Persio, L. D. and Honchar, O. (2016b). Artificial neural networks architectures for stock price prediction: Comparisons and applications. International Journal of Circuits, Systems and Signal Processing, 10:403–413.
 Persio and Honchar (2017) Persio, L. D. and Honchar, O. (2017). Recurrent neural networks approach to the financial forecast of google assets. International Journal of Mathematics and Computers in Simulation, 11:7–13.
 PrügelBennett (2017) PrügelBennett, A. (2017). Advanced machine learning. University of Southampton, School of Electronics and Computer Science.
 Qin et al. (2017) Qin, Y., Song, D., Chen, H., Cheng, W., Jiang, G., and Cottrell, G. W. (2017). A dualstage attentionbased recurrent neural network for time series prediction. In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, pages 2627–2633. IJCAI Organization.
 Rumelhart et al. (1986) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition, volume 1, pages 318–362. MIT Press.

Sambasivan and Das (2017)
Sambasivan, R. and Das, S. (2017).
A statistical machine learning approach to yield curve forecasting.
In
Proceedings of the International Conference on Computational Intelligence in Data Science, ICCIDS
, pages 1–6. IEEE.  Schmidhuber (2015) Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61:85–117. https://doi.org/10.1016/j.neunet.2014.09.003.
 Sermpinis et al. (2019) Sermpinis, G., Karathanasopoulos, A., Rosillo, R., and de la Fuente, D. (2019). Neural networks in financial trading. Annals of Operations Research. https://doi.org/10.1007/s1047901903144y.
 Sermpinis et al. (2012) Sermpinis, G., Laws, J., Karathanasopoulos, A., and Dunis, C. L. (2012). Forecasting and trading the EUR/USD exchange rate with Gene Expression and Psi Sigma Neural Networks. Expert Systems with Applications, 39(10):8865–8877. https://doi.org/10.1016/j.eswa.2012.02.022.

Sermpinis et al. (2013)
Sermpinis, G., Theofilatos, K., Karathanasopoulos, A., Georgopoulos, E. F., and
Dunis, C. (2013).
Forecasting foreign exchange rates with adaptive neural networks using radialbasis functions and particle swarm optimization.
European Journal of Operational Research, 225(3):528–540. http://dx.doi.org/10.1016/j.ejor.2012.10.020.  Strobelt et al. (2018) Strobelt, H., Gehrmann, S., Pfister, H., and Rush, A. M. (2018). LSTMVis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE Transactions on Visualization and Computer Graphics, 24(1):667–676.
 Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
 Wang et al. (2019) Wang, J., Zhang, Y., Tang, K., Wu, J., and Xiong, Z. (2019). Alphastock: A buyingwinnersandsellinglosers investment strategy using interpretable deep reinforcement attention networks. In Proceedings of the International Conference on Knowledge Discovery & Data Mining, SIGKDD, pages 1900–1908. ACM. https://doi.org/10.1145/3292500.3330647.

Wolpert and Macready (1997)
Wolpert, D. H. and Macready, W. G. (1997).
No free lunch theorems for optimization.
IEEE Transactions on Evolutionary Computation
, 1(1):67–82. https://doi.org/10.1109/4235.585893.  Xiong et al. (2016) Xiong, R., Nichols, E. P., and Shen, Y. (2016). Deep learning stock volatility with google domestic trends. arXiv preprint arXiv:1512.04916v3, pages 1–6.
 Zhang et al. (2017) Zhang, Y., Wang, D., Chen, Y., Shang, H., and Tian, Q. (2017). Credit risk assessment based on long shortterm memory model. In Proceedings of the International Conference on Intelligent Computing, ICIC, pages 700–712. Springer.
Comments
There are no comments yet.