A Hybrid Framework for Sequential Data Prediction with End-to-End Optimization

by   Mustafa E. Aydın, et al.
Bilkent University

We investigate nonlinear prediction in an online setting and introduce a hybrid model that effectively mitigates, via an end-to-end architecture, the need for hand-designed features and manual model selection issues of conventional nonlinear prediction/regression methods. In particular, we use recursive structures to extract features from sequential signals, while preserving the state information, i.e., the history, and boosted decision trees to produce the final output. The connection is in an end-to-end fashion and we jointly optimize the whole architecture using stochastic gradient descent, for which we also provide the backward pass update equations. In particular, we employ a recurrent neural network (LSTM) for adaptive feature extraction from sequential data and a gradient boosting machinery (soft GBDT) for effective supervised regression. Our framework is generic so that one can use other deep learning architectures for feature extraction (such as RNNs and GRUs) and machine learning algorithms for decision making as long as they are differentiable. We demonstrate the learning behavior of our algorithm on synthetic data and the significant performance improvements over the conventional methods over various real life datasets. Furthermore, we openly share the source code of the proposed method to facilitate further research.



page 1

page 2

page 3

page 4


Markovian RNN: An Adaptive Time Series Prediction Network with HMM-based Switching for Nonstationary Environments

We investigate nonlinear regression for nonstationary sequential data. I...

A Tree Architecture of LSTM Networks for Sequential Regression with Missing Data

We investigate regression for variable length sequential data containing...

Recursive Recurrent Nets with Attention Modeling for OCR in the Wild

We present recursive recurrent neural networks with attention modeling (...

Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data

Nowadays, deep neural networks (DNNs) have become the main instrument fo...

Takens-inspired neuromorphic processor: a downsizing tool for random recurrent neural networks via feature extraction

We describe a new technique which minimizes the amount of neurons in the...

Event-based Feature Extraction Using Adaptive Selection Thresholds

Unsupervised feature extraction algorithms form one of the most importan...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Background

We study nonlinear prediction in a temporal setting where we receive a data sequence related to a target signal and estimate the signal’s next samples. This problem is extensively studied in the signal processing and machine learning literatures since it has many applications such as in electricity demand

fore_elec , medical records fore_medic , weather conditions fore_weather and retail sales fore_retail

. Commonly, this problem is studied as two disjoint sub-problems where features are extracted, usually by a domain expert, and then a decision algorithm is trained over these selected features using possibly different feature selection methods. As shown in our simulations, this disjoint optimization can provide less than adequate results for different applications since the selected features may not be the best features to be used by the selected algorithm. To remedy this important problem, we introduce an algorithm which performs these two tasks jointly and optimize both the features and the algorithm in an end-to-end manner in order to minimize the final loss. In particular, we use recursive structures to extract features from sequential data, while preserving the state information, i.e., the history, and boosted decision trees to produce the final output, which are shown to be very effective in several different real life applications

boosting_1 ; boosting_2 . This combination is then jointly optimized to minimize the final loss in an end-to-end manner.

The existing well-known statistical models (such as autoregressive integrated moving average and exponential smoothing) for regression are robust to overfitting, have generally fewer parameters to estimate, easier to interpret due to the intuitive nature of the underlying model and amenable to automated procedures for hyperparameter selection

the_forecasting_paper . However, these models have very strong assumptions about the data such as stationarity and linearity, which prevent them from capturing nonlinear patterns that real life data tend to possess nonlinear_life . Machine learning-based models, on the other hand, are data-driven and free of the strong statistical assumptions. Recurrent neural networks (RNN) and in particular long short-term memory neural networks (LSTM) are one of such models that excel at representing the data, especially the sequential data, in a hierarchical manner via nonlinear modules stacked on top of each other ant_yt . Thanks to the feedback connections to preserve the history in memory rnn_itself , they are widely used in sequential learning tasks. Nevertheless, a fully connected layer usually employed in the final layer of these hidden layers often hinders their regression ability (e.g., in rnn_nondense , authors use an attention layer for more accurate judgments). An alternative to these deep learning models is the tree-based models that learn hierarchical relations in the data by splitting the input space and then fitting different models to each split ant_yt . Such a data-driven splitting and model fitting are shown to be very effective in real life applications trees_fine_1 ; trees_fine_2 . Among such models, gradient-boosted decision trees (GBDT) gbdt_friedman

are promising models that work via sequentially combining weak learners, i.e., decision trees. Specialized GBDTs, such as XGBoost

xgb_paper and LightGBM lgb_paper , demonstrated excellent performance at various time series prediction problems m5_comp

. Although they were not designed to process sequential data, GBDTs can be used by incorporating temporal side information, such as lagged values of the desired signal, as components of the feature vectors. However, this need for domain-specific feature engineering and independent design of the algorithm from the selected features hinder their full potential in sequential learning tasks since both processes are time consuming in most applications and require domain expertise

finans_paper ; cyber_paper .

Here, we effectively combine LSTM and GBDT architectures for nonlinear sequential data prediction. Our motivation is to use an LSTM network as a feature extractor from the sequential data in the front end, mitigating the need for hand-designed features for the GBDT, and use the GBDT to perform supervised regression on these learned features while enjoying joint optimization. In particular, we connect the two models in an end-to-end fashion and jointly optimize the whole architecture using stochastic gradient descent in an online manner. To be able to learn not only the feature mapping but also the optimal partition of the feature space, we use a soft gradient boosting decision tree (sGBDT) sgbm_paper that employs soft decision trees irsoy_sdt ; hinton_sdt as the weak learners. We emphasize that our framework is generic so that one can use other deep learning architectures for feature extraction (such as RNNs and GRUs) and machine learning algorithms for decision making as long as they are differentiable.

1.2 Related Work

Combination of decision trees with neural networks has attracted wide interest in the machine learning literature. In ndf_paper

, authors propose a neural decision forest where they connect soft decision trees to the final layer of a CNN architecture. This combination provided the state-of-the-art performance in the ImageNet dataset

image_net_dataset . Our work differs from this method since we use boosting with fixed-depth trees instead of bagging and we jointly optimize the whole architecture whereas they employ an alternating optimization procedure. In hinton_sdt , Frosst and Hinton first employ a neural network and then using its predictions along with the true targets, train a soft decision tree. Their main goal is to get a better understanding of the decisions a neural network makes by distilling its knowledge into a decision tree. They, however, focus on classification, use two separate training phases by design, and favor explainability over model performance. The adaptive neural trees proposed in ant_yt

uses a soft decision tree with “split data, deepen transform and keep” methodology where neural networks are the transformers in the edges of the tree. The tree is thereby grown step-wise and overall architecture necessitates both a “growth” phase and a “refinement” phase (where parameters are actually updated). Our architecture, however, does not embed neural networks inside the trees but places it in a sequential manner, does not grow trees but uses fixed depth weak learners and performs the optimization via a single backpropagation pass. References

nrf_paper ; tel_paper also take similar approaches to the aforementioned models but none of the proposed models are designed to process sequential data, which renders them unsuitable for time series forecasting tasks. Perhaps the closest architecture to ours is disj_paper where authors aim a forecasting task with a hybrid model of an LSTM network and XGBoost. Nevertheless, they employ a disjoint architecture in that they have a three stage training: twice for XGBoost and once for LSTM, which not only increases the computational time but also retains it from enjoying the benefits of an end-to-end optimization end_to_end_paper_1 ; end_to_end_paper_2 .

1.3 Contributions

We list our main contributions as follows.

  1. We introduce a hybrid architecture composed of an LSTM and a soft GBDT for sequential data prediction that can be trained in an end-to-end fashion with stochastic gradient descent. We also provide the backward pass equations that can be used in backpropagation.

  2. To the best of our knowledge, this model is the first in the literature that is armed with joint optimization for a sequential feature extractor in the front end and a supervised regressor in the back end.

  3. The soft GBDT regressor in the back end not only learns a feature mapping but also learns the optimal partitioning of the feature space.

  4. The proposed architecture is generic enough that the feature extractor part can readily be replaced with, for example, an RNN variant (e.g., gated recurrent unit (GRU)

    gru_paper ) or a temporal convolutional network (TCN) tcn_paper . Similarly, the supervised regressor part can also be replaced with a machine learning algorithm as long as it is differentiable.

  5. Through an extensive set of experiments, we show the efficacy of the proposed model over real life datasets. We also empirically verify the integrity of our model with synthetic datasets.

  6. We publicly share our code for both model design and experiment reproducibility111https://github.com/mustafaaydn/lstm-sgbdt.

1.4 Organization

The rest of the paper is organized as follows. In Section 2, we introduce the nonlinear prediction problem, describe the LSTM network in the front end and point to the unsuitableness of a hard decision tree in joint optimization. We then describe the soft GBDT regressor in the back end in Section 2.2.1 and the proposed end-to-end architecture along with the backward pass equations in Section 2.2.2. We demonstrate the performance of the introduced architecture through a set of experiments in Section 3. Finally, Section 4 concludes the paper.

2 Material and methods

2.1 Problem Statement

In this paper, all vectors are real column vectors and presented by boldface lowercase letters. Matrices are denoted by boldface uppercase letters. and denotes the element of the vectors and , respectively. represents the ordinary transpose of . represents the entry at the row and the column of the matrix .

We study the nonlinear prediction of sequential data. We observe a sequence possibly along with a side information sequence . At each time , given the past information , for , we produce the predictions where is the number of steps ahead to predict. Hence, in this purely online setting, our goal is to find the relationship

where is a nonlinear function, which models . We note that can either produce estimates at once or do it in a recursive fashion (roll out) by using its own predictions. In either case, we suffer the loss over the prediction horizon given by

where , for example, can be the squared error loss. As an example, in weather nowcasting, temperature () for the next six hours () are predicted using the hourly measurements of the past. Side information could include the wind and humidity levels (). The model updates itself whenever new measurements of the next hour become available, i.e., works in an online manner.

We use RNNs for processing the sequential data. An RNN is described by the recursive equation rnn_itself

where is the input vector and is the state vector at time instant . and are the input-to-hidden and hidden-to-hidden weight matrices (augmented to include biases), respectively; is a nonlinear function that applies element wise. As an example, for the prediction task, one can use directly as the input vector, i.e., , or some combination of the past information with a window size along with to construct the input vector . The final output of the RNN is produced by passing through a sigmoid or a linear model:

where is and . Here, we consider as the “extracted feature vector” from the sequential data by the RNN.

To effectively deal with the temporal dependency in sequential data, we employ a specific kind of RNN, namely the LSTM neural networks lstm_itself in the front end as a feature extractor. Here we use the variant with the forget gates lstm_forget . The forward pass equations in one cell are described as


where is the state vector, is the input to the cell, is the hidden state from the previous cell and denotes the vertical concatenation of and . are the forget gate, input gate, block input and output gate vectors, respectively, whose stacked weight matrices are and . The corresponding biases are represented by and .

denotes the logistic sigmoid function, i.e,

and is the hyperbolic tangent function, i.e,

; both are nonlinear activation functions that apply point-wise.

is the Hadamard (element-wise) product operator. The constant error flow through the cell state in backpropagation allows LSTMs to prevent vanishing gradients, and nonlinear gated interactions described by (1) control the information flow effectively to capture long term dependencies lstm_odssey .

Instead of the linear model, one can use decision trees to process the extracted features, i.e., produced by the RNN or LSTM. A (hard) binary decision tree forms an axis-aligned partition of the input space hard_tree . The tree consists of two types of nodes: internal nodes and leaf nodes, as depicted in Fig. 1. The input vector enters from the root node and is subjected to a decision rule at each internal node, routing it either to the left child or the right child. This path ends when the input ends up in a leaf node where an output value is produced, through which the input space is partitioned. Thereby, a regression tree with leaf nodes is recursively constructed to split the input space into regions, , where . This construction is usually done with a greedy approach hard_tree . The forward pass of an input vector is given by


where is the scalar quantity that leaf node associates with the input, is the indicator function and represents the tree. By (2), the summation has only one nonzero term due to the hard decisions that route the input vector in a binary fashion, i.e., only one of the leaf nodes contributes to the final output.

[D_1, tikz= at (0,0.3) (input) ; [D_2, edge=blue [ϕ_1] [ϕ_2, edge=blue,draw=blue,bottom color=green!50] ] [D_3 [ϕ_3] [ϕ_4] ] ]

Figure 1: A hard binary decision tree. Pink nodes are the internal nodes and green nodes are the leaf nodes. ’s represent the decision rules applied at each internal node to form a binary decision. For an example input , the set of decisions lead to the second leaf node where the path is colored blue. Therefore, is associated with the value that is assigned to the second leaf node.

A standalone tree, however, usually suffers from overfitting hard_tree and a way to overcome this is to use many trees in a sequential manner, i.e., gradient boosting decision trees (GBDT) gbdt_friedman . A GBDT works by employing individual trees as “weak” learners and fitting one after another to the residuals of the previous tree with the same greedy approach for constructing individual trees. The forward pass is the weighted sum of the predictions of each individual tree. For trees, the overall output for the input is given by


where for represents the forward pass of the tree and is the shrinkage parameter that weights the predictions of individual trees (except for the first one in the chain) to provide regularization gbdt_friedman . The very first tree, i.e., , predicts a constant value regardless of the input, e.g., the mean value of the targets when the loss criterion is the mean squared error.

Connecting a recurrent neural network and a decision tree based architecture in an end-to-end fashion and training with gradient based methods is not possible when conventional decision trees are used as they inherently lack differentiability due to hard decisions at each node. Therefore, for the joint optimization, we introduce a model composed of an RNN in the front end and a soft GBDT in the back end, which are jointly optimized in an end-to-end manner. We next introduce this architecture for nonlinear prediction.

2.2 The proposed model

We next introduce a model which jointly optimizes the feature selection and model building in a fully online manner. To this end, we use an LSTM network in the front end as a feature extractor from the sequential data and a soft GBDT (sGBDT) in the back end as a supervised regressor. Given and for in the forward pass, a multi layer LSTM produces hidden state vectors by applying (1) in each of its cells. Any differentiable pooling method can be used on the hidden state vectors of the last layer, e.g, last pooling where only the hidden state of the rightmost LSTM cell (when unrolled) is taken. This pooled quantity, , is the filtered sequential raw data that represents the feature extraction or learning part via LSTM. This extracted feature vector is then fed into the sGBDT as the input to produce . We now describe the soft decision tree based GBDT used in the model.

2.2.1 sGBDT

An sGBDT is a gradient boosting decision tree gbdt_friedman where weak learners are soft decision trees irsoy_sdt ; hinton_sdt . In this section, we first study the soft decision tree variant we use in our model and then present the gradient boosting machinery that combines them.

Soft Decision Tree (sDT)

The widely used hard binary decision trees recursively route a sample from the root node to a single leaf node and result in (usually) axis-aligned partitions of the input space. The decision rules applied at each node redirect the sample either to the left or the right child. Soft decision trees differ in that they redirect a sample to all

leaf nodes of the tree with certain probabilities attached. These

soft decision rules yield a differentiable architecture.

Formally, we have a binary tree with an ordered set of internal nodes and leaf nodes . At the internal node

, the sample is subjected to a probabilistic routing controlled by a Bernoulli random variable with parameter

where “success” corresponds to routing to the left child. A convenient way to obtain a probability is to use the sigmoid function . Thereby we attach and to each as learnable parameters such that is the probability of sample going to the left child and is the probability of it going to the right child. With this scheme, we define a “path probability” for each leaf node as the multiplication of the probabilities of the internal nodes that led to it. As shown in Fig. 2, every leaf node has a probability attached to it for a given sample. These path probabilities correspond to each leaf nodes’ contribution in the overall prediction given by the tree, which is


where is the path probability of leaf node , is the predicted value produced at which is a learnable parameter (and vector valued in general) and denotes the ascendant nodes of the leaf node , including the root node.

The soft decisions formed through ’s not only allow for a fully differentiable tree but also result in smooth decision boundaries that effectively reduce the number of nodes necessary to construct the tree in comparison to hard trees irsoy_sdt . Furthermore, being a vector valued quantity in general makes multi output regression possible.

max width=0.91 before typesetting nodes= !r.replace by=[, coordinate, append] , for tree= where n children=0 tier=word,bottom color=green!20,top color=green!20, inner sep=1.5mm,edge=blue,draw=blue,bottom color=green!50 circle, aspect=2,bottom color=red!10,top color=red!10 , edge+=thick, -Latex, math content, s sep’+=.7cm, draw, thick, edge path’= (!u) -— (.parent), [w_1,b_1, tikz= at (0,0.25) (input) ; [w_2,b_2, edge=blue [ϕ_1] [ϕ_2] ] [w_3,b_3, edge=blue [ϕ_3] [ϕ_4] ] ]

Figure 2: A soft binary decision tree. Pink nodes are the internal nodes and green nodes are the leaf nodes, as in Fig. 1. Unlike the hard decision tree, however, the decision rule at each internal node is not hard and the input is routed to both left and right child, as indicated with blue edges. The direction is accompanied by a Bernoulli random variable parameterized by and . Therefore, all of the leaf nodes has a contribution in the final prediction of the tree, as expressed in (4). We also note that the learnable parameters in each leaf node are vector valued in general in contrast to hard decision trees.
Gradient boosting of sDTs

We adapt the gradient boosting machinery as originally proposed in gbdt_friedman . The weak learners are fixed depth soft decision trees. Hence, all of the trees have a predetermined depth which saves computational time. In this regard, our sGBDT resembles AdaBoost where decision stumps are employed as weak learners adaboost_paper . Note that our implementation can be extended to growing trees in a straightforward manner. As depicted in Fig. 3, each tree (except for the very first one) is fitted to the residuals of the previous tree in the boosting chain. The first tree, , does not have any predecessor and is tasked to produce a constant output regardless of the input, which can be, for example, the mean of the target values of the training set. The forward pass equation of the sGBDT follows (3). The first tree does not possess any learnable parameter and rather acts as a starting point in the boosting chain. The parameters of other trees, which include for internal nodes and for leaf nodes, are learned through backpropagation with gradient descent. The backward pass equations are given in the next section.

2.2.2 The end-to-end model

Here, we describe how a given sequential data is fed forward and the corresponding backward pass equations of the joint optimization. Given an -layer LSTM network, the input is first windowed and then fed into the first layer of LSTM, denoted by LSTM(1). The window size determines how many time steps the model should look back for the prediction, i.e., the number of steps an LSTM cell should be unrolled in each layer. Each cell applies (1) sequentially where , i.e., initial cell gets vector for both the hidden state and cell state. Hidden states at all time steps ( for the layer) are recorded and fed into the next layer of LSTM as the inputs as shown in the left side of Fig. 3. This process repeats until it reaches the last layer of LSTM whereby a pooling method, , is applied to the hidden states of each cell . can be any differentiable operation; here we present three options of last, mean and max pooling as


The pooled LSTM output, , carries the features extracted from the raw sequential data and becomes the input vector for the sGBDT, as shown in Fig. 3. Then, as per the gradient boosting machinery, is fed into soft trees, last of which give an output as described in (4), whereas the first one outputs a constant value irrespective of the input, as explained in section 2.2.1. Nevertheless, we still call this first component in the chain a tree. The output of the sGBDT is a weighted sum of individual tree outputs. Combining (3) and (4), the overall output of the sGBDT is given by



represents the forward pass of the tree, is the set of leaf nodes in the tree and is the set of ascendant nodes of leaf node of the tree. We note that the superscript indicates that the quantity belongs to the tree. This completes the forward pass of the data.

Figure 3: The proposed LSTM-SGBDT architecture. The recurrent network in the front end is tasked to extract a feature vector, , from the raw sequential data. This extracted feature vector is a summary (i.e., pooling) of the last layer’s hidden states of all time steps. is then fed to the soft GBDT in the back end where the weak learners are soft decision trees. Boosted trees act as a supervised regressor; however, the need for hand-designed features are mitigated. The total loss is backpropagated through the whole architecture, achieving an end-to-end optimization.

We fit this architecture in an end-to-end manner using backpropagation with gradient descent. For the backward pass equations, without loss of generality, we focus on the case where values produced at leaf nodes of the trees,

, are scalar quantities; and we use the mean squared error for the loss function. Further, for brevity, we assume a 1-layer LSTM and let

be the augmented version of the pooled LSTM output (i.e., it has a prepended to it so that and described in Section 2.2.1 are collapsed into one vector, ). Let with be the input vector (possibly augmented with side information ; e.g., fed into LSTM cell when unrolled and be the ground truth, which is a real number. The total loss obtained from the soft trees, as shown in Fig. 3, can be written as

where is the output of the tree as in (4), is the shrinkage rate parameter that helps for regularization gbdt_friedman and is the residual that becomes the supervised signal the tree is fitted to, i.e.,

We now present the gradient of the total loss with respect to the learnable parameters. For the sGBDT, we have for each leaf node and for each internal node in the tree. The gradients are given as


where is the number of weak learners (soft decision trees), is the set of descendants of the internal node , is the set of left descendants of the internal node , i.e., it consists of the nodes in the sub-tree rooted at node which are reached from by first selecting the left branch of . (7) and (8) give the necessary components for the update equations of the learnable parameters of the sGBDT, namely, and for each tree and node.

For the joint optimization, we also derive the gradient of the loss with respect to , which is given as


where is the set of ascendants of the leaf node , is the set of right ascendants of the leaf node , i.e., it consists of the nodes in the tree from which by first selecting the right branch, there is a path that leads to . With (9), the backpropagation path until the pooled output of LSTM is completed.

We assume last pooling, i.e., of (5) for the derivation such that the error propagates through the hidden state of the last cell. The gradients of the total loss with respect to LSTM parameters are given by


where denotes the set ,

starting from the hidden state of the rightmost cell (i.e, ) with given by (9) and denotes the square matrix inside the stacked that multiplies in (1). results in a square, diagonal matrix where the diagonal entries are given by its vector operand. Finally, are given by

Equations 711 give the required backward pass equations of the whole architecture. The corresponding stochastic gradient update equations are


where represents the change in the variable between two successive iterations and is the learning rate hyperparameter.

3 Experiments

In this section, we illustrate the performance of the proposed architecture under different scenarios. In the first part, we demonstrate the learning advantages of the proposed model with respect to disjoint structures through synthetic datasets. We then consider the regression performance of the model over well-known real life datasets such as bank bank_dataset , elevators elev_puma_dataset , kinematics kine_dataset and pumadyn elev_puma_dataset datasets. We then perform simulations over the hourly, daily and yearly subsets of the M4 competition m4_comp dataset. We lastly consider two financial datasets, Alcoa stock price alcoa_dataset and Hong Kong exchange rate data against USD hk_dataset .

3.1 Architecture Verification

In this section, we show the advantages of our end-to-end optimization framework with respect to the classical disjoint framework, i.e., where one first extracts features and then optimizes the regressor. For controlled experiments, we first work with artificially generated data that are generated through an LSTM network. The inputs are from i.i.d. standard Gaussian distribution and the targets,

, are formed as the output of a fixed LSTM network with a given seed. The “disjoint” model consists of an LSTM for feature extraction and a hard GBDT for regressing over those features. We note that the mechanism, as stated in disj_paper

, requires fitting the models independently. On the other hand, our “joint” model has an LSTM in the front end and a soft GBDT in the back end, while the optimization is now end-to-end. We generate 1,000 artificial samples with a fixed seed and spare 20% of the data to test set. Both models are trained for 100 epochs. At the end of each epoch, we subject both models to test data and record their prediction performance in RMSE. The results are seen in Fig.

4. We observe that the disjoint model struggles to reach an optimal solution and stabilizes its performance after around 40 epochs while the joint architecture has steadily-decreasing error performance on the test set. This shows the advantage of employing end-to-end optimization and verifies the working characteristics of our algorithm.

Figure 4: Learning curve comparison with a disjoint architecture. The joint architecture not only mitigates training separate models but also achieves closer to optimal performance compared to the disjoint architecture.

We further test the integrity of our end-to-end model using datasets that are produced using both LSTM and GBDT. Through this, we check the learning mechanism of the model in various controlled scenarios. First, we generate a random sequential input from an i.i.d. standard Gaussian distribution and feed it o an LSTM network whose weights are fixed with a random seed. Its output is then fed to a fixed soft GBDT to get the ground truths, i.e, . The architecture is then subjected to replicate these ground truth values as close as possible given the same random inputs. This tests whether the model is able to tune its parameters in an end-to-end manner to fit the given synthetic dataset. Secondly, we feed the above randomly generated data directly into a soft GBDT and get the ground truths as its output. We then train the model with this input-output pair, expecting that the LSTM part in the front end learns the identity mapping. Lastly, we again use a soft GBDT to generate the same input-output pair, but train the overall model with the inputs where is the randomly generated design matrix and is a random constant matrix of suitable shape. This configuration aims for LSTM to learn an inverse mapping. We generate 1,000 samples in all three configurations.

In Fig. 5, we observe that in all configurations, the network reaches to near zero root mean squared error (RMSE) between its predictions and the ground truths. Convergence is very fast in replication and identity tasks suggesting that the model is capable of learning from the individual parts and that the joint optimization readily tunes the parameters of the LSTM network. In the inverse task, the convergence is rather slow since the input was subjected to a random projection. Still, the LSTM in the front end learns an inverse mapping, verifying the integrity of the end-to-end model.

Figure 5: Integrity verification results of the hybrid model for different configurations.

3.2 Real Life Datasets

In this section, we evaluate the performance of the proposed model in online learning setting. To this end, we consider four real life datasets.

  • Bank dataset bank_dataset is a simulation of queues in a series of banks using customers with varying level of complexity of tasks and patience. We aim to predict the fraction of customers that have left the queue without getting their task done. There are 8,192 samples and each sample has 32 features, i.e., .

  • Elevators dataset elev_puma_dataset is obtained from the experiments of controlling an F16 aircraft; the aim is to predict a numeric variable in range related to an action on the elevator of the plane, given 18 features. The dataset has 16,599 samples.

  • Kinematics dataset kine_dataset consists of 8,192 samples that are a result of a realistic simulation of the dynamics of an 8-link all-revolute robot arm. We aim to predict the distance of the end-effector from a given target using 8 features.

  • Pumadyn dataset elev_puma_dataset , similar to Kinematics dataset, contains a realistic simulation of a Puma 560 robot arm. The goal is to predict the angular acceleration of one of the robot arm’s links. There are 8,192 samples and 32 features.

We compare our model with the conventional LSTM architecture on these four datasets. To assess the performance of online regression, we measure time accumulated mean squared errors of models, which is computed as the cumulative sum of MSE’s normalized by the data length at a given time . In Table 1, we report the cumulative error at the last time step for both models. We observe that our joint model performs better on all datasets compared to the conventional LSTM architecture. In particular, in the Elevators dataset where the sample size is the largest, the difference in normalized cumulative error is the greatest. This suggests that the pure LSTM architecture is unable to learn from new data in long term as well as the joint architecture, featuring the role of the embedded soft GBDT as the final regressor. Similarly, in the Pumadyn dataset where the number of features is the largest, there is an order of magnitude difference in performances. This verifies the importance of mitigating hand-designed features and letting LSTM in the front end to extract necessary information from raw features to feed to the regressor in the end.

Bank Elevators Kinematics Pumadyn
LSTM 0.0190 0.0119 0.0884 0.0094
LSTM-SGBDT 0.0151 0.0787 0.0009
Table 1: Cumulative errors of the models on Bank, Elevators, Kinematics and Pumadyn datasets.

3.3 M4 Competition Datasets

The M4 competition m4_comp provided 100,000 time series of varying length with yearly, quarterly, monthly, weekly, daily and hourly frequencies. In our experiments, we chose yearly, daily and hourly datasets. The yearly dataset consists of 23,000 very short series with the average length being 31.32 years. Therefore, inclusion of this dataset aims to assess the performance of the models under sparse data conditions. Next, the daily dataset has 4,227 very long series where the average length is 2357.38 days. This dataset helps assess how effective a model is in capturing long-term trends and time shifts. Lastly, the hourly dataset consists of 414 series. The reason for including this subset is the dominant seasonal component. We experiment over all of the hourly series and randomly selected 500 of daily and yearly series each. The prediction horizon for hourly, daily and yearly datasets are 48, 14 and 6 time steps, respectively.

We compare our model (LSTM-SGBDT) with the classic LSTM model presented in Section 2.1. For each dataset, we separate a hold-out subseries from the end of each time series as long as the desired prediction horizon. For example, a series in the hourly dataset has its last 48 samples spared for validation of the hyperparameters. For the LSTM part, we tune the number of layers, number of hidden units in each layer and the learning rate. For our architecture, we also search for the pooling method , number of soft trees , depth of each tree and the shrinkage rate in boosting. We use MAPE (mean absolute percentage error) to evaluate models’ performance across series as it is scale-independent and defined as

where is the prediction horizon (e.g., 14 days for the daily dataset), and denote the vector of true and predicted values over the horizon, respectively. We only apply standardization to the values in the preprocessing stage.

The mean MAPE scores of the models across each datasets are seen in Table 2. The LSTM-SGBDT model achieves better performance in all three M4 datasets. In the yearly dataset where the performance difference is the smallest, LSTM-SGBDT model struggles less than vanilla LSTM against the sparse data. This indicates that soft GBDT part helps reduce overfitting of the data hungry LSTM network and acts as a regularizer. LSTM-SGBDT also outperforms the vanilla LSTM in the daily dataset with a wider margin of . Given that the daily dataset consists of very long series, LSTM-SGBDT is able to model long-term dependencies better than LSTM. Lastly, our model achieves better performance across the whole M4 hourly dataset than LSTM. This suggests that our model is better at detecting seasonality patterns, which are dominantly seen in this particular dataset. We note that no preprocessing specific to seasonality extraction has been made. Testing part of an example series from the hourly dataset along with the predictions of LSTM, LSTM-SGBDT and the naive model that always predicts the last observed value are given in Fig. 6. We observe that the vanilla LSTM model failed to capture the amplitude of the seasonal pattern. It also tends to predict the last observed value (i.e., the same as the naive predictions) for many time steps, which LSTM-based models are known to suffer from lstm_naive . Our model, on the other hand, successfully captures the seasonal pattern and closely follows the actual values.

Hourly Daily Yearly
LSTM 0.20282 0.06448 0.18032
LSTM-SGBDT 0.12290 0.01658 0.15852
Table 2: Mean MAPE performances of the models across M4 datasets.
Figure 6: Naive, LSTM and LSTM-SGBDT predictions over the forecast horizon of an M4 hourly series.

3.4 Financial Datasets

In this section, we evaluate the performances of the models under two financial scenarios. We first consider the Alcoa stock price dataset alcoa_dataset

, which contains the daily stock price values between the years 2000 and 2020. We aim to predict the future prices values by considering the past prices. We only apply standardization to the price values in a cumulative manner. The scaler updates its stored mean and standard deviation values as more data are revealed. We hold out

of the data from the beginning and use it for the hyperparameter validation. We choose a window size of 5 so that last five trading days’ stock prices are used by the models to predict the today’s price value. Tuned hyperparameters are the same as those in M4 datasets for both models. Fig. 7 illustrates the performance of the models as the data length varies. We observe that our model consistently achieves a lower accumulated error than the vanilla LSTM model. Furthermore, LSTM model is slower to react to the abrupt change in the stock prices around length of 2,000 which corresponds to the global financial crisis in the late 2008.

Figure 7: Time accumulated errors of LSTM and LSTM-SGBDT models for the Alcoa stock price dataset for the period 2000-2020.

Apart from the Alcoa stock price dataset, we also experiment with the Hong Kong exchange rate dataset hk_dataset , which has the daily rate for Hong Kong dollars against US dollars between the dates 2005/04/01 and 2007/03/30. Our goal is to predict the future exchange rates by using the data of the previous five days. We use the same setup for the hyperparameter configuration as in the Alcoa dataset. The time accumulated errors for both models are presented in Fig. 8. We observe that while both models follow a similar error pattern, our model achieves a lower cumulative error for almost all time steps. We also notice that the gap between the errors becomes wider as more data are revealed.

Figure 8: Time accumulated errors of LSTM and LSTM-SGBDT models for the Hong Kong exchange rate against U.S. dollars.

4 Conclusions

We studied nonlinear prediction/regression in an online setting and introduced a hybrid architecture composed of an LSTM and a soft GBDT. The recurrent network in the front end acts as a feature extractor from the raw sequential data and the boosted trees in the back end employ a supervised regressor role. We thereby remove the need for hand-designed features for the boosting tree while enjoying joint optimization for the end-to-end model thanks to the soft decision trees as the weak learners in the boosting chain. We derive the gradient updates to be used in backpropagation for all the parameters and also empirically verified the integrity of the architecture. We note that our framework is generic so that one can use other deep learning architectures for feature extraction (such as RNNs and GRUs) and machine learning algorithms for decision making as long as they are differentiable. We achieve consistent and significant performance gains in our experiments over conventional methods on various well-known real life datasets. We also provide the source code for replicability.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.


  • (1) J. Nowicka-Zagrajek, R. Weron, Modeling electricity loads in california: Arma models with hyperbolic noise, Signal Processing 82 (12) (2002) 1903–1915. doi:https://doi.org/10.1016/S0165-1684(02)00318-3.
  • (2) X. Zheng, C. Zhang, C. Wan, Mirna-disease association prediction via non-negative matrix factorization based matrix completion, Signal Processing 190 (2022) 108312. doi:https://doi.org/10.1016/j.sigpro.2021.108312.
  • (3) M. J. M. Spelta, W. A. Martins, Normalized lms algorithm and data-selective strategies for adaptive graph signal estimation, Signal Processing 167 (2020) 107326. doi:https://doi.org/10.1016/j.sigpro.2019.107326.
  • (4) C. Soguero-Ruiz, F. J. Gimeno-Blanes, I. Mora-Jiménez, M. del Pilar Martínez-Ruiz, J. L. Rojo-Álvarez, Statistical nonlinear analysis for reliable promotion decision-making, Digital Signal Processing 33 (2014) 156–168. doi:https://doi.org/10.1016/j.dsp.2014.06.014.
  • (5)

    L. Liu, L. Shao, P. Rockett, Human action recognition based on boosted feature selection and naive bayes nearest-neighbor classification, Signal Processing 93 (6) (2013) 1521–1530, special issue on Machine Learning in Intelligent Image Processing.

  • (6) X. An, C. Hu, G. Liu, H. Lin, Distributed online gradient boosting on data stream over multi-agent networks, Signal Processing 189 (2021) 108253. doi:https://doi.org/10.1016/j.sigpro.2021.108253.
  • (7) F. Petropoulos et al., Forecasting: theory and practice, International Journal of Forecasting (Jan 2022). doi:10.1016/j.ijforecast.2021.11.001.
  • (8)

    A. Singer, G. Wornell, A. Oppenheim, Nonlinear autoregressive modeling and estimation in the presence of noise, Digital Signal Processing: A Review Journal 4 (4) (1994) 207–221.

  • (9) R. Tanno, K. Arulkumaran, D. Alexander, A. Criminisi, A. Nori, Adaptive neural trees, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, Vol. 97 of Proceedings of Machine Learning Research, PMLR, 2019, pp. 6166–6175.
  • (10) A. C. Tsoi, Gradient based learning methods, Springer Berlin Heidelberg, Berlin, Heidelberg, 1998, pp. 27–62. doi:10.1007/BFb0053994.
  • (11) H. Xu, R. Ma, L. Yan, Z. Ma, Two-stage prediction of machinery fault trend based on deep learning for time series analysis, Digital Signal Processing 117 (2021) 103150. doi:https://doi.org/10.1016/j.dsp.2021.103150.
  • (12)

    I. Crignon, B. Dubuisson, G. Le Guillou, The ternary decision tree: A multistage classifier with reject, Signal Processing 5 (5) (1983) 433–443.

  • (13) B. C. Civek, I. Delibalta, S. S. Kozat, Highly efficient hierarchical online nonlinear regression using second order methods, Signal Processing 137 (2017) 22–32. doi:https://doi.org/10.1016/j.sigpro.2017.01.029.
  • (14) J. H. Friedman, Greedy function approximation: A gradient boosting machine., The Annals of Statistics 29 (5) (2001) 1189 – 1232. doi:10.1214/aos/1013203451.
  • (15) T. Chen, C. Guestrin, XGBoost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, ACM, New York, NY, USA, 2016, pp. 785–794. doi:10.1145/2939672.2939785.
  • (16) G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, Lightgbm: A highly efficient gradient boosting decision tree, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 30, Curran Associates, Inc., 2017.
  • (17) S. Makridakis, E. Spiliotis, V. Assimakopoulos, The m5 competition: Background, organization, and implementation, International Journal of Forecasting (2021). doi:https://doi.org/10.1016/j.ijforecast.2021.07.007.
  • (18) H. Wang, J. Wu, S. Yuan, J. Chen, On characterizing scale effect of chinese mutual funds via text mining, Signal Processing 124 (2016) 266–278, big Data Meets Multimedia Analytics. doi:https://doi.org/10.1016/j.sigpro.2015.05.018.
  • (19) J. Peng, W. Li, Q. Ling, Byzantine-robust decentralized stochastic optimization over static and time-varying networks, Signal Processing 183 (2021) 108020. doi:https://doi.org/10.1016/j.sigpro.2021.108020.
  • (20) J. Feng, Y. Xu, Y. Jiang, Z. Zhou, Soft gradient boosting machine, CoRR abs/2006.04059 (2020). arXiv:2006.04059.
  • (21)

    O. Irsoy, O. T. Yildiz, E. Alpaydin, Soft decision trees, in: International Conference on Pattern Recognition, 2012.

  • (22) N. Frosst, G. E. Hinton, Distilling a neural network into a soft decision tree, CoRR abs/1711.09784 (2017). arXiv:1711.09784.
  • (23)

    P. Kontschieder, M. Fiterau, A. Criminisi, S. R. Bulò, Deep neural decision forests, in: 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1467–1475.

  • (24)

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.

  • (25) G. Biau, E. Scornet, J. Welbl, Neural random forests, CoRR abs/1604.07143 (2016). arXiv:1604.07143.
  • (26) H. Hazimeh, N. Ponomareva, P. Mol, Z. Tan, R. Mazumder, The tree ensemble layer: Differentiability meets conditional computation, in: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Vol. 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 4138–4148.
  • (27) H. Chen, S. M. Lundberg, S. Lee, Hybrid gradient boosting trees and neural networks for forecasting operating room data, CoRR abs/1801.07384 (2018). arXiv:1801.07384.
  • (28) Y. Huang, Y. He, R. Lu, X. Li, X. Yang, Thermal infrared object tracking via unsupervised deep correlation filters, Digital Signal Processing 123 (2022) 103432. doi:https://doi.org/10.1016/j.dsp.2022.103432.
  • (29) S. Sadrizadeh, H. Otroshi-Shahreza, F. Marvasti, Impulsive noise removal via a blind cnn enhanced by an iterative post-processing, Signal Processing 192 (2022) 108378. doi:https://doi.org/10.1016/j.sigpro.2021.108378.
  • (30)

    K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder–decoder for statistical machine translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1724–1734.

  • (31) S. Bai, J. Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, CoRR abs/1803.01271 (2018). arXiv:1803.01271.
  • (32) S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–80. doi:10.1162/neco.1997.9.8.1735.
  • (33) F. Gers, J. Schmidhuber, F. Cummins, Learning to forget: Continual prediction with lstm, Neural computation 12 (2000) 2451–71. doi:10.1162/089976600300015015.
  • (34) K. Greff, R. Srivastava, J. Koutník, B. Steunebrink, J. Schmidhuber, Lstm: A search space odyssey, IEEE transactions on neural networks and learning systems 28 (03 2015). doi:10.1109/TNNLS.2016.2582924.
  • (35) L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, Classification and Regression Trees, Wadsworth and Brooks, Monterey, CA, 1984.
  • (36) Y. Freund, R. E. Schapire, Experiments with a new boosting algorithm, in: Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, ICML’96, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1996, p. 148–156.
  • (37) L. Torgo, Regression data sets.
    URL https://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html
  • (38) J. Alcala-Fdez, A. Fernández, J. Luengo, J. Derrac, S. Garc’ia, L. Sanchez, F. Herrera, Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic and Soft Computing 17 (2010) 255–287.
  • (39) C. E. Rasmussen, R. M. Neal, G. Hinton, D. Camp, M. Revow, Z. Ghahramani, R. Kustra, R. Tibshirani, Delve data sets.
  • (40) S. Makridakis, E. Spiliotis, V. Assimakopoulos, The m4 competition: 100,000 time series and 61 forecasting methods, International Journal of Forecasting 36 (1) (2020) 54–74, m4 Competition. doi:https://doi.org/10.1016/j.ijforecast.2019.04.014.
  • (41) Alcoa Inc., Common stock.
    URL https://finance.yahoo.com/quote/AA
  • (42) E. W. Frees, Regression Modeling with Actuarial and Financial Applications, International Series on Actuarial Science, Cambridge University Press, 2009. doi:10.1017/CBO9780511814372.
  • (43) M. Du, Improving LSTM neural networks for better short-term wind power predictions, CoRR abs/1907.00489 (2019). arXiv:1907.00489.