Bilinear Input Normalization for Neural Networks in Financial Forecasting

09/01/2021
by   Dat Thanh Tran, et al.
Tampere Universities
5

Data normalization is one of the most important preprocessing steps when building a machine learning model, especially when the model of interest is a deep neural network. This is because deep neural network optimized with stochastic gradient descent is sensitive to the input variable range and prone to numerical issues. Different than other types of signals, financial time-series often exhibit unique characteristics such as high volatility, non-stationarity and multi-modality that make them challenging to work with, often requiring expert domain knowledge for devising a suitable processing pipeline. In this paper, we propose a novel data-driven normalization method for deep neural networks that handle high-frequency financial time-series. The proposed normalization scheme, which takes into account the bimodal characteristic of financial multivariate time-series, requires no expert knowledge to preprocess a financial time-series since this step is formulated as part of the end-to-end optimization process. Our experiments, conducted with state-of-the-arts neural networks and high-frequency data from two large-scale limit order books coming from the Nordic and US markets, show significant improvements over other normalization techniques in forecasting future stock price dynamics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/01/2020

Data Normalization for Bilinear Structures in High-Frequency Financial Time-series

Financial time-series analysis and forecasting have been extensively stu...
11/21/2011

Comparative study of Financial Time Series Prediction by Artificial Neural Network with Gradient Descent Learning

Financial forecasting is an example of a signal processing problem which...
01/24/2019

Temporal Logistic Neural Bag-of-Features for Financial Time series Forecasting leveraging Limit Order Book Data

Time series forecasting is a crucial component of many important applica...
06/07/2021

DMIDAS: Deep Mixed Data Sampling Regression for Long Multi-Horizon Time Series Forecasting

Neural forecasting has shown significant improvements in the accuracy of...
06/16/2020

Prior knowledge distillation based on financial time series

One of the major characteristics of financial time series is that they c...
12/04/2017

Temporal Attention augmented Bilinear Network for Financial Time-Series Data Analysis

Financial time-series forecasting has long been a challenging problem be...
03/25/2021

SubSpectral Normalization for Neural Audio Data Processing

Convolutional Neural Networks are widely used in various machine learnin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Nowadays, the world economical and social developments and well-beings are heavily influenced by financial markets. People participate in financial activities, which promote the circulation of assets and developments of the world economy, with the ultimate goal of gaining economic benefits. Under this light, the success of the participants depends largely on the quality and quantity of information that they possess, as well as their ability to interpret these information for decision-making. Because of this, computational intelligence in finance, which utilizes modern computing methodologies to analyze financial markets for decision-making, has attracted many researchers and practitioners from both academia and industry. Representative topics under this discipline include stock market forecasting [33, 39], algorithmic trading [25, 13], risk assessment [16, 10], asset pricing [5, 19], and portfolio allocation and optimization [7, 2]. Among these objectives, a substantial amount of research efforts has been dedicated to prediction and forecasting since financial decision-making, for the most part, depends on reliable projections about the future.

There are two common approaches, namely fundamental analysis [32] and technical analysis [21], which are currently adopted in predicting future market behaviors. In fundamental analysis, valuation techniques take into account different economic indicators that reflect and affect the market movements to establish long-term views on the development of a financial entity. On the other hand, in technical analysis, it is generally believed that the prices themselves already encompass all factors that affect the market dynamics. For this reason, technical analysts construct forecasting models based on series of historical transactions with the assumption that history tends to repeat itself [21], and the underlying processes, which generate the observed series, can be captured by mathematical or computational models.

Although financial time-series forecasting has been extensively studied over the past decades with a large body of literature dedicated to tackling specific problems, there are still many challenges in processing and analyzing data derived from financial markets, especially those coming from high-frequency intra-day activities. Over time, the development of internet technologies, database systems and electronic trading platforms have enabled us to collect a vast amount of digital footprints of the financial market. Enormous volumes of data, while ensuring statistical significance of any analysis, also create a great computational challenge when building financial prediction models. The computational aspect is especially critical for trading applications that take advantage statistical arbitrage, which usually exists in very short time before market correction [1]. Another challenge posed by financial time-series comes from the fact that they are usually complex, noisy, nonlinear and nonstationary in nature, which leads to difficulties not only in modeling but also in preprocessing.

Techniques for financial time-series prediction fall into two categories: traditional statistical models and machine learning models. In the stochastic model based approach, linear relationship is often assumed between the independent variables. Representative tools in this category include autoregressive integrated moving average (ARIMA) and its variants or generalized autoregressive conditional heteroskedasticity (GARCH) [9], to name a few. While stochastic models often possess nice theoretical properties, the underlying assumption is often too strong, leading to poor generalization performance in real-world data. On the other hand, machine learning models, which make no prior statistical or structural assumption, are often capable of modeling complex nonlinear relationships among the independent factors and the prediction targets. For this reason, machine learning models often generalize better than stochastic models in many forecasting scenarios [15, 27].

Among different types of machine learning models, neural networks are the leading solutions for many financial forecasting problems nowadays [17, 33, 39, 35, 8]

. The majority of these solutions were adopted from computer vision (CV) and natural language processing (NLP) applications where neural networks have demonstrated unprecedented successes in the last decade. Despite the fact that future market prediction based on historical time-series can be casted as a pattern recognition problem similar to those encountered in CV and NLP, thus can be treated in some degree of success with tools from CV and NLP, the unique characteristics of financial data make the market prediction tasks fundamentally different and require special treatments. The majority of problems targeted in CV and NLP concern solving cognitive tasks in which the data is intuitive and well-understood by normal human beings, such as recognition of objects or understanding natural languages. On the other hand, historical financial phenomena even require human experts to recognize or interpret, not to mention speculating about the future. In addition, images, videos or speeches, for example, are well-behaved signals in the sense that the value range and variances are known and can be easily processed without losing the essential information within them, while financial time-series are highly volatile and often exhibit concept drift phenomena

[4, 12], i.e., dynamic changes in the relationship between independent and target variables over time. Because of this, data preprocessing is an important procedure when working with financial time-series.

Among many preprocessing steps, data normalization, which is one of the most essential steps before building a machine learning model, aims at transforming input variables into a common range to avoid the potential bias induced by large numbers. For deep neural networks, improperly normalized data can easily lead to numerical issues with the gradient updates. In literature, there are many normalization methods such as z-score normalization, min-max normalization, pareto scaling, power transformation, to name a few

[31]

. These normalization methods utilize global data statistics, such as the mean, standard deviation or maximum value to transform the data. For financial time-series, especially those covering long periods, replacing global statistics with local statistics computed over the recent history is a common practice to avoid the problem of potential regime shifts in which recent observations have significantly different value range than past observations. To deal with this phenomenon, several sophisticated methods have been proposed, for example

[28, 22].

While many static normalization schemes have been developed as described above, we are only aware of one prior work [26]

that proposed an adaptive method for input time-series. Different from static approaches, an adaptive data-driven method transforms raw input data using statistics that are identified and learned via optimization. That is, the step is implemented as the first layer in a computation graph, with all parameters jointly estimated using stochastic gradient descent. In fact, one of the reasons that make neural nets work so well is the fact that they are estimated in an end-to-end manner, being able to learn data-dependent transformations. Thus, we argue that the normalization step for input time-series should also be learned in the same end-to-end manner when employing neural networks in financial forecasting.

In this paper, we propose Bilinear Input Normalization (BiN), a neural network layer that takes into account the bimodal nature of multivariate time-series, and performs input data transformation using parameters that are jointly estimated with other parameters in the network. The preliminary results of this work was presented in [34], which includes limited analysis and empirical evaluation of BiN for Temporal Attention Augmented Bilinear Layer (TABL) networks. In this paper, we provide more detailed, in-depth presentation and discussion of the proposed method, as well as extensive experiments demonstrated with another state-of-the-arts (SoTA) architecture in financial forecasting using stock market data from two different markets (US and Nordic).

The remainder of the paper is organized as follows. In Section II, we review related works in data normalization methods, with a focus on normalization schemes for neural networks. Section III describes in details the motivation and operations of the Bilinear Input Normalization layer. In Section 4, we provide basic information regarding limit order books and describe the problem of predicting stock mid-price dynamics using limit order book data, which is followed by the experimental setup, dataset description, the results and our analysis. Section V concludes our work.

Ii Related Work

Normalization is a scaling or transformation operation, usually in a linear manner, to ensure a uniform value range between different data dimensions, reducing the effects of dominant values and outliers

[11]. Perhaps, the most common normalization method is z-score normalization, which centers the data around the origin with unit standard deviation. There are also works that only center the data, without the scaling step as in z-score normalization. The steps in Pareto scaling [23] are similar to z-score normalization, except for the division of standard deviation instead of the variance. A generalization of z-score normalization is the variance stability scaling method [38]

, which multiplies the z-score standardized data with the ratio between the mean and standard deviation of the data. Power transformation is another normalization method employing the mean statistic to reduce the effects of heteroscedasticity

[18]. Besides data’s mean and variance, minimum, maximum and median values are also utilized in normalization, such as min-max normalization, median and median absolute deviation normalization. For interested readers, we refer to the analysis of different static data normalization techniques in machine learning models [31].

The term data normalization is often understood as the operation that preprocesses raw data, i.e., input data. However, in neural networks, normalization operation is also popular in hidden layers. This is due to the fact that different layers in a deep network can encounter significant input distribution shift during stochastic gradient updates. Normalization operation can be used to help stablize and improve the training process. Batch Normalization (BN) was proposed for Convolutional Neural Networks such a purpose

[14]

. Since stochastic gradient descent only operates in a mini-batch manner, the mini-batch mean and variance are accumulated in a moving average style to estimate the global mean and variance in BN. After subtracting the mean and dividing by the variance, BN also learns to scale and shift the hidden representations. Instead of the mini-batch statistics, Instance Normalization

[37] uses sample-level statistics, and learns how to normalize each image so that its contrast matches with that of a predefined style image in the visual style transfer problems. Both BN and IN were originally proposed for visual data, although BN has also been widely used in NLP.

Both BN and IN are adaptive data-driven normalization schemes. However, they were proposed to normalize the hidden representations, and they are not commonly used for input normalization. Regarding adaptive input normalization method for time-series, we are only aware of the work in [26], which formulated a 3-stage normalization procedure called Deep Adaptive Input Normalization (DAIN). Since DAIN is directly related to our proposed method, we describe DAIN in more details here.

In this paper, let us denote the collection of multivariate series as , where denotes the number of univariate series and denotes the temporal length of each series. Here and are also referred to as the feature and temporal dimensions, respectively. In addition, we denote the -th column of as , which is the representation of the series at the time index . We also refer to as the -th temporal slice. The first step of DAIN is to shift every temporal slice in as follows:

(1)

where is a learnable weight matrix that estimates the amount of shifting from the mean temporal slice () calculated from each series.

Fig. 1: Illustration of the effect of normalization along temporal mode. Here we consider two samples and on the left and right sides, respectively, each of which contains the opening prices of two stocks for 10 consecutive days, thus the multivariate series has dimensions . The continuous line represents the function governing the relationship between two stocks and the scatter plots represent the prices that we observe (our samples). We can see that compared to prices at , the price range at the time of has shifted for both stocks but their relationship is similar (the relative arrangement of points in -dimensional space is similar, but with different amounts of spread). After the normalization step (here we simply demonstrate with scaling factor of one and no shifting), the arrangements of normalized points are positioned at the same place in this -dimensional space, with similar spreads.

After shifting, the intermediate representation is then scaled as follows:

(2)

where is another weight matrix that estimates the amount of scaling from the standard deviation (), which is computed from temporal slices. In Eq. (2), the square-root operator is applied element-wise; and denote the element-wise multiplication and division, respectively.

The final step in DAIN is gating, which is used as a type of attention mechanism to suppress irrelevant features:

(3)

where and are two weight matrices to learn the gating function.

The output of DAIN is, thus, , which is the normalized series having the same size as the input series . Since the normalization scheme of DAIN contains several processing steps with nonlinear operations, stochastic updates in DAIN are sensitive to the learning rate. For this reason, the authors in [26] used three different learning rates for the parameters associated with three computational steps in DAIN. As we will see in the next section, our normalization scheme is more intuitive for time-series while requiring fewer computation and parameters. In addition, since our normalization scheme only relies on linear operations, it is robust with respect to the learning rates that are normally adopted to train the network under consideration.

Iii Adaptive Input Normalization with Bilinear Normalization Layer

The proposed BiN layer formulation shares some similarities with DAIN and IN in the sense that we also propose to take advantage of sample-level statistics when learning to transform the input series. More specifically, the basic statistics, which are used to normalize each input sample, were calculated independently for each sample. There are also global parameters that are shared between samples in BiN. In this way, our formulation (as well as DAIN and IN) is different from BN, which utilizes global statistics estimated from the whole dataset to normalize every sample. For BN and IN, both methods were not proposed to work as an input normalization scheme for time-series, but to work with higher-order tensors in hidden layers of convolutional neural networks, which have different semantic structure than multivariate time-series. We are also not aware of any work that utilizes BN and IN for input data normalization, especially for time-series. The main difference between the proposed method and DAIN is that BiN is formulated to jointly learn to transform the input samples along both temporal and feature dimensions, taking into account the bimodal nature of multivariate time-series, while DAIN only works along the temporal dimension.

In order to better understand our motivation in taking into consideration the bimodal nature of multivariate time-series, let us take an example in predicting the opening value of NASDAQ-100 index of a day based on the historical opening prices of 100 constituent companies in the last 10 days. In this case, each input sample has dimensions of . On one hand, we can consider that each is represented by a set of features ( columns of ), each of which has dimensions, representing the snapshot of the opening prices of constituent companies in NASDAQ-100. Thus, the mean value and variance of this set, also of , would represent the average opening prices and their volatility of companies in the last days. On the other hand, we can also consider that each is represented by a set of univariate series, each of which contains opening prices of a company over consecutive days. Therefore, the mean value and variance of this set, also of , would represent the mean and variance of the NASDAQ-100 equal weighted index111This means that each constituent company contributes 1%, without taking into account market capitalization. For example QQQE is an ETF that tracks NASDAQ-100 with equal weights during the last days. In our example, both ways of considering and the corresponding statistics are valid and meaningful. Each gives a different interpretation of the data contained in

, as well as the underlying assumption about elements being normally distributed in the set representing

. Because of this, the proposed normalization layer utilizes and combines statistics from both views in order to transform the multivariate series.

The proposed layer normalizes along the temporal dimension as follows:

(4a)
(4b)
(4c)
(4d)

where and are two parameters of BiN that are optimized during stochastic gradient descent.

After the computation steps in Eq. (4), we obtain an intermediate series that has been normalized in the temporal dimension. Basically, in Eq. (4), given an input series , BiN first computes the mean temporal slice (column) and its standard deviation as in Eq. (4a, 4b), which are then used to standardize each temporal slice of the input before applying element-wise scaling (using ) and shifting (using ) as in Eq. (4c). While the standardizing step is independent for each sample in the training set, last shifting and scaling parameters are shared between all samples. Here we use the subscript () in , and to indicate that they are associated with the second dimension, i.e., the temporal dimension, of the multivariate series.

In order to interpret the effects of Eq. (4a), (4b), and (4b), we can take the same approach as the example given for NASDAQ-100 previously. That is, the input series can be viewed as the set consisting of temporal slices, i.e., a set consisting of points in a -dimensional space. The first part in Eq. (4c), i.e. , moves this set of points around the origin and as well as controlling their spread while keeping their arrangement pattern similarly. If we have two input series and with the corresponding sets and spreading and lying in two completely different areas of this -dimensional space but have the same arrangement pattern, without the alignment performed by the first part of Eq. (4c), we cannot effectively capture the linear or nonlinear222Nonlinear patterns can be estimated by several piece-wise linear patterns (using more than one linear projections such as more than one convolution filters)

arragement patterns that are similar between the two series when using, for example, a 1D convolution filter that strides along the temporal dimension as often encountered in CNN architectures for time-series. We illustrate our example in Figure

1. Here we should note that although BiN applies additional scaling and shifting in Eq. (4c) after the alignment, the values of and are the same for every input series, thus the points of the set and are still centered at the same point and having approximately similar spreads. Since and are optimized together with other network’s parameters, they enable BiN to manipulate the aligned distributions of to match with the statistics of other layers.

While the effect of non-stationarity in the temporal mode are often visible and has been heavily studied, its effects when considered from the feature dimension perspective are less obvious. To see this, let us now view the series as the set of points (its rows) in a -dimensional space. Let us also take the previous scenario where two series, and , have and scattered in different regions of a -dimensional co-ordinate system (viewed under the temporal perspective) before the normalization step in Eq. (4). When and are very far away viewed from the feature perspective, these two series are also likely to possess and which are distributed in two different regions of a -dimensional space, although having very similar arrangement. This scenario also prevents a convolution filter that strides along the feature dimension to effectively capture the prominent linear/nonlinear patterns existing in the feature dimension of all input series. For this reason, our proposed normalization scheme also normalizes the input series along the feature dimension as follows:

(5a)
(5b)
(5c)
(5d)

where denotes the -th row of . In addition, and are two learnable weights.

After computing the steps in Eq. (5), we obtain another intermediate series that has been normalized in the feature dimension.

Finally, BiN linearly combines the intermediate normalized series obtained from Eq. (4) and (5) to generate the output :

(6)

where and are two learnable scalars, which enable BiN to weigh the importance of temporal and feature normalization. Here we should note that and are constrained to be non-negative. This constraint is achieved during stochastic optimization by setting the value (of or ) to whenever the updated value is negative.

Iv Experiments

Iv-a Limit Order Book

In finance, a limit order is a type of trade order to buy or sell a fixed number of shares with a specified price. In a buy (bid) limit order, the trader specifies the number of shares and the maximum price per share of the stock that he or she is willing to pay. On the contrary, for a sell (ask) limit order, the trader must specifies the number of shares and the minimum share price that he or she wants to sell. The two types of limit order form the two sides of the limit order book (LOB): the bid and the ask sides. The limit orders are sorted such that the ones with the highest bid price are on top of the bid side and the ones with the lowest ask price are on top of the ask side. Whenever the best ask price is equal or lower than the best bid price, those orders are executed and removed from the LOB.

Since the LOB contains all the transactions related to a stock, it reflects the current supply and demand of the stock at different price levels. In literature, there are numerous researches that take advantage of the LOB data and formulate different research questions such as order flow distribution, price jumps, random walk nature of prices, stochastic models of limit orders, to name a few [30, 29, 3, 6, 20]. One of the problems related to the LOB that are heavily studied using machine learning methods is the problem of forecasting future mid-price movements. Mid-price, at any point in time, is the average value between the best-bid and best-ask prices. This quantity is a virtual price since no trade can happen at the current mid-price. Since the movements of mid-price reflect the changes in market dynamics, they are considered as important events to forecast. In order to benchmark performances of BiN, we conducted experiments using two different LOB datasets coming from two different markets: Nordic and US markets.

Iv-B Experiments using Nordic data

Iv-B1 Dataset and Experimental Setup

FI-2010 [24] is a large scale, publicly available Limit Order Book (LOB) dataset, which contains buy and sell limit order information (the prices and volumes) over business days from Finnish stocks traded in Helsinki Stock Exchange (operated by NASDAQ Nordic). At each order event (a point in time), the dataset contains the prices and volumes from the top best-bid and best-ask orders of both sides, leading to a

-dimensional vector representation. The authors of this dataset provided the labels (up, down, stationary) for the mid-price movements in the next

order events. Since the majority of existing research results were reported for prediction horizons in the set , we also conducted experiments with these values. Interested readers can read more about the FI-2010 dataset in [24].

For the FI-2010 dataset, we followed the same experimental setup proposed in [33], which is widely used to benchmark the performances of deep neural networks in this task. Under this setting, data of the first days was used to train the models, and the last days were used for evaluation purposes. In this first set of experiments, we evaluated BiN in combination with the Temporal Attention augmented Bilinear Layer (TABL) network, which is one of the SoTA neural networks in FI-2010 dataset [33]. Since TABL architectures also take advantage of the bimodal nature of the time-series, BiN is expected to ideally complement TABL networks. To enable comparisons with prior works, the best performing architecture C(TABL) reported in [33] was adopted in our experiments. For this architecture, the input time-series were constructed from most recent order events. As we mentioned above, since at each order event, the LOB is represented by a -dimensional vector, each input series that is fed to C(TABL) has dimensions of . All C(TABL) networks were trained with ADAM optimizer for epochs, with an initial learning rate of , which was reduced by a factor of at epoch and . Weight decay () and max-norm constraint () were used for regularization.

Accuracy, average Precision, Recall and F1 are reported as the performance metrics. Since FI-2010 is an imbalanced dataset, average F1 measure is considered as the main performance metric for FI-2010 following prior conventions [33]. Here we should note that we used no validation set for FI-2010, and simply used the F1 score measured on the train set for validation purposes. Each experiment was run times and the median value measured on the test set is reported.

Iv-B2 Experiment Results

Models Accuracy % Precision % Recall % F1 %
Prediction Horizon
CNN[35] -
LSTM[36] -
C(BL) [33]
DeepLOB [39]
DAIN-MLP [26] -
DAIN-RNN [26] -
C(TABL) [33]
BN-C(TABL)
BiN-C(TABL)
Prediction Horizon
CNN[35] -
LSTM[36] -
C(BL) [33]
DeepLOB [39]
DAIN-MLP [26] -
DAIN-RNN [26] -
C(TABL) [33]
BN-C(TABL)
BiN-C(TABL)
Prediction Horizon
CNN[35] -
LSTM[36] -
C(BL) [33]
DeepLOB [39]
C(TABL) [33]
BN-C(TABL)
BiN-C(TABL)
TABLE I: Experiment Results. Methods without any indication of normalization method means that z-score normalization was applied. Bold-face numbers denote the best F1 measure between the same model using different normalization methods.

Table I shows the experiment results in three prediction horizons of C(TABL) networks using Batch Normalization and BiN, in comparison with existing results. Here we should note that the data provided in FI-2010 has been anonymized, i.e., the prices and volumes of orders were normalized. For those results reported in Table I without any indication of the normalization method, it means that z-score normalization was applied. In addition, we attempted to evaluate DAIN using the C(TABL) architecture on FI-2010 dataset, however, we could not achieve reasonable performances since this normalization strategy requires extensive tuning of three different learning rates for different computation steps. Besides, in the original paper [26], DAIN was only applied to MLP and RNN networks. For this reason, we report the original results of DAIN using MLP and RNN in Table I. In the experiments using US data, we did obtain reasonable results with DAIN and comparisons with DAIN are made in Section IV-C.

Models Accuracy % Precision % Recall % F1 %
Prediction Horizon
B(TABL) [33]
C(TABL) [33]
BiN-B(TABL)
BiN-C(TABL)
Prediction Horizon
B(TABL) [33]
C(TABL) [33]
BiN-B(TABL)
BiN-C(TABL)
Prediction Horizon
B(TABL) [33]
C(TABL) [33]
BiN-B(TABL)
BiN-C(TABL)
TABLE II: Improvement comparisons between BiN-C(TABL) versus BiN-B(TABL)

It is clear that our proposed BiN layer (BiN-C(TABL)) when used to normalize the input data yielded significant improvements over BN and z-score normalization when applied to the same network. The improvements are obvious for all prediction horizons. Especially, for the longest horizon , BiN enhanced the C(TABL) network with up to improvement (from to ) in average F1 measure. Compared to DAIN, the performances achieved by our normalization strategy coupled with C(TABL) or DeepLOB networks are superior to that of DAIN coupled with MLP or RNN. Regarding BN when used as an input normalization scheme, it is obvious that BN deteriorated the performance of C(TABL) networks. For example, in case of , adding BN to C(TABL) network led to more than drop in averaged F1. This phenomenon is expected since BN was originally designed to reduce covariate shift between hidden layers of Convolutional Neural Network, rather than as a mechanism to normalize input time-series.

Comparing BiN-C(TABL) with a SoTA CNN-LSTM architecture having 11 hidden layers called DeepLOB [39], it is clear that our proposed normalization layer helped a TABL network having only 2 hidden layers to significantly close the gaps when and ( versus for , and versus for ), while outperforming DeepLOB by a large margin when ( versus ).

In order to investigate how much improvement BiN can contribute to neural networks of different complexities, we evaluated BiN with a smaller TABL architecture, namely B(TABL) as proposed in [33]. B(TABL) has only one hidden layer with a total of parameters, compared to C(TABL) which has two hidden layers with a total of parameters. The results are shown in Table II. It is clear that BiN significantly boosted both B(TABL) and C(TABL) architectures in different prediction horizons, with BiN-B(TABL) networks perform as well as BiN-C(TABL) networks in all prediction horizons, making the additional hidden layer in BiN-C(TABL) redundant. Here we should note that adding our proposed normalization layer to B(TABL) networks only leads to a mere increase of parameters while achieving the same performances as BiN-C(TABL) networks, which have approximately twice the amount of parameters.

Since BN was proposed to normalize hidden representations, we also experimented using BiN to normalize hidden representations in TABL networks. The results are shown in Table III, where BiN-C(TABL) and BN-C(TABL) denote the results when BiN and BN were only applied to input, while BiN-C(TABL)-BiN and BN-C(TABL)-BN denote the results when BiN and BN were applied to both the input and hidden representations. As we can see from Table III, there are very small differences between the two arrangements, except a noticeable improvement for BN when the prediction horizon is . For BiN, the this results imply that adding normalization to the hidden layers bring no additional benefit for C(TABL) networks when the input data has been properly normalized.

Models Accuracy % Precision % Recall % F1 %
Prediction Horizon
BN-C(TABL)
BiN-C(TABL)
BN-C(TABL)-BN
BiN-C(TABL)-BiN

Prediction Horizon
BN-C(TABL)
BiN-C(TABL)
BN-C(TABL)-BN
BiN-C(TABL)-BiN
Prediction Horizon
BN-C(TABL)
BiN-C(TABL)
BN-C(TABL)-BN
BiN-C(TABL)-BiN
TABLE III: Comparisons between Bilinear Normalization and Batch Normalization when applied to only input layer (BiN-C(TABL) and BN-C(TABL)) or all layers (BiN-C(TABL)-BiN and BN-C(TABL)-BN

Iv-C Experiments using US data

Iv-C1 Dataset and Experiment Setup

While the Nordic dataset provides a reasonable testbed for our evaluation purpose, the Nordic market is less liquid compared to the US market, which is the biggest stock market worldwide. The number of intra-day orders in large-cap US stocks is significantly higher than that of the Nordic stocks, making it harder to predict the future market conditions. For the US market, we procured orders from TotalView-ITCH feed and obtained the LOB data of Amazon and Google from the 22nd of September 2015 to the 5th of October 2015. The trading hours in NASDAQ US spans from 09:30 to 16:00 (EST) and only orders submitted during this period were considered in our analysis. After the filtering process, we obtained approximately 13 millions order events for working days. Similar to the Nordic data, we used the first days for training the prediction models and the last days for testing purposes.

In addition to forecasting the types of mid-price dynamics (up, down, stationary) at a fixed future horizon (Setting 1), we also evaluated the models in a more active setting (Setting 2), in which models were trained to predict the next movement (up or down) of the mid-price and when it occurs. That is, we have both classification (movement type) and regression (horizon value) objectives in Setting 2, with the loss function consists of the cross entropy and the mean squared error. The movement labels were derived following the same procedure used in

[24], which includes price smoothing and movement classification based on a threshold of .

For the experiments with US data, in addition to C(TABL) architecture, we also evaluated with the DeepLOB architecture [39] as the predictors. Different from the Nordic dataset which was pre-normalized, the US data contains raw values for the prices and volumes. For this reason, we experimented with two static normalization methods, namely z-score normalization and min-max normalization with the results denoted as z-C(TABL) and mm-C(TABL) for C(TABL) networks, and z-DeepLOB and mm-DeepLOB for DeepLOB networks.

Models Accuracy (%) Precision (%) Recall (%) F1 (%)
Prediction Horizon
C(TABL)
z-C(TABL)
mm-C(TABL)
BN-C(TABL)
DAIN-C(TABL)
BiN-C(TABL)
Prediction Horizon
C(TABL)
z-C(TABL)
mm-C(TABL)
BN-C(TABL)
DAIN-C(TABL)
BiN-C(TABL)
Prediction Horizon
C(TABL)
z-C(TABL)
mm-C(TABL)
BN-C(TABL)
DAIN-C(TABL)
BiN-C(TABL)
TABLE IV: Results for C(TABL) architecture in experiment Setting 1 of US data

Iv-C2 Experiment Results

Table IV shows the experiment results in Setting 1 of the US data for the C(TABL) architecture. First of all, it is clear that we obtained the worst performance when using raw data to train the predictors (results associated with C(TABL)). Between the two static normalization methods, z-score normalization exhibited better ability in preprocessing the data compared to min-max normalization. Both static normalization methods significantly improve the quality of training data. Among adaptive normalization methods, performances obtained from BN are inferior to DAIN and BiN. Overall, the proposed normalization layer when combined with C(TABL) architecture yielded the best performances in all prediction horizons compared to others.

Table V shows the experiment results in Setting 1 of the US data for DeepLOB networks. Similar to the results obtained for C(TABL) networks, we also obtained the worst performance when using raw data to train the DeepLOB architecture. Between z-score normalization and min-max normalization, using the former led to slightly better results compared to the latter. While BN showed no superiority over z-score normalization, both DAIN and BiN outperformed static normalization methods. Among all normalization methods, BiN was the most suitable normalization technique to combine with the DeepLOB architecture.

Models Accuracy (%) Precision (%) Recall (%) F1 (%)
Prediction Horizon
DeepLOB
z-DeepLOB
mm-DeepLOB
BN-DeepLOB
DAIN-DeepLOB
BiN-DeepLOB
Prediction Horizon
DeepLOB
z-DeepLOB
mm-DeepLOB
BN-DeepLOB
DAIN-DeepLOB
BiN-DeepLOB
Prediction Horizon
DeepLOB
z-DeepLOB
mm-DeepLOB
BN-DeepLOB
DAIN-DeepLOB
BiN-DeepLOB
TABLE V: Results for DeepLOB network architecture in experiment Setting 1 of US data
F1 (%) RMSE
C(TABL)
z-C(TABL)
mm-C(TABL)
BN-C(TABL)
DAIN-C(TABL)
BiN-C(TABL)
DeepLOB
z-DeepLOB
mm-DeepLOB
BN-DeepLOB
DAIN-DeepLOB
BiN-DeepLOB
TABLE VI: Results for C(TABL) and DeepLOB architectures in experiment Setting 2 of US data

In experiment Setting 2, the models were trained to predict the type of the next movement of mid-price, which is measured by F1 score, as well as the horizon when it happens, which is measured by Root Mean Squared Error (RMSE). The performances of C(TABL) and DeepLOB networks using different input normalization methods are shown in Table VI. For both network architectures, the best F1 scores were obtained using the proposed normalization method. Z-score standardization and BN performed similarly, being the second best in terms of F1 score. Min-max normalization, again, showed inferior performances compared to z-score normalization. Surprisingly, DAIN performed poorly in terms of F1 score when compared to z-score normalization in this experiment setting. Regarding the prediction of the horizon value, BiN achieved the best RMSE among all normalization methods used for the C(TABL) architecture. For the DeepLOB architecture, a peculiar phenomenon can be observed: for all normalization methods, we obtained the same RMSE, even between different runs, with DAIN as the only exception. For these models, the gradient updates toward the end of the training process seemed to only affect the classification objective and not the regression one. Even though DAIN achieved the best RMSE compared to others when applied to the DeepLOB architecture, the combination of DAIN and DeepLOB performed poorly in terms of F1 score.

From the results obtained for both Setting 1 and Setting 2, we can see that the proposed normalization method performs consistently, being the best normalization method for SoTA neural networks in most cases.

V Conclusions

In this paper, we propose Bilinear Input Normalization (BiN) layer, a completely data-driven time-series normalization strategy, which is designed to take into consideration the bimodal nature of financial time-series, and aligns the multivariate time-series in both feature and temporal dimensions. The parameters of the proposed normalization method are optimized in an end-to-end manner with other parameters in a neural network. Using large scale limit order books coming from the Nordic and US markets, we evaluated the performance of BiN in comparisons with other normalization techniques to tackle different forecasting problems related to the future mid-price dynamics. The experimental results showed that BiN performed consistently when combined with different state-of-the-arts neural networks, being the most suitable normalization method in the majority of scenarios.

Vi Acknowledgement

The authors wish to acknowledge CSC – IT Center for Science, Finland, for computational resources.

References

  • [1] M. Avellaneda and J. Lee (2010) Statistical arbitrage in the us equities market. Quantitative Finance 10 (7), pp. 761–782. Cited by: §I.
  • [2] G. Ban, N. El Karoui, and A. E. Lim (2018) Machine learning and portfolio optimization. Management Science 64 (3), pp. 1136–1154. Cited by: §I.
  • [3] J. Bouchaud, Y. Gefen, M. Potters, and M. Wyart (2004) Fluctuations and response in financial markets: the subtle nature of ‘random’price changes. Quantitative finance 4 (2), pp. 176–190. Cited by: §IV-A.
  • [4] M. P. Clements, P. H. Franses, and N. R. Swanson (2004) Forecasting economic and financial time-series with non-linear models. International Journal of Forecasting 20 (2), pp. 169–183. Cited by: §I.
  • [5] J. H. Cochrane (1996) A cross-sectional test of an investment-based asset pricing model. Journal of Political Economy 104 (3), pp. 572–621. Cited by: §I.
  • [6] R. Cont and A. De Larrard (2013) Price dynamics in a markovian limit order market. SIAM Journal on Financial Mathematics 4 (1), pp. 1–25. Cited by: §IV-A.
  • [7] V. DeMiguel, L. Garlappi, F. J. Nogales, and R. Uppal (2009) A generalized approach to portfolio optimization: improving performance by constraining portfolio norms. Management science 55 (5), pp. 798–812. Cited by: §I.
  • [8] A. Dingli and K. S. Fournier (2017)

    Financial time series forecasting–a deep learning approach

    .
    International Journal of Machine Learning and Computing 7 (5), pp. 118–122. Cited by: §I.
  • [9] R. F. Engle (1982) Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation. Econometrica: Journal of the econometric society, pp. 987–1007. Cited by: §I.
  • [10] J. Galindo and P. Tamayo (2000) Credit risk assessment using statistical and machine learning: basic methodology and risk modeling applications. Computational Economics 15 (1), pp. 107–143. Cited by: §I.
  • [11] S. García, J. Luengo, and F. Herrera (2015) Data preprocessing in data mining. Vol. 72, Springer. Cited by: §II.
  • [12] A. Hatemi-j (2008) Tests for cointegration with two unknown regime shifts with an application to financial market integration. Empirical Economics 35 (3), pp. 497–505. Cited by: §I.
  • [13] Y. Hu, K. Liu, X. Zhang, L. Su, E. Ngai, and M. Liu (2015)

    Application of evolutionary computation for rule discovery in stock algorithmic trading: a literature review

    .
    Applied Soft Computing 36, pp. 534–551. Cited by: §I.
  • [14] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §II.
  • [15] M. J. Kane, N. Price, M. Scotch, and P. Rabinowitz (2014)

    Comparison of arima and random forest time series models for prediction of avian influenza h5n1 outbreaks

    .
    BMC bioinformatics 15 (1), pp. 1–9. Cited by: §I.
  • [16] A. E. Khandani, A. J. Kim, and A. W. Lo (2010) Consumer credit-risk models via machine-learning algorithms. Journal of Banking & Finance 34 (11), pp. 2767–2787. Cited by: §I.
  • [17] J. Korczak and M. Hemes (2017) Deep learning for financial time series forecasting in a-trader system. In 2017 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 905–912. Cited by: §I.
  • [18] O. M. Kvalheim, F. Brakstad, and Y. Liang (1994) Preprocessing of analytical profiles in the presence of homoscedastic or heteroscedastic noise. Analytical Chemistry 66 (1), pp. 43–51. Cited by: §II.
  • [19] M. Lettau and M. Pelger (2020) Estimating latent asset-pricing factors. Journal of Econometrics 218 (1), pp. 1–31. Cited by: §I.
  • [20] Y. Mäkinen, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2019) Forecasting jump arrivals in stock prices: new attention-based network architecture using limit order book data. Quantitative Finance 19 (12), pp. 2033–2050. Cited by: §IV-A.
  • [21] J. J. Murphy (1999) Technical analysis of the financial markets: a comprehensive guide to trading methods and applications. Penguin. Cited by: §I.
  • [22] S. Nayak, B. B. Misra, and H. S. Behera (2014) Impact of data normalization on stock index forecasting. Int. J. Comp. Inf. Syst. Ind. Manag. Appl 6, pp. 357–369. Cited by: §I.
  • [23] I. Noda (2008) Scaling techniques to enhance two-dimensional correlation spectra. Journal of Molecular Structure 883, pp. 216–227. Cited by: §II.
  • [24] A. Ntakaris, M. Magris, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2018) Benchmark dataset for mid-price forecasting of limit order book data with machine learning methods. Journal of Forecasting 37 (8), pp. 852–866. Cited by: §IV-B1, §IV-C1.
  • [25] G. Nuti, M. Mirghaemi, P. Treleaven, and C. Yingsaeree (2011) Algorithmic trading. Computer 44 (11), pp. 61–69. Cited by: §I.
  • [26] N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2019) Deep adaptive input normalization for price forecasting using limit order book data. arXiv preprint arXiv:1902.07892. Cited by: §I, §II, §II, §IV-B2, TABLE I.
  • [27] X. Qian and S. Gao (2017) Financial series prediction: comparison between precision of time series models and machine learning methods. arXiv preprint arXiv:1706.00948, pp. 1–9. Cited by: §I.
  • [28] X. Shao (2015) Self-normalization for time series: a review of recent developments. Journal of the American Statistical Association 110 (512), pp. 1797–1817. Cited by: §I.
  • [29] M. Siikanen, J. Kanniainen, and A. Luoma (2017) What drives the sensitivity of limit order books to company announcement arrivals?. Economics Letters 159, pp. 65–68. Cited by: §IV-A.
  • [30] M. Siikanen, J. Kanniainen, and J. Valli (2017) Limit order books and liquidity around scheduled and non-scheduled announcements: empirical evidence from nasdaq nordic. Finance Research Letters 21, pp. 264–271. Cited by: §IV-A.
  • [31] D. Singh and B. Singh (2020) Investigating the impact of data normalization on classification performance. Applied Soft Computing 97, pp. 105524. Cited by: §I, §II.
  • [32] M. C. Thomsett (2006) Getting started in fundamental analysis. John Wiley & Sons. Cited by: §I.
  • [33] D. T. Tran, A. Iosifidis, J. Kanniainen, and M. Gabbouj (2018) Temporal attention-augmented bilinear network for financial time-series data analysis. IEEE transactions on neural networks and learning systems 30 (5), pp. 1407–1418. Cited by: §I, §I, §IV-B1, §IV-B1, §IV-B2, TABLE I, TABLE II.
  • [34] D. T. Tran, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2020) Data normalization for bilinear structures in high-frequency financial time-series. In International Conference on Pattern Recognition (ICPR), Cited by: §I.
  • [35] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2017) Forecasting stock prices from the limit order book using convolutional neural networks. In 2017 IEEE 19th Conference on Business Informatics (CBI), Vol. 1, pp. 7–12. Cited by: §I, TABLE I.
  • [36] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2017) Using deep learning to detect price change indications in financial markets. In Signal Processing Conference (EUSIPCO), 2017 25th European, pp. 2511–2515. Cited by: TABLE I.
  • [37] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §II.
  • [38] R. A. van den Berg, H. C. Hoefsloot, J. A. Westerhuis, A. K. Smilde, and M. J. van der Werf (2006) Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC genomics 7 (1), pp. 1–15. Cited by: §II.
  • [39] Z. Zhang, S. Zohren, and S. Roberts (2019) DeepLOB: deep convolutional neural networks for limit order books. IEEE Transactions on Signal Processing 67 (11), pp. 3001–3012. Cited by: §I, §I, §IV-B2, §IV-C1, TABLE I.