Data Normalization for Bilinear Structures in High-Frequency Financial Time-series

Financial time-series analysis and forecasting have been extensively studied over the past decades, yet still remain as a very challenging research topic. Since the financial market is inherently noisy and stochastic, a majority of financial time-series of interests are non-stationary, and often obtained from different modalities. This property presents great challenges and can significantly affect the performance of the subsequent analysis/forecasting steps. Recently, the Temporal Attention augmented Bilinear Layer (TABL) has shown great performances in tackling financial forecasting problems. In this paper, by taking into account the nature of bilinear projections in TABL networks, we propose Bilinear Normalization (BiN), a simple, yet efficient normalization layer to be incorporated into TABL networks to tackle potential problems posed by non-stationarity and multimodalities in the input series. Our experiments using a large scale Limit Order Book (LOB) consisting of more than 4 million order events show that BiN-TABL outperforms TABL networks using other state-of-the-arts normalization schemes by a large margin.



There are no comments yet.


page 1

page 2

page 3

page 4


Temporal Attention augmented Bilinear Network for Financial Time-Series Data Analysis

Financial time-series forecasting has long been a challenging problem be...

Bilinear Input Normalization for Neural Networks in Financial Forecasting

Data normalization is one of the most important preprocessing steps when...

Deep Adaptive Input Normalization for Price Forecasting using Limit Order Book Data

Deep Learning (DL) models can be used to tackle time series analysis tas...

Data-driven Neural Architecture Learning For Financial Time-series Forecasting

Forecasting based on financial time-series is a challenging task since m...

Forecasting high-frequency financial time series: an adaptive learning approach with the order book data

This paper proposes a forecast-centric adaptive learning model that enga...

Simultaneous Decorrelation of Matrix Time Series

We propose a contemporaneous bilinear transformation for matrix time ser...

Tensor Representation in High-Frequency Financial Data for Price Change Prediction

Nowadays, with the availability of massive amount of trade data collecte...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Although we have observed great successes in time-series and sequence analysis [19, 25, 10, 16, 20, 12], and the topic in general has been extensively studied, we still face great challenges when working with multivariate time-series obtained from financial markets, especially high-frequency data. In High-Frequency Trading (HFT), traders focus on short-term investment horizon and profit from small margin of the price changes with large volume. Thus, HFT traders rely on market volatility to make profit. This, however, also poses great challenges when dealing with the data obtained in the HFT market.

Due to the unique characteristics of financial market, we still need a great amount of efforts in order to have the same successes as in Computer Vision (CV)

[14, 13, 6, 18, 21, 17]

and Natural Language Processing (NLP)

[1, 2]

. On one hand, the problems targeted in CV and NLP mainly involve cognitive tasks, whose inputs are intuitive and innate for human being to visualize and interpret such as images or languages while it is not our natural ability to interpret financial time-series. On the other hand, images or audio are well-behaved signals in a sense that the range or variances are known or can be easily manipulated without loosing the characteristic of the signal, while financial observations can exhibit drastic changes over time or even at the same time instance, signals from different modalities can be very different such as the stock prices. Thus data preprocessing plays an important role when working with financial data.

Perhaps the most popular normalization scheme for time-series is z-score normalization, i.e. transforming the data to have zero-mean and unit standard deviation, or min-max normalization, i.e., scaling the values of each dimension into the range

. The limitation in z-score or min-max normalization lies in the fact that the statistics of the past observations (during training phase) are used to normalize future observations, which might possess completely different magnitudes due to non-stationarity or concept drift. In order to tackle this problem, several sophisticated methods have been proposed [15, 7]. In addition, hand-crafted stationary features, econometric or quantitative indicators with mathematical assumptions of the underlying processes are also widely used. These financial indicators can sometimes perform relatively well after a long process of experimentation and validation, which, however, prevents their practical implementation in HFT [9].

Different from the aforementioned model-based approaches, data-driven normalization methods aim to directly estimate relevant statistics which are specific to the given analysis task in an end-to-end manner. That is, the normalization step is implemented as a neural network layer whose parameters are jointly optimized with other layers via stochastic gradient descend. Perhaps the most widely used formulation is Batch Normalization (BN)

[5], which was originally proposed for visual data. BN, however, is mostly used between hidden layers to reduce internal covariate shifts. Proposed for the task of visual style transfer, Instance Normalization (IN) [24] was very successful in normalizing the constrast level of generated images. For time-series, an input normalization layer that learns to adaptively estimate the normalization statistics in a given time-series, which outperforms existing schemes, was proposed in [11].

Existing data-driven approaches, however, neglect the tensor structure inherent in multivariate time-series, performing normalization only along the temporal mode of time-series. In order to take advantage of the tensor representation, the authors in

[19] proposed TABL networks which separately capture linear dependency along the temporal and feature dimension in each layer. Since TABL network performs a sequence of weighted sum alternating between the temporal and feature dimension, we propose a data-driven normalization strategy that takes into account statistics from both temporal and spatial dimensions, which is dubbed as Bilinear Normalization (BiN). Combining BiN with TABL, we show that BiN-TABL networks significantly outperforms TABL networks using other normalization strategies in the mid-price movement prediction problem using a large scale Limit Order Book dataset.

The remainder of the paper is organized as follows. Section 2 reviews related literature in data normalization. In Section 3, we describe the motivation and processing steps of our Bilinear Normalization layer. In Section 4, we provide information about experiment setup, present and discuss our empirical results. Section 5 concludes our work.

Ii Related Work

Deep neural networks have seen significant improvement over the past decades thanks to the advancement in both hardware and algorithms. On the algorithmic side, training deep networks comprising of multiple layers can be challenging since the distribution of each layer’s inputs can change significantly during the iterative optimization process, which harms the error feedback signals. Thus, by manipulating the statistics between layers, we have seen great improvements in optimizing deep neural networks. An early example is the class of initialization methods [3, 4], which initialize the network’s parameters based on each layer’s statistics. However, most of the initialization methods are data independent. A more active approach is the direct manipulation of the statistics by learning them jointly with the network’s parameters with the early work called Batch Normalization (BN) [5]. BN estimates global mean and variance of input data by gradually accumulating the mini-batch statistics. After standardizing the data to have zero-mean and unit variance, BN also learns to scale and shift the distribution. Instead of mini-batch statistics, Instance Normalization [24] uses sample-level statistics and learns how to normalize each image so that its contrast matches with that of a predefined style image in the visual style transfer problems. Both BN and IN were originally proposed for visual data, although BN has also been widely used in NLP.

We are not aware of any data-driven normalization scheme for time-series, except the recently proposed Deep Adaptive Input Normalization (DAIN) formulation [11], which applies normalization to the input time-series via a 3-stage procedure. Specifically, let be a collection time-series where denotes the temporal dimension and denotes the spatial/feature dimension. In addition, we denote the representation (temporal slice) at time instance of series . Here the subscript denotes the tensor mode (1 for feature slices and 2 for temporal slices). DAIN first shifts the input time-series by:


where is a learnable weight matrix that estimates the amount of shifting from the mean temporal slice () calculated from each series.

After shifting, the intermediate representation is then scaled as follows:


where is a learnable weight matrix that estimates the amount of scaling from the standard deviation () along the temporal dimension. In Eq. (2), the square-root operator is applied element-wise; and denote the element-wise multiplication and division, respectively.

The final step in DAIN is gating, which aims to suppress irrelevant features:


where and are learnable weights.

Overall, DAIN takes the input time-series and outputs its normalized version by manipulating its temporal slices. As we will see in the next Section, our BiN formulation is much simpler (requiring few calculations) and more intuitive compared to DAIN when using with TABL networks.

Iii Bilinear Normalization (BiN)

Our proposed BiN layer formulation bears some resemblances to DAIN and IN in that BiN also uses sample-level statistics to manipulate the input distribution. That is, each input sample is normalized based on its statistics only. This is different from BN, which uses global statistics calculated and aggregated from mini-batches. BiN differs from DAIN and IN in that we propose to jointly normalize the input time-series along both temporal and feature dimensions, taking into account the property of bilinear projection in TABL networks.

The core idea in TABL networks is the separate modeling of linear dependency along the temporal and feature dimension. That is, the interactions between temporal slices and feature slices are captured by bilinear projection:


where and are the projection parameters, and is the transformed series.

In Eq. (4), linearly combines temporal slices () in . That is, the function of is to capture linear patterns in local temporal movement. On the other hand, linearly combines a set of feature slices (

), i.e., row vectors of

, to model local interactions among different univariate series.

Due to the above property, it is intuitive to shift and scale not only the distribution of temporal slices but also that of feature slice . To this end, we propose BiN, which can learn how to jointly manipulate the input data distribution along the temporal and feature dimension.

The normalization along the temporal dimension in BiN is described by the following equations:


where and are two learnable weight vectors of BiN. In addition, is a constant vector having all elements equal to one and is its transpose.

In short, given an input series, we first calculate the mean temporal slice and its standard deviation as in Eq. (5a, 5b), which are then used to standardize each temporal slice of the input as in Eq. (5c) before applying element-wise scaling (using ) and shifting (using ) as in Eq. (5d).

In order to interpret the effects of Eq. (5), we can view the input series as the set consisting of temporal slices, i.e., a set of points in -dimensional space. The process in Eq. (5c) moves this set of points around the origin and as well as controlling their spread while keeping their arrangement pattern similarly. If we have two input series and with and spread and lie in two completely different areas of this -dimensional space but have the same arrangement pattern, without the alignment performed by Eq. (5c), we cannot effectively capture the linear or nonlinear111Nonlinear patterns can be estimated by several piece-wise linear patterns (by setting the second dimension of larger than 1, i.e., ) arrangement patterns of these points using in Eq. (4). Here we should note that although BiN applies additional scaling and shifting as in Eq. (5d) after the alignment, the values of and are the same for every input series, thus still keeping their alignments. Since and are optimized together with TABL network’s parameters, they enable BiN to manipulate the aligned distributions to match with the statistics of other layers.

While the effect of non-stationarity in the temporal mode are often visible and has been heavily studied, its effects when considered from the feature dimension perspective are less obvious. To see this, let us now view the series as the set of points (its feature slices) in a -dimensional space. Let us also take the previous scenario where two series, and , have and scattered in different regions of a -dimensional co-ordinate system (viewed under the temporal perspective) before the normalization step in Eq. (5). When and are very far away, being viewed from the feature perspective, these two series are also likely to possess and which are distributed in two different regions of a -dimensional co-ordinate system, although having very similar arrangement. This scenario also prevents in TABL networks to effectively capture the prominent linear/nonlinear patterns existing in the feature dimension of all input series. Thus, BiN also normalizes the input series along the feature dimension as follows:


where and are two learnable weights.

Overall, BiN takes as input the series and outputs , which is the linear combination of and from Eq. (6d) and (5d), respectively:


where and are two learnable scalars.

Iv Experiments

In order to evaluate the proposed BiN layer, we conducted empirical analysis on FI-2010 [8], a large scale, publicly available Limit Order Book (LOB) dataset, which contains buy and sell limit order information (the prices and volumes) over business days from Finnish stocks traded in Helsinki Exchange (operated by NASDAQ Nordic). At each time instance, the dataset contains the prices and volumes from the top levels of both buy and sell sides, leading to a -dimensional vector representation. Along with this original information, hand-crafted features are also provided by the database.

Using this dataset, we investigated the problem of mid-price movement prediction in the next events. Mid-price at a given time instance is the mean value between the best bid (buy) and best ask (sell) prices, which is a virtual quantity since no trade can take place at this price at the given time. Its movements (stationary, increasing, decreasing), however, reflects the dynamic of the LOB and the market. Therefore, being able to predict the future movements is of great importance. For more information on FI-2010 and LOB, we refer the reader to [8].

We followed the same experimental setup proposed in [19] which used the first days for training and the last days for evaluating purpose. We also used the TABL architecture that produced the best performance in [19] to evaluate, denoted as C(TABL) as in [19]. The results for TABL networks applying our BiN layer and BN layer as an input normalization layer are denoted as BiN-C(TABL) and BN-C(TABL), respectively.

Accuracy, averaged Precision, Recall and F1 are reported as the performance metrics. Since FI-2010 is an imbalanced dataset, we focus our analysis on F1 measure. In addition, each experiment was run times and the median value is reported.

Models Accuracy % Precision % Recall % F1 %
Prediction Horizon
CNN[22] -
LSTM[23] -
C(BL) [19]
DeepLOB [25]
DAIN-MLP [11] -
DAIN-RNN [11] -
C(TABL) [19]

Prediction Horizon
CNN[22] -
LSTM[23] -
C(BL) [19]
DeepLOB [25]
DAIN-MLP [11] -
DAIN-RNN [11] -
C(TABL) [19]

Prediction Horizon
CNN[22] -
LSTM[23] -
C(BL) [19]
DeepLOB [25]
DAIN-MLP [11] - - - -
DAIN-RNN [11] - - - -
C(TABL) [19]
TABLE I: Experiment Results. Bold-face numbers denote the best F1 measure among the normalization strategies

Table I shows the experiment results in three prediction horizons of the proposed BiN-C(TABL) networks in comparisons with the original TABL architecture C(TABL), other input normalization strategies BN-C(TABL), DAIN-MLP, DAIN-RNN (the lower section of each horizon) as well as recent state-of-the-art results for deep architectures (the upper section).

It is clear that our proposed BiN layer when used to normalize the input data yields significant improvement over the original TABL networks with up to improvement (from to ) for horizon

in average F1 measure. Compared with DAIN, the performances achieved by our normalization strategy coupled with TABL networks far exceed that of DAIN coupled with MLP and RNN. In addition, BN shows inferior results when being used to normalize the input data, which is expected since BN was originally designed to reduce covariate shift between hidden layers of Convolutional Neural Network.

Compared BiN-C(TABL) with state-of-the-arts CNN-LSTM architecture with 11 hidden layers called DeepLOB [25], our proposed normalization layer helps TABL networks with only 2 hidden layers to significantly close the gaps in horizon and , while outperforming DeepLOB by a large margin in .

Since BN has been widely used for hidden layers, we also compare the performance of BiN and BN when applied to all layers in Table II. The upper section of each horizon shows the performance of BiN and BN when applied only to the input layer while the lower section shows their performance when applied to all layers. As we can see from Table II, there is virtually no differences between the two arrangements. This result shows that adding normalization to the hidden layers bring no improvement to both strategies and the improvements obtained for TABL networks are indeed attributed to the input data normalization performed by BiN.

Models Accuracy % Precision % Recall % F1 %
Prediction Horizon

Prediction Horizon
Prediction Horizon
TABLE II: Comparisons between Bilinear Normalization and Batch Normalization when applied to only input layer (BiN-C(TABL) and BN-C(TABL)) or all layers (BiN-C(TABL)-BiN and BN-C(TABL)-BN

V Conclusions

In this paper, we propose BiN, an efficient time-series normalization strategy which is designed to tackle the potential difficulties posed by noisy, non-stationary financial time-series. By taking into account the property of bilinear projection, we demonstrated that BiN can greatly improve the performances of TABL networks in predicting the mid-price movements.


  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §I.
  • [2] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I.
  • [3] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    pp. 249–256. Cited by: §II.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §II.
  • [5] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §I, §II.
  • [6] A. Iosifidis, A. Tefas, and I. Pitas (2012) View-invariant action recognition based on artificial neural networks. IEEE transactions on neural networks and learning systems 23 (3), pp. 412–424. Cited by: §I.
  • [7] S. Nayak, B. B. Misra, and H. S. Behera (2014) Impact of data normalization on stock index forecasting. Int. J. Comp. Inf. Syst. Ind. Manag. Appl 6, pp. 357–369. Cited by: §I.
  • [8] A. Ntakaris, M. Magris, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2017) Benchmark dataset for mid-price prediction of limit order book data. arXiv preprint arXiv:1705.03233. Cited by: §IV, §IV.
  • [9] A. Ntakaris, G. Mirone, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2019)

    Feature engineering for mid-price prediction with deep learning

    Ieee Access 7, pp. 82390–82412. Cited by: §I.
  • [10] N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2018) Temporal bag-of-features learning for predicting mid price movements using high frequency limit order book data. IEEE Transactions on Emerging Topics in Computational Intelligence. Cited by: §I.
  • [11] N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2019) Deep adaptive input normalization for price forecasting using limit order book data. arXiv preprint arXiv:1902.07892. Cited by: §I, §II, TABLE I.
  • [12] N. Passalis, A. Tsantekidis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2017) Time-series classification using neural bag-of-features. In Signal Processing Conference (EUSIPCO), 2017 25th European, pp. 301–305. Cited by: §I.
  • [13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 779–788. Cited by: §I.
  • [14] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §I.
  • [15] X. Shao (2015) Self-normalization for time series: a review of recent developments. Journal of the American Statistical Association 110 (512), pp. 1797–1817. Cited by: §I.
  • [16] D. T. Tran, M. Gabbouj, and A. Iosifidis (2017) Multilinear class-specific discriminant analysis. Pattern Recognition Letters. Cited by: §I.
  • [17] D. T. Tran, M. Gabbouj, and A. Iosifidis (2020) Multilinear compressive learning with prior knowledge. arXiv preprint arXiv:2002.07203. Cited by: §I.
  • [18] D. T. Tran, A. Iosifidis, and M. Gabbouj (2018) Improving efficiency in convolutional neural networks with multilinear filters. Neural Networks 105, pp. 328–339. Cited by: §I.
  • [19] D. T. Tran, A. Iosifidis, J. Kanniainen, and M. Gabbouj (2018) Temporal attention-augmented bilinear network for financial time-series data analysis. IEEE transactions on neural networks and learning systems 30 (5), pp. 1407–1418. Cited by: §I, §I, TABLE I, §IV.
  • [20] D. T. Tran, M. Magris, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2017) Tensor representation in high-frequency financial data for price change prediction. IEEE Symposium Series on Computational Intelligence (SSCI). Cited by: §I.
  • [21] D. T. Tran, M. Yamac, A. Degerli, M. Gabbouj, and A. Iosifidis (2019) Multilinear compressive learning. arXiv preprint arXiv:1905.07481. Cited by: §I.
  • [22] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2017) Forecasting stock prices from the limit order book using convolutional neural networks. In Business Informatics (CBI), 2017 IEEE 19th Conference on, Vol. 1, pp. 7–12. Cited by: TABLE I.
  • [23] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2017) Using deep learning to detect price change indications in financial markets. In Signal Processing Conference (EUSIPCO), 2017 25th European, pp. 2511–2515. Cited by: TABLE I.
  • [24] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §I, §II.
  • [25] Z. Zhang, S. Zohren, and S. Roberts (2019) DeepLOB: deep convolutional neural networks for limit order books. IEEE Transactions on Signal Processing 67 (11), pp. 3001–3012. Cited by: §I, TABLE I, §IV.