I Introduction
Although we have observed great successes in timeseries and sequence analysis [19, 25, 10, 16, 20, 12], and the topic in general has been extensively studied, we still face great challenges when working with multivariate timeseries obtained from financial markets, especially highfrequency data. In HighFrequency Trading (HFT), traders focus on shortterm investment horizon and profit from small margin of the price changes with large volume. Thus, HFT traders rely on market volatility to make profit. This, however, also poses great challenges when dealing with the data obtained in the HFT market.
Due to the unique characteristics of financial market, we still need a great amount of efforts in order to have the same successes as in Computer Vision (CV)
[14, 13, 6, 18, 21, 17]and Natural Language Processing (NLP)
[1, 2]. On one hand, the problems targeted in CV and NLP mainly involve cognitive tasks, whose inputs are intuitive and innate for human being to visualize and interpret such as images or languages while it is not our natural ability to interpret financial timeseries. On the other hand, images or audio are wellbehaved signals in a sense that the range or variances are known or can be easily manipulated without loosing the characteristic of the signal, while financial observations can exhibit drastic changes over time or even at the same time instance, signals from different modalities can be very different such as the stock prices. Thus data preprocessing plays an important role when working with financial data.
Perhaps the most popular normalization scheme for timeseries is zscore normalization, i.e. transforming the data to have zeromean and unit standard deviation, or minmax normalization, i.e., scaling the values of each dimension into the range
. The limitation in zscore or minmax normalization lies in the fact that the statistics of the past observations (during training phase) are used to normalize future observations, which might possess completely different magnitudes due to nonstationarity or concept drift. In order to tackle this problem, several sophisticated methods have been proposed [15, 7]. In addition, handcrafted stationary features, econometric or quantitative indicators with mathematical assumptions of the underlying processes are also widely used. These financial indicators can sometimes perform relatively well after a long process of experimentation and validation, which, however, prevents their practical implementation in HFT [9].Different from the aforementioned modelbased approaches, datadriven normalization methods aim to directly estimate relevant statistics which are specific to the given analysis task in an endtoend manner. That is, the normalization step is implemented as a neural network layer whose parameters are jointly optimized with other layers via stochastic gradient descend. Perhaps the most widely used formulation is Batch Normalization (BN)
[5], which was originally proposed for visual data. BN, however, is mostly used between hidden layers to reduce internal covariate shifts. Proposed for the task of visual style transfer, Instance Normalization (IN) [24] was very successful in normalizing the constrast level of generated images. For timeseries, an input normalization layer that learns to adaptively estimate the normalization statistics in a given timeseries, which outperforms existing schemes, was proposed in [11].Existing datadriven approaches, however, neglect the tensor structure inherent in multivariate timeseries, performing normalization only along the temporal mode of timeseries. In order to take advantage of the tensor representation, the authors in
[19] proposed TABL networks which separately capture linear dependency along the temporal and feature dimension in each layer. Since TABL network performs a sequence of weighted sum alternating between the temporal and feature dimension, we propose a datadriven normalization strategy that takes into account statistics from both temporal and spatial dimensions, which is dubbed as Bilinear Normalization (BiN). Combining BiN with TABL, we show that BiNTABL networks significantly outperforms TABL networks using other normalization strategies in the midprice movement prediction problem using a large scale Limit Order Book dataset.The remainder of the paper is organized as follows. Section 2 reviews related literature in data normalization. In Section 3, we describe the motivation and processing steps of our Bilinear Normalization layer. In Section 4, we provide information about experiment setup, present and discuss our empirical results. Section 5 concludes our work.
Ii Related Work
Deep neural networks have seen significant improvement over the past decades thanks to the advancement in both hardware and algorithms. On the algorithmic side, training deep networks comprising of multiple layers can be challenging since the distribution of each layer’s inputs can change significantly during the iterative optimization process, which harms the error feedback signals. Thus, by manipulating the statistics between layers, we have seen great improvements in optimizing deep neural networks. An early example is the class of initialization methods [3, 4], which initialize the network’s parameters based on each layer’s statistics. However, most of the initialization methods are data independent. A more active approach is the direct manipulation of the statistics by learning them jointly with the network’s parameters with the early work called Batch Normalization (BN) [5]. BN estimates global mean and variance of input data by gradually accumulating the minibatch statistics. After standardizing the data to have zeromean and unit variance, BN also learns to scale and shift the distribution. Instead of minibatch statistics, Instance Normalization [24] uses samplelevel statistics and learns how to normalize each image so that its contrast matches with that of a predefined style image in the visual style transfer problems. Both BN and IN were originally proposed for visual data, although BN has also been widely used in NLP.
We are not aware of any datadriven normalization scheme for timeseries, except the recently proposed Deep Adaptive Input Normalization (DAIN) formulation [11], which applies normalization to the input timeseries via a 3stage procedure. Specifically, let be a collection timeseries where denotes the temporal dimension and denotes the spatial/feature dimension. In addition, we denote the representation (temporal slice) at time instance of series . Here the subscript denotes the tensor mode (1 for feature slices and 2 for temporal slices). DAIN first shifts the input timeseries by:
(1)  
where is a learnable weight matrix that estimates the amount of shifting from the mean temporal slice () calculated from each series.
After shifting, the intermediate representation is then scaled as follows:
(2)  
where is a learnable weight matrix that estimates the amount of scaling from the standard deviation () along the temporal dimension. In Eq. (2), the squareroot operator is applied elementwise; and denote the elementwise multiplication and division, respectively.
The final step in DAIN is gating, which aims to suppress irrelevant features:
(3)  
where and are learnable weights.
Overall, DAIN takes the input timeseries and outputs its normalized version by manipulating its temporal slices. As we will see in the next Section, our BiN formulation is much simpler (requiring few calculations) and more intuitive compared to DAIN when using with TABL networks.
Iii Bilinear Normalization (BiN)
Our proposed BiN layer formulation bears some resemblances to DAIN and IN in that BiN also uses samplelevel statistics to manipulate the input distribution. That is, each input sample is normalized based on its statistics only. This is different from BN, which uses global statistics calculated and aggregated from minibatches. BiN differs from DAIN and IN in that we propose to jointly normalize the input timeseries along both temporal and feature dimensions, taking into account the property of bilinear projection in TABL networks.
The core idea in TABL networks is the separate modeling of linear dependency along the temporal and feature dimension. That is, the interactions between temporal slices and feature slices are captured by bilinear projection:
(4) 
where and are the projection parameters, and is the transformed series.
In Eq. (4), linearly combines temporal slices () in . That is, the function of is to capture linear patterns in local temporal movement. On the other hand, linearly combines a set of feature slices (
), i.e., row vectors of
, to model local interactions among different univariate series.Due to the above property, it is intuitive to shift and scale not only the distribution of temporal slices but also that of feature slice . To this end, we propose BiN, which can learn how to jointly manipulate the input data distribution along the temporal and feature dimension.
The normalization along the temporal dimension in BiN is described by the following equations:
(5a)  
(5b)  
(5c)  
(5d) 
where and are two learnable weight vectors of BiN. In addition, is a constant vector having all elements equal to one and is its transpose.
In short, given an input series, we first calculate the mean temporal slice and its standard deviation as in Eq. (5a, 5b), which are then used to standardize each temporal slice of the input as in Eq. (5c) before applying elementwise scaling (using ) and shifting (using ) as in Eq. (5d).
In order to interpret the effects of Eq. (5), we can view the input series as the set consisting of temporal slices, i.e., a set of points in dimensional space. The process in Eq. (5c) moves this set of points around the origin and as well as controlling their spread while keeping their arrangement pattern similarly. If we have two input series and with and spread and lie in two completely different areas of this dimensional space but have the same arrangement pattern, without the alignment performed by Eq. (5c), we cannot effectively capture the linear or nonlinear^{1}^{1}1Nonlinear patterns can be estimated by several piecewise linear patterns (by setting the second dimension of larger than 1, i.e., ) arrangement patterns of these points using in Eq. (4). Here we should note that although BiN applies additional scaling and shifting as in Eq. (5d) after the alignment, the values of and are the same for every input series, thus still keeping their alignments. Since and are optimized together with TABL network’s parameters, they enable BiN to manipulate the aligned distributions to match with the statistics of other layers.
While the effect of nonstationarity in the temporal mode are often visible and has been heavily studied, its effects when considered from the feature dimension perspective are less obvious. To see this, let us now view the series as the set of points (its feature slices) in a dimensional space. Let us also take the previous scenario where two series, and , have and scattered in different regions of a dimensional coordinate system (viewed under the temporal perspective) before the normalization step in Eq. (5). When and are very far away, being viewed from the feature perspective, these two series are also likely to possess and which are distributed in two different regions of a dimensional coordinate system, although having very similar arrangement. This scenario also prevents in TABL networks to effectively capture the prominent linear/nonlinear patterns existing in the feature dimension of all input series. Thus, BiN also normalizes the input series along the feature dimension as follows:
(6a)  
(6b)  
(6c)  
(6d) 
where and are two learnable weights.
Overall, BiN takes as input the series and outputs , which is the linear combination of and from Eq. (6d) and (5d), respectively:
(7) 
where and are two learnable scalars.
Iv Experiments
In order to evaluate the proposed BiN layer, we conducted empirical analysis on FI2010 [8], a large scale, publicly available Limit Order Book (LOB) dataset, which contains buy and sell limit order information (the prices and volumes) over business days from Finnish stocks traded in Helsinki Exchange (operated by NASDAQ Nordic). At each time instance, the dataset contains the prices and volumes from the top levels of both buy and sell sides, leading to a dimensional vector representation. Along with this original information, handcrafted features are also provided by the database.
Using this dataset, we investigated the problem of midprice movement prediction in the next events. Midprice at a given time instance is the mean value between the best bid (buy) and best ask (sell) prices, which is a virtual quantity since no trade can take place at this price at the given time. Its movements (stationary, increasing, decreasing), however, reflects the dynamic of the LOB and the market. Therefore, being able to predict the future movements is of great importance. For more information on FI2010 and LOB, we refer the reader to [8].
We followed the same experimental setup proposed in [19] which used the first days for training and the last days for evaluating purpose. We also used the TABL architecture that produced the best performance in [19] to evaluate, denoted as C(TABL) as in [19]. The results for TABL networks applying our BiN layer and BN layer as an input normalization layer are denoted as BiNC(TABL) and BNC(TABL), respectively.
Accuracy, averaged Precision, Recall and F1 are reported as the performance metrics. Since FI2010 is an imbalanced dataset, we focus our analysis on F1 measure. In addition, each experiment was run times and the median value is reported.
Models  Accuracy %  Precision %  Recall %  F1 % 
Prediction Horizon  
CNN[22]    
LSTM[23]    
C(BL) [19]  
DeepLOB [25]  
DAINMLP [11]    
DAINRNN [11]    
C(TABL) [19]  
BNC(TABL)  
BiNC(TABL)  
Prediction Horizon 

CNN[22]    
LSTM[23]    
C(BL) [19]  
DeepLOB [25]  
DAINMLP [11]    
DAINRNN [11]    
C(TABL) [19]  
BNC(TABL)  
BiNC(TABL)  
Prediction Horizon 

CNN[22]    
LSTM[23]    
C(BL) [19]  
DeepLOB [25]  
DAINMLP [11]         
DAINRNN [11]         
C(TABL) [19]  
BNC(TABL)  
BiNC(TABL) 
Table I shows the experiment results in three prediction horizons of the proposed BiNC(TABL) networks in comparisons with the original TABL architecture C(TABL), other input normalization strategies BNC(TABL), DAINMLP, DAINRNN (the lower section of each horizon) as well as recent stateoftheart results for deep architectures (the upper section).
It is clear that our proposed BiN layer when used to normalize the input data yields significant improvement over the original TABL networks with up to improvement (from to ) for horizon
in average F1 measure. Compared with DAIN, the performances achieved by our normalization strategy coupled with TABL networks far exceed that of DAIN coupled with MLP and RNN. In addition, BN shows inferior results when being used to normalize the input data, which is expected since BN was originally designed to reduce covariate shift between hidden layers of Convolutional Neural Network.
Compared BiNC(TABL) with stateofthearts CNNLSTM architecture with 11 hidden layers called DeepLOB [25], our proposed normalization layer helps TABL networks with only 2 hidden layers to significantly close the gaps in horizon and , while outperforming DeepLOB by a large margin in .
Since BN has been widely used for hidden layers, we also compare the performance of BiN and BN when applied to all layers in Table II. The upper section of each horizon shows the performance of BiN and BN when applied only to the input layer while the lower section shows their performance when applied to all layers. As we can see from Table II, there is virtually no differences between the two arrangements. This result shows that adding normalization to the hidden layers bring no improvement to both strategies and the improvements obtained for TABL networks are indeed attributed to the input data normalization performed by BiN.
Models  Accuracy %  Precision %  Recall %  F1 % 
Prediction Horizon  
BNC(TABL)  
BiNC(TABL)  
BNC(TABL)BN  
BiNC(TABL)BiN  
Prediction Horizon 

BNC(TABL)  
BiNC(TABL)  
BNC(TABL)BN  
BiNC(TABL)BiN  
Prediction Horizon  
BNC(TABL)  
BiNC(TABL)  
BNC(TABL)BN  
BiNC(TABL)BiN 
V Conclusions
In this paper, we propose BiN, an efficient timeseries normalization strategy which is designed to tackle the potential difficulties posed by noisy, nonstationary financial timeseries. By taking into account the property of bilinear projection, we demonstrated that BiN can greatly improve the performances of TABL networks in predicting the midprice movements.
References
 [1] (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §I.
 [2] (2018) Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I.

[3]
(2010)
Understanding the difficulty of training deep feedforward neural networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256. Cited by: §II. 
[4]
(2015)
Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification
. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §II.  [5] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §I, §II.
 [6] (2012) Viewinvariant action recognition based on artificial neural networks. IEEE transactions on neural networks and learning systems 23 (3), pp. 412–424. Cited by: §I.
 [7] (2014) Impact of data normalization on stock index forecasting. Int. J. Comp. Inf. Syst. Ind. Manag. Appl 6, pp. 357–369. Cited by: §I.
 [8] (2017) Benchmark dataset for midprice prediction of limit order book data. arXiv preprint arXiv:1705.03233. Cited by: §IV, §IV.

[9]
(2019)
Feature engineering for midprice prediction with deep learning
. Ieee Access 7, pp. 82390–82412. Cited by: §I.  [10] (2018) Temporal bagoffeatures learning for predicting mid price movements using high frequency limit order book data. IEEE Transactions on Emerging Topics in Computational Intelligence. Cited by: §I.
 [11] (2019) Deep adaptive input normalization for price forecasting using limit order book data. arXiv preprint arXiv:1902.07892. Cited by: §I, §II, TABLE I.
 [12] (2017) Timeseries classification using neural bagoffeatures. In Signal Processing Conference (EUSIPCO), 2017 25th European, pp. 301–305. Cited by: §I.

[13]
(2016)
You only look once: unified, realtime object detection.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 779–788. Cited by: §I.  [14] (2015) Faster rcnn: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §I.
 [15] (2015) Selfnormalization for time series: a review of recent developments. Journal of the American Statistical Association 110 (512), pp. 1797–1817. Cited by: §I.
 [16] (2017) Multilinear classspecific discriminant analysis. Pattern Recognition Letters. Cited by: §I.
 [17] (2020) Multilinear compressive learning with prior knowledge. arXiv preprint arXiv:2002.07203. Cited by: §I.
 [18] (2018) Improving efficiency in convolutional neural networks with multilinear filters. Neural Networks 105, pp. 328–339. Cited by: §I.
 [19] (2018) Temporal attentionaugmented bilinear network for financial timeseries data analysis. IEEE transactions on neural networks and learning systems 30 (5), pp. 1407–1418. Cited by: §I, §I, TABLE I, §IV.
 [20] (2017) Tensor representation in highfrequency financial data for price change prediction. IEEE Symposium Series on Computational Intelligence (SSCI). Cited by: §I.
 [21] (2019) Multilinear compressive learning. arXiv preprint arXiv:1905.07481. Cited by: §I.
 [22] (2017) Forecasting stock prices from the limit order book using convolutional neural networks. In Business Informatics (CBI), 2017 IEEE 19th Conference on, Vol. 1, pp. 7–12. Cited by: TABLE I.
 [23] (2017) Using deep learning to detect price change indications in financial markets. In Signal Processing Conference (EUSIPCO), 2017 25th European, pp. 2511–2515. Cited by: TABLE I.
 [24] (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §I, §II.
 [25] (2019) DeepLOB: deep convolutional neural networks for limit order books. IEEE Transactions on Signal Processing 67 (11), pp. 3001–3012. Cited by: §I, TABLE I, §IV.
Comments
There are no comments yet.