I Introduction
Nowadays, the world economical and social developments and wellbeings are heavily influenced by financial markets. People participate in financial activities, which promote the circulation of assets and developments of the world economy, with the ultimate goal of gaining economic benefits. Under this light, the success of the participants depends largely on the quality and quantity of information that they possess, as well as their ability to interpret these information for decisionmaking. Because of this, computational intelligence in finance, which utilizes modern computing methodologies to analyze financial markets for decisionmaking, has attracted many researchers and practitioners from both academia and industry. Representative topics under this discipline include stock market forecasting [33, 39], algorithmic trading [25, 13], risk assessment [16, 10], asset pricing [5, 19], and portfolio allocation and optimization [7, 2]. Among these objectives, a substantial amount of research efforts has been dedicated to prediction and forecasting since financial decisionmaking, for the most part, depends on reliable projections about the future.
There are two common approaches, namely fundamental analysis [32] and technical analysis [21], which are currently adopted in predicting future market behaviors. In fundamental analysis, valuation techniques take into account different economic indicators that reflect and affect the market movements to establish longterm views on the development of a financial entity. On the other hand, in technical analysis, it is generally believed that the prices themselves already encompass all factors that affect the market dynamics. For this reason, technical analysts construct forecasting models based on series of historical transactions with the assumption that history tends to repeat itself [21], and the underlying processes, which generate the observed series, can be captured by mathematical or computational models.
Although financial timeseries forecasting has been extensively studied over the past decades with a large body of literature dedicated to tackling specific problems, there are still many challenges in processing and analyzing data derived from financial markets, especially those coming from highfrequency intraday activities. Over time, the development of internet technologies, database systems and electronic trading platforms have enabled us to collect a vast amount of digital footprints of the financial market. Enormous volumes of data, while ensuring statistical significance of any analysis, also create a great computational challenge when building financial prediction models. The computational aspect is especially critical for trading applications that take advantage statistical arbitrage, which usually exists in very short time before market correction [1]. Another challenge posed by financial timeseries comes from the fact that they are usually complex, noisy, nonlinear and nonstationary in nature, which leads to difficulties not only in modeling but also in preprocessing.
Techniques for financial timeseries prediction fall into two categories: traditional statistical models and machine learning models. In the stochastic model based approach, linear relationship is often assumed between the independent variables. Representative tools in this category include autoregressive integrated moving average (ARIMA) and its variants or generalized autoregressive conditional heteroskedasticity (GARCH) [9], to name a few. While stochastic models often possess nice theoretical properties, the underlying assumption is often too strong, leading to poor generalization performance in realworld data. On the other hand, machine learning models, which make no prior statistical or structural assumption, are often capable of modeling complex nonlinear relationships among the independent factors and the prediction targets. For this reason, machine learning models often generalize better than stochastic models in many forecasting scenarios [15, 27].
Among different types of machine learning models, neural networks are the leading solutions for many financial forecasting problems nowadays [17, 33, 39, 35, 8]
. The majority of these solutions were adopted from computer vision (CV) and natural language processing (NLP) applications where neural networks have demonstrated unprecedented successes in the last decade. Despite the fact that future market prediction based on historical timeseries can be casted as a pattern recognition problem similar to those encountered in CV and NLP, thus can be treated in some degree of success with tools from CV and NLP, the unique characteristics of financial data make the market prediction tasks fundamentally different and require special treatments. The majority of problems targeted in CV and NLP concern solving cognitive tasks in which the data is intuitive and wellunderstood by normal human beings, such as recognition of objects or understanding natural languages. On the other hand, historical financial phenomena even require human experts to recognize or interpret, not to mention speculating about the future. In addition, images, videos or speeches, for example, are wellbehaved signals in the sense that the value range and variances are known and can be easily processed without losing the essential information within them, while financial timeseries are highly volatile and often exhibit concept drift phenomena
[4, 12], i.e., dynamic changes in the relationship between independent and target variables over time. Because of this, data preprocessing is an important procedure when working with financial timeseries.Among many preprocessing steps, data normalization, which is one of the most essential steps before building a machine learning model, aims at transforming input variables into a common range to avoid the potential bias induced by large numbers. For deep neural networks, improperly normalized data can easily lead to numerical issues with the gradient updates. In literature, there are many normalization methods such as zscore normalization, minmax normalization, pareto scaling, power transformation, to name a few
[31]. These normalization methods utilize global data statistics, such as the mean, standard deviation or maximum value to transform the data. For financial timeseries, especially those covering long periods, replacing global statistics with local statistics computed over the recent history is a common practice to avoid the problem of potential regime shifts in which recent observations have significantly different value range than past observations. To deal with this phenomenon, several sophisticated methods have been proposed, for example
[28, 22].While many static normalization schemes have been developed as described above, we are only aware of one prior work [26]
that proposed an adaptive method for input timeseries. Different from static approaches, an adaptive datadriven method transforms raw input data using statistics that are identified and learned via optimization. That is, the step is implemented as the first layer in a computation graph, with all parameters jointly estimated using stochastic gradient descent. In fact, one of the reasons that make neural nets work so well is the fact that they are estimated in an endtoend manner, being able to learn datadependent transformations. Thus, we argue that the normalization step for input timeseries should also be learned in the same endtoend manner when employing neural networks in financial forecasting.
In this paper, we propose Bilinear Input Normalization (BiN), a neural network layer that takes into account the bimodal nature of multivariate timeseries, and performs input data transformation using parameters that are jointly estimated with other parameters in the network. The preliminary results of this work was presented in [34], which includes limited analysis and empirical evaluation of BiN for Temporal Attention Augmented Bilinear Layer (TABL) networks. In this paper, we provide more detailed, indepth presentation and discussion of the proposed method, as well as extensive experiments demonstrated with another stateofthearts (SoTA) architecture in financial forecasting using stock market data from two different markets (US and Nordic).
The remainder of the paper is organized as follows. In Section II, we review related works in data normalization methods, with a focus on normalization schemes for neural networks. Section III describes in details the motivation and operations of the Bilinear Input Normalization layer. In Section 4, we provide basic information regarding limit order books and describe the problem of predicting stock midprice dynamics using limit order book data, which is followed by the experimental setup, dataset description, the results and our analysis. Section V concludes our work.
Ii Related Work
Normalization is a scaling or transformation operation, usually in a linear manner, to ensure a uniform value range between different data dimensions, reducing the effects of dominant values and outliers
[11]. Perhaps, the most common normalization method is zscore normalization, which centers the data around the origin with unit standard deviation. There are also works that only center the data, without the scaling step as in zscore normalization. The steps in Pareto scaling [23] are similar to zscore normalization, except for the division of standard deviation instead of the variance. A generalization of zscore normalization is the variance stability scaling method [38], which multiplies the zscore standardized data with the ratio between the mean and standard deviation of the data. Power transformation is another normalization method employing the mean statistic to reduce the effects of heteroscedasticity
[18]. Besides data’s mean and variance, minimum, maximum and median values are also utilized in normalization, such as minmax normalization, median and median absolute deviation normalization. For interested readers, we refer to the analysis of different static data normalization techniques in machine learning models [31].The term data normalization is often understood as the operation that preprocesses raw data, i.e., input data. However, in neural networks, normalization operation is also popular in hidden layers. This is due to the fact that different layers in a deep network can encounter significant input distribution shift during stochastic gradient updates. Normalization operation can be used to help stablize and improve the training process. Batch Normalization (BN) was proposed for Convolutional Neural Networks such a purpose
[14]. Since stochastic gradient descent only operates in a minibatch manner, the minibatch mean and variance are accumulated in a moving average style to estimate the global mean and variance in BN. After subtracting the mean and dividing by the variance, BN also learns to scale and shift the hidden representations. Instead of the minibatch statistics, Instance Normalization
[37] uses samplelevel statistics, and learns how to normalize each image so that its contrast matches with that of a predefined style image in the visual style transfer problems. Both BN and IN were originally proposed for visual data, although BN has also been widely used in NLP.Both BN and IN are adaptive datadriven normalization schemes. However, they were proposed to normalize the hidden representations, and they are not commonly used for input normalization. Regarding adaptive input normalization method for timeseries, we are only aware of the work in [26], which formulated a 3stage normalization procedure called Deep Adaptive Input Normalization (DAIN). Since DAIN is directly related to our proposed method, we describe DAIN in more details here.
In this paper, let us denote the collection of multivariate series as , where denotes the number of univariate series and denotes the temporal length of each series. Here and are also referred to as the feature and temporal dimensions, respectively. In addition, we denote the th column of as , which is the representation of the series at the time index . We also refer to as the th temporal slice. The first step of DAIN is to shift every temporal slice in as follows:
(1)  
where is a learnable weight matrix that estimates the amount of shifting from the mean temporal slice () calculated from each series.
After shifting, the intermediate representation is then scaled as follows:
(2)  
where is another weight matrix that estimates the amount of scaling from the standard deviation (), which is computed from temporal slices. In Eq. (2), the squareroot operator is applied elementwise; and denote the elementwise multiplication and division, respectively.
The final step in DAIN is gating, which is used as a type of attention mechanism to suppress irrelevant features:
(3)  
where and are two weight matrices to learn the gating function.
The output of DAIN is, thus, , which is the normalized series having the same size as the input series . Since the normalization scheme of DAIN contains several processing steps with nonlinear operations, stochastic updates in DAIN are sensitive to the learning rate. For this reason, the authors in [26] used three different learning rates for the parameters associated with three computational steps in DAIN. As we will see in the next section, our normalization scheme is more intuitive for timeseries while requiring fewer computation and parameters. In addition, since our normalization scheme only relies on linear operations, it is robust with respect to the learning rates that are normally adopted to train the network under consideration.
Iii Adaptive Input Normalization with Bilinear Normalization Layer
The proposed BiN layer formulation shares some similarities with DAIN and IN in the sense that we also propose to take advantage of samplelevel statistics when learning to transform the input series. More specifically, the basic statistics, which are used to normalize each input sample, were calculated independently for each sample. There are also global parameters that are shared between samples in BiN. In this way, our formulation (as well as DAIN and IN) is different from BN, which utilizes global statistics estimated from the whole dataset to normalize every sample. For BN and IN, both methods were not proposed to work as an input normalization scheme for timeseries, but to work with higherorder tensors in hidden layers of convolutional neural networks, which have different semantic structure than multivariate timeseries. We are also not aware of any work that utilizes BN and IN for input data normalization, especially for timeseries. The main difference between the proposed method and DAIN is that BiN is formulated to jointly learn to transform the input samples along both temporal and feature dimensions, taking into account the bimodal nature of multivariate timeseries, while DAIN only works along the temporal dimension.
In order to better understand our motivation in taking into consideration the bimodal nature of multivariate timeseries, let us take an example in predicting the opening value of NASDAQ100 index of a day based on the historical opening prices of 100 constituent companies in the last 10 days. In this case, each input sample has dimensions of . On one hand, we can consider that each is represented by a set of features ( columns of ), each of which has dimensions, representing the snapshot of the opening prices of constituent companies in NASDAQ100. Thus, the mean value and variance of this set, also of , would represent the average opening prices and their volatility of companies in the last days. On the other hand, we can also consider that each is represented by a set of univariate series, each of which contains opening prices of a company over consecutive days. Therefore, the mean value and variance of this set, also of , would represent the mean and variance of the NASDAQ100 equal weighted index^{1}^{1}1This means that each constituent company contributes 1%, without taking into account market capitalization. For example QQQE is an ETF that tracks NASDAQ100 with equal weights during the last days. In our example, both ways of considering and the corresponding statistics are valid and meaningful. Each gives a different interpretation of the data contained in
, as well as the underlying assumption about elements being normally distributed in the set representing
. Because of this, the proposed normalization layer utilizes and combines statistics from both views in order to transform the multivariate series.The proposed layer normalizes along the temporal dimension as follows:
(4a)  
(4b)  
(4c)  
(4d) 
where and are two parameters of BiN that are optimized during stochastic gradient descent.
After the computation steps in Eq. (4), we obtain an intermediate series that has been normalized in the temporal dimension. Basically, in Eq. (4), given an input series , BiN first computes the mean temporal slice (column) and its standard deviation as in Eq. (4a, 4b), which are then used to standardize each temporal slice of the input before applying elementwise scaling (using ) and shifting (using ) as in Eq. (4c). While the standardizing step is independent for each sample in the training set, last shifting and scaling parameters are shared between all samples. Here we use the subscript () in , and to indicate that they are associated with the second dimension, i.e., the temporal dimension, of the multivariate series.
In order to interpret the effects of Eq. (4a), (4b), and (4b), we can take the same approach as the example given for NASDAQ100 previously. That is, the input series can be viewed as the set consisting of temporal slices, i.e., a set consisting of points in a dimensional space. The first part in Eq. (4c), i.e. , moves this set of points around the origin and as well as controlling their spread while keeping their arrangement pattern similarly. If we have two input series and with the corresponding sets and spreading and lying in two completely different areas of this dimensional space but have the same arrangement pattern, without the alignment performed by the first part of Eq. (4c), we cannot effectively capture the linear or nonlinear^{2}^{2}2Nonlinear patterns can be estimated by several piecewise linear patterns (using more than one linear projections such as more than one convolution filters)
arragement patterns that are similar between the two series when using, for example, a 1D convolution filter that strides along the temporal dimension as often encountered in CNN architectures for timeseries. We illustrate our example in Figure
1. Here we should note that although BiN applies additional scaling and shifting in Eq. (4c) after the alignment, the values of and are the same for every input series, thus the points of the set and are still centered at the same point and having approximately similar spreads. Since and are optimized together with other network’s parameters, they enable BiN to manipulate the aligned distributions of to match with the statistics of other layers.While the effect of nonstationarity in the temporal mode are often visible and has been heavily studied, its effects when considered from the feature dimension perspective are less obvious. To see this, let us now view the series as the set of points (its rows) in a dimensional space. Let us also take the previous scenario where two series, and , have and scattered in different regions of a dimensional coordinate system (viewed under the temporal perspective) before the normalization step in Eq. (4). When and are very far away viewed from the feature perspective, these two series are also likely to possess and which are distributed in two different regions of a dimensional space, although having very similar arrangement. This scenario also prevents a convolution filter that strides along the feature dimension to effectively capture the prominent linear/nonlinear patterns existing in the feature dimension of all input series. For this reason, our proposed normalization scheme also normalizes the input series along the feature dimension as follows:
(5a)  
(5b)  
(5c)  
(5d) 
where denotes the th row of . In addition, and are two learnable weights.
After computing the steps in Eq. (5), we obtain another intermediate series that has been normalized in the feature dimension.
Finally, BiN linearly combines the intermediate normalized series obtained from Eq. (4) and (5) to generate the output :
(6) 
where and are two learnable scalars, which enable BiN to weigh the importance of temporal and feature normalization. Here we should note that and are constrained to be nonnegative. This constraint is achieved during stochastic optimization by setting the value (of or ) to whenever the updated value is negative.
Iv Experiments
Iva Limit Order Book
In finance, a limit order is a type of trade order to buy or sell a fixed number of shares with a specified price. In a buy (bid) limit order, the trader specifies the number of shares and the maximum price per share of the stock that he or she is willing to pay. On the contrary, for a sell (ask) limit order, the trader must specifies the number of shares and the minimum share price that he or she wants to sell. The two types of limit order form the two sides of the limit order book (LOB): the bid and the ask sides. The limit orders are sorted such that the ones with the highest bid price are on top of the bid side and the ones with the lowest ask price are on top of the ask side. Whenever the best ask price is equal or lower than the best bid price, those orders are executed and removed from the LOB.
Since the LOB contains all the transactions related to a stock, it reflects the current supply and demand of the stock at different price levels. In literature, there are numerous researches that take advantage of the LOB data and formulate different research questions such as order flow distribution, price jumps, random walk nature of prices, stochastic models of limit orders, to name a few [30, 29, 3, 6, 20]. One of the problems related to the LOB that are heavily studied using machine learning methods is the problem of forecasting future midprice movements. Midprice, at any point in time, is the average value between the bestbid and bestask prices. This quantity is a virtual price since no trade can happen at the current midprice. Since the movements of midprice reflect the changes in market dynamics, they are considered as important events to forecast. In order to benchmark performances of BiN, we conducted experiments using two different LOB datasets coming from two different markets: Nordic and US markets.
IvB Experiments using Nordic data
IvB1 Dataset and Experimental Setup
FI2010 [24] is a large scale, publicly available Limit Order Book (LOB) dataset, which contains buy and sell limit order information (the prices and volumes) over business days from Finnish stocks traded in Helsinki Stock Exchange (operated by NASDAQ Nordic). At each order event (a point in time), the dataset contains the prices and volumes from the top bestbid and bestask orders of both sides, leading to a
dimensional vector representation. The authors of this dataset provided the labels (up, down, stationary) for the midprice movements in the next
order events. Since the majority of existing research results were reported for prediction horizons in the set , we also conducted experiments with these values. Interested readers can read more about the FI2010 dataset in [24].For the FI2010 dataset, we followed the same experimental setup proposed in [33], which is widely used to benchmark the performances of deep neural networks in this task. Under this setting, data of the first days was used to train the models, and the last days were used for evaluation purposes. In this first set of experiments, we evaluated BiN in combination with the Temporal Attention augmented Bilinear Layer (TABL) network, which is one of the SoTA neural networks in FI2010 dataset [33]. Since TABL architectures also take advantage of the bimodal nature of the timeseries, BiN is expected to ideally complement TABL networks. To enable comparisons with prior works, the best performing architecture C(TABL) reported in [33] was adopted in our experiments. For this architecture, the input timeseries were constructed from most recent order events. As we mentioned above, since at each order event, the LOB is represented by a dimensional vector, each input series that is fed to C(TABL) has dimensions of . All C(TABL) networks were trained with ADAM optimizer for epochs, with an initial learning rate of , which was reduced by a factor of at epoch and . Weight decay () and maxnorm constraint () were used for regularization.
Accuracy, average Precision, Recall and F1 are reported as the performance metrics. Since FI2010 is an imbalanced dataset, average F1 measure is considered as the main performance metric for FI2010 following prior conventions [33]. Here we should note that we used no validation set for FI2010, and simply used the F1 score measured on the train set for validation purposes. Each experiment was run times and the median value measured on the test set is reported.
IvB2 Experiment Results
Models  Accuracy %  Precision %  Recall %  F1 % 
Prediction Horizon  
CNN[35]    
LSTM[36]    
C(BL) [33]  
DeepLOB [39]  
DAINMLP [26]    
DAINRNN [26]    
C(TABL) [33]  
BNC(TABL)  
BiNC(TABL)  
Prediction Horizon  
CNN[35]    
LSTM[36]    
C(BL) [33]  
DeepLOB [39]  
DAINMLP [26]    
DAINRNN [26]    
C(TABL) [33]  
BNC(TABL)  
BiNC(TABL)  
Prediction Horizon  
CNN[35]    
LSTM[36]    
C(BL) [33]  
DeepLOB [39]  
C(TABL) [33]  
BNC(TABL)  
BiNC(TABL) 
Table I shows the experiment results in three prediction horizons of C(TABL) networks using Batch Normalization and BiN, in comparison with existing results. Here we should note that the data provided in FI2010 has been anonymized, i.e., the prices and volumes of orders were normalized. For those results reported in Table I without any indication of the normalization method, it means that zscore normalization was applied. In addition, we attempted to evaluate DAIN using the C(TABL) architecture on FI2010 dataset, however, we could not achieve reasonable performances since this normalization strategy requires extensive tuning of three different learning rates for different computation steps. Besides, in the original paper [26], DAIN was only applied to MLP and RNN networks. For this reason, we report the original results of DAIN using MLP and RNN in Table I. In the experiments using US data, we did obtain reasonable results with DAIN and comparisons with DAIN are made in Section IVC.
Models  Accuracy %  Precision %  Recall %  F1 % 
Prediction Horizon  
B(TABL) [33]  
C(TABL) [33]  
BiNB(TABL)  
BiNC(TABL)  
Prediction Horizon  
B(TABL) [33]  
C(TABL) [33]  
BiNB(TABL)  
BiNC(TABL)  
Prediction Horizon  
B(TABL) [33]  
C(TABL) [33]  
BiNB(TABL)  
BiNC(TABL) 
It is clear that our proposed BiN layer (BiNC(TABL)) when used to normalize the input data yielded significant improvements over BN and zscore normalization when applied to the same network. The improvements are obvious for all prediction horizons. Especially, for the longest horizon , BiN enhanced the C(TABL) network with up to improvement (from to ) in average F1 measure. Compared to DAIN, the performances achieved by our normalization strategy coupled with C(TABL) or DeepLOB networks are superior to that of DAIN coupled with MLP or RNN. Regarding BN when used as an input normalization scheme, it is obvious that BN deteriorated the performance of C(TABL) networks. For example, in case of , adding BN to C(TABL) network led to more than drop in averaged F1. This phenomenon is expected since BN was originally designed to reduce covariate shift between hidden layers of Convolutional Neural Network, rather than as a mechanism to normalize input timeseries.
Comparing BiNC(TABL) with a SoTA CNNLSTM architecture having 11 hidden layers called DeepLOB [39], it is clear that our proposed normalization layer helped a TABL network having only 2 hidden layers to significantly close the gaps when and ( versus for , and versus for ), while outperforming DeepLOB by a large margin when ( versus ).
In order to investigate how much improvement BiN can contribute to neural networks of different complexities, we evaluated BiN with a smaller TABL architecture, namely B(TABL) as proposed in [33]. B(TABL) has only one hidden layer with a total of parameters, compared to C(TABL) which has two hidden layers with a total of parameters. The results are shown in Table II. It is clear that BiN significantly boosted both B(TABL) and C(TABL) architectures in different prediction horizons, with BiNB(TABL) networks perform as well as BiNC(TABL) networks in all prediction horizons, making the additional hidden layer in BiNC(TABL) redundant. Here we should note that adding our proposed normalization layer to B(TABL) networks only leads to a mere increase of parameters while achieving the same performances as BiNC(TABL) networks, which have approximately twice the amount of parameters.
Since BN was proposed to normalize hidden representations, we also experimented using BiN to normalize hidden representations in TABL networks. The results are shown in Table III, where BiNC(TABL) and BNC(TABL) denote the results when BiN and BN were only applied to input, while BiNC(TABL)BiN and BNC(TABL)BN denote the results when BiN and BN were applied to both the input and hidden representations. As we can see from Table III, there are very small differences between the two arrangements, except a noticeable improvement for BN when the prediction horizon is . For BiN, the this results imply that adding normalization to the hidden layers bring no additional benefit for C(TABL) networks when the input data has been properly normalized.
Models  Accuracy %  Precision %  Recall %  F1 % 
Prediction Horizon  
BNC(TABL)  
BiNC(TABL)  
BNC(TABL)BN  
BiNC(TABL)BiN  
Prediction Horizon 

BNC(TABL)  
BiNC(TABL)  
BNC(TABL)BN  
BiNC(TABL)BiN  
Prediction Horizon  
BNC(TABL)  
BiNC(TABL)  
BNC(TABL)BN  
BiNC(TABL)BiN 
IvC Experiments using US data
IvC1 Dataset and Experiment Setup
While the Nordic dataset provides a reasonable testbed for our evaluation purpose, the Nordic market is less liquid compared to the US market, which is the biggest stock market worldwide. The number of intraday orders in largecap US stocks is significantly higher than that of the Nordic stocks, making it harder to predict the future market conditions. For the US market, we procured orders from TotalViewITCH feed and obtained the LOB data of Amazon and Google from the 22nd of September 2015 to the 5th of October 2015. The trading hours in NASDAQ US spans from 09:30 to 16:00 (EST) and only orders submitted during this period were considered in our analysis. After the filtering process, we obtained approximately 13 millions order events for working days. Similar to the Nordic data, we used the first days for training the prediction models and the last days for testing purposes.
In addition to forecasting the types of midprice dynamics (up, down, stationary) at a fixed future horizon (Setting 1), we also evaluated the models in a more active setting (Setting 2), in which models were trained to predict the next movement (up or down) of the midprice and when it occurs. That is, we have both classification (movement type) and regression (horizon value) objectives in Setting 2, with the loss function consists of the cross entropy and the mean squared error. The movement labels were derived following the same procedure used in
[24], which includes price smoothing and movement classification based on a threshold of .For the experiments with US data, in addition to C(TABL) architecture, we also evaluated with the DeepLOB architecture [39] as the predictors. Different from the Nordic dataset which was prenormalized, the US data contains raw values for the prices and volumes. For this reason, we experimented with two static normalization methods, namely zscore normalization and minmax normalization with the results denoted as zC(TABL) and mmC(TABL) for C(TABL) networks, and zDeepLOB and mmDeepLOB for DeepLOB networks.
Models  Accuracy (%)  Precision (%)  Recall (%)  F1 (%) 

Prediction Horizon  
C(TABL)  
zC(TABL)  
mmC(TABL)  
BNC(TABL)  
DAINC(TABL)  
BiNC(TABL)  
Prediction Horizon  
C(TABL)  
zC(TABL)  
mmC(TABL)  
BNC(TABL)  
DAINC(TABL)  
BiNC(TABL)  
Prediction Horizon  
C(TABL)  
zC(TABL)  
mmC(TABL)  
BNC(TABL)  
DAINC(TABL)  
BiNC(TABL) 
IvC2 Experiment Results
Table IV shows the experiment results in Setting 1 of the US data for the C(TABL) architecture. First of all, it is clear that we obtained the worst performance when using raw data to train the predictors (results associated with C(TABL)). Between the two static normalization methods, zscore normalization exhibited better ability in preprocessing the data compared to minmax normalization. Both static normalization methods significantly improve the quality of training data. Among adaptive normalization methods, performances obtained from BN are inferior to DAIN and BiN. Overall, the proposed normalization layer when combined with C(TABL) architecture yielded the best performances in all prediction horizons compared to others.
Table V shows the experiment results in Setting 1 of the US data for DeepLOB networks. Similar to the results obtained for C(TABL) networks, we also obtained the worst performance when using raw data to train the DeepLOB architecture. Between zscore normalization and minmax normalization, using the former led to slightly better results compared to the latter. While BN showed no superiority over zscore normalization, both DAIN and BiN outperformed static normalization methods. Among all normalization methods, BiN was the most suitable normalization technique to combine with the DeepLOB architecture.
Models  Accuracy (%)  Precision (%)  Recall (%)  F1 (%) 

Prediction Horizon  
DeepLOB  
zDeepLOB  
mmDeepLOB  
BNDeepLOB  
DAINDeepLOB  
BiNDeepLOB  
Prediction Horizon  
DeepLOB  
zDeepLOB  
mmDeepLOB  
BNDeepLOB  
DAINDeepLOB  
BiNDeepLOB  
Prediction Horizon  
DeepLOB  
zDeepLOB  
mmDeepLOB  
BNDeepLOB  
DAINDeepLOB  
BiNDeepLOB 
F1 (%)  RMSE  

C(TABL)  
zC(TABL)  
mmC(TABL)  
BNC(TABL)  
DAINC(TABL)  
BiNC(TABL)  
DeepLOB  
zDeepLOB  
mmDeepLOB  
BNDeepLOB  
DAINDeepLOB  
BiNDeepLOB 
In experiment Setting 2, the models were trained to predict the type of the next movement of midprice, which is measured by F1 score, as well as the horizon when it happens, which is measured by Root Mean Squared Error (RMSE). The performances of C(TABL) and DeepLOB networks using different input normalization methods are shown in Table VI. For both network architectures, the best F1 scores were obtained using the proposed normalization method. Zscore standardization and BN performed similarly, being the second best in terms of F1 score. Minmax normalization, again, showed inferior performances compared to zscore normalization. Surprisingly, DAIN performed poorly in terms of F1 score when compared to zscore normalization in this experiment setting. Regarding the prediction of the horizon value, BiN achieved the best RMSE among all normalization methods used for the C(TABL) architecture. For the DeepLOB architecture, a peculiar phenomenon can be observed: for all normalization methods, we obtained the same RMSE, even between different runs, with DAIN as the only exception. For these models, the gradient updates toward the end of the training process seemed to only affect the classification objective and not the regression one. Even though DAIN achieved the best RMSE compared to others when applied to the DeepLOB architecture, the combination of DAIN and DeepLOB performed poorly in terms of F1 score.
From the results obtained for both Setting 1 and Setting 2, we can see that the proposed normalization method performs consistently, being the best normalization method for SoTA neural networks in most cases.
V Conclusions
In this paper, we propose Bilinear Input Normalization (BiN) layer, a completely datadriven timeseries normalization strategy, which is designed to take into consideration the bimodal nature of financial timeseries, and aligns the multivariate timeseries in both feature and temporal dimensions. The parameters of the proposed normalization method are optimized in an endtoend manner with other parameters in a neural network. Using large scale limit order books coming from the Nordic and US markets, we evaluated the performance of BiN in comparisons with other normalization techniques to tackle different forecasting problems related to the future midprice dynamics. The experimental results showed that BiN performed consistently when combined with different stateofthearts neural networks, being the most suitable normalization method in the majority of scenarios.
Vi Acknowledgement
The authors wish to acknowledge CSC – IT Center for Science, Finland, for computational resources.
References
 [1] (2010) Statistical arbitrage in the us equities market. Quantitative Finance 10 (7), pp. 761–782. Cited by: §I.
 [2] (2018) Machine learning and portfolio optimization. Management Science 64 (3), pp. 1136–1154. Cited by: §I.
 [3] (2004) Fluctuations and response in financial markets: the subtle nature of ‘random’price changes. Quantitative finance 4 (2), pp. 176–190. Cited by: §IVA.
 [4] (2004) Forecasting economic and financial timeseries with nonlinear models. International Journal of Forecasting 20 (2), pp. 169–183. Cited by: §I.
 [5] (1996) A crosssectional test of an investmentbased asset pricing model. Journal of Political Economy 104 (3), pp. 572–621. Cited by: §I.
 [6] (2013) Price dynamics in a markovian limit order market. SIAM Journal on Financial Mathematics 4 (1), pp. 1–25. Cited by: §IVA.
 [7] (2009) A generalized approach to portfolio optimization: improving performance by constraining portfolio norms. Management science 55 (5), pp. 798–812. Cited by: §I.

[8]
(2017)
Financial time series forecasting–a deep learning approach
. International Journal of Machine Learning and Computing 7 (5), pp. 118–122. Cited by: §I.  [9] (1982) Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation. Econometrica: Journal of the econometric society, pp. 987–1007. Cited by: §I.
 [10] (2000) Credit risk assessment using statistical and machine learning: basic methodology and risk modeling applications. Computational Economics 15 (1), pp. 107–143. Cited by: §I.
 [11] (2015) Data preprocessing in data mining. Vol. 72, Springer. Cited by: §II.
 [12] (2008) Tests for cointegration with two unknown regime shifts with an application to financial market integration. Empirical Economics 35 (3), pp. 497–505. Cited by: §I.

[13]
(2015)
Application of evolutionary computation for rule discovery in stock algorithmic trading: a literature review
. Applied Soft Computing 36, pp. 534–551. Cited by: §I.  [14] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §II.

[15]
(2014)
Comparison of arima and random forest time series models for prediction of avian influenza h5n1 outbreaks
. BMC bioinformatics 15 (1), pp. 1–9. Cited by: §I.  [16] (2010) Consumer creditrisk models via machinelearning algorithms. Journal of Banking & Finance 34 (11), pp. 2767–2787. Cited by: §I.
 [17] (2017) Deep learning for financial time series forecasting in atrader system. In 2017 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 905–912. Cited by: §I.
 [18] (1994) Preprocessing of analytical profiles in the presence of homoscedastic or heteroscedastic noise. Analytical Chemistry 66 (1), pp. 43–51. Cited by: §II.
 [19] (2020) Estimating latent assetpricing factors. Journal of Econometrics 218 (1), pp. 1–31. Cited by: §I.
 [20] (2019) Forecasting jump arrivals in stock prices: new attentionbased network architecture using limit order book data. Quantitative Finance 19 (12), pp. 2033–2050. Cited by: §IVA.
 [21] (1999) Technical analysis of the financial markets: a comprehensive guide to trading methods and applications. Penguin. Cited by: §I.
 [22] (2014) Impact of data normalization on stock index forecasting. Int. J. Comp. Inf. Syst. Ind. Manag. Appl 6, pp. 357–369. Cited by: §I.
 [23] (2008) Scaling techniques to enhance twodimensional correlation spectra. Journal of Molecular Structure 883, pp. 216–227. Cited by: §II.
 [24] (2018) Benchmark dataset for midprice forecasting of limit order book data with machine learning methods. Journal of Forecasting 37 (8), pp. 852–866. Cited by: §IVB1, §IVC1.
 [25] (2011) Algorithmic trading. Computer 44 (11), pp. 61–69. Cited by: §I.
 [26] (2019) Deep adaptive input normalization for price forecasting using limit order book data. arXiv preprint arXiv:1902.07892. Cited by: §I, §II, §II, §IVB2, TABLE I.
 [27] (2017) Financial series prediction: comparison between precision of time series models and machine learning methods. arXiv preprint arXiv:1706.00948, pp. 1–9. Cited by: §I.
 [28] (2015) Selfnormalization for time series: a review of recent developments. Journal of the American Statistical Association 110 (512), pp. 1797–1817. Cited by: §I.
 [29] (2017) What drives the sensitivity of limit order books to company announcement arrivals?. Economics Letters 159, pp. 65–68. Cited by: §IVA.
 [30] (2017) Limit order books and liquidity around scheduled and nonscheduled announcements: empirical evidence from nasdaq nordic. Finance Research Letters 21, pp. 264–271. Cited by: §IVA.
 [31] (2020) Investigating the impact of data normalization on classification performance. Applied Soft Computing 97, pp. 105524. Cited by: §I, §II.
 [32] (2006) Getting started in fundamental analysis. John Wiley & Sons. Cited by: §I.
 [33] (2018) Temporal attentionaugmented bilinear network for financial timeseries data analysis. IEEE transactions on neural networks and learning systems 30 (5), pp. 1407–1418. Cited by: §I, §I, §IVB1, §IVB1, §IVB2, TABLE I, TABLE II.
 [34] (2020) Data normalization for bilinear structures in highfrequency financial timeseries. In International Conference on Pattern Recognition (ICPR), Cited by: §I.
 [35] (2017) Forecasting stock prices from the limit order book using convolutional neural networks. In 2017 IEEE 19th Conference on Business Informatics (CBI), Vol. 1, pp. 7–12. Cited by: §I, TABLE I.
 [36] (2017) Using deep learning to detect price change indications in financial markets. In Signal Processing Conference (EUSIPCO), 2017 25th European, pp. 2511–2515. Cited by: TABLE I.
 [37] (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §II.
 [38] (2006) Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC genomics 7 (1), pp. 1–15. Cited by: §II.
 [39] (2019) DeepLOB: deep convolutional neural networks for limit order books. IEEE Transactions on Signal Processing 67 (11), pp. 3001–3012. Cited by: §I, §I, §IVB2, §IVC1, TABLE I.
Comments
There are no comments yet.