Technological developments in recent years have substantially increased the amount of data gathered from real life systems[1, 2, 3, 4]. There exists a significant data flow through the recently arising applications such as large-scale sensor networks, information sensing mobile devices and web based social networks [5, 6, 7]
. The size as well as the dimensionality of this data strain the limits of current architectures. Since processing and storing such massive amount of data result in an excessive computational cost, efficient machine learning and data processing algorithms are needed[1, 8].
In this paper, we investigate the widely studied sequential prediction problem for high dimensional data streams. Efficient prediction algorithms specific to big data sequences have great importance for several real life applications such as high frequency trading, forecasting, trend analysis, financial market  and locational tracking. Unfortunately, conventional methods in machine learning and data processing literatures are inadequate to efficiently and effectively process high dimensional data sequences [14, 15, 16]. Even though today’s computers have powerful processing units, traditional algorithms create a bottleneck even for that processing power when the data is acquired at high speeds and too large in size [14, 15].
In order to mitigate the problem of excessive computational cost, we introduce sequential, i.e., online, processing, where the data is neither stored nor reused, and avoid ”batch” processing. [17, 16]. One family of the well known online learning algorithms in the data processing literature is the family of first order methods, e.g., Online Gradient Descent [18, 19]. These methods only use the gradient information to minimize the overall prediction cost. They achieve logarithmic regret bounds that are theoretically guaranteed to hold under certain assumptions . Gradient based methods are computationally more efficient compared to other families of online learning algorithms, i.e., for a sequence of -dimensional feature vectors , where , the computational complexity is only in the order of . However, their convergence rates remain significantly slow when achieving an optimal solution, since no statistics other than the gradient is used [3, 16, 19]. Even though gradient based methods suffer from this convergence issue, they are extensively used in big data applications due to such low computational demand .
Different from the gradient based algorithms, the well known second order Newton-Raphson methods, e.g, Online Newton Step, use the second order statistics, i.e., Hessian of the cost function . Hence, they asymptotically achieve the performance of the ”best” possible predictor much faster. Existence of logarithmic regret bounds is theoretically guaranteed for this family of algorithms as well . Additionally, the second order methods are robust and prone to highly varying data statistics, compared to the first order methods, since they keep track of the second order information [17, 21]. Therefore, in the sense of convergence rate and steady state error performances, Newton-Raphson methods considerably outperform the first order methods [16, 17, 19]. However, the second order methods offer a quadratic computational complexity, i.e., , while the gradient based algorithms provide a linear relation, i.e., . As a consequence, it is not usually feasible for real-life big data applications to utilize the merits of the second order algorithms .
In this paper, we study sequential data prediction, where the consecutive feature vectors are the shifted versions of each other, i.e., for a feature vector of , the upcoming vector is in the form of . To this end, we introduce second order methods for this important problem with computational complexity only linear in the data dimension, i.e., . We achieve such an enormous reduction in computational complexity since there are only two entries changing from to , where we avoid unnecessary calculations in each update. We do not use any statistical assumption on the data sequence other than the shifted nature of the feature vectors. Therefore, we present an approach that is highly appealing for big data applications since it provides the merits of the Newton-Raphson methods with a much lower computational cost.
Overall, in this paper, we introduce an online sequential data prediction algorithm that processes only the currently available data without any storage, efficiently implements the Newton-Raphson methods, i.e., the second order methods outperforms the gradient based methods in terms of performance, has computational complexity same as the first order methods and requires no statistical assumptions on the data sequence. We illustrate the outstanding gains of our algorithm in terms of computational efficiency by using two sequential real life big datasets and compare the resulting error performances with the regular Newton-Raphson methods.
2 Problem Description
In this paper, all vectors are real valued and column-vectors. We use lower case (upper case) boldface letters to represent vectors (matrices). The ordinary transpose is denoted as for the vector
. The identity matrix is represented by, where the subscript is used to indicate that the dimension is . We denote the -dimensional zero vector as .
We study sequential data prediction, where we sequentially observe a real valued data sequence , . At each time , after observing
, we generate an estimate of the desired data,, using a linear model as
where represents the feature vector of previous samples, i.e., . Here, and are the corresponding weight vector and the offset variable respectively at time . With an abuse of notation, we combine the weight vector with the offset variable , and denote it by , yielding , where . As the performance criterion, we use the widely studied instantaneous absolute loss as our cost function, i.e., , where the prediction error at each time instant is given by
We adaptively learn the weight vector coefficients to asymptotically achieve the best possible fixed weight vector , which minimizes the total prediction error after iteration, i.e.,
for any . The definition of is given for the absolute loss case. To this end, we use the second order Online Newton Step (ONS) algorithm to train the weight vectors. The ONS algorithm significantly outperforms the first order Online Gradient Descent (OGD) algorithm in terms of convergence rate and steady state error performance since it keeps track of the second order statistics of the data sequence [16, 18, 19]. The weight vector at each time is updated as
where is the step size and corresponds to the gradient of the cost function at time w.r.t. , i.e., . Here, the dimensional matrix is given by
where is chosen to guarantee that is positive definite, i.e., , and hence, invertible. Selection of the parameters and is crucial for good performance . Note that for the first order OGD algorithm, we have for all , i.e., we do not use the second order statistics but only the gradient information.
Although the second order update algorithms provide faster convergence rates and better steady state performances, computational complexity issue prohibits their usage in most real life applications [19, 22]. Since each update in (4) requires the multiplication of an dimensional matrix with an dimensional vector for , the computational complexity is in the order of , while the first order algorithms just need multiplication and addition. As an example, in protein structure prediction, we have deeming the second order methods 1000 times slower than the first order OGD algorithm .
In the next section, we introduce a sequential prediction algorithm, which achieves the performance of the Newton-Raphson methods, while offering computational complexity same as the first order methods.
3 Efficient Implementation for Complexity Reduction
In this section, we construct an efficient implementation that is based on the low rank property of the update matrices. Instead of directly implementing the second order methods as in (4) and (5), we use unitary and hyperbolic transformations to update the weight vector and the inverse of the Hessian-related matrix .
We work on time series data sequences, which directly implies that the feature vectors and are highly related. More precisely, we have the following relation between these two consecutive vectors as
This relation shows that consecutive data vectors carry quite the same information, which is the basis of our algorithm. We use the instantaneous absolute loss, which is defined as
Although the absolute loss is widely used in the data prediction applications, it is not differentiable when . However, we resolve this issue by setting a threshold close to zero and not updating the weight vector when the absolute error is below this threshold, . From (4) and (5), the absolute loss results in the following update rules for and ,
since depending on the sign of the error.
It is clear that the complexity of the second order algorithms essentially results from the matrix-vector multiplication, as in (8). Rather than getting matrix from and then calculating the multiplication individually at each iteration, we develop a direct and compact update rule, which calculates from without any explicit knowledge of the dimensional matrix .
Then, the difference between the consecutive terms and is given by
We define the dimensional extended vector and get the following two equalities using the relation given in (6),
Therefore, (11) becomes
where the update term is defined as
This equation implies that we do not need the exact values of and individually and it is sufficient to know the value of the defined difference for the calculation of . Moreover, we observe that the update term can be expressed in terms of rank 2 matrices, which is the key point for the reduction of complexity.
Initially, we assume that for , which directly implies using (3). Therefore, is found as
At this point, we define the dimensional matrix and the dimensional matrix as
to achieve the equality given by
We show, at the end of the discussion, that once the rank 2 property is achieved, it holds for all . By using the reformulation of the difference term, we restate the term given in (14) as
For the further discussion, we prefer matrix notation and represent (19) as
where is defined as
We first employ a unitary Givens transformation in order to zero out the second element of the vector and then use a -unitary Hyperbolic rotation , i.e., , to eliminate the last term . Consequently, we achieve the following update rule
where represents the overall transformation process. Existence of these transformation matrices is guaranteed . This update gives the next normalization term , however, for the update, we also need the updated value of , i.e., , explicitly. Moreover, even calculating the term is not sufficient, since we also need the individual value of the vector to update the weight vector coefficients.
Here, by subtracting these two equations, we get
We emphasize that the same transformation , which we used to get , also transforms to and to , if we extend the transformed vector as follows
where we show that and . We denote (26) as , where represents the input matrix and states the output matrix of the transformation. Then, the following equality is achieved
since is unitary, i.e., . Equating the elements of matrices in both sides of (27) yields
Hence, we identify the value of matrix using the second term in (28) as
where we expand the term using its definition given in (15). We know that the term equals to the difference using the update relation (9). Therefore, substituting this equality and inserting the value of yields
This equality implies that is time invariant, i.e., and is given as
Hence, we show that when the low rank property of the difference term is achieved for , it is preserved for the iteration , for . Therefore, the transformation in (26) gives all the necessary information and provides a complete update rule. As a result, the weight vector is updated as
where the individual value of is found by multiplying (29) by , which is the left upper most entry of the transformed matrix , and taking the first elements. The complete algorithm is provided in Algorithm 1 with all initializations and required updates.
The processed matrix has the dimensions , which results in the computational complexity of . Since there is no statistical assumptions, we obtain the same error rates compared to the regular implementation.
In this section, we illustrate the efficiency of our algorithm on real life sequential big datasets. We implement both the regular and the fast ONS algorithms on two different datasets, one of which is the CMU ARCTIC speech dataset where a professional US English male speaker reads 1132 distinct utterances. The recording is performed at 16 KHz and there exist more than 50 million samples. The second dataset is a weather forecasting trace provided by Davis Weather station in Amherst, Massachusetts. We also implement the first order OGD algorithm on this dataset to demonstrate the performance comparison with the second order methods in terms of computational efficiency and convergence rates. Temperature data was collected every 5 minutes during a period of 7 years from 2006 to 2013. There exist more than 600 thousand samples. Hence, both datasets are suitable for simulating big data scenarios. The data sequences are scaled to the range .
4.1 Computational Complexity Analysis
As the first experiment, we examine the computation time of both the proposed efficient ONS algorithm and the regular ONS algorithm. We first work on the two partitions of CMU ARCTIC speech dataset with lengths and , and measure the corresponding total processing time. Sequences of different lengths are chosen to illustrate the effect of increasing data length. For both sequences, we choose feature vectors, , with several dimensions ranging from to . In Fig. 1.a, we demonstrate the computation time comparisons of the regular and the fast implementations of ONS algorithm. As expected from our results, complexity of the regularly implemented ONS algorithm shows a quadratic relation with the dimension of the feature vectors, . However, the proposed efficient implementation provides a linear relation. A substantial observation from Fig. 1.a is that, with an increasing dimensionality of the space of feature vectors, the reduction in the complexity becomes outstanding. We also observe that the growth in the dataset length causes the same linear effect on both algorithms, i.e., doubling the total length results in the doubled computation time.
We then consider the weather forecasting temperature dataset, where in this case total length of the data sequence is not as much as the previous dataset. Therefore, we specifically concentrate on much larger values for the dimension of the feature vectors. Here, we choose the dimension of the space of feature vectors ranging from to and total length of the data sequence is . In Fig. 1.b, we illustrate the relative gain of the introduced fast ONS algorithm with respect to the regular implementation of ONS and the OGD algorithm in terms of total computation time. We observe that the relative computation time gain of the presented algorithm shows a significant improvement in comparison with the regular ONS, as the data dimension increases. However, we also observe that the relative gain falls into the negative region when compared with the first order OGD algorithm. This is an expected result, since the OGD uses only an dimensional vector in each iteration, whereas the fast ONS uses an dimensional matrix and performs additional transformation operations to update the weight vectors. Besides, the negative gain remains constant, since both algorithms eventually have the complexity in the same order of .
4.2 Numerical Stability Analysis
We theoretically show that the introduced algorithm efficiently implements the ONS algorithm without any statistical assumptions or any information loss. Hence, both the regular and the fast ONS algorithms offer the same error performances. However, there might occur negligible numerical differences as a consequence of the finite precision of real life computing systems. In the second part of the experiments, we examine the effects of numerical calculations on the mean square error curves of both the regular and the fast ONS algorithms. We first consider the CMU ARCTIC speech dataset with samples since we observe that the algorithms reach the steady state for this . The dimension of the feature vectors is chosen as , and the learning rates are determined as 0.003 for both algorithms. We demonstrate the comparison of mean square error curves in Fig. 2.a. A direct and significant observation is that the efficient implementation is numerically stable. There is no observable difference between the mean square error curves in terms of both convergence and steady state.
We work on the temperature tracking dataset as well for the numerical stability analysis. In this case, we include the OGD algorithm into the comparison and increase the dimension of feature vectors to for all algorithms. Here, we examine the effect of large dimensionality on the numerical performance of the proposed algorithm and also compare the error performances with the first order OGD algorithm. Similar to the first case, we only represent the first 500 samples since the second order algorithms reach the steady state for this point. The learning rates are set to 0.001 for the regular and the fast ONS and 0.1 for the OGD algorithm. In Fig. 2.b, we illustrate the corresponding mean square error curves. Same as the previous analysis, the fast ONS algorithm shows numerically no difference compared to the regular ONS algorithm. Hence, even for such high dimensional feature vectors, the proposed algorithm remains numerically stable. Additionally, we illustrate that the first order OGD algorithm shows less than adequate performance in terms of convergence rate. Therefore, negative gain on the computation time observed in the previous experiment becomes insignificant when we consider the mean square error analysis in Fig. 2.b.
In this paper, we investigate online sequential data prediction problem for high dimensional data sequences. Even though the second order Newton-Raphson methods achieve superior performance, compared to the gradient based algorithms, the problem of extremely high computational cost prohibits their usage in real life big data applications. For an dimensional feature vector, the computational complexity of these methods increases in the order of . To this end, we introduce a highly efficient implementation that reduces the computational complexity of the Newton-Raphson methods from to . The presented algorithm does not require any statistical assumption on the data sequence. We only use the similarity between the consecutive feature vectors without any information loss. Hence, our algorithm offers the outstanding performance of the second order methods with the low computational cost of the first order methods. We illustrate that the efficient implementation of Newton-Raphson methods attains significant computational gains, as the data dimension grows. We also show that our algorithm is numerically stable.
-  X. Wu, X. Zhu, G. Q. Wu, and W. Ding, “Data mining with big data,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan 2014.
-  C. Xu, Y. Zhang, R. Li, and X. Wu, “On the feasibility of distributed kernel regression for big data,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 11, pp. 3041–3052, Nov 2016.
-  R. D’Ambrosio, W. Belhajali, and M. Barlaud, “Boosting stochastic newton descent for bigdata large scale classification,” in 2014 IEEE International Conference on Big Data, Oct 2014, pp. 36–41.
-  R. Couillet and M. Debbah, “Signal processing in large systems,” IEEE Signal Processing Magazine, vol. 24, pp. 211–317, 2013.
-  R. Wolff, K. Bhaduri, and H. Kargupta, “A generic local algorithm for mining data streams in large distributed systems,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 4, pp. 465–478, April 2009.
-  T. Wu, S. H. Yu, W. Liao, and C. S. Chang, “Temporal bipartite projection and link prediction for online social networks,” in 2014 IEEE International Conference on Big Data, Oct 2014, pp. 52–59.
-  Y. Yilmaz and X. Wang, “Sequential distributed detection in energy-constrained wireless sensor networks,” IEEE Transactions on Signal Processing, vol. 17, no. 4, pp. 335–339, 2014.
-  T. Moon and T. Weissman, “Universal FIR MMSE filtering,” IEEE Transactions on Signal Processing, vol. 57, no. 3, pp. 1068–1083, 2009.
-  R. Savani, “High-frequency trading: The faster, the better?” IEEE Intelligent Systems, vol. 27, no. 4, pp. 70–73, July 2012.
-  P. Ghosh and V. L. R. Chinthalapati, “Financial time series forecasting using agent based models in equity and fx markets,” in 2014 6th Computer Science and Electronic Engineering Conference (CEEC), Sept 2014, pp. 97–102.
-  L. Deng, “Long-term trend in non-stationary time series with nonlinear analysis techniques,” in 2013 6th International Congress on Image and Signal Processing (CISP), vol. 2, Dec 2013, pp. 1160–1163.
W. Cao, L. Cao, and Y. Song, “Coupled market behavior based financial crisis
The 2013 International Joint Conference on Neural Networks (IJCNN), Aug 2013, pp. 1–8.
-  Y. Ding, H. Tan, W. Luo, and L. M. Ni, “Exploring the use of diverse replicas for big location tracking data,” in 2014 IEEE 34th International Conference on Distributed Computing Systems (ICDCS), June 2014, pp. 83–92.
-  L. Bottou and Y. Le Cun, “On-line learning for very large data sets,” Applied Stochastic Models in Business and Industry, vol. 21, no. 2, pp. 137–151, 2005. [Online]. Available: http://dx.doi.org/10.1002/asmb.538
-  L. Bottou and O. Bousquet, “The tradeoffs of large scale learning,” in Advances in Neural Information Processing Systems, 2008, pp. 161–168. [Online]. Available: http://leon.bottou.org/publications/pdf/nips-2007.pdf
-  N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games. Cambridge: Cambridge University Press, 2006.
-  A. C. Singer, S. S. Kozat, and M. Feder, “Universal linear least squares prediction: upper and lower bounds,” IEEE Transactions on Information Theory, vol. 48, no. 8, pp. 2354–2362, Aug 2002.
-  E. Hazan, A. Agarwal, and S. Kale, “Logarithmic regret algorithms for online convex optimization,” Machine Learning, vol. 69, no. 2-3, pp. 169–192, 2007.
-  D. Bertsimas and J. N. Tsitsiklis, Introduction to linear optimization, ser. Athena scientific series in optimization and neural computation. Belmont (Mass.): Athena Scientific, 1997. [Online]. Available: http://opac.inria.fr/record=b1094316
-  E. K. P. Chong and S. H. Zak, An introduction to optimization, ser. Wiley-Interscience series in discrete mathematics and optimization. New York: Wiley, 2008. [Online]. Available: http://opac.inria.fr/record=b1128546
-  S. S. Kozat, A. T. Erdogan, A. C. Singer, and A. H. Sayed, “Steady-state mse performance analysis of mixture approaches to adaptive filtering,” IEEE Transactions on Signal Processing, vol. 58, no. 8, pp. 4050–4063, Aug 2010.
-  A. H. Sayed, Fundamentals of Adaptive Filtering. NJ: John Wiley & Sons, 2003.
-  J. Cheng, A. N. Tegge, and P. Baldi, “Machine learning methods for protein structure prediction,” IEEE Reviews in Biomedical Engineering, vol. 1, pp. 41–49, 2008.
-  A. H. Sayed, Adaptive Filters. NJ: John Wiley & Sons, 2008.
-  J. Kominek and A. W. Black, “Cmu arctic databases.” [Online]. Available: http://www.festvox.org/cmu_arctic/index.html
-  M. Liberatore, “Umass trace repository.” [Online]. Available: http://traces.cs.umass.edu/index.php/Sensors/Sensors