 # Exploiting statistical dependencies of time series with hierarchical correlation reconstruction

While we are usually focused on predicting future values of time series, it is often valuable to additionally predict their entire probability distributions, for example to evaluate risk or Monte Carlo simulations. On example of time series of ≈ 30000 Dow Jones Industrial Averages, there will be shown application of hierarchical correlation reconstruction for this purpose: mean-square fitting polynomial as joint density for (current value, context), where context is for example a few previous values. Then substituting the currently observed context and normalizing density to 1, we get predicted probability distribution for the current value. In contrast to standard machine learning approaches like neural networks, optimal coefficients here can be inexpensively directly calculated, are unique and independent, each has a specific cumulant-like interpretation, and such approximation can approach complete description of any joint distribution - providing a perfect tool to quantitatively describe and exploit statistical dependencies in time series.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction Figure 1: Top: degree m=5 polynomials (integrating to 1) on [0,1] range predicting probability density based on length 5 context (previous 5 values) in 100 random positions of analysed sequence (normalized Dow Jones Industrial Averages): joint density for d=1+5=6 variables (current value and context) was MSE fitted as polynomial, then substituting the current context and normalizing to integrate to 1, we get predicted density for the current value. We can see that some predicted polynomials go below 0, what can be calibrated to be interpreted as low positive (Fig. 4, 5). Predicted densities are usually close to marked ρ=1 uniform density (obtained if not using context), but often localize improving prediction. Bottom: evaluation plot by sorting predicted densities for the actual current values in all 29349 situations: in ≈20% of cases it gives worse prediction than ρ=1 (without using context), but in the remaining cases it is essentially better. The number of coefficients in the used basis is |B|=(m+1)d. We can see that prediction generally improves (higher density) with growing number of coefficients, however, beside growing computational cost, it comes with overfitting (leading e.g. to negative densities) - polynomial approaches sum of Dirac deltas in points of the sample .

Modeling spatial or temporal statistical dependencies between observed values is a difficult task required in many applications. Standard approaches like correlation matrix, PCA (principal component analysis) approximate this behavior with multivariate gaussian distribution. Further corrections can be extracted by approaches like GMM (gaussian mixture model), KDE (kernel density estimation



or ICA (independent component analysis

, but they have many weaknesses like lack of error control, large freedom of parameters and varying their number, or focusing on a specific types of distributions.

Approximating observed data sample with a polynomial is universal approach used in many fields of science, can provide as close description as needed. It turns out also very advantageous for density estimation, including multivariate joint distribution ([3, 4]

), especially if variables are normalized to approximately uniform distribution on

with CDF of approximated distribution like in copula theory , to properly model tails, optimize weights and standardize coefficients.

Using orthonormal basis , it turns out that mean-square (MSE, ) optimization leads to estimated coefficients being just averages over the observed sample: . For multiple variables we can use basis of products of 1D orthornormal polynomials. On example of DJIA time series 111Source of DJIA time series: http://www.idvbook.com/teaching-aid/data-sets/the-dow-jones-industrial-average-data-set/, with results summarized in Fig. 1, it will be used for prediction of current probability distribution based on a few previous values.

Financial time series are usually modelled with ARMA-ARCH-like simple models 

for this purpose, exploiting pairwise dependencies between first two moments. Presented approach allows to systematically expand it in MSE-optimal way for including any moments among any number of variables, also for non-stationary time series. Additionally, standard models are usually based on Gaussian distribution, which turns out improper e.g. for daily log returns as we can see in a few figures here (especially Fig.

11) - presented approach can be based on any distribution (e.g. Laplace), using it for normalization of variables.

Finally, asymptotically (with basis and sample size ) we can approach complete description of any statistical dependencies - any real joint distribution of observed variables. Coefficients can be cheaply calculated as just averages, are unique and independently calculated, for stationary time series we can control their accuracy. Each has also a specific interpretation: resembling cumulants, but being much more convenient for reconstructing probability distribution - instead of the difficult problem of moments , here they are just coefficients of polynomial as density. Inconvenience of parameterizing density with polynomial are occurring negative predictions, which can be interpreted as small positive - we can optimize this calibration based on the data sample as discussed later (Fig. 4, 5).

In the discussed example: analysis of DJIA time series, we will first normalize the variables to nearly uniform probability distribution on

: by considering differences of logarithms (so called log returns), and transforming them by CDF (cumulative distribution function) of approximated distribution (usually Laplace) as shown in Fig.

2 (and 10). In copula theory 

such normalization is made with empirical distribution, then there is used usually single parameter distribution to describe dependencies. Here we normalize with parametric distribution instead, repairing inaccuracy by fitting polynomial, described by as large number of parameters as needed - it allows for flexibility, including modelling evolution of polynomial coefficients for non-stationary time series. Additionally, thanks to normalization, coordinates correspond to quantiles e.g. 1/2 to median, length to percentage of population - providing proper weights for optimized

distance.

After normalization , looking at successive positions of , if uncorrelated they would come from distribution on . Its corrections as linear combination of orthonormal basis of polynomials can be inexpensively and independently calculated, providing unique and asymptotically complete description of statistical dependencies between these neighboring values. Treating

of them as earlier context, substituting their values and normalizing to 1, we get predictions of (conditional) probability distribution for the current value as summarized in Fig.

1. Figure 2: Normalization of the original variable to nearly uniform on [0,1] (marked green) used for further correlation modelling. The original sequence {vt} of 29355 Dow Jones daily averages (over 100 years) is first logarithmized (top plot), then we take differences yt=ln(vt+1)−ln(vt). Sorting {yt} we get its approximated CDF, which, in contrast to standard Gaussian assumption, turns out in good agreement with Laplace distribution (μ≈0.00044, b≈0.0072) - estimated and drawn (red) in the second plot. The marked green next plot is the final xt=CDFLaplace(μ,b)(yt) sequence used for further correlation modeling. The bottom plot shows sorted {xt} values to verify that they come from nearly uniform distribution (line) - its inaccuracy will be repaired later with fitting polynomial (Fig. 6). It corresponds to standard unconditional coverage test  checking if modelled quantiles agree with population percentages - it is usually done only for a few extreme quantiles (e.g. 1%, 5%), getting line for such sorted CDF(y) would mean perfect agreement for all quantiles.

Handling of non-stationary time series is also proposed: by replacing

global average with local averages over past values with exponentially decaying weights, or using interpolation with time as additional dimension.

Presented approach can be naturally extended to multivariate time series, e.g. stock prices of separate companies to model their statistical dependencies, what is presented in  on example of yield curve parameters, here there will be presented example of modelling various statistical dependencies of 29 of Dow Jones companies.

More relaxed treatment of values of such polynomials as cumulant-like features and contributions, also combining with discrete variables, can be found in  for credibility evaluation of income data.

## Ii Normalization to nearly uniform density

We will first work on example of Dow Jones Industrial Averages time series for . As financial data usually evolve in multiplicative not additive manner, we will work with to make it additive.

Time series are usually normalized to allow assumption of stationary process: such that joint probability distribution does not change when position is shifted. The standard approach, especially for gaussian distribution, is to subtract mean value, then divide by the standard deviation.

However, above normalization does not exploit local dependencies between values, what we are interested in. Using experience from data compression (especially lossless image e.g. JPEG LS ), we can use a predictor for the next value based on its local context: for example a few previous values (2D neighbors for image compression), or some more complex features (e.g. using averages over time windows, or dimensionality reduction methods like PCA), then model probability distribution of difference from the predicted value (residue).

Considering simple linear predictors: as in ARMA-like models, we can use optimize parameters to minimize mean square error. For 2D image such optimization leads to approximate parameters . For Dow Jones sequence such optimization leads to nearly negligible weights for all but the previous value. Hence, for simplicity we will just operate on logarithmic returns:

 yt=ln(vt+1)−ln(vt) (1)

time series, where the number of possible indexes has been reduced by 1 due to shift: .

Such sequences of differences from predictions (residues) are well known in data compression to have nearly Laplace distribution - density:

 g(y)=12bexp(−|y−μ|b) (2)

where maximum likelihood estimation of parameters is just: median of , mean of . We can see in Fig. 2 that CDF from sorted

values has decent agreement with CDF of Laplace distribution. Otherwise, there can be used e.g. generalized normal distribution

, also called exponential power distribution (EPD) or generalized error distributions, which includes both gaussian and Laplace distribution - its optimization is presented in Fig. 11. Student’s t or stable distributions (Levy)  might be also worth considering as they include heavy tail distributions. These distributions have also known asymmetric variants - which can be considered if two directions have essentially different tails.

We could also use a varying distribution for normalization , using for example CDF of Gaussian of width depending on previous value using coefficients from ARCH model. This way we could directly enhance some successful standard model - by using it for normalization, than fitting polynomial to exploit complex further statistical dependencies: missed by the low parameter model.

For simplicity there is mainly used Laplace distribution here to normalize our variables to nearly uniform in , what allows to compactify the tails, improve performance (weight proportional to population) and normalize further coefficients:

 xt=G(yt)whereG(y)=∫y−∞g(y′)dy′ (3)

is CDF of used distribution (Laplace here). We can see in Fig. 2 (and later 10) that this final sequence has nearly uniform probability distribution. Its corrections will be included in further estimation of polynomial as (joint) probability distribution, like presented later in Fig. 6.

We will search for density. To remove transformation (3) to get final density, observe that . Differentiating over , we get

 ρY(y)=ρX(G(y))⋅g(y). (4)

## Iii Hierarchical correlation reconstruction

After normalization we have sequence with nearly uniform density, marked green in Fig. 2 here. Taking its succeeding values, if uncorrelated they would come from nearly uniform distribution on - difference from uniform distribution describes statistical dependencies in our time series. We will use polynomial to describe them: estimate joint density for succeeding values of .

Define for and , . They form vectors containing value with its context - we will model (joint) probability density of these vectors as polynomial. Generally we can also use more sophisticated contexts, for example average of a few earlier values (e.g. ) as a single context variable to include longer range correlations, or e.g. some macroeconomical variables to exploit also dependence from additional information. Dimensionality reduction techniques like PCA can be used to include dependence from a large number of variables. Normalization to nearly uniform density is recommended for the predicted values (), for context values it might be better to omit it, especially when absolute values are important like for image compression.

Finally assume we have vector sequence of value with its context, we would like to model density of such vectors as polynomial. It turns out  that using orthonormal basis, which for multidimensional case can be products of 1D orthonormal polynomials, mean square () optimization leads to extremely simple formula for estimated coefficients:

 ρ(x)=∑j∈{0…m}dajf%j(x)=m∑j1…jd=0ajfj1(x1)⋅…⋅fjd(xd)
 with estimated coefficients:aj=1nn∑t=1fj(xt) (5)

The basis used this way has functions, generally it seems worth to consider different for separate coordinates , or restrict to include correlations only from e.g. pairs by using j with at most two nonzero indexes. Beside inexpensive calculation, this simple approach has also very convenient property of coefficients being independently calculated, giving each j unique value and interpretation, and controllable error. Independence also allows for flexibility of considered basis - instead of using all j, we can focus on more promising ones: with larger absolute value of coefficient, replacing negligible . Instead of mean square optimization, we can use often preferred: likelihood maximization , but it requires additional iterative optimization and introduces dependencies between coefficients.

Above 1D polynomials are orthonormal in : , getting (rescaled Legendre): and for correspondingly:

 √3(2x−1),√5(6x2−6x+1),√7(20x3−30x2+12x−1),
 3(70x4−140x3+90x2−20x+1),
 √11(252x5−630x4+560x3−210x2+30x−1).

Their plots are in the top of Fig. 3. corresponds to normalization. The coefficient decides about reducing or increasing the mean - has similar interpretation as expected value. Analogously

coefficient decides about focusing or spreading given variable, similarly as variance. And so on: further

has similar interpretation as -th cumulant, however, while reconstructing density from moments is a difficult problem, presented description is directly coefficients of polynomial estimating the density.

For multiple variables, describes only correlations between coordinates, does not affect coordinates, as we can see in the center of Fig. 3. Each such mixed coefficient has also a specific interpretations here, for example decides between increase and decrease of second variable with increase of the first, analogously decides focus or spread of the second variable.

Assuming stationary time series (fixed joint distribution of ), errors of such estimated coefficients come from approximately gaussian distribution:

 ~a−a∼N(0,1√n√∫(fj−aj)2ρdx) (6)

For the integral has value 1, getting in our case. As we can see in bottom of Fig. 3, a few percents of coefficients here are more that tenfold larger: can be considered as essential, not a result of noise.

Here is a list of the largest coefficients for Dow Jones normalized series (beside ) in , case. It neglects shifted sequences, for example . Figure 3: Top: the first 6 of used 1D orthonormal basis of polynomials (⟨f,g⟩=∫10fgdx): j=0 coefficient guards normalization, the remaining functions integrate to 0, and their coefficients describe perturbation from uniform distribution. These coefficients have similar interpretation as cumulants, but are more convenient for density reconstruction. Center: 2D product basis for j∈{0,1,2}. The j=0 coordinates do not change with corresponding perturbation. Bottom: sorted calculated coefficients (without a000000=1) for DJIA sequence, m=5 and length 5 context (d=6) modelling. Assuming stationarity, for uniform distribution their standard deviation would be σ≈1/√n≈0.006, exceeded here more than tenfold by many coefficients - allowing to conclude that they are essential: not just a noise.

Positive:

Negative: Figure 4: Finding calibration function φ: proper interpretation of predicted densities (ρ→φ(ρ)), including negative as small positive. Fig. 5 shows example of its use. Top right: sorted predicted densities for actually observed values: evaluation plot from Fig. 1 for m=5,d=6 case. Top left: density of its values, calculated numerically as 1/derivative from top right plot, it integrates to ≈1. Bottom left: density of all predicted values from all n predicted polynomials - it can be found analytically, but it is simpler to do it numerically: [0,1] range was approximated with size 1000 regular lattice, getting 1000n values from all polynomials. They were sorted getting empirical distribution function, which 1/derivative is the presented density, integrating to ≈1. Bottom right: finding calibration function from dividing both densities - such that we should interpret predicted polynomial ρ as density φ(ρ). Orange line shows using φ(z)=z in this place, and obviously has a problem for predicted densities below zero. It agrees with blue plot trend on [0,2] range, confirming that predictions in this range can be interpreted as practically unchanged. For application we should smoothen and parameterize this function, like using the green plot: φ(z)=max(0.15,min(z,0.15z+1.7)) including small positive density for values below 0 and slower positive trend above 2. Alternatively, calibration function can be chosen by assuming some parametrization like φ(ρ)=max(ρ,a) for a≈0.2 and optimizing its parameters to maximize log-likelihood of prediction like in Fig. 11. Figure 5: Examples of predicted polynomials for the discussed m=5,d=6 DJIA case, their calibration to nonnegative densities, and translating into final densities (removing normalization with Laplace) - for randomly chosen 10 points (moments of time, the same for all rows). However, each row applies only first m′=1..5 of predicted cumulant-like coefficients: presented predicted polynomials are of these degrees. For simplicity each use calibration: φ(z)=max(0.15,min(z,0.15z+1.7)) green plot from Fig. 4, calculated for the last row (the remaining have slightly different but similar). Applying ρ→φ(ρ) usually damage probabilistic normalization, hence calibration step also includes division by ∫10φ(ρ(x))dx. Finally to remove normalization with Laplace distribution here, there is applied Laplace CDF−1 on horizontal axis, and multiplication by Laplace distribution density on vertical axis (4).

Each such unique coefficient describes a specific correction from uniform density: by . For example we can see large positive coefficients for all pairs of , what means upward directed parabola for both variables: quantitatively describes how large change in a given day increases probability of large changes in neighboring days. Further coefficients have more complex interpretations, for example large positive means that 6 large increases in a row are preferred, but 6 large decreases are less likely. In contrast, large negative means that larger change 5 days earlier reduces probability of 5 large increases in a row. Figure 6: Modelling probability distribution as independent variable (d=1) using degree m polynomials: ρ(x)=∑mj=0ajfj(x). After normalization with CDF of Laplace distribution, we should have ρ≈1. Here we repair its inaccuracy with estimated polynomial (left column), corresponding to the plot in the bottom of Fig. 2 - the right column contains differences between empirical CDF and such fitted polynomial. Obviously this difference reduces with degree m, however, we can see that it contains a growing number (≈m) of oscillations (Runge’s phenomenon).

Having such density we can use it to predict probability distribution of the current symbol basing on the context (Fig. 1): by substituting context to the polynomial and normalizing the remaining 1D polynomial to integrate to 1. This predicted density as polynomial can sometimes go below zero, what needs a separate interpretation as low positive - we can obtain such proper calibration from the data sample, as discussed in Fig. 4, 5.

## Iv Adaptivity for non-stationary time series Figure 7: Modelling non-stationary probability distribution of values (d=1) - like in Fig. 6, but adapted to inhomogeneous behavior in time. The top two plots used adaptive averaging at+1f=λatf+(1−λ)f(xt) for m=4 with two different learning rates λ=0.9997 or 0.999. The bottom plot has estimated m=9 degree polynomial for density of (t,xt) variables - in contrast to adaptive averaging, it requires already knowing the future. Figure 8: Top: evaluation of results of models presented in Fig. 7 - sorted predicted densities of actual values. They are compared with stationary model: green line, using fixed coefficients being averages over entire time period. Bottom: time dependence of first four coefficients over the time: ρt(x)=∑jatjfj(x). They are constant for the stationary model (green lines), degree 9 polynomials for interpolation (red), and noisy curves for adaptive averaging - especially the orange one for relatively low λ=0.999. The blue curve for λ=0.9997 is smoother, however, it is at cost of delay (shifted right) - needs more time to adapt to new behavior.

We have previously assumed stationary time series: that joint probability distribution within length moving time windows is fixed, what is often only approximation for real time series. As coefficients here are just averages over values: , for coefficients describing local behavior we can use (known in data compression) averaging with exponentially decaying weights :

 at+1f=λatf+(1−λ)f(xt)ρt(x)=∑fatff(x) (7)

for some learning rate : close but smaller than 1 (e.g. ), starting for example with . Its proper choice is a difficult question: larger gives smoother behavior, but needs more time to adapt (delay).

For modeling of time trends or a posteriori analysis of historical data (with known future), we can alternatively estimate polynomial for multi-dimensional variable with time as one of coordinates, rescaled to range, e.g. . This way we estimate behavior of each coefficient as polynomial, allowing e.g. to interpolate to real time, or try to forecast that future trend (as low degree polynomial) will be similar as in the earlier period.

It might be tempting to use this approach also for extrapolation to predict future trends, e.g. rescale time to range instead, and look at behavior in time 1. However, such polynomial often has some uncontrollable behavior at the boundaries, suggesting caution while such extrapolation. Other orthonormal families (e.g. sines and cosines) have better boundary behavior - might be more appropriate for such extrapolation, however, discussed earlier modelling of joint distribution with context representing the past is generally a safer approach.

The following three figures present such analysis for discussed DJIA sequence. Figure 6 contains estimation of density as polynomial using stationarity assumption - models inaccuracy of Laplace used in normalization. Figure 7 contains its time evolution for non-stationary models: adaptive or interpolation. Figure 8 evaluates these approaches and shows time evolution for first 4 cumulant-like coefficients.

## V Modelling multivariate dependencies Figure 9: Top: 10-year evolution 2008-2018 of logarithm of stock price of 29 out of 30 Dow Jones companies (without DWDP). We will work on their daily differences: yi. Center: eigenvectors corresponding to 5 largest eigenvalues of their covariance matrix (Σij=E[(Yi−E[Yi])(Yj−E[Yj])]) is a popular tool (PCA) showing directions of largest variance, grouping dependent variables - e.g. financial companies in vector presented as the yellow plot. Bottom: correlation matrix (covariance matrix normalized to unit variance for each variable) - real non-negative, describing strength of dependency between variables.

Example of predicting probability distribution for multivariate time series is presented in . To continue the Dow Jones topic, there will be presented application of the discussed methodology to better understand complex statistical dependencies between stock prizes of 30 companies DJIA is calculated from, for example for more accurate wallet analysis. The selection of these companies has changed throughout the history - there will be used the one from September 2018. Daily prices for the last 10 years can be downloaded from NASDAQ webpage (www.nasdaq.com) for all but DowDuPont (DWDP) - there will be used daily close values for 2008-08-14 to 2018-08-14 period ( values) for the remaining 29 companies: 3M (MMM), American Express (AXP), Apple (AAPL), Boeing (BA), Caterpillar (CAT), Chevron (CVX), Cisco Systems (CSCO), Coca-Cola (KO), ExxonMobil (XOM), Goldman Sachs (GS), The Home Depot (HD), IBM (IBM), Intel (INTC), Johnson&Johnson (JNJ), JPMorgan Chase (JPM), McDonald’s (MCD), Merck&Company (MRK), Microsoft (MSFT), Nike (NKE), Pfizer (PFE), Procter&Gampble (PG), Travelers (TRV), UnitedHealth Group (UNH), United Technologies (UTX), Verizon (VZ), Visa (V), Walmart (WMT), Walgreens Boots Alliance (WBA) and Walt Disney (DIS). We will again work on logarithmic returns.

The standard approach to quantitatively describe statistical dependencies between variables is by correlation matrix - presented in Fig. 9. We can use discussed here cumulant-like coefficients for better undersetting of more complex statistical dependencies and their time trends, each of them having unique specific interpretation.

The simplest application is modelling pairwise statistical dependencies - this time spatial instead of temporal as before. After normalization of all variables to nearly uniform distribution (with Laplace CDF like previously), pairwise joint distribution would be nearly uniform on if they are uncorrelated. To this initial density we add independently calculated correction as products of two orthonormal polynomials - coefficients of the first three for all pairs are presented in Fig. 12 and 13. There are also presented linear time trends for two of them: using first coefficient with time rescaled to as additional variable like in the previous section. The diagonal terms have no meaning in this methodology - are filled with a constant value.

These additional coefficients allow for deeper understanding of statistical dependencies, like between growth of one variable and variance of the second (12) or between their variances (22) like in ARCH. Time trends (calculated for previous 10 years) may suggest direction of further evolution of these dependencies. These coefficients are similar to multivariate mixed cumulants, but having a direct translation to probability density.

Figure 9 also contains first 5 eigenvectors of covariance matrix as in standard PCA technique. They correspond to largest variance directions and often turn out to group dependent variables. Here we usually estimate density as higher degree polynomials (formally ), for which there are generalizations of eigenvector decomposition , but they are more difficult to calculate and interpret. Presented matrices use only subset of coefficients, allowing to use standard eigenvector decomposition, e.g. :

 ∑αβcαβf2(xα)f2(xβ)=n∑k=1λk(∑αvkαf2(xα))2

Presented matrices describe only pair-wise statistical dependencies, we can analogously model dependencies in triples e.g. or in larger groups, and their time evolutions.

Log-likelihood evaluation and examples of modelling pairwise joint distributions are presented in Fig. 11. We can see how problematic is being fixed on using only Gaussian distribution, what leaves a burden of far inferior predictions - in terms of both unconditional coverage (Fig. 10) and log-likelihood. If choosing base distribution accordingly to data (like Laplace), exploiting further statistical dependencies (3 diagrams) leaves not much place for further improvements (from to times larger average density) for this type of data. Generally possible improvement is bounded by conditional entropy: average over possible contexts of entropy of the new value, which asymptotically is minus average log-likelihood. For example if we would like to encode these values with accuracy, for large we would need on average minus log-likelihood bits per value.

The discussed here approach starts with normalizing all variables to nearly uniform distribution. Alternative approach is to multiply idealized distribution (e.g. multivariate Gaussian from PCA, or Laplace) by polynomial, estimating the coefficients .

## Vi Extensions

The used examples presented basic methodologies for educative reasons, which in real models can be optimized, for example:

• Selective choice of basis: we have used complete basis of polynomials, what makes its size impractically large especially for high dimensions. However, usually only a small percentage of coefficients is above noise - we can selectively choose and use a sparse basis of significant values instead - describing real statistical dependencies, mixed moments between only a small number of variables. Another option is to selectively reduce polynomial degree for some of variables, or for example restrict the real degree of the entire polynomial: instead of each coordinate.

• Long-range value prediction: combining with state-of-art prediction models exploiting long-range dependencies, for example using a more sophisticated (than just the previous value) predictor of the current value.

• Improving informational content of context used for prediction: instead of using a few previous values as the context, we can use some features e.g. describing long-range behavior like average over a time window, or for example obtained from dimensionality reduction methods like PCA (principal component analysis).

• Maybe directly combining with a successful e.g. ARMA/ARCH-like model: use it for normalization using its modeled distributions like ARCH Gaussian of varying width, then model e.g. with polynomial to extract and exploit further statistical dependencies from data sample.

• Multivariate time series usually allow for much better prediction, as presented in . Adding for example macroeconomic variables should improve prediction.

This appendix contains Wolfram Mathematica source for basic discussed procedure and stationary process, optimized to use built-in vector operations:

im = Import["c:/djia-100.xls"];
v = Log[Transpose[im[]][[2, 2 ;; -1]]];
Print[ListPlot[v]];
n0 = Length[v];
yt = Table[v[[i + 1]] - v[[i]], {i, n1 = n0 - 1}];
syt = Sort[yt];                (* for approximated CDF *)
mu = Median[yt];                 (* Laplace estimation *)
b = Mean[Abs[yt - mu]];
cdfL = If[y < mu, Exp[(y-mu)/b]/2, 1-Exp[-(y-mu)/b]/2];
Print["Laplace distribution: mu= ", mu, "  b= ", b];
Print[Show[
ListPlot[Table[{syt[[i]], (i - 0.5)/n1}, {i, n1}]],
Plot[cdfL, {y, -0.1,0.1},PlotStyle -> {Thin, Red}]]];
xt = Table[cdfL /. y -> yt[[i]],{i,n1}]; (* normalized *)
Print[ListPlot[Sort[xt]]]; Print[ListPlot[xt]];
cl = 3; d = 1 + cl;  (* dimension = 1 + context length *)
m = 4;                 (* maximal degree of polynomial *)
coefn = Power[m + 1, d]; Print[coefn, " coefficients"];
p = Table[Power[x, k], {k, 0, m}];
p = Simplify[Orthogonalize[p,Integrate[#1 #2,{x,0,1}]&]];
Print["used orthonormal polynomials: ", p];
n = n1 - cl;            (* final number of data points *)
(* table of contexts and their polynomials: *)
ct = Transpose[Table[xt[[i + cl ;; i ;; -1]], {i, n}]];
ctp = Table[
If[j==1, Power[ct,0], p[[j]] /. x -> ct], {j, m+1}];
(* calculate coefficients: *)
coef = Table[jt = IntegerDigits[jn, m + 1, d] + 1;
Mean[Product[ctp[[jt[[c]], c]], {c, d}]],
{jn, 0, coefn - 1}];

(* find 1D polynomials for various times: *)
pt = Table[0, {i, m + 1}, {i, n}];
Do[jt = IntegerDigits[jn, m + 1, d] + 1;
pt[[jt[]]] +=
coef[[jn+1]] * Product[ctp[[jt[[c]], c]],
{c, 2, cl + 1}], {jn, 0, coefn - 1}];
(* probability normalization to 1: *)
Do[pt[[i]] /= pt[], {i, m + 1, 1, -1}];
(* predicted densities for observed values: *)
rho = Sum[ctp[[i, 1]] * pt[[i]], {i, m + 1}];
Print[ListPlot[Sort[rho]]];
(* densities in 10 random times: *)
plst = RandomInteger[{1, n}, 10];
pl = Table[i = plst[[k]];
Sum[pt[[j, i]]*p[[j]], {j, m + 1}], {k, Length[plst]}];
Plot[pl, {x, 0, 1}, PlotRange -> {{0, 1}, {0, 5}}]