I Introduction
In standard parametric estimation we choose some density family and assume that all datapoints are from this distribution using the same parameters . For maximum likelihood (ML) estimation we find maximizing having equal contribution of all datapoints . This estimation is perfect for i.i.d. sequence from stationary time series. For distinction, in analogy to staticadaptive separation of models in data compression [1], let us refer to it as static estimation.
For nonstationary time series these parameters might evolve in time, like estimated density in the bottom of Fig. 1. To estimate such evolving parameter evolution, we discuss adaptive estimation using moving estimator [2], which for time finds maximizing moving likelihood based only on the previously seen datapoints, for example using exponentially weakening weights:
(1) 
where defines rate of weakening of contribution of the old points in such exponential moving average. For it becomes ML estimation based on all previous points. In practice usually .
While standard static estimation is performed once  finding a compromise for all datapoints, discussed adaptive estimation is generally much more computationally expensive  needs to be performed separately for each . However, in some situations it can be optimized at least for some parameters, by making it an evolving estimation exploiting previously found values. For example when standard ML estimation is given by average over some function of datapoints, we can transform it to adaptive estimation by just replacing this average with exponential moving average.
Specifically, we will focus on exponential power distribution (EPD) [3] family: , which covers wide range of tail behaviors like Gaussian () or Laplace () distribution. It is also convenient for such adaptive estimation of scale parameter as in standard ML estimation: is average of . We can transform it to adaptive estimation by just replacing average with exponential moving average: .
On example of 100 years Dow Jones Industrial Average (DJIA) daily logreturns and 10 years for 29 its recent companies, we have tested that such adaptive estimation of leads to essentially better loglikelihoods than standard static estimation as we can see in Fig. 1, 2. Surprisingly, the parameter defining tail behavior, usually just chosen as by assuming Gaussian distribution, turns out less universal  various companies have different optimal , much closer to heavier tail of Laplace distribution.
The discussed general philosophy of adaptive estimation directly focuses on nonstationarity of time series  trying to model evolution of parameters. Its tools like adaptive EPD can be used as a building block for the proper methods like ARIMAGARCH family. Surprisingly, such adaptive EPD estimation (just ) for this data already turns out comparable with much more sophisticated standard methods like GARCH(1,1) [4], represented as red lines in Fig. 2
Ii Exponential power distribution (EPD)
For shape parameter, scale parameter and location, probability distribution function (PDF,
) and cumulative distribution function (CDF,
) of EPD are correspondingly:(2) 
where is Euler gamma function, is regularized incomplete gamma function. This PDF and CDF is visualized in Fig. 3.
Iia Static parameter estimation
Let us start with standard static ML estimation: assuming i.i.d sequence. For generality let use weights of points, assuming to imagine them as contribution of each point. In standard static estimation we assume equal contributions.
Such general weighted loglikelihood is:
(3) 
From necessary condition we get maximum likelihood estimator for scale parameter (assuming fixed ):
(4) 
There is no general analytic formula for the remaining parameters, but they can be estimated numerically. Estimation of the location can be expressed as:
(5) 
for Gaussian distribution it is just mean of values . For Laplace distribution it is their median. Some practical approximation, e.g. as initial value of more sophisticated estimation, might be just using mean for all .
IiB Adaptive estimation of scale parameter
Let us define moving loglikelihood for time using only its previous points, exponentially weakening weights of the old points to try to estimate parameters describing local behavior:
(7) 
to get (exponential moving) weights summing to 1.
Now fixing , and optimizing , from (4) we get
(8) 
which is exponential moving average (EMA), can be evolved iteratively:
(9) 
initial has to be chosen arbitrarily.
We have transformed static estimator given by average, into adaptive estimator by just replacing average with exponential moving average. Observe that it can be analogously done for any estimator of form.
IiC Generalization, interpretation and choice of
To generalize the above, assume that estimation of parameter is analogously given by average ():
(10) 
For the above parameter of EPD with fixed, we would have , .
Generally we analogously have adaptation for
(11) 
Denoting , we can write it as:
(12) 
allowing to imagine evolution of as random walk with step from random variable, which can evolve in time here. Generally is proportional to speed of this random walk.
This interpretation could be used to optimize the choice of , also its potential evolution. For example by calculating (e.g. exponential moving averaged) square root of variance of sequence, and evaluate square root of variance of random variable  dividing them we get estimation of parameter.
We leave its details for future work as improvement by optimization from the fixed was practically negligible for the analyzed daily logreturn data. Fig. 2 presents such difference by green plot for , and blue for individually optimized .
IiD Approximated adaptive estimation of ,
While in practice the most important is adaptation of scale parameter , there might be also worth to consider adaptation of the remaining parameters. Their estimation rather does not have analytical formulas already in static case, hence for adaptive case we should look for practical approximations, preferably also based on EMA for inexpensive updating.
Location is mean of such parametric distribution, just using mean of datapoints is optimal for Gaussian distribution case, and can be easily transformed to adaptive estimation. Hence a natural approximation of such estimation is analogous:
(13) 
for e.g. and some chosen rate , not necessarily equal .
Adaptive estimation of seems more difficult. Some example of approximation is using method of moments e.g. with (6) formula, especially that we can naturally get adaptive estimation of moments with EMA. For example as here using additional analogous EMA for estimated recent mean :
Another general approach are gradient methods, adapting chosen parameter(s) e.g. to increase contribution to loglikelihood. For example for some tiny :
Iii DJIA logreturns tests
We will now look at evaluation of these methodologies from perspective of 100 years daily Dow Jones index^{1}^{1}1Source of DJIA time series: http://www.idvbook.com/teachingaid/datasets/thedowjonesindustrialaveragedataset/, values , working on logreturns sequence for for , summarized in Fig. 1.
As evaluation there is used mean loglikelihood: , where in static setting has constant parameters chosen by MLE, in adaptive these parameters evolve in time: are estimated based on previous values.
In adaptive settings there was arbitrarily chosen , , and from numerical search: , . Here are the obtained parameters and mean loglikelihoods for various settings:

static Gaussian () distribution has MLE mean , , giving ,

static Laplace () distribution has MLE median , , giving ,

static EPD has MLE , , , giving ,

Gaussian with and adaptive gives ,

Laplace with and adaptive gives ,

optimal with and adaptive gives ,

Gaussian with adaptive and gives ,

Laplace with adaptive and gives ,

EPD with adaptive and gives .
We can see that standard assumption of static Gaussian can be essentially improved both by going to closer to Laplace distribution, and by switching to adaptive estimation of scale parameter . Additional adaptive estimation of location provides some tiny further improvement here. The final used evolution of and is presented in Fig. 4.
There were also trials for adapting , but were not able to provide a noticeable improvement here.
Figure 2 additionally contains such evaluation of logreturns for 29 out of 30 companies used for this index in September 2018. Daily prices for the last 10 years were downloaded from NASDAQ webpage (www.nasdaq.com) for all but DowDuPont (DWDP)  there were used daily close values for 20080814 to 20180814 period ( values) for the remaining 29 companies: 3M (MMM), American Express (AXP), Apple (AAPL), Boeing (BA), Caterpillar (CAT), Chevron (CVX), Cisco Systems (CSCO), CocaCola (KO), ExxonMobil (XOM), Goldman Sachs (GS), The Home Depot (HD), IBM (IBM), Intel (INTC), Johnson&Johnson (JNJ), JPMorgan Chase (JPM), McDonald’s (MCD), Merck&Company (MRK), Microsoft (MSFT), Nike (NKE), Pfizer (PFE), Procter&Gampble (PG), Travelers (TRV), UnitedHealth Group (UNH), United Technologies (UTX), Verizon (VZ), Visa (V), Walmart (WMT), Walgreens Boots Alliance (WBA) and Walt Disney (DIS).
Iiia Further improvements with Hierarchical Correlation Reconstruction
The estimated parametric distributions of variables in separate times often leave statistical dependencies which can be further exploited.
For this purpose, we can use the best found model, here EPD with adaptive and
for DJIA sequence, and use it for normalization of variables to nearly uniform distributions by going through cumulative distribution functions (CDF) of estimated parametric distributions: transform to
sequence.We can then take e.g. neighboring values of sequence, which should be approximately from uniform distribution on . In Hierarchical Correlation Reconstruction [6] we estimate distortion from this uniform distribution as a polynomial of modelled static or adaptive coefficients.
Obtained mean loglikelihood improvement in 10fold crossvalidation for single variables was for static model, for adaptive  using polynomial model to improve the original EPD model.
For modelling joint distribution of two neighboring variables
: using the previous value to predict conditional distribution, the improvement was for static model, for adaptive. Analogously for three neighboring variables the improvement was for static model, for adaptive.Iv Conclusions and further work
While applied time series analysis often uses static Gaussian distribution, there was shown that simple inexpensive generalization: to adaptive distribution, and more general exponential power distribution, can essentially improve standard evaluation: likelihood.
This article is focused only on a basic technique, in practice it can be further improved by combining with complementing methods, like mentioned in Section IIIA additional modelling of joint distribution for variables normalized with CDF of distributions discussed here.
While discussed adaptive estimation of scale parameter is MLEoptimal, the remaining parameters rather require some approximations  worth further exploration of better approaches.
Another open question is finding better ways for choosing rate of exponential moving average, also varying in time to include changing rate of forgetting e.g. due to varying time differences.
In contrast e.g. to Levy/stable distributions, the discussed EPD does not cover heavy tails (1/polynomial density), or asymmetry  it is worth to search for practical transformations also of other types of parametric distributions into adaptive estimation.
This appendix contains Wolfram Mathematica source for used evaluation of adaptive exponential power distribution (vectorized for performance). The
Prepend
inserts initial values in the beginning, then [[1;;2]]
removes the last value, so the used density parameter is modeled based only on history:
(* xt: sequence of values, kap: fixed kappa *) (* eta, nu: EMA coefficients *) (* mu1, sigma1: initial mu, sigma *) cons = kap^(1/kap)/2 /Gamma[1 + 1/kap]; mu = ExponentialMovingAverage[ Prepend[xt, mu1], nu][[1 ;; 2]]; sigma = ExponentialMovingAverage[ Prepend[Abs[xt  mu]^kap, sigma1^kap] , eta][[1 ;; 2]]^(1/kap); rho=cons*Exp[((Abs[xtmu]/sigma)^kap)/kap]/sigma; Mean[Log[rho]] (* mean loglikelihood *)
References
 [1] R. N. Williams, Adaptive data compression. Springer Science & Business Media, 2012, vol. 110.
 [2] J. Duda, “Parametric context adaptive laplace distribution for multimedia compression,” arXiv preprint arXiv:1906.03238, 2019.
 [3] D. Zhu and V. ZindeWalsh, “Properties and estimation of asymmetric exponential power distribution,” Journal of econometrics, vol. 148, no. 1, pp. 86–99, 2009.
 [4] T. Bollerslev, “Generalized autoregressive conditional heteroskedasticity,” Journal of econometrics, vol. 31, no. 3, pp. 307–327, 1986.
 [5] D. Zhu and V. ZindeWalsh, “Properties and estimation of asymmetric exponential power distribution,” Journal of econometrics, vol. 148, no. 1, pp. 86–99, 2009.
 [6] J. Duda, “Exploiting statistical dependencies of time series with hierarchical correlation reconstruction,” arXiv preprint arXiv:1807.04119, 2018.
Comments
There are no comments yet.