Adaptive exponential power distribution with moving estimator for nonstationary time series

03/04/2020 ∙ by Jarek Duda, et al. ∙ 0

While standard estimation assumes that all datapoints are from probability distribution of the same fixed parameters θ, we will focus on maximum likelihood (ML) adaptive estimation for nonstationary time series: separately estimating parameters θ_T for each time T based on the earlier values (x_t)_t<T using (exponential) moving ML estimator θ_T=max_θ l_T for l_T=∑_t<Tη^T-tln(ρ_θ (x_t)) and some η∈(0,1]. Computational cost of such moving estimator is generally much higher as we need to optimize log-likelihood multiple times, however, in many cases it can be made inexpensive thanks to dependencies. We focus on such example: exponential power distribution (EPD) ρ(x)∝(-|(x-μ)/σ|^κ/κ) family, which covers wide range of tail behavior like Gaussian (κ=2) or Laplace (κ=1) distribution. It is also convenient for such adaptive estimation of scale parameter σ as its standard ML estimation is σ^κ being average x-μ^κ. By just replacing average with exponential moving average: (σ_T+1)^κ=η(σ_T)^κ +(1-η)|x_T-μ|^κ we can inexpensively make it adaptive. It is tested on daily log-return series for DJIA companies, leading to essentially better log-likelihoods than standard (static) estimation, surprisingly with optimal κ tails types varying between companies. Presented general alternative estimation philosophy provides tools which might be useful for building better models for analysis of nonstationary time-series.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In standard parametric estimation we choose some density family and assume that all datapoints are from this distribution using the same parameters . For maximum likelihood (ML) estimation we find maximizing having equal contribution of all datapoints . This estimation is perfect for i.i.d. sequence from stationary time series. For distinction, in analogy to static-adaptive separation of models in data compression [1], let us refer to it as static estimation.

Figure 1: Simple evaluation on example of mean log-likelihood for 100 years daily log-returns of DJIA sequence. In horizontal axis there is shape parameter of exponential power distribution density . Orange line shows evaluation of standard ”static” model : MLE choosing fixed parameters, separately for each . Blue line shows evaluation of the simplest adaptive model: separately for each , fixing , evolving scale parameter as exponential moving average of up to the previous position: . We can see that 1) adaptivity brings a large log-likelihood improvement, 2) the optimal (marked with dots) is far from Gaussian, much closer to Laplace distribution, 3) optimal for static and adaptive models are different.

For nonstationary time series these parameters might evolve in time, like estimated density in the bottom of Fig. 1. To estimate such evolving parameter evolution, we discuss adaptive estimation using moving estimator [2], which for time finds maximizing moving likelihood based only on the previously seen datapoints, for example using exponentially weakening weights:


where defines rate of weakening of contribution of the old points in such exponential moving average. For it becomes ML estimation based on all previous points. In practice usually .

Figure 2: Analogously as in Fig. 1, log-likelihood (vertical) dependence from shape parameter (horizontal) for modeling of daily log-returns for 100 years DJIA, and 10 years for 29 of its recent companies. The lowest orange plots are for standard static MLE estimation of EPD. The higher green plots are for discussed EPD adaptive estimation using rate. The blue line uses individually optimized

rate instead - usually is nearly the same, log-likelihood improvement from its optimization is nearly negligible. For comparison, there is also plotted red line for evaluation using standard GARCH(1,1) model - it uses Gaussian distribution corresponding to

, where we can see it is comparable to adaptive EPD. While there is usually assumed type of tail behavior, we can see that data suggest much lower optimal , closer to of Laplace distribution. Moreover, while intuition suggests some universality of tail behavior, data shows that it varies between companies. Additionally, we can see that optimal is larger for adaptive estimation - intuitively, adaptivity has allowed to use thinner tails.

While standard static estimation is performed once - finding a compromise for all datapoints, discussed adaptive estimation is generally much more computationally expensive - needs to be performed separately for each . However, in some situations it can be optimized at least for some parameters, by making it an evolving estimation exploiting previously found values. For example when standard ML estimation is given by average over some function of datapoints, we can transform it to adaptive estimation by just replacing this average with exponential moving average.

Specifically, we will focus on exponential power distribution (EPD) [3] family: , which covers wide range of tail behaviors like Gaussian () or Laplace () distribution. It is also convenient for such adaptive estimation of scale parameter as in standard ML estimation: is average of . We can transform it to adaptive estimation by just replacing average with exponential moving average: .

On example of 100 years Dow Jones Industrial Average (DJIA) daily log-returns and 10 years for 29 its recent companies, we have tested that such adaptive estimation of leads to essentially better log-likelihoods than standard static estimation as we can see in Fig. 1, 2. Surprisingly, the parameter defining tail behavior, usually just chosen as by assuming Gaussian distribution, turns out less universal - various companies have different optimal , much closer to heavier tail of Laplace distribution.

The discussed general philosophy of adaptive estimation directly focuses on non-stationarity of time series - trying to model evolution of parameters. Its tools like adaptive EPD can be used as a building block for the proper methods like ARIMA-GARCH family. Surprisingly, such adaptive EPD estimation (just ) for this data already turns out comparable with much more sophisticated standard methods like GARCH(1,1) [4], represented as red lines in Fig. 2

Ii Exponential power distribution (EPD)

For shape parameter, scale parameter and location, probability distribution function (PDF,

) and cumulative distribution function (CDF,

) of EPD are correspondingly:


where is Euler gamma function, is regularized incomplete gamma function. This PDF and CDF is visualized in Fig. 3.

Ii-a Static parameter estimation

Let us start with standard static ML estimation: assuming i.i.d sequence. For generality let use weights of points, assuming to imagine them as contribution of each point. In standard static estimation we assume equal contributions.

Such general weighted log-likelihood is:


From necessary condition we get maximum likelihood estimator for scale parameter (assuming fixed ):


There is no general analytic formula for the remaining parameters, but they can be estimated numerically. Estimation of the location can be expressed as:


for Gaussian distribution it is just mean of values . For Laplace distribution it is their median. Some practical approximation, e.g. as initial value of more sophisticated estimation, might be just using mean for all .

To approximately estimate the shape parameter

, we can for example use the method of moments, especially that variance of EPD has simple form:


which is strongly decreasing with , e.g. for Laplace distribution, for Gaussian distribution.

Figure 3: Probability distribution function (PDF, ) and cumulative distribution function (CDF) for exponential power distribution (EPD) with fixed center and scale parameter , for various shape parameters . We get Gaussian distribution for , Laplace distribution for , and can also cover different types of tails and bodies of distribution, choosing agnostically: from evaluation on data. We could also use asymmetric EPD [5], e.g. by using separate for each direction.

Ii-B Adaptive estimation of scale parameter

Let us define moving log-likelihood for time using only its previous points, exponentially weakening weights of the old points to try to estimate parameters describing local behavior:


to get (exponential moving) weights summing to 1.

Now fixing , and optimizing , from (4) we get


which is exponential moving average (EMA), can be evolved iteratively:


initial has to be chosen arbitrarily.

We have transformed static estimator given by average, into adaptive estimator by just replacing average with exponential moving average. Observe that it can be analogously done for any estimator of form.

Ii-C Generalization, interpretation and choice of

To generalize the above, assume that estimation of parameter is analogously given by average ():


For the above parameter of EPD with fixed, we would have , .

Generally we analogously have adaptation for


Denoting , we can write it as:


allowing to imagine evolution of as random walk with step from random variable, which can evolve in time here. Generally is proportional to speed of this random walk.

This interpretation could be used to optimize the choice of , also its potential evolution. For example by calculating (e.g. exponential moving averaged) square root of variance of sequence, and evaluate square root of variance of random variable - dividing them we get estimation of parameter.

We leave its details for future work as improvement by optimization from the fixed was practically negligible for the analyzed daily log-return data. Fig. 2 presents such difference by green plot for , and blue for individually optimized .

Ii-D Approximated adaptive estimation of ,

While in practice the most important is adaptation of scale parameter , there might be also worth to consider adaptation of the remaining parameters. Their estimation rather does not have analytical formulas already in static case, hence for adaptive case we should look for practical approximations, preferably also based on EMA for inexpensive updating.

Location is mean of such parametric distribution, just using mean of datapoints is optimal for Gaussian distribution case, and can be easily transformed to adaptive estimation. Hence a natural approximation of such estimation is analogous:


for e.g. and some chosen rate , not necessarily equal .

Adaptive estimation of seems more difficult. Some example of approximation is using method of moments e.g. with (6) formula, especially that we can naturally get adaptive estimation of moments with EMA. For example as here using additional analogous EMA for estimated recent mean :

Another general approach are gradient methods, adapting chosen parameter(s) e.g. to increase contribution to log-likelihood. For example for some tiny :

Iii DJIA log-returns tests

We will now look at evaluation of these methodologies from perspective of 100 years daily Dow Jones index111Source of DJIA time series:, values , working on log-returns sequence for for , summarized in Fig. 1.

As evaluation there is used mean log-likelihood: , where in static setting has constant parameters chosen by MLE, in adaptive these parameters evolve in time: are estimated based on previous values.

Figure 4: Top: log-values of DJIA and corresponding dates. Bottom: obtained evolution of and parameters for DJIA log-returns adaptive models (EMA). Maxima of the former correspond to locally increased variance, maxima/minima of the latter correspond to periods of ascend/descend.

In adaptive settings there was arbitrarily chosen , , and from numerical search: , . Here are the obtained parameters and mean log-likelihoods for various settings:

  • static Gaussian () distribution has MLE mean , , giving ,

  • static Laplace () distribution has MLE median , , giving ,

  • static EPD has MLE , , , giving ,

  • Gaussian with and adaptive gives ,

  • Laplace with and adaptive gives ,

  • optimal with and adaptive gives ,

  • Gaussian with adaptive and gives ,

  • Laplace with adaptive and gives ,

  • EPD with adaptive and gives .

We can see that standard assumption of static Gaussian can be essentially improved both by going to closer to Laplace distribution, and by switching to adaptive estimation of scale parameter . Additional adaptive estimation of location provides some tiny further improvement here. The final used evolution of and is presented in Fig. 4.

There were also trials for adapting , but were not able to provide a noticeable improvement here.

Figure 2 additionally contains such evaluation of log-returns for 29 out of 30 companies used for this index in September 2018. Daily prices for the last 10 years were downloaded from NASDAQ webpage ( for all but DowDuPont (DWDP) - there were used daily close values for 2008-08-14 to 2018-08-14 period ( values) for the remaining 29 companies: 3M (MMM), American Express (AXP), Apple (AAPL), Boeing (BA), Caterpillar (CAT), Chevron (CVX), Cisco Systems (CSCO), Coca-Cola (KO), ExxonMobil (XOM), Goldman Sachs (GS), The Home Depot (HD), IBM (IBM), Intel (INTC), Johnson&Johnson (JNJ), JPMorgan Chase (JPM), McDonald’s (MCD), Merck&Company (MRK), Microsoft (MSFT), Nike (NKE), Pfizer (PFE), Procter&Gampble (PG), Travelers (TRV), UnitedHealth Group (UNH), United Technologies (UTX), Verizon (VZ), Visa (V), Walmart (WMT), Walgreens Boots Alliance (WBA) and Walt Disney (DIS).

Iii-a Further improvements with Hierarchical Correlation Reconstruction

The estimated parametric distributions of variables in separate times often leave statistical dependencies which can be further exploited.

For this purpose, we can use the best found model, here EPD with adaptive and

for DJIA sequence, and use it for normalization of variables to nearly uniform distributions by going through cumulative distribution functions (CDF) of estimated parametric distributions: transform to


We can then take e.g. neighboring values of sequence, which should be approximately from uniform distribution on . In Hierarchical Correlation Reconstruction [6] we estimate distortion from this uniform distribution as a polynomial of modelled static or adaptive coefficients.

Obtained mean log-likelihood improvement in 10-fold cross-validation for single variables was for static model, for adaptive - using polynomial model to improve the original EPD model.

For modelling joint distribution of two neighboring variables

: using the previous value to predict conditional distribution, the improvement was for static model, for adaptive. Analogously for three neighboring variables the improvement was for static model, for adaptive.

Iv Conclusions and further work

While applied time series analysis often uses static Gaussian distribution, there was shown that simple inexpensive generalization: to adaptive distribution, and more general exponential power distribution, can essentially improve standard evaluation: likelihood.

This article is focused only on a basic technique, in practice it can be further improved by combining with complementing methods, like mentioned in Section III-A additional modelling of joint distribution for variables normalized with CDF of distributions discussed here.

While discussed adaptive estimation of scale parameter is MLE-optimal, the remaining parameters rather require some approximations - worth further exploration of better approaches.

Another open question is finding better ways for choosing rate of exponential moving average, also varying in time to include changing rate of forgetting e.g. due to varying time differences.

In contrast e.g. to Levy/stable distributions, the discussed EPD does not cover heavy tails (1/polynomial density), or asymmetry - it is worth to search for practical transformations also of other types of parametric distributions into adaptive estimation.

This appendix contains Wolfram Mathematica source for used evaluation of adaptive exponential power distribution (vectorized for performance). The

Prepend inserts initial values in the beginning, then [[1;;-2]] removes the last value, so the used density parameter is modeled based only on history:

(* xt: sequence of values, kap: fixed kappa *)
(* eta, nu: EMA coefficients *)
(* mu1, sigma1: initial mu, sigma *)
cons = kap^(-1/kap)/2 /Gamma[1 + 1/kap];
mu = ExponentialMovingAverage[
  Prepend[xt, mu1], nu][[1 ;; -2]];
sigma = ExponentialMovingAverage[
  Prepend[Abs[xt - mu]^kap, sigma1^kap]
  , eta][[1 ;; -2]]^(1/kap);
Mean[Log[rho]]     (* mean log-likelihood *)