volatility-change-points
Dynamic time series clustering via volatility change-points
view repo
This note outlines a method for clustering time series based on a statistical model in which volatility shifts at unobserved change-points. The model accommodates some classical stylized features of returns and its relation to GARCH is discussed. Clustering is performed using a probability metric evaluated between posterior distributions of the most recent change-point associated with each series. This implies series are grouped together at a given time if there is evidence the most recent shifts in their respective volatilities were coincident or closely timed. The clustering method is dynamic, in that groupings may be updated in an online manner as data arrive. Numerical results are given analyzing daily returns of constituents of the S&P 500.
READ FULL TEXT VIEW PDF
We develop a new method to find the number of volatility regimes in a
no...
read it
Change point detection in time series has attracted substantial interest...
read it
This paper presents the construction of a particle filter, which incorpo...
read it
Contagion arising from clustering of multiple time series like those in ...
read it
Given a heterogeneous time-series sample, the objective is to find point...
read it
In this study, we analyze behaviours of the well-known CMA-ES by extract...
read it
The dynamic mode decomposition (DMD) extracted dynamic modes are the
non...
read it
Dynamic time series clustering via volatility change-points
The purpose of this note is to outline, contextualize and demonstrate a method for clustering time series using a change-point model. The emphasis is on conveying the ideas of the method rather than depth or generality, although some possible extensions and research directions are given at the end in section 4. A Python implementation in the form of a Jupyter notebook is available: https://github.com/nckwhiteley/volatility-change-points.
Time series clustering is typically a two-step procedure. The first step is to specify a pairwise measure of dissimilarity between series. An overview of several popular approaches is given in (Montero et al., 2014, Sec. 2). To mention just a few examples, this dissimilarity could be derived from fairly simple statistics, such as cross-correlation; could involve solving an optimization problem to find a ‘best’ match between each pair of series, for instance using the Fréchet distance or Dynamic Time Warping (Berndt and Clifford, 1994); or could involve fitting a some form of model to each of the series, then computing a distance between the fitted parameter values (Corduas and Piccolo, 2008; Otranto, 2008) or forecast distributions (Alonso et al., 2006; Vilar et al., 2010).
The second step is to pass the dissimilarity measure to an algorithm which determines associations between the series. Again to mention just a few popular techniques, hierarchical methods such as agglomerative clustering, see for example (Murphy, 2012, Sec. 25.5) for an overview, form clusters sequentially. Each datum starts in its own cluster and pairs of clusters are merged step-by-step in accordance with some linkage criterion which quantifies how between-cluster dissimilarity is derived from between-series dissimilarity. Centroid-based techniques such as -means (MacQueen, 1967) or its generalizations beyond Euclidean distance to, e.g., Bregman divergences (Banerjee et al., 2005) or Wasserstein distances (Ye et al., 2017) choose a collection of cluster centers to minimize the sum of within-cluster divergences/distances. The computational cost of global minimization is usually prohibitive and so for implementation one settles for a local minimum obtained using an iterative refinement method, such as Lloyd’s algorithm (Lloyd, 1982) in the case of Euclidean distance.
A further level of sophistication is to approach clustering as a statistical inference problem, with associations between data points and clusters treated as latent variables to be inferred under a probabilistic model. This allows uncertainty over clusterings, model parameters and model structure to be quantified and reported in a principled manner. The price to pay is usually an increased computational cost, for example incurred through the EM algorithm, variational methods or Monte Carlo sampling. The question of how to scale-up these methods to tackle large data sets is an active topic of research. For a recent overview and ideas involving parallelization and multi-step procedures, see (Ni et al., 2018).
We propose a method which may be regarded as a half-way point between a full-blown statistical treatment of time series clustering and the simple two-step recipe described above. We do not perform probabilistic modeling of associations, but we do perform probabilistic modeling on a per-series basis and use it to define a notion of dissimilarity.
Clustering of time series can serve a variety of purposes. In exploratory data analysis one may simply want to discover groupings or unexpected phenomena, and then summarize or report them for purposes of dimension reduction or interpretation. Clustering may be one ingredient within a broader statistical workflow, in which actions or decisions are taken on the basis of discovered clusters.
Stemming from an influential paper of Mantegna (1999)
, clustering of financial time series using dissimilarity measures derived from correlation has been applied to assist fundamental understanding of markets, risk management, portfolio optimization and trading. A comprehensive overview of research on this topic across machine learning, econophysics, statistical physics, econometrics and behavioral finance is maintained on arXiv by
Marti et al. (2017). The current version includes a bibliography of over 400 references which we shall not attempt to summarize.An alternative approach to time series clustering, which does not feature in (Marti et al., 2017)
, is to define dissimilarity by some distance between parameter vectors obtained by fitting a model to each of the series individually.
Otranto (2008) gives a detailed account of dissimilarity measures in this vein, and uses Wald tests and autoregressive metrics to measure the distance between GARCH processes and thus cluster based on the heteroskedastic characteristics. Otranto (2010)extends this technique to clustering based on distance between fitted Dynamic Conditional Correlation models, and deploys the resulting covariance matrix estimates within portfolio optimization.
As discussed by Marti et al. (2016) and Marti et al. (2017, Sec 4), a research topic still in its infancy but of considerable interest is how to track changes in market structure, by recognizing clusters which may change over time. Indeed Marti et al. (2017, Sec 4) report that many empirical studies do not achieve this but just deliver a static clustering based all data available for a given time period. An obvious step towards dynamic clustering is to apply a static clustering method on a sliding window. If, for example, the clustering techniques of Otranto (2008, 2010) were applied in this manner, the length of the window would achieve a trade-off between temporal locality and noisy parameter estimates, hence noisy estimates of dissimilarity. The question of how long the window should be in order to best deal with time-varying clusters is often not an easy one to answer rigorously.
The method introduced below defines dissimilarity between time series not in terms of correlation or parameter estimates, but rather in terms of evidence about times of volatility change-points. As a consequence, time series which evidence shifts in volatility around the same points in times tend to be clustered together. This is of interest because synchrony in volatility change-points across series may arise from common underlying market factors or similar responses to changing market conditions which the method may help to uncover. One appealing feature of the method is that it naturally accommodates dynamic clustering, in the sense that clusters can be re-evaluated at each point in time as new data arrive, but it circumvents the need to work on a sliding window: the underlying change-point model effectively adapts to the time-scale of volatility changes of each series.
Consider a sequence of unobserved, integer valued and strictly increasing change-points . is equal to zero with probability one, and the increments are i.i.d. with c.d.f. denoted by .
For , define , and observe that
is a Markov chain, with transition probabilities:
(1) |
corresponding to whether a new change-point has occurred or not.
Let
be observed returns which are assumed to be jointly distributed with
such that for each ,(2) |
with the convention , the Kronecker delta at , to respect the fact that is zero with probability .
Consider the sequence of probability mass functions ,
(3) |
Again due to the fact that is zero with probability , we have . Combining the conditional independence structure of (2) with (1), elementary marginalization and Bayes’ rule validate the following recursion, for ,
(4) |
This change-point model and the recursion (4) are directly inspired by those of Chopin (2007), Fearnhead and Liu (2007) and Adams and MacKay (2007). Our model is slightly more general than those of Fearnhead and Liu (2007) and Adams and MacKay (2007), who assumed that conditional on a change-point time, observations after that time are independent of observations before. Note (2) does not imply such independence.
In section 2.4 we describe an instance of the above change-point model in which the terms
arise by analytically integrating out parameters associated with the change-points under conjugate prior distributions. This makes our setting more restrictive than that of
Chopin (2007), who did not assume such analytic integration possible, but instead used numerical integration in the form of a sequential Monte Carlo algorithm.Suppose now that one is presented with series of returns . Let be as in (3) with there replaced by . We propose to cluster the series at any given time with dissimilarity taken to be a probability metric evaluated between the distributions . Thus series will be clustered at time if, through , they exhibit similar evidence about the times of their respective most recent change-points.
Which probability metric to choose? It seems sensible to consider: i) interpretation and ii) computational overhead. To address the first of these two criteria, consider the total variation distance:
The total variation distance is maximal and equal to as soon as and have disjoint support, which is a rather restrictive notion of dissimilarity. For instance, for two Kronecker delta’s and ,
(5) |
For describing distributions over change-points this insensitivity to translation seems undesirable - a more appealing property might be that the distance is strictly increasing in . Alternatives such as the Hellinger and distances involve similarly summing of point-wise differences between probability mass functions, or functions thereof, and hence have the same drawback. Divergences which involve ratios of probability mass functions such as or Kullback-Liebler similarly fail to express dissimilarity if the support of one mass function is not contained within that of the other.
The total variation distance can be regarded as one instance of a Wasserstein distance: given a distance on and , the ’th Wasserstein distance associated with is:
(6) |
where is the set of all probability mass functions on whose marginals are and . The total variation distance arises if one takes to be the discrete distance and .
If instead is the usual distance on , , we have
(7) |
and slightly more generally it can be shown by a direct computation of the infimum in (6) that:
(8) |
see (Bobkov and Ledoux, 2016, Ex 2.3). Whilst (7) and (8) are of course rather specific examples, they illustrate the manner in which the Wasserstein distance associated with is more expressive than total variation distance regarding translation, and therefore arguably more suited to our purposes of comparing distributions over change-point times.
Turning to the criterion of computational overhead, in the case of on , the Wasserstein distance is conveniently available in closed form:
where is the generalized inverse c.d.f. of . Even more conveniently from a computational point of view, is the fact that:
(9) |
see (Bobkov and Ledoux, 2016) for background.
With these considerations we shall settle on (9) applied to each pair as our dissimilarity measure at time . Note that the support of any is always contained in . Moreover, if the approximation technique suggested in section 2.3 is applied, each approximating distribution (more details later) will have a number of support points uniformly upper bounded in t, and hence the cost of evaluating is uniformly upper bounded in
Choosing the dissimilarity measure completes the first of the two steps described in section 1
. What options are available for the second step? Hierarchical clustering can be performed immediately after evaluating the pairwise distances and we shall illustrate this approach through numerical experiments. For a centroid-based approach in the style of
-means, one needs to introduce the notion of Wasserstein barycentre, which is the Fréchet mean in the space of probability distributions equipped with the Wasserstein distance. Computing these barycentres is a non-trivial task in general, see
(Peyré and Cuturi, 2019) for numerical methods.The number of terms in the summation in (4) clearly increases linearly with . Hence the cost of computing this recursion from time zero up to some time is quadratic in . A simple route towards an online algorithm, i.e. one whose computational cost per time step does not increase with time, is to introduce an approximation to each with a number of support points uniformly upper bounded in .
For instance, consider a simple pruning strategy: fix a number of support points . For , computer exactly using the recursion (4). For assume one already has an approximation to , call it which has support points in . Then one can substitute for in the right hand side of (4), and retain the of the resulting support points associated with highest probabilities to give an approximation to .
A further consideration for online implementation of (4) is the cost of evaluating , does this also increase with ? In the instance of the change-point model described in section 2.4, we shall show that depends on through statistics which can be updated online as data arrive, and hence it is possible to evaluate the terms sequentially in at a fixed cost per time step.
Let be distributed as in section 2.1. We now introduce a specific model for the returns which, upon analytically marginalizing out certain parameters, will satisfy (2.1) with a closed-form expression for . In turn, this can be plugged into the recursion (4) or its approximation discussed in section 2.3, in order to evaluate the dissimilarity measure.
Consider a sequence of triples of parameters and assume that:
(10) |
where are i.i.d. . Thus parameterize the conditional joint distribution of the data between the ’th and ’th change-points, i.e. , given .
We assume the following prior independence properties: the sequence and the sequence are independent, and the triples are independent across . It can be shown that as a consequence of these independences and (10),
(11) | ||||
(12) |
The intuitive interpretation of these identities is that conditional on the time of the most-recent change-point being , data strictly prior to are irrelevant to: predicting the next data point, as per (11), and inference for , i.e., the parameters associated with the most recent change-point, as per (12).
To arrive at a closed-form expression for we set a zero-mean Normal-Inverse-Gamma prior distribution on each parameter triple:
(13) |
where , and are hyper-parameters which are common across .
The following proposition gives the expression for as desired, and marginal posterior densities for the parameters and conditional on the time of the most recent change-point.
(14) | |||
(15) | |||
(16) |
where
(17) | |||
(18) | |||
(19) | |||
(20) |
, , and .
Note from (10),
The expressions in (14)-(16) can therefore be obtained by conditioning on or
, and then applying standard results for Bayesian linear regression under a Normal-Inverse-Gamma prior, see for example
(Murphy, 2012, Sec 7.6.3). ∎Further to the considerations in section 2.3 it is important to notice that , , , can be calculated in a recursive manner, so that the cost of evaluating each term of the form , , etc. does not increase with . The following lemma gives the details.
For fixed ,
and for ,
The expression for follows from
and the Sherman-Morrison formula. The other expressions are quite straight forward. ∎
It is widely recognized that returns data at daily or higher frequencies often exhibit certain stylized features:
long-run mean or median close to zero, and heavy tails;
long-run auto-correlation of returns which is small or decays quickly with lag-length, but auto-correlation of absolute or squared returns which decays slowly;
time-dependent volatility.
To explain the interpretation of the change-point model in this context, consider GARCH:
(21) | ||||
(22) |
where
is a white noise process. This is perhaps the most widely used time series model which accommodates the stylized features described above:
in the original presentation of Bollerslev (1986), were taken as i.i.d. standard Gaussian, so that the marginal distribution of under (21) is a scale-mixture of zero-mean Gaussians. To further account for heavy-tails, Bollerslev (1987) suggested instead a -distribution centered at zero for , with unit scale parameter;
due to the independence of the and the centering of their common distribution at zero, it is easily seen that the autocorrelation of (assuming it exists) is zero. The sequence of squared returns from GARCH is an ARMA process (Andersen et al., 2009, Thm 7, p.61) and hence may exhibit non-trivial autocorrelation;
These properties manifest themselves in the predictive distributions associated with GARCH; if indeed are unit scale and zero-centered student’s- variables with degrees of freedom, then:
(23) |
where by writing out (22),
(24) |
Let us now explain the connection to (14). For purposes of exposition, suppose that the parameters are omitted from the change-point model, in the sense that (10) is simplified to:
(25) |
and suppose the prior on each parameter pair ) is just the marginal prior of these two parameters under (13).
Before giving the proof let us compare the predictive densities (26) and (23).
Consider the number of parameters in (24) and in (27)-(28). The former involves and . The latter involves , but these parameters can effectively be removed by considering the uninformative prior limits , , under which remains well-defined as a probability density assuming for some . By contrast, there appears not to be a prior distribution under which one can analytically integrate out in GARCH, so one must estimate these parameters or integrate them out numerically, which would complicate the fitting of a change-point model.
Concerning the stylized features of returns described above, the median of in (23) is clearly zero. If exhibits little lag-one auto-correlation, in the sense that , then the median of (26) is approximately zero also. However, if this auto-correlation is non-zero, this will be captured in (26), both in terms of the centering at and through . Thus the change-point model accommodates but does no insist upon stylized feature 1) and zero autocorrelations of returns in stylized feature 2); the model is flexible enough to explain away variations in data which cannot be well modelled in terms of dynamic volatility, such as short–lived trends and brief periods of correlated returns. The squared scale parameter in (24) is an exponentially-weighted average of the previous squared returns . This is what allows GARCH(1,1) to capture the auto-correlation of squared returns as per stylized feature 2). The predictive distribution in (26) achieves this in a slightly different manner: involves a uniformly-weighted average of the squared returns, , where is the time of the most recent change-point appearing in the conditioning in (26). Thus the change-point model can represent memory in the process of squared returns whilst avoiding the need for the parameter in GARCH(1,1). Finally, Regarding stylized feature 3), obviously the change-point model accommodates changing volatility from one change-point to the next.
The degrees of freedom in (23) is constant at ; in (26) the degrees of freedom is , hence increasing as the time since the most recent change-point, , grows. As per (25), the change-point model assumes volatility is constant between change-points and this increase in the degrees of freedom reflects accumulation of data since the most recent change-point, assuming it is known or we are conditioning upon it. Integrating out the time of the most recent change-point results in the following identities:
Thus for the change-point model, the predictive density is a mixture of densities of the form (26), i.e. of student’s- distributions with varying degrees of freedom, centering and scale parameters, where the mixing distribution is derived from the posterior change-point distributions . The parameter posteriors and , i.e. also with the time of the most recent change-point integrated out, have similar mixture representations, the details are left to the reader.
Re-introducing the parameter in (10) allows non-zero median returns to be modelled, which may be desirable over short periods or to accommodate short-lived market trends, but is not accommodated in (21)-(21). Thus again the change-point model is flexible: if the data indicate the median/mean is zero, as per stylized feature 1), or not, then this will be reflected in the predictive distribution (23).
In summary, the model described in section 2.4 has the convenient property that the parameters can be integrated out analytically, thus allowing it to interface with the generic change-point model and inference recursion in section 2.1. Its predictive distributions are closely related to those of GARCH(1,1) and it accommodates the standard stylized features of returns, but is flexible enough to also model short-lived auto-correlations and trends.
All numerical experiments were based on a data set of daily prices for stocks which were constituents of the S&P500 index continuously from 1998 to mid 2013. The data set was taken from https://quantquote.com/historical-stock-data. According to source these data are split/dividend adjusted. All returns referred to below are daily closing log returns, i.e. .
When applying the change-point model from section 2.1, the prior on each of the inter-change-point times, e.g.,
, was taken to be a geometric distribution shifted so its support is
rather than . The parameter of the geometric distribution was set to . The hyper-parameters in the prior distribution (13) were taken to be , corresponding to a fairly uninformative prior over ’s; and and , corresponding respectively to an uninformative prior over the ’s and a prior over the ’s which places substantial mass on . The approximation method described in section 2.3 was implemented with the number of support points taken to be 100.The objective of this section is to illustrate the output from the change-point model applied to a single time series.
The top plot in figure 1 shows the returns for AMZN. The second plot shows the number of trading days since the maximum-a-posterior (MAP) most recent change-point. To be precise, let be time since the start of the data set on 1/1/1998 in units of trading days and let . Then the plot shows against the calendar date corresponding to .
The third and fourth plots show means and 95% credible regions for and , i.e. the two marginals of (15) with plugged in. The interpretation of these distributions are that they are the posterior distributions of the parameters associated with the MAP most recent change-point. The bottom plot in figure 1 is constructed by finding the mode and 95% credible region of , i.e. (16) with plugged in, and then mapping through , to give the corresponding point estimate and credible region for .
To illustrate inference about change-point times beyond the simple point estimate , figure 2 shows a snapshot of the returns from April 2007 until July 2009 and the change-point distributions for corresponding to 28/09/2008, 23/02/2009, 05/05/2009, 16/07/2009. On 28/09/2008. i.e. just before the market crash, the change-point distribution (second plot from top) shows a small amount of evidence for a recent change, but most probability mass is associated with 24/07/2007 when the stock price surged after better-than-expected Q2 results were announced. The third plot down, showing the change-point distribution for corresponding to 23/02/2009 puts most of its mass around the September 2008 market crash. In the fourth plot, corresponding to 05/05/2009, the multiple modes in the distribution can be interpreted as competing hypotheses about the most recent change point: the September 2008 market crash is amongst them, followed by crises in December 2008 and January-March 2009. The bottom plot picks up the change to a period of lower volatility around the end of March 2009.
The top plot in figure 3
shows the returns along with the 95% credible region for each of the one-step-ahead posterior predictive distributions
, i.e. (3) with plugged in. The bottom plot shows these predictive credible regions pushed forward to the price.Figure 4 shows the dissimilarity matrix of Wassertein distances as in (9) across the first 80 S&P 500 constituents by alphabetical order for corresponding to 16/07/2009. The reason for considering only 80 constituents is to keep the following visual results simple and easy to read. The date 16/07/2009 was chosen for purposes of illustration as it post-dates the global financial crisis and the onset of the subsequent recovery.
Whilst the dissimilarity matrix seems to show rich structure, it is not easy to directly interpret. This is where hierarchical clustering comes in: figure 3.3
shows the result of agglomerative clustering with the average linkage method, implemented in Python using the Seaborn statistical data visualization library, see
https://seaborn.pydata.org and https://SciPy.org for details of the underling linkage method.This clustering method proceeds by initializing each stock in a separate cluster, then sequentially combining nearby clusters and re-calculating between-cluster distances. The output is a dendrogram, shown on the right of figure 3.3, and a re-ordering of the rows/columns of the dissimilarity matrix to respect the structure of the dendrogram.
Once clusters of stocks are identified from this dendrogram, one may then interrogate their respective change-point distributions. To illustrate the idea, three clusters are highlighted in figure 3.3. The first cluster consists of:
AIV, Apartment Investment and Management, a real estate investment trust;
AFL, AFLAC Incorporated, an insurance company;
AVB, AvalonBay Communities Real estate, an investment trust;
AON, Aon, an insurance broker, risk, retirement and health services consulting company;
BK, Bank of New York;
BXP, Boston Properties, a real estate investment trust ;
ALL, Allstate, an insurance company;
BAC, Bank of America.
There is a clear theme of financial, real-estate and insurance sectors to this cluster. Let us examine their change-point distributions for corresponding to 16/07/2009: they are shown in figure 5 and the common feature is probability mass around May 2009. It was at this time that some stocks badly effected by the crisis in 2008 and early 2009 showed signs of recovery. Indeed inspecting the estimates of for each of these stocks (not shown) reveals there was a discrete in each of their volatilities around May 2009.
The second cluster is less sector-specific, consisting of:
APA, Apache Corporation, a hydrocarbon exploration company;
AZO, AutoZone, an automotive parts retailer;
CHK, Chesapeake Energy, a hydrocarbon exploration company;
AAPL, Apple;
AGN, Allergan, a pharmaceutical company;
CL, Colgate-Palmolive, a consumer products company.
Inspecting figure 3.3, it is clear that this cluster is one component of a larger cluster of 30+ stocks which have similar change-point distributions. Figure 6 indicates the feature they have in common is evidence of a volatility change, or several volatility changes, in December 2008, but little evidence of a change-point between then and June 2009. Broadly speaking, these stocks were hit by the crisis in around October 2008, but their volatility subsequently decreased sooner than the stocks in the first cluster, around the end of 2008.
The third cluster is smaller, consisting of the three stocks:
ABT, Abbott Laboratories, a healthcare company;
AXP, American Express;
CAT, Caterpillar, a construction equipment manufacturer.
Figure 7 reveals that the feature these three stocks have in common is evidence of a change-point around September 2008, about the time of the market crash, but little evidence of a change-point between then and June 2009. In fact inspecting the estimated volatility parameters for these stocks (not shown) shows that all three of these stocks remained in a state of relatively high volatility from September 2008 until around July 2009.
[H]
There are a number of avenues open for further investigation.
In terms of the modelling of individual time series, there are number of ways the model from section 2.4 could be extended. As it stands, it doesn’t explicitly model leverage effects - that increases in volatility tend to be larger when recent returns have been negative. A number of variants of the basic GARCH model, such as Threshold-GARCH and Exponential-GARCH do model leverage effects, but involve parameters for which conjugate priors are available. It might be useful to find a half way point between such models and that of section 2.4, to achieve more accurate modelling, whilst retaining the analytic tractability which allows parameters to be integrated out.
It could be desirable to develop a more principled approach to calibrating the hyper-parameters ,
, and the parameters of the prior on the inter-change-point times. This could be approached, for example, as a maximum likelihood or Bayesian inference problem. Particle Markov Chain Monte Carlo methods for the latter are given in
(Whiteley et al., 2009).It could also be interesting to explore alternative probability metrics and alternative clustering methods, for instance -means using Wasserstein Barycenters (Ye et al., 2017).
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence
, pages 2583–2589. AAAI Press, 2016.
Comments
There are no comments yet.