Modeling joint probability distribution of yield curve parameters

07/31/2018 ∙ by Jarek Duda, et al. ∙ Cracow University of Economics 0

US Yield curve has recently collapsed to its most flattened level since subprime crisis and is close to the inversion. This fact has gathered attention of investors around the world and revived the discussion of proper modeling and forecasting yield curve, since changes in interest rate structure are believed to represent investors expectations about the future state of economy and have foreshadowed recessions in the United States. While changes in term structure of interest rates are relatively easy to interpret they are however very difficult to model and forecast due to no proper economic theory underlying such events. Yield curves are usually represented by multivariate sparse time series, at any point in time infinite dimensional curve is portrayed via relatively few points in a multivariate space of data and as a consequence multimodal statistical dependencies behind these curves are relatively hard to extract and forecast via typical multivariate statistical methods.We propose to model yield curves via reconstruction of joint probability distribution of parameters in functional space as a high degree polynomial. Thanks to adoption of an orthonormal basis, the MSE estimation of coefficients of a given function is an average over a data sample in the space of functions. Since such polynomial coefficients are independent and have cumulant-like interpretation ie.describe corresponding perturbation from an uniform joint distribution, our approach can also be extended to any d-dimensional space of yield curve parameters (also in neighboring times) due to controllable accuracy. We believe that this approach to modeling of local behavior of a sparse multivariate curved time series can complement prediction from standard models like ARIMA, that are using long range dependencies, but provide only inaccurate prediction of probability distribution, often as just Gaussian with constant width.



There are no comments yet.


page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As successful forecasting of financial time series could be often turned into profit, it makes it difficult to get a better prediction for the following value that just the previous value. However, above self-regulatory mechanism of the market does not restrict prediction of probability distribution of values, what is crucial for example for risk evaluation or Monte Carlo simulations.

Figure 1: Each variable is normalized to have nearly uniform density (PDF) on range. Top: sorted predicted for observed values is usually much higher than base thanks to exploiting joint distribution. Four graphs correspond to joint distribution of parameters reconstructed with degree 9 polynomial for all three variables, or all their pairs. Surprisingly, we see that gives even better predictions than for all 3 variables here. Bottom: region of predicted for all 3 variables. From above plot we can read that observed values were there in

of cases. In contrast to usually assumed Gaussian, distribution obtained from the real data turns out multimodal here. Density focused near diagonal for

means they are anti-correlated.

Standard approaches to predict probability distribution of values like ARIMA usually models this distribution as Gaussian, often of constant width: predicts some value and its inaccuracy (standard deviation). In contrast, we will model this probability distribution using large number of independent coefficients describing joint distribution as polynomial - what turns out leading to very different and more complex distribution than standardly assumed Gaussian - for example multimodal in the discussed example.

Specifically, as it is difficult to obtain a better prediction than just the previous value, we will focus on sequence of differences between two succeeding values. In discussed example it will be 3-dimensional space of parameters of Yield Curves of Diebold-Li model [diebold] (for fixed ), which dimensionality can be further increased for improved prediction by operating on time window: use a few previous values as context for prediction.

For convenience of fitting polynomial, we will first normalize each variable to nearly uniform distribution on

. It can be done by transforming variables with CDF (cumulative probability distribution) of approximated distribution of this variable, for which we will use Laplace distributions as it agrees well with empirical CDF (Fig. 2)

Taking such normalized variables, e.g. for different parameters in given or neighboring times, if uncorrelated they would come from nearly uniform distribution on . We will model perturbation from this uniform density as linear combination of orthonormal polynomials . It makes MSE optimal estimation very inexpensive [2]: is just average over sample . Coefficients for different j are independent and have multivariate cumulant-like specific interpretation, can be used for describing statistical dependencies between tested variables.

Figure 2: Left: time series of 6470 (from 1993 to 2018) daily Yield Curve parameters (Diebold-Li model [diebold]) fitted using standard assumption. We will work on time series: normalized to nearly uniform distribution on . Right: comparison of empirical CDF obtained from sorted values with CDF of Laplace and Gaussian distribution with estimated parameters - we will use Laplace as it has better agreement.

This article extends methodology from [1]

for 1D variable (on example of Dow Jones Industrial Averages time series) into the case of multidimensional random variables.

Ii Normalization to nearly uniform density

We will discuss on example of time series of 6470 (from 1993 to 2018) daily Yield Curve parameters for .

Time series are usually normalized for example to allow assumption of stationary process: such that joint probability distribution does not change while shifting position. The standard approach, especially for Gaussian distribution, is to subtract mean value, then divide by the standard deviation. However, such normalization does not exploit local dependencies between values, what we are interested in.

Hence we will work on sequence of differences (errors, residues) from current value to its prediction based on previous values, which can be taken for example from ARIMA-like models. For simplicity we will use here the previous value as predictor: operate on sequence for where . In practical applications can be replaced with a more sophisticated predictior, for example exploiting long-range dependencies.

As shown in Fig. 2, such sequences of differences from predictor turns out to have nearly Laplace distribution:


where maximum likelihood estimation of parameters is just: median of , mean of .

For simplicity we use Laplace distributions here to normalize variables to nearly uniform in , with separate parameters for different variables:


where is CDF of used distribution (Laplace here).

We will search for density. To remove transformation (1) to retrieve the final density of , observe that . Differentiating over , we get .

Iii Hierarchical correlation reconstruction

After normalization we have time series with nearly uniform density of separate variables. Taking its values: as different coordinates or neighboring in time, if uncorrelated they would come from nearly uniform distribution in - difference from uniform distribution describes statistical dependencies in our time series. We will use polynomial to describe this difference: estimate joint density for neighboring values of .

Assuming we have vector sequence of neighboring values (we will discuss various possibilities later), we would like to model density of such vectors as polynomial. It turns out [2] that using orthonormal basis, which for multidimensional case can be products of 1D orthonormal polynomials, mean square (MSE, ) optimization leads to extremely simple formula for estimated coefficients:


The basis used this way has functions. Beside inexpensive calculation, this simple approach has also very convenient property of coefficients being independent, giving each j unique value and interpretation. Independence also allows for flexibility of considered basis - instead of considering all j, we can focus on more promising ones: for example with larger absolute value of coefficient, replacing negligible . Instead of MSE optimization, we can use often preferred: likelihood maximization [3], but it requires additional iterative optimization and introduces dependencies between coefficients.

Figure 3: Top: the first 6 of used 1D orthonormal basis of polynomials : coefficient guards normalization, the remaining functions integrate to 0, and their coefficients describe perturbation from uniform distribution. These coefficients have similar interpretation as cumulants, but are much more convenient for reconstruction of density. Bottom: 2D product basis for : . The coordinates do not modify corresponding variable - generally given coefficient describes statistical dependencies between coordinates having nonzero index.

Above 1D polynomials are orthonormal in : , getting (rescaled Legendre): and for correspondingly:

They are plotted in the top of Fig. 3. corresponds to normalization. The coefficient decides about reducing or increasing the mean - have similar interpretation as expected value. Analogously

coefficient decides about focusing or spreading given variable, similarly as variance. And so on: further

have similar interpretation as cumulants, however, while reconstructing density from moments is a difficult moment problem, presented description is directly coefficients of polynomial estimating the density.

For multiple variables, describes only correlations between coordinates, does not affect coordinates, as we can see in the bottom of Fig. 3. Each coefficient has also a specific interpretations here, for example decides between increase and decrease of second variable with increase of the first, analogously decides focus or spread of the second variable.

Errors of such estimated coefficients come from approximately Gaussian distribution:


For the integral has value 1, getting in our case. As we can see in Fig. 4, many coefficients are more that tenfold larger here: can be considered as essential, not a result of a noise.

Iv Context-free modeling

Figure 4: Modelling joint probability distribution of variables, each normalized to nearly uniform distribution on . Top left: largest positive and negative obtained coefficients of polynomial used as density estimation: corresponding to correction from uniform distribution on . The remaining 3 region plots show the actual values (6469 tiny black points) and region plots of obtained density as degree polynomial for all 3 pairs of variables, presenting non-uniformity of their joint distribution, especially for the pair (top right). If these variables would be uncorrelated (), probability of a region would be proportional to its area. In contrast, the blue region here corresponding to estimated density has more than 1/2 of area, but only of probability, which is mostly concentrated in the diagonal, near its edges. It can be seen in the largest coefficients: negative 110 gives anti-correlation, positive 220 increases probability of extreme values. The third variable appears much further in coefficient list, what means weaker statistical dependency.

We will work analogously to Markov modelling: model probability distribution of a new value basing on one or a few previous values - referred as context for the prediction. In contrast to standard Markov situation, values we are using are continuous: from

or or here. The number of previous values considered for prediction is referred as the order of model. Order 0 or context-free models values as independent random variables: all from the same probability density. Order 1 is standard Markov process: uses context as one previous value to model probability distribution of the current value. Analogously for higher order models: using a few previous values as context for prediction.

Let us start with basic context-free approach: just model joint distribution of observed values, not looking for statistical dependencies with neighboring values. In discussed example dimension and for .

We could also use only a subset of such variables, e.g. for we have three possible pairs here: , and . Reducing to should lead to nearly uniform density due to the used normalization. Imperfection of e.g. assumed Laplace distribution used for this purpose will be corrected while fitting polynomial - in multidimensional case by j coefficients with for all but a given coordinate.

Top of Fig. 1 contains evaluation for fitting degree polynomial (for each variable) and 4 cases: and all 3 pairs . It shows sorted for predictions of actually observed values - the higher it is, the better prediction. Beside the used model, efficiency of such prediction strongly relies on objective statistical dependencies between these variables - for example will fail if they are uncorrelated. Surprisingly, we can see from this plot that using just first two variables gives better prediction than for all three here - the third variable is weakly correlated, more strongly with (red plot) than with (green).

Such plots evaluating prediction also allow to calibrate density plots, including interpretation for negative densities being artifact of the presented method: we can see that of cases here got negative density, hence region is expected to indicate the proper value in of cases.

Analogously we can interpret further lines of these graphs, for example we can see that the blue and orange lines in Fig. 1 have in of cases. Lower 3D density plot shows this region in : while it contains only of the volume, what would be its probability for uncorrelated variables, it contains here much more: of points from the sample.

2D regions for multiple isolines of constant for all 3 pairs are presented in Fig. 4, which alternatively can be obtained as marginalization of 3D density. It gives better visualization of strong statistical dependance between and , and much weaker with . Percentages indicate probability of cases observed in a given region, for example for for light blue regions. The missing probability is localized in further regions. Tiny black points are the actual 6469 data points - presented density region plots MSE fit degree polynomial to sum of Dirac deltas in all points of the sample.

Figure 4 also contains the largest positive coefficients (left, always starts with 1 for normalization), and negative (right). They provide unique independent cumulant-like description of statistical dependencies in modelled sample. For example largest positive is , what corresponds to parabola in first and second variable: statistical avoidance of being both near the center. Largest negative is , saying that with growth of the first variable, there comes reduction of the second.

V Context-dependent modelling

The next step is trying to exploit statistical dependencies between values neighboring in time: based on context representing the history, for example a few previous values, or extracted crucial information about the past for example in some dimensionality reduction method like PCA: corresponding to the largest eigenvalues of covariance matrix.

For simplicity and reducing dimension we will work on pairs as has much weaker correlation. We have considered one previous pair as the context : , or two previous pairs : for which is or correspondingly.

The most significant 100 coefficients for the largest considered model (, , coefficients) are presented in Fig. 5. Each is independent and has a specific meaning: correction to initially uniform density on - providing unique description of statistical dependencies in the observed data sample. For certainty that they are not just a result of random noise, here for (uniform density on ), which is exceeded a few dozens of times in this sample.

Figure 5: The most significant statistical dependencies: 100 largest absolute value coefficients for the million coefficient model: , modeling three neighboring pairs. The corresponding 6 coordinates are: . The list obviously starts with corresponding to normalization (the remaining functions integrate to 0). Then we have ”11” pairs as already seen in Fig. 4, this time in all 3 positions with nearly identical coefficient (tiny differences come from occurrences at the beginning and the end). Then we see large positive coefficient describing dependency between neighboring pairs: saying e.g. that with growth of the first 3 variables, the fourth is also likely to grow. While using up to order, we see that the ”9” index appears only in the last: 100th position here - dominant statistical dependencies are described by relatively low order polynomials here. Assuming uniform density on , these coefficients should come for Gaussian distribution centered in zero with , hence above coefficients can be seen as statistical significant: should not be interpreted as a result of noise.

The results of this order 2 (right, coefficients) and analogously order 1 (left, coefficients) are presented in Fig. 6. Especially the order 2 model gives nearly perfect agreement: in of cases the actually observed value is in the smallest predicted region (red boundary for ). However, this is fitting million coefficient model to just 6467 data points - polynomial approaching spikes in data points.

The proper prediction evaluation should test generalization capabilities, what is presented in Fig. 7. These tests of 27 models first randomly split data sample into two disjoint subsets, use the first one to calculate coefficients, and test on the second subset. We see that the million coefficient model () in of cases gives negative density - has strong overfitting. However, focusing on predicted high density regions, it most frequently gives the proper prediction.

Finally, we see that the choice of the most appropriate model is a difficult question, it might be worth to consider a few models and somehow mix their predictions.

Figure 6: Top: sorted predicted densities for the actually observed values for two degree models: using one (left, coefficients) or two (right, coefficients) previous pairs as the context. It contains percentages of cases when density was above thresholds, drawn below in region plot. Bottom: region plots for predicted densities in some four random points in time, the same for both models. We can see overfitting especially in the right column, with large white regions denoting predicted . This model fits million coefficients to size 6467 sample - approaching density as polynomial with spikes in the used points. The proper model evaluation should test its generalization capabilities instead: estimate coefficients on a subset of sample, and test on the remaining points - its results are presented in Fig. 7.
Figure 7: The proper evaluation of 27 models: sorted predicted densities for the actually observed values (the higher the better prediction) in randomly chosen of data points, using the remaining of points to train the model (estimate coefficients). There were used context-free (), order 1 () and order 2 () models for and all degrees . The highest (blue) plots are analogous as in Fig. 6, but this time with disjoint training and test sets to prevent overfitting.

Vi Conclusion and further perspectives

While there is usually assumed Gaussian distribution for financial data, in reality it is often much more complicated, including multimodal distributions. There was presented basics of systematic approach for modelling such joint distribution with a polynomial - what allows to effectively find and work with parametrisation using thousands of unique and independent cumulant-like coefficients, each one having a specific interpretation, and being inexpensive to calculate.

The used example applied basic methodology for educative reasons, we plan to investigate its extensions in the future, for example:

  • Selective choice of basis: we have used complete basis of polynomials, what makes its size impractically large especially for high dimensions. However, usually only a small percentage of coefficients is above noise - we can selectively choose and use a sparse basis of significant values instead - describing real statistical dependencies. Alternatively, we can selectively reduce polynomial degree for some of variables.

  • Adaptive choice of coefficients: we have assumed that coefficients are constant in time, what corresponds to stationarity of time series. However, in practice it is often non-stationary, what can be modelled using coefficients being not average of all values of a given function like here, but some local averages instead, for example with exponentially decaying weight [3].

  • Long-range value prediction: combination with state-of-art prediction models exploiting long-range dependencies, for example using a more sophisticated (than just the previous value) predictor of the current value.

  • Improving information content of context used for prediction: instead of using a few previous values as the context, we can use some features e.g. describing long-range behavior like average over a time window, or for example obtained from dimensionality reduction methods like PCA (principal component analysis).

While approach used here was analogous to Markov modelling, alternative approach to consider in the future is using time as one of coordinates, e.g. fit polynomial to triples in moving time window. It would require much lower dimension, allowing to directly model longer correlations. It also allow to work with continuous time.