I Introduction
While it is more convenient to work on exact values, real life predictions usually have some uncertainty, controlling of which could allow e.g. to distinguish nearly certain predictions from the practically worthless ones. Generally, wanting to predict variable from variables, if there is no a strict relation, they often come from some complicated joint probability distribution  knowing , we can only predict conditional probability distribution. This article discusses such prediction of conditional probability distributions on example of bidask spreads, which is often publicly unavailable, from a few variables which are available. There is used hierarchical correlation reconstruction (HCR) [1] methodology as briefly presented in Fig. 1: first normalize all variables to nearly uniform distribution like in copula theory [2], then model densities as polynomials using basis of orthonormal polynomials  for which coefficients are analogous to (mixed) moments. Then predict such coefficients of density of from coefficients of , for example using leastsquares linear regression here, separately for each modelled moment of . The evaluation is performed using 10fold crossvalidation for loglikelihood of normalized variables: average natural logarithm of predicted conditional density in the actual value, what can be interpreted as estimated minus conditional entropy . Figure 2 contains examples of such predicted densities, Figure 3 contains main evaluation of the used variables and models.
Having predicted conditional probability distributions, a basic application can be just taking expected values  getting conservative predictions of values, e.g. avoiding predicting extremes, as presented in Figure 7. We can additionally calculate variance of such predicted density to estimate uncertainty of value predicted as expected value. We can also handle more sophisticated situations like bimodal distribution with two (or more) separate maxima: when pointing expected value might not be a good choice (can have much lower density), a better prediction might be e.g. one of the maxima, or maybe both: providing prediction as alternative of two (or more) possibilities. Finally, we can also work on complete predicted densities, e.g. to generate its random values for MonteCarlo methods, or processing such entire probability distributions through some further functions for their more accurate modelling, as e.g. generally for nonlinear functions: expected value of function is not equal function of expected value.
To model such conditional distributions we will use HCR methodology, which combines advantages of classical statistics and machine learning. While the former allows for well controlled and interpretable but relatively small (rough) models/descriptions, machine learning allows for very accurate descriptions using huge models, but usually lacks uniqueness of solution, control and interpretability of coefficients, and often is computationally costly. HCR allows to inexpensively work on huge models obtained from (unique) leastsquares optimization, using well interpretable coefficients: as mixed moments of variables, starting e.g. with moments of single variables and correlations of coefficients.
More specifically, HCR conveniently starts with normalization of all marginal distributions to nearly uniform distributions like in copula theory  so they can be interpreted as quantiles, compactifying tails problematic for linear regression. Now we can model distortion from uniform distribution on this
hypercube with a linear combination, e.g. of orthonormal polynomials, for which coefficients can be interpreted analogously to (mixed) moments. E.g. for 3 variables, ’000’ coefficient is always 1  corresponding to normalization, ’100’ is analogous to expected value of 1st variable, ’020’ to variance of 2nd variable, ’011’ to correlation coefficient between 2nd and 3rd variable, ’202’ is large if with large variance of 1st variable there comes large variance of 3rd variable (like heteroskedasticity in ARCH model), and so on also for higher moments and dependencies between 3 or more variables, getting hierarchical decomposition of statistical dependencies (joint distribution) into mixed moments.
While we could directly extract and exploit joint distribution with HCR, experimental tests have shown that alternative approach from [3] gives slightly better evaluation by extracting additional dependencies, hence we will focus only on it: use separate bases of (mixed) moments for and predict each considered coefficient for with least squares linear regression using coefficients of . While [3] has used only moments of separate variables for , here we expand this methodology by using also their mixed moments  starting with ’11’ correlationlike coefficient.
Its advantage over modelling joint distribution is being able to notice and exploit e.g. that difference of two variables has some useful relation with the predicted variable. Looking e.g. at RWEG in Fig. 3, ’2’nd variable is practically noise, its (blue) dot is nearly zero. However, the difference between ’13’ and ’123’ model (red and orange dot) is much larger: relations with other variables allowed to extract more information from this noise.
In the discussed example we would like to predict conditional probability distributions for (nearly inaccessible) bidask spreads (relative quoted) from more available information. The basic considered ’123’ model began with 5 classically used variables: closing price (), high and low value (), volume () and logreturn (). Surprisingly, it has turned out that using and alone did not help improving evaluation (loglikelihood in 10fold crossvalidation), hence finally there were used 3 variables . The second considered ’123+’ model complements this information with other relatively available variables, searching through which has lead to final use of 3 additional variables: (market) depth, midpoint changes intraday, and midpoint volatility.
The choice of basis of moments is a difficult question: too small leads to underfitting by not being able to express dependencies of data behavior, too large leads to overfitting by representing features of training set which do not generalize to test set. For ’123’ and ’123+’ models there was chosen a compromise to optimize for all companies: for ’123’ predicting 8 first moments of using 53 mixed moments of 3 variables of , hence using coefficient models. For ’123+’ predicting 6 moments from 205 mixed moments of 6 variables of , getting coefficient models.
The used data is for 22 DAX companies for which large enough dataset was available, arbitrarily chosen as containing at least 2000 daily datapoints. As there were observed large difference between models for different companies, corresponding to different behaviors of traders of its stock, there were mainly considered individual models for each company. There was also performed hierarchical search for combinations of companies for which using common model leads to the smallest loss of evaluation, however, such loss often turns out significant.
Ii Dataset and basic concepts
This section discusses dataset and reminds standard concepts, to be used for building the used methodology in the next Section.
Iia Dataset and variables
There was used daily data for DAX companies from 19992013 period (source in Acknowledgment), selected as having available at least 2000 datapoints: Deutsche Telekom (DTEG), Daimler (DAIG), SAP (SAPG), Siemens (SIEG), Deutsche Post (DPWG), Allianz (ALVG), Bay Motoren (BMW), Infineon (IFXG), Volkswagen (VOWG), Fresenius (FREG), Henkel (HNKG), Continental (CONG), Merck (MRCG), Muench. Rueckvers (MUVG), Deutsche Boerse (DB1G), Lufthansa (LHAG), Fresen Med Care (FME), Deutsche Bank (DBKG), Heidelbergcement (HEIG), RWE (RWEG), Beiersdorf (BEIG), Theyssenkrupp (TKA).
The basic set of variables is  closing price,  volume,  return,  high/low price. However, it has turned out that trying to exploit dependence on and alone did improve evaluation, hence finally the basic considered model: ’123’ uses only as ’1’st variable, as ’2’ndvariable and normalized as ’3’rd variable.
There were also performed trials to improve the prediction by using information from some additional relatively available variables  3 were found helpful in predictions: (market) depth, midpoint changes intraday and midpoint volatility.
IiB Bidask spread and some its standard predictors
Bidask spread is the difference between the lowest asking price (, offered by a seller) and the highest bid price (, offered by a buyer). While this value is important because it is a main measure of market quality [4, 5], this information is usually publicly unavailable. Therefore, there is an interest in being able to predict this value based on other, more accessible data.
More specifically, we work on relative quoted spread, which is normalized by dividing by midpoint :
(1) 
Simple examples of its predictor based on the 5 basic variables are [6, 7], [8]:
(2) 
(3) 
They are intended for a simpler task than discussed: to predict values, while here we want to predict entire conditional probability distributions. We can reduce predicted probability distributions into predicted values e.g. as expected value, median, or positions of maxima (especially for multimodal distributions). Figure 7 presents comparisons using such predictions reduced with expected value.
However, in practice such prediction is often further processed through some functions, generally for nonlinear, hence it is more accurate to process the probability distribution (e.g. on a lattice) through the functions before e.g. taking expected value.
IiC Normalization to nearly uniform marginal distributions
Like in copula theory, in HCR methodology it is convenient to initially normalize all variables to nearly uniform marginal distributions in , hence we further only work on such normalized variables, what beside usually better prediction, also allows for better presentation of evaluation: e.g. density without prediction is 1, loglikelihood is 0.
This standard normalization requires estimations of cumulative distribution function (CDF), individually for each variable, and applying this CDF function to the original values. Finally, having a prediction we can go back to the original variable using CDF
, for example as in the original [3] article, however, for simplicity we omit this step here  working only on normalized variables. Also predictions underwent such normalization for the purpose of Fig. 7 visual performance comparison  making that an ideal predictor would give diagonal.There was used empirical distribution function (EDF) for such normalization here: for each variable its observed values are sorted, then th value in such order obtains normalized value. Hence values become their estimated quantiles this way, difference of two normalized values describes percentage of population between these two values.
Having predicted density for normalized variable, we can transform it to the original variable e.g. by discretizing this density to probability distribution on lattice, and assigning probability of its th position to th ordered original value. For simplicity it is omitted in this article.
IiD Evaluation: loglikelihood with 10fold crossvalidation
The most standard evaluation of probability distributions is loglikelihood like in ML estimation: average (natural) logarithm of (predicted) density in the actually observed value. Hence we will use this evaluation here.
Working on variables normalized to marginal distributions, without prediction they would have practically zero loglikelihood. It allows to imagine gains from predictions as averaged improvement over this , as in Fig. 2. For example the best observed loglikelihood corresponds to times better density than without prediction, the same as if we could squeeze range 3.3 times to a wide range. Sorting predicted densities in the actually observed values, we can get additional information about distribution of prediction, as presented in this Figure.
We predict here conditional density  denoted as for density of predicted based on known value of . Hence the used evaluation can be seen as estimation of , which is minus conditional entropy
. While being unknown, random variables have some concrete value of conditional entropy  we can hopefully try to approach it with better and better models.
We are focusing here on large models using hundreds of coefficients, hence we need to be careful not to overfit: represent only behavior which indeed generalizes, is not just a statistical artifact of the training set. Machine learning also builds large models, usually evaluating using crossvalidation: randomly split dataset into training and test set, training set is used to build the model, then test (or validation) set is used to evaluate the built model.
However, such evaluation depends on the random splitting into training and test set. There is used standard 10fold cross validation to weaken this random effect: dataset is randomly split into 10 nearly equal size subsets, evaluation is average from 10 cross validations: using successive subsets as the test set and the remaining as the training set. However, there is still observed scale randomness of such evaluation, hence only two digits after coma are being presented.
Iii Used HCRbased methodology
This section discusses the used methodology, being expansion of the one used in [3]. We decompose and variables into mixed moments and model separately each moments of using leastsquares linear regression of moments of , then combine them into predicted conditional probability distribution of .
Iiia Decomposing joint distribution into mixed moments
After normalization of marginal distributions of all variables to nearly uniform on , for variables their joint distribution on would be also nearly uniform if they were statistically independent.
Distortion from uniform joint distribution corresponds to statistical dependencies between these variables  we would like to model and exploit it. In HCR we model it as just a linear combination using an orthornormal basis e.g. of polynomials, which gives the coefficients similar interpretation as moments and mixed moments: dependencies between moments for multiple variables.
The first orthonormal polynomials (rescaled Legendre) for are and correspondingly (plotted in Fig. 1):
We could alternatively use e.g. , , for orthonormal basis, however, experimentally it usually leads to inferior evaluation.
Decomposing density , we need normalization to integrate to 1. Due to orthogonality, for , hence the following coefficients do not affect normalization. As we can see in their plots in Fig. 1, positive shifts density toward right  acting analogously as expected value. Positive
increases probability of extreme values at cost of central values  analogously as variance. Skewnesslike higher order asymmetry is brought by
and so on  we can intuitively interpret these coefficients as moments (cumulants). This is only an approximation, but useful for interpretation of discussed models.In multiple dimensions we can use product basis:
(4) 
leading to model of joint distribution:
(5) 
where is the basis of mixed moments we are using for our modelling. It is required to contain for normalization. Beside, there is a freedom of choosing this basis, what allows to hierarchically decompose statistical dependencies of multiple variables into mixed moments.
Figure 1 contains some first 5 functions of such product basis for variables: corresponds to normalization and requires . Coefficients of , describe expected value and variance of the first variable, and analogously of the second. Then we can start including moment dependencies, starting with which determines decrease/increase of expected value of one variable with growth of expected value of the second variables  analogously to correlation coefficient. We have also dependencies between higher moments, like asymmetric relating expected value of the first variable and variance of the second.
And analogously for more variables, e.g. describes correlation between 2nd and 5th out of 6 variables. Finally we can hierarchical decompose statistical dependencies between multiple variables into their mixed moments. However, to fully describe general joint distribution, we would need infinite number of mixed moments this way  for practical applications we need to choose some finite basis of moments to focus on.
IiiB Estimation with least squares linear regression
Having a data sample
, we would like to estimate such mixed moments as coefficients for linear combination of some orthonomal basis of functions e.g. polynomials. Smoothing the sample with kernel density estimation, finding linear combination minimizing square distance to such smoothened sample, and performing limit to zero width of the used kernel, we get convenient and inexpensive MSE estimation
[1]: independently for each coefficient as just average over dataset of value of the corresponding function:(6) 
We could use such obtained model for predicting conditional distribution: substitute the known variables to the modeled joint distribution, after normalization getting (conditional) density of the unknown variables.
However, for the bidask spread prediction problem, slightly better evaluation was obtained by generalizing alternative approach from [3], which allows to additionally exploit subtle variable dependencies, hence we will focus on it.
Specifically, to model , let us use separate bases of (mixed) moments: for , for
, and model relations between them. While there could be considered more sophisticated models for such relations including neural networks, for simplicity and interpretability we focus on linear models here, treating
as interpretable features:(7) 
hence the model is defined by the matrix, which examples are visualized in Fig. 4 for , .
It allows for good interpretability: coefficient is linear contribution of th mixed moment of to th (mixed) moment of . We focus on onedimensional , but the formalism allows to analogously predict multidimensional.
To find the model we use leastsquares optimization here  it is very inexpensive, can be made independently for each
thanks to using orthonormal basis, and intuitively is a proper heuristic: leastsquares optimization estimates the mean  exactly as we would like for coefficient estimation (
6). However, this is not necessarily the optimal choice  it might be worth to explore also more sophisticated ways.Such leastsquares optimization has to be performed separately for each . Denoting
as coefficient vector for
th moment and as (e.g. training) dataset of pairs:matrix and vector for . Such leastsquares optimization has unique solution:
(8) 
Separately calculated for each , leading to the entire model as matrix with rows, like in Fig. 4.
IiiC Applying the model, enforcing nonnegativity
Having such model we can apply it to (e.g. test) datapoints as in 7, getting predicted conditional density for on as a polynomial. However, sometimes it can get below 0, so let us refer to it as and then enforce nonnegativity required for densities:
(9) 
Such obtained polynomial always integrates to 1. However, it occasionally can get below zero, what should be interpreted as corresponding to some low positive density. Such interpretation to nonnegative density is referred as calibration, and can be optimized based on dataset as discussed in [9]. For simplicity there was just used:
(10) 
where normalization factor is chosen to integrate to 1: . The threshold was experimentally chosen as a compromise for the used dataset, its tuning can slightly improve evaluation.
Figure 2 contains examples of such predicted densities on the test set with actual values marked as vertical lines. Flat near zero regions come from thresholding. While they are relatively frequent in such predicted densities, in plots of sorted below we can see that these close to zero densities are very rare among the actual values: prediction properly excludes these regions as unlikely.
Integration is relatively costly to compute, especially in higher dimensions, hence for efficient calculation the predicted polynomial was discretized here into 100 values on lattice, what corresponds to approximating density with piecewise constant function on length 1/100 subranges. Then was applied, and division by sum for normalization. Finally density in discretized position was used as in loglikelihood evaluation.
IiiD Basic basis selection
The optimal choice of basis is a difficult open question. As the basic choice there was used combinatorial family:
(11) 
where chooses how many first moments to use for th variable, bounds the sum of used moments (and formally degree of corresponding polynomial), bounds the number of nonzero : to include dependencies of up to variables.
For example the ’123’ model infers 8 moments from 3 variables using a compromise: of size basis, directly written e.g. in Fig. 4. The ’123+’ model infers 6 moments from 6 variables: of size .
Iv Bidask spread modelling
This section discusses application of the presented methodology to model conditional distribution of (relative quoted) bidask spread.
Iva ’123’ model using basic variables
The initial plan for this article was to improve prediction from standard models: , trying to predict conditional distribution of spread from their values using the discussed methodology. However, the results were disappointing, especially for , as we can see in Fig. 7.
Therefore, we have decided to use the original variables () instead, what has turned out to lead to essentially better predictions. There was manually performed search for parameters using basic basis selection (IIID) to maximize averaged loglikelihood in 10fold cross validation. This search has finally lead to basis for only 3 variables: to predict up to 8th moment of . Surprisingly, adding dependence on and alone was worsening evaluation  their dependence did not generalize from training to test sets.
While the optimal choice of basis seems a difficult open problem, exhaustive search over all subsets is rather impractically costly, Fig. 5 presents some heuristic approaches. The family seems generally a good start, e.g. successively modifying some parameter by one as long as observing improvement. In this Figure we can see large improvement while rising the number of predicted moments up to , what suggests that complexity of conditional distributions for the considered problem requires this degree of polynomial for proper description. This Figure also contains trials of using some first of such mixed moments accordingly to different orders. A heuristic optimization of a reasonable cost is the presented selective removal: for each mixed moment from calculate evaluation when it is removed, finally remove the one leading to the best evaluation, and so on as long as evaluation improves.
IvB Individual vs common models, universality
A natural question is how helpful for prediction is a given variable  Figure 3 presents some answers by calculating loglikelihood also for models using only some of the variables. We can see different companies can have very different behavior here, e.g. for some is helpful, for some it is not, what we can also see in the presented points from dataset. Figure 4 shows that they can even have opposite behavior: e.g. dependence on price.
This is a general lesson that while we would like predictors as nice simple formulas, the reality might be much more complicated  models found here are results of cultures of traders of stocks of individual companies, which can essentially vary between companies.
Therefore, to get the most accurate predictions we should build individual models for each company. Even more, a specific behavior of a given company can additionally evolve in time  what could be exploited e.g. by building separate models for shorter time periods, or using adaptive leastsquares linear regression [10], and is planned for future investigation.
However, building such models requires training data, which in case of variables like bidask spread might be difficult to access. Hence it is also important to search for universality  e.g. try to guess a model for a company for which we lack such data, based on data available for other companies. This generally seems a very difficult problem, Fig. 6 shows that even having all the data, using common model for multiple companies we should expect large evaluation drop. For example we can see that behavior of DTEG completely disagrees with common model for all.
As we can see in this tree Figure, the one common model situation improves if we can cluster companies into groups of similar behavior (orange dots)  there are also presented results for splitting companies into just two groups with separate models (comL, comR in Fig. 4), also visually leading to slightly better prediction as we can see comparing 3rd and 4th column in Fig. 7.
IvC ’123+’ model with additional variables
The information from basic variables can often be complemented with some additional  a natural approach is checking if we can improve loglikelihood in the discussed methodology if adding information from some new variables.
The size of basis can even grow exponentially with the number of variables here: there are mixed moments if using all up to th moment for all variables. The parameter: maximal number of interacting variables in allows to bound it by . The sum limitation in seems also very useful, bounding degree of used polynomials.
Due to quickly growing basis size for increasing number of variables, we could easily exceed the size of dataset  experimentally seen as overfitting: decreasing performance. Manual search for using additional variables has started with basis for the standard variables, and carefully increasing in basis with separate single additional variable to consider. The most promising variables were later considered together, by modifying parameters by 1 as long as improvement was observed (of averaged loglikelihood over individual models for all companies).
It has finally lead to ’123+’ model: and size using 3 additional variables: depth, midpoint changes intraday and midpoint volatility. Such model has coefficients.
Due to rapid growth of the number of coefficients, for adding further variables it is worth to consider e.g. building some features from multiple variables to be directly used here, or use some alternative way for choosing basis for , e.g. directly optimized on the dataset like PCA or other dimensionality reduction.
V Conclusions and further work
There was presented a general methodology for extracting and exploiting complex statistical dependencies between multiple variables in inexpensive and interpretable way for predicting conditional probability distributions, on example of difficult problem of predicting bidask spread from more available information. It expands approach form [3] by inferring from mixed moments, and searching for the basis in large spaces of possibilities.
Figure 7 presents its comparison with standard methods when using only expected value from such predicted conditional density  perfect predictor would lead to diagonal, standard methods give rather a noise instead, while the predictions from the discussed approaches indeed often resemble diagonal, especially when using individual models. Predicted conditional probability density provides much more information: e.g. allows to additionally estimate uncertainty of such prediction, or provide oror prediction for multimodal densities, or allows for generating its random values e.g. for MonteCarlo simulations, or just provide the entire density for accurate considerations if transforming such random variables through some further functions.
There are many directions for further development of this relatively new general methodology, for example:

Optimal choice of basis is a difficult problem, necessary to be automatized especially for a larger number of variables  selecting from discussed basis of orthonormal polynomials, or maybe automatically optimizing a completely different basis based on dataset.

There are observed large differences between behaviors of individual companies  bringing difficult questions of trying to optimize for common behavior, optimize models based on incomplete information, etc. Additionally, such behavior has probably also time inhomogeneity  the models should evolve in time, requiring adaptive models to improve performance, where the problem of data availability becomes even more crucial.

The discussed models rapidly grow with the number of variables, what requires some modifications for exploiting high dimensional information  like extracting features from these variables, dimensionality reduction like PCA, etc.

We have predicted conditional distribution for onedimensional variable, but the methodology was introduced to be more general: predicting for multidimensional should be just a matter of using proper , what is planned to be tested in future.

The predicted densities as polynomials have often rapid growths at the ends of  their removal might improve performance.

There was assumed linear relation between moments with leastsquares optimization, what is inexpensive and has good interpretability, but is not necessarily optimal  there could be considered e.g. using neural networks instead, and optimizing criteria closer to loglikelihood of final predictions.
Acknowledgement
Henryk Gurgul thanks Professor Roland Mestel for providing the bidask data from data bank ”Finance Research Graz Data Services” and Professor Erik Theissen and Stefan Scharnowski from Mannheim for providing data from ”Market Microstructure Database”.
References
 [1] J. Duda, “Hierarchical correlation reconstruction with missing data, for example for biologyinspired neuron,” arXiv preprint arXiv:1804.06218, 2018.
 [2] F. Durante and C. Sempi, Principles of copula theory. Chapman and Hall/CRC, 2015.
 [3] J. Duda and A. Szulc, “Credibility evaluation of income data with hierarchical correlation reconstruction,” arXiv preprint arXiv:1812.08040, 2018.
 [4] R. Mestel, M. Murg, and E. Theissen, “Algorithmic trading and liquidity: Long term evidence from austria,” Finance Research Letters, vol. 26, pp. 198–203, 2018.
 [5] H. Gurgul and A. Machno, “The impact of asynchronous trading on epps effect on warsaw stock exchange,” Central European Journal of Operations Research, vol. 25, no. 2, pp. 287–301, 2017.
 [6] Y. Amihud, “Illiquidity and stock returns: crosssection and timeseries effects,” Journal of financial markets, vol. 5, no. 1, pp. 31–56, 2002.
 [7] K. Y. Fong, C. W. Holden, and C. A. Trzcinka, “What are the best liquidity proxies for global research?” Review of Finance, vol. 21, no. 4, pp. 1355–1401, 2017.
 [8] B. BedowskaSójka and K. Echaust, “Commonality in liquidity indices: The emerging european stock markets,” Systems, vol. 7, no. 2, p. 24, 2019.
 [9] J. Duda, “Exploiting statistical dependencies of time series with hierarchical correlation reconstruction,” arXiv preprint arXiv:1807.04119, 2018.
 [10] ——, “Parametric context adaptive laplace distribution for multimedia compression,” arXiv preprint arXiv:1906.03238, 2019.