    # Credibility evaluation of income data with hierarchical correlation reconstruction

In situations like tax declarations or analyzes of household budgets we would like to evaluate credibility of exogenous variable (declared income) based on some available (endogenous) variables - we want to build a model and train it on provided data sample to predict (conditional) probability distribution of exogenous variable based on values of endogenous variables. Using Polish household budget survey data there will be discussed simple and systematic adaptation of hierarchical correlation reconstruction (HCR) technique for this purpose, which allows to combine interpretability of statistics with modelling of complex densities like in machine learning. For credibility evaluation we normalize marginal distribution of predicted variable to ρ≈ 1 uniform distribution on [0,1] using empirical distribution function (x=EDF(y)∈[0,1]), then model density of its conditional distribution (Pr(x_0|x_1 x_2...)) as a linear combination of orthonormal polynomials using coefficients modelled as linear combinations of properties of the remaining variables. These coefficients can be calculated independently, have similar interpretation as cumulants, additionally allowing to directly reconstruct probability distribution. Values corresponding to high predicted density can be considered as credible, while low density suggests disagreement with statistics of data sample, for example to mark for manual verification a chosen percentage of data points evaluated as the least credible.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

While in standard regression we want to estimate the conditional expected value, in some situations we need to predict the entire probability distribution. In ARMA/ARCH modelling

, or data compression like JPEG-LS , it is resolved by predicting the most likely value, usually using a linear combination of neighboring values, then assuming some parametric distribution of error from this prediction - for example as Gaussian in ARMA-like models or Laplace distribution in data compression. Widths of such distributions are often chosen based on the context (e.g. in ARCH, JPEG-LS).

However, in situation like credibility evaluation of tax declarations, the conditional distribution of the main variable to verify (declared income) is quite complex, as we will see in the presented analysis, and we need to model dependencies with multiple categorical variables - their properties might affect not only the expected value, but also higher moments. Such dependence between expected values of two variables is described by correlation coefficient, between variances e.g. in ARCH model, HCR used here allows to systematically exploit such dependencies of any moments between two or more variables. Figure 1: Top: modelled conditional densities of Pr(x0|x1…xd) using degree m=1 (left) or m=2 (right) polynomials (Fig. 3 contains further up to m=10) for predicted exogenous variable (equivalent income normalized to uniform marginal distribution on [0,1]) based on the remaining (endogenous) variables for ten randomly chosen data points from the sample. The actual values are marked with vertical lines, the higher predicted density for them, the better - we see that in most cases the prediction is above the base ρ=1. Inconvenience of this method is sometimes obtaining negative densities, but we can interpret such predictions, e.g. as just having low credibility here. Bottom: evaluation/calibration plots for predictions with m=1…10 degree polynomials, obtained by sorting predicted densities for actual values. We see that in ≈1/6 of cases this prediction is worse than ρ=1, but it can be essentially better in the remaining cases, especially if using high degree polynomial. Such low predicted density suggests disagreement of given data point with statistics of data sample - suggests its low credibility, for example to perform manual verification if this value is below a threshold chosen to mark a given percentage of the least credible.

To evaluate credibility of a given (exogenous) variable , we would like to predict its conditional probability density of based on the remaining (endogenous) variables - intuitively, the higher such density is, the higher credibility of this value. However, this intuition requires some normalization, as for example tails of distributions have lower density what should not be interpreted as lower credibility - these values are just spread over wider range.

A natural normalization, used for example in copula theory , is to nearly uniform marginal distribution on . We can use empirical distribution function (EDF, by sorting obtained values) to transform the original variable to from nearly uniform distribution. Using this normalization, modelled density of can be seen as evaluation of credibility, its examples are presented in Fig. 1.

To model we will use hierarchical correlation reconstruction (HCR) approach ([4, 5, 6, 7]

): model density of joint distribution of all variables

as a linear combination of orthonormal polynomials:

 ρ(x)=∑j=(j0…jd)ajfj0(x0)⋅…⋅fjd(xd) (1)

where satisfy .

Such coefficients have cumulant-like interpretation. For example has similar behavior as expected value of the first variable, as variance of the second variable. We can also use mixed coefficients, for example describes dependence between expected values of the first two variables - has similar interpretation as correlation coefficient. We can model complex joint density with such polynomial approximation, especially that mean-square estimation of these coefficient turns out very inexpensive and can be calculated independently : using orthonoromal basis, coefficient of a given function turns out just average of this function over the sample: for data points.

For credibility evaluation, after normalizing all marginal distributions to nearly uniform on , we would like to estimate their joint density as such linear combination - using a chosen basis, for example exploiting only pairwise dependencies (up to two nonzero indexes). Then substituting coordinates to such joint distribution, and normalizing to integrate to 1, we get probability density , describing credibility of this value.

However, above HCR approach is appropriate for continuous variables, while here we have many discrete - there will be suggested adaptation for this situation. Finally, for simplicity and interpretability there will be just used least-squares regression to optimize linear dependencies between properties of endogenous variables and cumulant-like coefficients of the predicted variable:

 ρ0(x0)=1+∑jajfj(x0)foraj=βj0+∑kβjkvk

with coefficients chosen by least-square optimization of .

For evaluation/calibration there will be used plot of sorted for all points from the dataset, presented in Fig. 1

. Its horizontal axis can be seen as quantile regarding modelled credibility. For example choosing that we want to manually verify 1% of least credible data points, we can read the density threshold (

) from this plot and verify only those below this threshold.

## Ii Dataset

The dataset is composed of observations on 37215 households collected annually by the Central Statistical Office of Poland (GUS). There will be used the following variables (all but the first are endogenous - they are used to predict the first variable which is exogenous; variable names are provided in brackets):

1. Continuous: equivalent monthly income111It is calculated as the household disposable income divided by a respective equivalence scale i. e. an indicator supposed to measure impact of demographic variables on cost of living. In the present study simple OECD 70/50 formula is employed: scale = 1 + (number of adults minus one) 0.7 + (number of children) 0.5. (inceq) in PLN - the exogenous variable, remaining equivalent cash in PLN at the end of month (casheq), shares of expenditures on luxury goods and food (luxury and food, respectively),

2. Discrete ordinal (the number of distinct values is provided in brackets): age of the household head in years (age, 87 values), his/her completed education level (edu, 11), number of persons (pers, 14), number of younger (child1, 8) and older (child2, 7) children in the household, urbanisation level (urb, 3), type of residence (loc, 6), number of cars (cars, 7), month of the query (month, 12).

3. Discrete categorical: main income source (source1, 12) and additional source (source2, 13) , voivodship (voi, 16), building type (build, 4), its ownership type (own, 6), age of the newest car (carage, 5), subjective evaluations of: change in the material position (chg, 5), income sufficiency (sf, 5), level of satisfaction of needs for food (sff, 5), clothing (sfc, 5), health care (sfh, 6), housing fees (sfs, 6), housing equipment (sfq, 6), culture (sfl, 7), education (sfe, 6), tourism and recreation (sft, 6).

4. Binary: sex of the household head, subjective evaluation whether the dwelling is too small or too large, presence in the household of persons: with tertiary and secondary education, unemployed, handicapped.

The model will use linear combinations of their properties. Age will be treated as continuous variable, their pairwise dependencies can be seen in Fig. 2, their cumulant-like proprieties are used in linear combination for prediction. To avoid arbitrary choice of weights, all the remaining variables are treated as binary: split into 0/1 variables, as many as the number of distinct values, being 1 if the category agrees, 0 otherwise.

## Iii Normalization and orthognonal polynomial basis

For many reasons it is convenient to normalize variables to have nearly uniform marginal distributions: to see them as quantiles - interpreting ranges as population percentages, to directly interpret predicted density as credibility here, finally to use polynomials for density estimation with normalized coefficients. In previous applications of HCR ([6, 7]

), there was used CDF (cumulant distribution function) of approximated parametric distribution (Laplace) for this normalization - approximating general behavior, then modelling with polynomial corrections from this idealization, which can evolve in time for non-stationary time series. Here probability distribution is stationary and too complex for parametric distributions, hence, like in copula theory

, we will directly use EDF for this normalization.

### Iii-a Normalization

Having sample, we can normalize it with EDF by sorting the values - finding order (bijection) such that:

 yo(1)≤yo(2)≤…≤yo(n) (2)

Hence, is in -th position of this order - wanting them to have nearly uniform distribution on , a natural choice is .

However, especially for discrete variables, (2) has many equalities, what needs special treatment - there is no base to choose an order among equal values, all of them should be transformed into the same value , naturally chosen as the center of such range - we can see one such horizontal line for casheq = 0 in Fig. 2, and age sample consisting only of horizontal lines due to rounding to complete years.

Finally, the used generalized formula (working for both continuous and discrete variables) is:

 xi=min{k:yo(k)=yi}+max{k:yo(k)=yi}−12 (3)

We can use this formula to normalize each variable of : separately for each lower index, getting , which for continuous variables have nearly uniform marginal distribution on . However, for discrete variables it is a step function, especially for binary - needing a completely different treatment.

There are ways to transform discrete variables into a smaller number of nearly independent continuous variables - for example by choosing a set of orthonormal projections. This choice is sometimes done randomly, but it often can be optimized by dimensionality reduction techniques like PCA (principal component analysis) or its discrete analogues like MCA (multiple correspondence analysis)



. Such projections often have nearly Gaussian distribution, but we would loose interpretability this way - this article presents analysis directly using discrete variables to get better interpretation of obtained coefficients (Fig.

4), however, it might be worth to explore also methods like MCA to directly work on continuous variables. Figure 2: Pairwise dependencies between 5 variables treated as continuous: exogenous (income) on horizontal axis to be predicted from endogenous variables on vertical axis. Each is normalized to nearly uniform marginal distribution - position can be seen as quantile, 0.5 as median, length e.g. 0.2 as 20% of population. Some of them have discreteness - corresponding to horizontal lines. Each of four [0,1]2 shown regions contains 37215 black dots from the dataset, and isolines for their density (would be ρ≈1 for independent) - estimated with HCR as polynomial ∑9ij=0aijfi(x0)fj(x1) using 100 coefficients (mixed moments up to 9th). For example for age we can see that younger people have higher expected inceq, middle-age lower, older closer to median - we need at least second order polynomial (f2) of age to model such behavior.

### Iii-B Orthogonal polynomial basis and HCR

Assuming (normalized) variable is from nearly uniform distributions on , it is very convenient to represent its density with polynomial: . Using orthonormal basis , mean-square optimization leads to  inexpensive estimation by just averaging: .

The first five of these polynomials (rescaled Legendre) are , and correspondingly:

 √3(2x−1),√5(6x2−6x+1),√7(20x3−30x2+12x−1),
 3(70x4−140x3+90x2−20x+1).

For normalization we need . The term shifts the expected value toward left or right. Positive term increases probability of extreme values, reducing it for central values - has analogous behavior as variance. And so on: coefficient has similar interpretation as -th cumulant. Using degree polynomial: corresponds to modelling distribution using the first moments, additionally directly getting density estimation from them.

We can also exploit statistical dependencies between two or more variables this way - by analogously modelling joint distribution on using product basis: . This way represents mixed cumulants - their dependencies between multiple variables. For a large number of variables, most of coordinates of used j should be 0 - coefficients with single nonzero coordinate describe probability distribution of corresponding variable, with two nonzero describe pairwise dependencies and so on - getting hierarchical correlation reconstruction (HCR) of a given distribution.

We could directly use such modelled joint density for credibility evaluation by just substituting and normalizing obtained polynomial of to integrate to 1. This way is expressed as nearly a linear combination of various products of

. However, this approach has difficulties with discrete values - hence, there is finally used linear regression to directly optimize coefficients of such linear combinations.

Orthogonal polynomial basis allows also for a different perspective: for a given value, we can take its coordinates in this basis: . As estimated is just average over such corresponding coordinates, we can imagine e.g. as contribution of this point to expected value, to variance etc. Hence, we can use as a list of properties of a given data point to infer its other property, exploiting nonlinear dependencies this way (for ) - we will use such properties in linear regression for prediction here.

There is generally a large freedom of choice for properties used in inference. For example for age variable here, a standard approach is dividing into age ranges, what can be seen as using with family of functions being 1 on a given age range and 0 otherwise. It leaves a question of sizes of these age ranges: long ranges reduce data precision, short ranges mean lower statistics and not exploiting local behavior (of neighboring ranges). Normalizing variable and using with orthonormal family of polynomials is an attempt to automatically exploit local dependencies, making each coefficient being affected by all data points. Numerical tests for age gave slightly more accurate predictions using orthonormal polynomials, than using the same number of fixed length age ranges.

### Iii-C Handling discrete values

Analogously to continuous variables with scalar product, we could directly work on discrete variables with product using some weight

- using basis of orthonormal vectors this time, we can we get analogous mean-square estimation as

.

However, while should be chosen as marginal distribution in analogy to continuous case, choosing the weights is a nontrivial problem, planned to be explored in the future. For simplicity and to present intuition of treating as useful properties for inference, which can be imagined as -th contribution to -th moment, in the presented analysis there will be used linear regression to directly optimize coefficients.

## Iv Used algorithm

The currently used algorithm optimizes coefficients with least-square regression:

1. All variables treated as continuous - including casheq having a large percentage of exactly 0 value, and age obtaining 87 distinct discrete values - are normalized to nearly uniform marginal distribution using formula (3).

2. All the remaining variables are treated as categorical and transformed into binary - thanks of it, weights of individual categories are optimized in later regression. For example edu(cation) obtains 11 different values, hence it is transformed into 11 binary variables: each being 1 if category agrees, 0 otherwise.

3. Denote as vector built of all properties of endogenous variables (normalized or not) directly used for prediction as a linear combination - here it has 223 coordinates visualized in Fig. 4. Its zeroth coordinate is fixed as 1 to get constant term in later regression. Then for variables treated as continuous, it contains for up to a chosen degree, which in 9 here - for all 4 variables treated this way (casheq, luxury, food, age) - getting coordinates of . It further contains all the remaining variables - categorical transformed into binary, and the original binary variables - both use only 0 or 1 values.

4. Build e.g. matrix with rows being applied function to all data points.

5. For each property we would like to use least square linear regression to infer for to predict density as degree polynomial. For this purpose, build vectors , then find vectors minimizing . It can be realized with pseudoinverse, and is implemented in many numerical libraries, e.g. as ”LeastSquares[M,b]” in Wolfram Mathematica. Values of these final used coefficients are visualized in Fig. 4.

6. Now predicted density is

 ρ0(x0)=1+m∑j=1ajfj(x0)foraj=v(y1…yd)⋅βj

This predicted density is for exogenous variable normalized to nearly uniform marginal distribution on - what allows to use it directly for credibility evaluation.

Sorting such predicted densities of actual values, we get evaluation/calibration plot presented in Fig. 1 for various used degree . Inconvenience of parameterizing density as polynomial is sometimes obtaining negative value, what need proper interpretation. It happens frequently in predictions in Fig. 3, but only in of cases in this plot of sorted predictions - confirming prediction and allowing to exclude such negative density regions with near certainty. Predicted negative density suggests disagreement of this point with sample statistics - low credibility of this data point.

To apply this evaluation/calibration plot, we can rescale its horizontal axis to range, allowing to interpret it as quantile in credibility evaluation. Then for example choosing a percentage of least credible data points for manual verification, we can read density threshold for this percentage, then mark for further verification all data points below this threshold (e.g. for ).

## V Conclusions and further work

There was presented a simple general approach for applying HCR technique with discrete variables - on example of credibility evaluation for income data, allowing to model conditional probability distribution for (normalized) predicted variable.

It tests agreement with statistics of provided data sample - will not detect systematic improper behavior existing in this sample. Handling such situations would rather require some supervised learning, like manually marking some of suspicious data points and then evaluate probability of being in this marked set.

The main purpose of this article was methodological - presenting simple interpretable analysis, but leaving many possibilities for improvements.

For example it has exploited only pairwise dependencies - between the exogenous variable and (separately) each of endogenous variable. We can analogously include higher order dependencies by using products of considered variables in vector of above algorithm, e.g. include properties in the least square regression for 3-point correlation, and analogously for higher order dependencies. As their number grows exponentially with dependency order, this choice requires some selectivity - for example can be done in hierarchical way: e.g. search for non-negligible 3-point correlations by expanding essential 2-point correlations. In presented analysis the number of parameters is negligibly small comparing do sample size - there is no danger of overfitting. Otherwise, the sample should be split at least into training and evaluation set .

Also, the main purpose of using least-square regression was simplicity and interpretability - it probably can be improved, what is planned to be explored in the future.

As mentioned, another direction worth exploring is using dimensionality reduction techniques like PCA or its discrete analogue: MCA, to reduce the number of variables and make them continuous. Figure 3: Predicted densities of Pr(x0|x1…xd) like in Fig. 1, but for all m=1,…,10 polynomial degrees, what can be imagined as modelling up to m-th moment. All use the same randomly chosen 10 data points - focusing on a chosen color, we can follow evolution of its prediction with growing degree. For example the extreme ones (violet and yellow) have essentially improved while using parabola, but the remaining stay nearly unchanged. We can see that localization of predicted range usually improves with degree. Figure 4: Top: final 4×223 coefficients obtained by least-square regression (optimizing ∑i∥fj(xi0)−aj∥2) for predicting probability density of exogenous variable (equivalent income) based on properties of endogenous variables - independently for coefficients corresponding to expected value (f1), variance (f2), skewness (f3)and kurtosis (f4) of predicted variable. For endogenous variables treated as continuous (casheq, luxury, food, age), the used properties are fj(xl) for j=1…9 describing j-th cumulant-like behavior. The remaining variables are treated as binary variables (0 or 1) for each appearing possibility. For example in the first row, negative first coefficient for food connects their expected values - describes anticorrelation, analogously for age it is positive - they are correlated. Coefficients for voivodeships (voi) describe individual corrections for each geographical region. Low statistics for some coefficients can lead to surprising behavior, for example we can see reduction of equivalent income with the number of persons in the household (pers), with a surprising spike at the end - it corresponds to a single data point of 12 person household. Bottom: average |vk| describing average influence of βk coefficient - it is fixed 1 for v0, nearly constant for cumulant-like coefficients for continuous variables, for each binarized discrete variable its contributions sum to 1: they describe proportions in population, for binary variables it is in [0,1].