VAT tax gap prediction: a 2-steps Gradient Boosting approach

12/08/2019 ∙ by Giovanna Tagliaferri, et al. ∙ Sapienza University of Rome 0

Tax evasion is the illegal non-payment of taxes by individuals, corporations, and trusts. It results in a loss of state revenue that can undermine the effectiveness of government policies. One measure of tax evasion is the so-called tax gap: the difference between the income that should be reported to the tax authorities and the amount actually reported. However, economists lack a robust method for estimating the tax gap through a bottom-up approach based on fiscal audits. This is difficult because the declared tax base is available on the whole population but the income reported to the tax authorities is generally available only on a small, non-random sample of audited units. This induces a selection bias which invalidates standard statistical methods. Here, we use machine learning based on a 2-steps Gradient Boosting model, to correct for the selection bias without requiring any strong assumption on the distribution. We use our method to estimate the Italian VAT Gap related to individual firms based on information gathered from administrative sources. Our algorithm estimates the potential VAT turnover of Italian individual firms for the fiscal year 2011 and suggests that the tax gap is about 30 potential tax base. Comparisons with other methods show our technique offers a significant improvement in predictive performance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tax evasion is the illegal evasion of taxes by individuals, corporations, and trusts. This results in a loss of State revenue that may be of great relevance and undermine the effectiveness of the state policies. It represents one of the main problems in modern economies, where the government budget is constantly under control (santoro2010evasione). One measure of the extent of tax evasion is the so-called tax gap, that is the amount of unreported income evaluated as the difference between the amount of income that should be reported to the tax authorities and the amount actually reported.

The aim of this paper is to introduce a robust methodology to provide estimate of the Tax Gap through a bottom-up approach based on fiscal audits (braiotta2015tax; kumar2015minimising). In particular, the analysis is focused on the estimation of the Italian VAT Gap related to individual firms, for the fiscal year 2011, based on information gathered from administrative sources. Different sources of data were employed: the Tax Administration Database ("Anagrafe Tributaria") and the Operational Audits Database. The first one contains information on the whole population coming from PIT, VAT and IRAP declarations, but does not include the amount of income that should be really reported to the tax authorities (potential tax base). The second one provides information only on a non-random sample of units (audited units) and it contains the so-called potential tax base. More details on the Italian tax system and the operational audits activity are presented in Section 2.

The final target of the analysis is to get trustful estimates of the potential tax base, the potential VAT turnover. The model estimated on the audited taxpayers should be able to infer the potential tax base on the non-audited ones. In this way, it allows to get an estimate of the total potential tax base of the Italian system BIT, compare it to the declared tax base BID, and estimate the total Undeclared Tax Base BIND as a difference between the estimated BIT and the known BID. The main issue of this analysis is that usually audited tax-payers are a very small portion of the total population and they are selected non-randomly, as in the case of controls performed by the Italian Revenue Agency (d2016general). This induces a selection bias

on the observed response variable which invalidate the application of standard statistical methods (

sarndal2003model; sarndal2005estimation). A machine learning approach based on 2-steps Gradient Boosting models based on the work by zadrozny2004learning is proposed in place of the standard methodology based on the Heckman model (heckman1976common; heckman1979sample

) in order to improve the estimation accuracy. The first step tries to estimate the inclusion probability of the audited taxpayers, so that they can be used to correct for the selection bias in the second step. The adoption of machine learning algorithms is driven by the necessity to move beyond linearity and their ability to extract information from very large sets of data (both in terms of units and variables) without any strong assumption on the distribution. In the domain of this paper, the Gradient Boosting algorithm, introduced by

friedman2001greedy, is chosen over its alternatives. It is a very robust ensemble learner, able to deal with data of any size (if adequate computational power is available) and nature. More details on the methodology are presented in Section 3. Afterwards, in Section 4, the proposed methodology is applied estimate the potential VAT turnover for a sub-sample of the Italian individual firms, for the fiscal year . Results on this sub-sample set are then compared to the ones obtained using the Heckman model, which turns out to be largely outplayed in terms of predictive performances. Finally, in Section 4.2, VAT tax gap and the VAT tax evasion propensities have been predicted for the of not-audited taxpayers in the considered sub-sample from the population.

2 Italian Taxation System

"Every person shall contribute to public expenditure in accordance with his/her tax-payer capacity. The taxation system shall be based on criteria of progression."

Const. Art. 53, Section I, Political rigths and duties

The Italian tax system is based on three fundamental principles, which are discussed in the 53rd article of the Italian Constitution in the “Political Rights And Duties” section.

  • Universality of taxation: all citizens must contribute to public expenses through the taxes payment. These are aimed to finance the operation of the state machine and are reflect in terms of performance and services for citizens. Those who are below a minimum income are exempt from the tax obligation and can also take advantage of all services, by virtue of the principle of economic and social solidarity.

  • Ability to contribute: the payment of taxes is based on economic prospects of citizens. This principle guarantees a fair distribution of the tax burden, excluding those who do not have the capacity to pay taxes.

  • Criteria of progression: the payment of taxes by citizens varies proportionally with respect to the potential tax base. This means that everyone pays taxes based on their economic possibility with a contribution that grows as income increases.

The most valuable taxes in terms of importance of the revenue are the Personal Income Tax (PIT) and the Value Added Tax (VAT), which contribute the of the total revenue.

The personal income tax is the main tax paid in Italy. This tribute weighs on all the persons fiscally resident in Italy on all types of income they get from whatever source. Each person is taxed progressively according to its total income, whether it is by employment, by pension or by self-employment. It constitutes one of the main sources of revenue for the state coffers. The study of PIT evasion and its gap is addressed in braiotta2015tax.

The Value Added Tax is second by importance. As it is stated by the European Commission <<it serves to tax the consumption of goods and assets services. It is applied to all commercial activities involving the production and distribution of goods and the provision of services>>. This kind of tax in not progressive and weighs completely on the final consumer, while for the taxable person (the entrepreneur and the self-employed) it remains neutral.

The activity of checking the correct fulfillment of tax payers’ obligations falls within the institutional tasks of the Italian Revenue Agency. These checks, with different methods and different timing, therefore evaluate the correspondence between what is declared and what is actually due (d2016general). The notice assessment of income and VAT taxes must be notified within four years following the declaration date. For example, on December 31 2017 were concluded the assessments for the declarations presented in 2012, which concern 2011 incomes.

Generally the declarations presented by the tax payers can be subject to various types of audits. For instance, the substantial control is aimed at adjusting the total income or turnover of the tax payer. In general, the audited tax payers are not selected randomly but they are identified on the basis of predetermined selection criteria. The motivation as a basis for this selection is that the evader is assumed to have a different behavior from the one who does not evade: evasive logics. Therefore, this criteria are believed to identify such evasive behaviour and optimize the selection of tax evaders. This kind of controls require really high costs both in terms of money and time and, for this reason, the assessment is viable only on less of the of the total population and the evaded amount of the remaining stays unknown. That is why accurate methods for the estimation and prediction of the evasion level and propensity also for not-audited individuals would be really valuable and may provide policy-makers with better tools and information to contrast the phenomenon of tax evasion.

3 Model and methodology

The planning of public investments and initiatives are strictly linked to the predicted potential tax base BIT of the whole economic system expected for the current and following years. However, the estimation of the potential tax base is always upward biased with respect to the actual income coming from taxes. As a matter of fact, the declared tax base BID, which is the actual amount of tax payed by all the economic subjects, is always lower than the BIT due to the well-known problem of tax evasion. The difference between these two quantities is the undeclared tax base BIND, commonly known as tax gap:

The quantification of the gap tax base is important in order to support the decisions in the state economic policies. Moreover, exploiting information from the previous years it may be possible to extrapolate and take into account its amount for the upcoming years. However, while the BID is available at the individual level on the bases of the income tax return submitted by the individual economic subjects, the undeclared BIT is generally unknown both at the aggregated and the individual level.

Nevertheless, thanks to the bottom-up approach based on fiscal audits, the BIT is yearly detected on a sample of audited taxpayers from the whole population : is known. However, these units are not randomly selected but the selection is based on arbitrary (and unknown) criteria established by the Italian Revenue Agency (d2016general). The logic behind this choice is driven by the necessity to maximize the selection of taxpayers who have an increased risk to hide part of their incomes. This means that the mechanism of selection is not negligible with respect to the response variable and, therefore, any estimation procedure based on such a sample would be affected by a potentially strong selection bias. This kind of bias invalidate any inference from the sample to the whole population of the economic subjects (sarndal2005estimation). For instance, let us denote our variable of interest and let it be observed only on a non random sample of units selected according to a sample design . The expected value for each unit would be:


If the sample design is not independent from the outcome variable, then:


and it is not possible to directly get any estimate for using solely the observed outcomes. This means it is not possible to directly get any estimate for the potential tax base of not-audited units by using information on the audited ones. Furthermore, due to the confidentiality of the selection criteria, the probability of selection of each unit in the sample are not available a-priori and therefore cannot be used to correct for the non-rappresentativeness of the selected sample.

These observations are of main importance when considering the appropriateness of the modeling assumptions for the chosen model. It needs to keep into account the selection bias, and provide some correction for this. The standard solution, currently adopted by the Italian Revenue Agency, is based on the well-known Heckman Model (heckman1979sample)

, which is based on the combination of logistic and linear regression. The one considered in this paper is a two steps procedure based on the work by

zadrozny2004learning that exploits the Gradient Boosting algorithm as a strong learner in both the steps. The adoption of such an algorithm allows to move beyond linearity, exploiting all the available information and relax some strong assumption hardly matched by the available data.

Furthermore, it may be interesting to point out that the availability of data at the individual level allows to go beyond the mere scope of the analysis. Indeed, predictions of the individual VAT gap can be used to compute the propensity to evade for units grouped according to some variables of interest (braiotta2015tax). These propensities can be used to identify the possible high-risk taxpayers and drive future selections for tax audit.

3.1 The 2-steps approach

The approach here introduced is viable when the complete population list and a common set of covariate is available on each unit. The following hypotheses need to hold.

  • The probability to be included in the sample for unit depends only on its covariates .

  • The response variable of unit , , is conditionally independent from the sampling design given the covariates :

The second assumption allows us to transfer all the dependence of the outcome variable on the sample design to the set of covariates:


Theoretically, Equation 3 deriving from the last assumption should allow to fit the model considering only the selected units. However, there is still an issue of non-rapresentativity of the sample with respect to the whole population. Being units in the sample different from units out of the sample both in terms of response and explicative variables, an estimation based only on the selected sample would rely on a certain dose of extrapolation. The resulting estimates would favor the fit on over-represented units and disregard the fit on under-represented ones. The 2-steps approach is adopted to correct this kind of bias using a Horvitz-Thompson style estimation, directly derived from the most basic survey sampling theory (sarndal2003model), exploiting all the auxiliary information available on the population. It consists of the subsequent application of two predictive models on the available data.

  1. The first model is a classification model. It is trained on the whole sample and targets the binary variable

    selected in the sample and not selected in the sample. It tries to find a specific pattern or some regularity in the mechanism of selection of the units, so that the probability to be selected can be estimated according to the auxiliary information included in the covariates . The resulting model would be able to produce predictions on the whole population which approximate the probability of inclusion of the first order for each unit:

  2. The second model targets the response variable and, therefore, it is then trained only on the units included in the sample (whose response variable is available). However, it is now possible to incorporate the inclusion probability resulting from Equation 4 in order to correct for the non-representativity of the selected units. These can be used to produce the inverse weights defined as:


    where is the probability to be selected notwithstanding the set of covariates. In practice, weighting each input observation proportionally to the inverse of their selection probability, we reduce the importance of units already over-represented in the sample while increasing the importance of under-represented ones.

Solutions of this kind are very common in the correction for bias deriving from non-negligible sampling design sarndal2003model, for instance when incorporating the response probability to correct for the non-response bias (bethlehem1988reduction; alho1990adjusting) and of the inverse probability of treatment weighting (IWTP) to correct for the treatment selection bias in causal inference (hirano2003efficient; austin2015moving

). Naturally, all these methods rely on the prediction accuracy of both the models: the classifier in the first step and the predictive model (either classifier or regressive) of the second step. Which model to choose in the two steps is very problem dependent and is up to the researcher. This work proposes the adoption of the

Classification and Regresstion Trees (CART) based Gradient Boosting algorithm for both the steps.

3.2 The Gradient Boosting algorithm

The Gradient Boosting

is a very powerful algorithm that allow to build predictive models for both the classification and regression tasks. It is an ensemble algorithm that relies on the concept of boosting, which is a technique for reducing bias and variance in supervised learning, firstly introduced in the seminal paper of

schapire1990strength. The Gradient in front of the term Boosting refers to a very flexible formulation of the boosting, firstly proposed by friedman2001greedy. This particular version exploits the Gradient Descent

in order to fasten the optimization procedure on the loss function of interest. This has been chosen among a set of Machine Learning algorithms for both steps because of its desirable combination of reduced computation burden and good performances in either tasks.

Let us consider the usual set of covariates and the response variable . The final aim of any supervised learning algorithm is to train itself on a set of data whose covariates and response variables are known and then produce an approximation to the function that generally relates and the expected value of . The approximation is obtained in such a way that the expected value of a pre-specified loss function

is minimized with respect to the joint distribution of all the pairs

in the set of data. In practice, the algorithm learns from the examples provided to it in the form of a training set and look for that approximation such that:

The choice of the loss function depends on the nature of the problem and of the outcome variable. For instance, in the case of the regression task, the usually adopted loss function is the Mean Squared Error. The peculiarity of the boosting procedure is that it approximates using a function of the form:

where are functions known as Base Learners and are real coefficients. The base learners are functions of derived from another, simple, learning algorithm and the ’s are expansion coefficients used to combine the base learners outcomes. Either the base learners and the expansion coefficients are estimated using the data from the train set using a forward-stagewise procedure. As any recursive algorithm, it starts from an initial guess and then the new set of coefficients and learner are derived as:



  for  up to  do
  end for
Algorithm 1 Gradient Boosting pseudo-code example

Unfortunately, choosing the best pair at each step for an arbitrary loss function is a computationally infeasible optimization problem in general. This is where the gradient descent plays a key role, leading to the Gradient Boosting algorithm. It solves the optimization problem in Equation 6 through an approximation that is legit whenever the loss function is differentiable. At each step , the base learner is chosen according to the best fit on the pseudo-residuals deriving from the previous step:


The pseudo-residual values play the role of the gradient, driving the optimization procedure towards the right direction step after step. In this simplified framework, given the base learner , the best value for can be obtained as:

A very common modification to the standard gradient boosting algorithm is the addition of a shrinkage parameter , which modifies the update rule in the following way:

This parameter controls the learning rate of the algorithm and allows for the regularization of the procedure (efron2016computer). The whole algorithm is resumed in the pseudo-code of Algorithm 1.

The most common version of the Gradient Boosting uses fixed-size CART (usually small, with low number of branches and/or splits) as base learners, whose predictive ability is strongly enhanced by their boosting combination (efron2016computer). Either the shrinkage parameter and the parameters that define each single random tree (number of splits, number of branches, etc.) are not estimated during the procedure. In the Machine Learning context they are known as tuning parameters and they need to be chosen in advance and stay fixed. Typically, they are selected via searching procedure bases on the cross-validation approach in order to avoid over-fitting (friedman2001elements).

3.3 Estimation of uncertainty

The main issue with machine learning algorithm is that they provide point estimates but they cannot rely on any modeling assumption in order to derive interval estimates. However, it is possible to address this weakness resorting to a bootstrap approach (efron2003second) as it is proposed in heskes1997practical.

This method consists of deriving different bootstrap samples from the original training set, in order to get different samples approximately distributed according the joint distribution of and of the considered training set. The idea is to fit the same model on the different samples from the original training set, which will provide a different approximation for the function relating covariates and the expected value of the response variable . Using each of these functions, it is possible to produce set of predictions for all the ’s in the training set: different predictions for each unit . The procedure is outlined in Figure 1.

Figure 1: Bootstrap procedure scheme.

In such a way, a sort of empirical posterior distribution for the prediction of each is obtained and, through these, it is possible to derive interval estimates in whatever way it is preferable. For instance, it may be considered picking the and

empirical quantiles in order to obtain a

equal tail posterior interval for the prediction.

4 Application to real data

The available set of data is composed of all the individual firms included in the administrative register for the taxation period 2007-2014.

Due to the unavoidable delay in the availability of declared information from taxpayers, there is a lag between the fiscal (audited) year and the year in which the control is performed. Moreover, considering that it is no more possible to control a fiscal year after five years, consider that data for five or six years of activity are necessary in order to have complete information on a specific audited tax year. For these reasons, the analysis will be driven on data from the year . Therefore, the population of interest, referred to the fiscal year , is composed of millions () of individual firms, where only the ( units) have been audited. In total, explanatory variables coming from the Tax Administration database were considered, concerning these area of information: demographics (gender, age, province, region); economic sector of operation; taxable income type; revenues, costs reported income, taxe base, gross tax and net tax; presumptive turnover provided by Business Sector Studies.

The software considered for the analysis of the dataset were the open-source

R and the data mining oriented SAS Enterprise Miner. Finally, R have been chosen for its greater flexibility. Unfortunately, due to the great sensitivity of the data, the whole analysis had to be performed on a single computer (Processor: Intel Pentium dual-core E1040; RAM: 4gb) without any possibility to move the data or rely on external virtual machines. These hardware limitations not only slowed down the implementation of the algorithm, but did not allow to consider the whole set of data because of the limited RAM memory. This issue has been overcome by considering only a sub-sample of all the units, limiting the potential of the results of this paper but still allowing to evaluate the proposed procedure and compare it to its competitor.

Given the large imbalance between audited and not-audited taxpayers in the population, it has been resorted to the sub-sampling only on the not-audited ones. Finally, as it is shown in Table 1, the analysis has been carried out on a sample of the not-audited taxpayers ( of the not-audited, ) stratified with respect to the demographic variables, and all the audited ones ( of the audited, ).

Total Population Sub-sample
Fiscal audits Frequency Percentage Frequency Percentage
Table 1: Total and sampled population of individual firms.

The declared tax base and all the covariates have been retrieved from the tax administration database based on the declaration models PIT, VAT and IRAP and are available on the whole population. On the other hand, through the papers resulting from the fiscal audits, it is possible to get the effective potential tax base on the non-randomly selected units subject to controls. The tax evasion amount for each firm is denoted as (undeclared tax base) and can be obtained as the difference between the BIT and the BID (declared tax base):

The final aim of this application is to provide an estimate for the VAT gap for the individual firm in the considered subsample according to the procedures introduced in Section 3 and compare the performances with the standard Heckman model. Moreover, predictions on the individual firms will be used to examine the evasion propensity at different levels of disaggregation, identifying categories of taxpayers more propense to evade (at higher risk of evasion) with respect to others.

It may be relevant to point out that the second step of the 2-steps methodology concerns only the audited taxpayers, hence the sub-sampling on the not-audited ones directly affected only the first step of the procedure (Section 3.1). The estimation of the inclusion probability may be less accurate, and so the correction of the selection bias but, given the great imbalance of audited and not-audited units in the population, it is well-known in literature how a sub-sampling strategy aimed at re-balancing the general situation does not impair sensibly the final estimates (more2016survey).

The variables coming from the different sources has been cleaned (elimination of unary variables and of repeated variables) and joined in one single dataset. Further data pre-processing has been considered but did not lead to any performance improvement. Indeed, machine learning algorithm are not affected by the usual data criticism of standard linear modeling such as: multicollinearity, skewness, deviations from Normality assumptions and so on


After that, the group of audited and not audited units have been compared; in particular, it may be interesting to notice the great difference in terms of declared tax base between the two groups. Table 2 shows the main summary statistics for the BID variable.

Not-audited Audited
Standard deviation
Percentiles 25
Table 2: Comparison of the BID summary statistics between audited and not-audited units.


value for the audited units is way greater than the one for the not-audited ones with the mean of the former statistically greater than the one of the latter (t-test

vs ; ). It seems that the selection criteria of the fiscal audits tend to favor the tax-payer with higher BID, proving the point that the sampling design does depend on some of its individual characteristics. As a matter of fact, the selection may depend also on other variables and the more general picture is further investigated in Section 4.1.

Finally, both the 2-steps Gradient Boosting and the Heckman model (standard model currently used by the Italian Revenue Agency) have been fitted to the same set of data. Results are shown and compared in Section 4.1.

4.1 Model fitting and uncertainty assessment

According to what has been explained in Section 3.1, two different gradient boosting have been subsequently fitted. The first one is a classification model aimed at estimating the inclusion probabilities while the second one is a weighted regressive model aimed at estimating the potential tax base. Both the models in the two steps have been fitted to the data and summarized in the software R using functions from the package gbm (ridgeway2007generalized). The fit and cross-validation of both the steps required about hour for a single run.

First step

The classification model of the first step has been fitted on the whole sub-sampled population of units, with the audited/not-audited variable as target ( not-audited and audited). The final outcomes are approximations to the inclusion probabilities of each unit in the fiscal audit sample. The Gradient Boosting fitting depends on some major tuning parameters such as the number of iterations , the depth of each single tree , the learning parameter etc. They are not directly estimated by the model, but they have to be chosen before the fitting. That is why the standard solution is to decide for a fixed grid of values for each of them, fit the model for any possible combination and pick the one which returns the best performance in terms of some arbitrarily chosen metric. The refinement and extension of the grid must be chosen taking into account the computational time to fit the model for any possible combination. The following ones have been chosen for the tuning parameters in this application:

Generally, a number of iterations of allow to reach convergence for the Gradient Boosting algorithm with a . In general, it is not possible to exclude that even better results could have been achieved for a greater number of iterations, which usually provide better results for low values of . However, given the low computational power available and the computational cost of the Gradient Boosting increasing with , its value has been limited to . It may be a good idea to perform a sensitivity analysis in order to verify the robustness of the algorithm with respect to other choices of the tuning parameters but it has been excluded from this work for the sake of brevity and time.

The performance cannot be directly evaluated in terms of fitting on the training set because machine learning techniques are so flexible that risk to be affected by over-fitting (friedman2001elements). For this reason, these tuning parameters have been chosen by splitting the sample into a training set and a testing set. The model is trained to estimate the probability of an audit using only information from the units in the training set and then it is used project what it has learned on the units in the testing set. According to this approach, the best set of parameters is the one who achieves the best score on the testing set.

In particular, of the units have been allocated to the training while the remaining to the testing (see Table 3). The metric chosen to evaluate the model performance in this step is the AUC score (fawcett2006introduction).

Train set Test set
Table 3: Training and testing set composition.

The optimal choice returned an AUC value of and was associated to the following set of parameters:

A greater value for , combined with lower values of , may provide even better results. This may be object of further investigation in future applications.

The gradient boosting procedure returns the variables which had the greater importance in the fitting process (those who played a role in the most of the splits). In this case, the most discriminant variables were the declared tax base (BID), the activity branch, the dimension and the incomes of the firm. This is coherent with the results reported in Table 2. The probability to be selected for a taxpayer is related to its declared tax base, income and dimension: this makes sense since the richer the firm the more it can potentially evade.

Second step

In the fitting of the second model, each unit has been weighted with the inverse of its predicted inclusion probability coming from the previous step. In practice, each unit got a weight computed according to equation 5 equal to:

Given that, the same procedure of the first step is adopted also for the validation of the second Gradient Boosting, which involved only the audited units (units in ). The sample is split in a train-set ( of the units, ) and a test-set ( of the units, ) as it is shown in Table 3. Since this step consisted of a regression problem, the optimal parameters have been chosen according to the index. The best value obtained for the on the testing set is , with tuning parameters:

Figure 2 compares predicted values and observed ones on the test set. Obviously, in case of perfect prediction, all the points would be aligned on the bisector. We can notice how the most and the largest of the errors are related to the underestimation of the true potential tax base.

Figure 2: Comparison of observed and predicted values on the test set for the BIT.

Interval estimates for either the training set and the testing set have been produced using the technique in Section 3.3 only on the second step of the procedure. The number of bootstrap samples from the training set has been fixed to , each with the same size equal to the original training set size . Therefore, at the end of the procedure, set of predictions are obtained for each observation . The resulting intervals, obtained by computing the and quantiles of the empirical distributions of the bootstrapped predictions , contained the true values only in the of the cases. The coverage is not satisfying, but it is important to highlight how the outcome predictions are approximating the conditional expected value . Therefore, the algorithm is bootstrapping the distribution of the prediction to , which is by definition way less variable than the point observation corresponding to the set of covariates . Currently, there is not a unified framework for the production of prediction intervals in the Machine Learning framework. Other recently proposed techniques which focus on the uncertainty of the point observation are not discussed in this paper. They are based on bootstrapping the prediction error (coulston2016approximating) or on building predictive models for the resulting predictions error (shrestha2006machine). The good thing about the obtained interval is that the coverage is for both the training and the test set, reassuring about the risk of over-fitting on the training data. Furthermore, a greater value for is kindly suggested to improve the approximation. However, with only bootstrap sample parameters the algorithm took approximately hours to provide the final sets of predictions on the considered hardware. It is not an very important issue since the computational time is linear in and can be drastically reduced using a better performing processor and parallelizing the procedure on a reasonable number of cores.

Finally, the predictions on the test set are compared to the ones produced by the standard Heckman model according to the same split. While the aggregate estimates of the total BIT result to be very close to each other, it is possible to notice differences in term of individual estimation. In particular, the obtained by the estimates from the Heckman model is equal to , which is sensibly lower than the one achieved by our new approach. Results are summarized in Table 4.

Gradient Boosting   Heckman Observed
Table 4: Performances on the test set of the 2-steps Gradient Boosting and the Heckman model.

It seems that the Heckman Model is able to catch the general behaviour, but lacks of flexibility in order to get individual values. Linearity is a too strong restrain for such a complex problem, and the Gradient Boosting is not limited by this.

4.2 Forecasting the Potential Vat Gap

Finally, the two steps Gradient Boosting has been used to produce predictions for all the units whose actual BIT is unknown (not-audited taxpayers). On the sub-sample actually analyzed (see Table 1), it has been predicted a VAT gap turnover of about of euro (€). This result is very close to the value returned by the Heckman model on the same sub-sample, which is of about of euro (€). Using the approach introduced in Section 3.3

, a credibility interval for the for the VAT gap (total

BIND) has been builded. The resulting interval contains the value estimated by the Heckman model, highlighting again an inner coherency between the two techniques.

Furthermore, these predictions allowed the computation of a synthetic measure of VAT evasion propensity defined on the lined of the evasion intensity used in braiotta2015tax. This is defined for each individual as the ratio of the undeclared tax base and the potential tax base, and estimated as:

Consequently, the general propensity to evade is estimated as:

A low value of this ratio amounts to a compliant behaviour and viceversa. The VAT evasion propensity for the entire sample of the taxpayers has been estimated to be of the for the 2-steps Gradient Boosting, just sligthly larger than the obtained with the Heckman model.

Afterwards, propensities have been computed using both the models for different classes of individuals according to some of the observed covariates. These may be used to identify classes of individuals with high VAT evasion propensity and they may be of help in the selection procedure of future fiscal audit. The propensity related to a specific class of individuals is straightforwardly estimated as:

Gradient Boosting Heckman
Sex size
Gradient Boosting Heckman
Age size
Table 5: Propensity behaviour by gender and age classes.

Also in this context, the two models returned coherent results. For instance, either the higher propensity to evade VAT tax for female tax-payers and decreasing propensity with age emerge from both the approaches. On the other hand, the Gradient Boosting two-steps model seems to emphasize these differences: while the difference in VAT propensity between male and female is just 2 percentage point in the Heckman, this gap is of more than points for the 2-steps Gradient Boosting approach (see Table 5). The situation is analogous in the case of the age behaviour, which pass from a points gap for Heckman to a points for the Gradient Boosting. Again, there are coherent results in terms of general behaviour but a supposedly greater ability of the former to catch specific variability. Indeed, given the better of the former on the test set, it is trustworthy that this greater variability is actually present in the population but the Heckman model is not able to catch it.

5 Concluding remarks

In this work, it is presented a non-parametric approach for the estimation of the VAT gap based on the Gradient Boosting algorithm, a machine learning technique for regression and classification problems. In conclusion, this new approach based on machine learning results preferable and more suited to deal with this kind of data because it is distribution free, it can manage any kind of variable, it is not sensible to multicollinearity issues and so on. Furthermore, machine learning based models usually provide good performances also in high dimensional settings, and allow to exploit all the information contained in large sets of data.

In practice, the estimates of the VAT gap obtained through the two approaches are very similar; however the Gradient Boosting based model produced sensibly more accurate estimates of the individual undeclared tax bases, catching the variability associated to observed variables as it is desirable. The Heckman model, on the contrary, seems to flatten out individual differences.

The possible further developments of this kind of approach are various and promising. For instance, the analysis exposed in this work has been performed only on a small subset of all the available observations because of hardware limitations. We are confident that way better results may be achieved by analyzing the whole set of data. Moreover, further investigation of methods for the building of more reliable interval estimates is of main interest being it one of the main drawbacks of the methodology. Last but not least, improving the computational power would allow the application of a complete k-fold cross-validation for the tuning of the parameters, even on finer grids, and to apply more expensive but efficient techniques such as the Extreme Gradient Boosting and Neural Networks.