1 Introduction
Tax evasion is the illegal evasion of taxes by individuals, corporations, and trusts. This results in a loss of State revenue that may be of great relevance and undermine the effectiveness of the state policies. It represents one of the main problems in modern economies, where the government budget is constantly under control (santoro2010evasione). One measure of the extent of tax evasion is the socalled tax gap, that is the amount of unreported income evaluated as the difference between the amount of income that should be reported to the tax authorities and the amount actually reported.
The aim of this paper is to introduce a robust methodology to provide estimate of the Tax Gap through a bottomup approach based on fiscal audits (braiotta2015tax; kumar2015minimising). In particular, the analysis is focused on the estimation of the Italian VAT Gap related to individual firms, for the fiscal year 2011, based on information gathered from administrative sources. Different sources of data were employed: the Tax Administration Database ("Anagrafe Tributaria") and the Operational Audits Database. The first one contains information on the whole population coming from PIT, VAT and IRAP declarations, but does not include the amount of income that should be really reported to the tax authorities (potential tax base). The second one provides information only on a nonrandom sample of units (audited units) and it contains the socalled potential tax base. More details on the Italian tax system and the operational audits activity are presented in Section 2.
The final target of the analysis is to get trustful estimates of the potential tax base, the potential VAT turnover. The model estimated on the audited taxpayers should be able to infer the potential tax base on the nonaudited ones. In this way, it allows to get an estimate of the total potential tax base of the Italian system BIT, compare it to the declared tax base BID, and estimate the total Undeclared Tax Base BIND as a difference between the estimated BIT and the known BID. The main issue of this analysis is that usually audited taxpayers are a very small portion of the total population and they are selected nonrandomly, as in the case of controls performed by the Italian Revenue Agency (d2016general). This induces a selection bias
on the observed response variable which invalidate the application of standard statistical methods (
sarndal2003model; sarndal2005estimation). A machine learning approach based on 2steps Gradient Boosting models based on the work by zadrozny2004learning is proposed in place of the standard methodology based on the Heckman model (heckman1976common; heckman1979sample) in order to improve the estimation accuracy. The first step tries to estimate the inclusion probability of the audited taxpayers, so that they can be used to correct for the selection bias in the second step. The adoption of machine learning algorithms is driven by the necessity to move beyond linearity and their ability to extract information from very large sets of data (both in terms of units and variables) without any strong assumption on the distribution. In the domain of this paper, the Gradient Boosting algorithm, introduced by
friedman2001greedy, is chosen over its alternatives. It is a very robust ensemble learner, able to deal with data of any size (if adequate computational power is available) and nature. More details on the methodology are presented in Section 3. Afterwards, in Section 4, the proposed methodology is applied estimate the potential VAT turnover for a subsample of the Italian individual firms, for the fiscal year . Results on this subsample set are then compared to the ones obtained using the Heckman model, which turns out to be largely outplayed in terms of predictive performances. Finally, in Section 4.2, VAT tax gap and the VAT tax evasion propensities have been predicted for the of notaudited taxpayers in the considered subsample from the population.2 Italian Taxation System
"Every person shall contribute to public expenditure in accordance with his/her taxpayer capacity. The taxation system shall be based on criteria of progression."
Const. Art. 53, Section I, Political rigths and duties
The Italian tax system is based on three fundamental principles, which are discussed in the 53rd article of the Italian Constitution in the “Political Rights And Duties” section.

Universality of taxation: all citizens must contribute to public expenses through the taxes payment. These are aimed to finance the operation of the state machine and are reflect in terms of performance and services for citizens. Those who are below a minimum income are exempt from the tax obligation and can also take advantage of all services, by virtue of the principle of economic and social solidarity.

Ability to contribute: the payment of taxes is based on economic prospects of citizens. This principle guarantees a fair distribution of the tax burden, excluding those who do not have the capacity to pay taxes.

Criteria of progression: the payment of taxes by citizens varies proportionally with respect to the potential tax base. This means that everyone pays taxes based on their economic possibility with a contribution that grows as income increases.
The most valuable taxes in terms of importance of the revenue are the Personal Income Tax (PIT) and the Value Added Tax (VAT), which contribute the of the total revenue.
The personal income tax is the main tax paid in Italy. This tribute weighs on all the persons fiscally resident in Italy on all types of income they get from whatever source. Each person is taxed progressively according to its total income, whether it is by employment, by pension or by selfemployment. It constitutes one of the main sources of revenue for the state coffers. The study of PIT evasion and its gap is addressed in braiotta2015tax.
The Value Added Tax is second by importance. As it is stated by the European Commission <<it serves to tax the consumption of goods and assets services. It is applied to all commercial activities involving the production and distribution of goods and the provision of services>>. This kind of tax in not progressive and weighs completely on the final consumer, while for the taxable person (the entrepreneur and the selfemployed) it remains neutral.
The activity of checking the correct fulfillment of tax payers’ obligations falls within the institutional tasks of the Italian Revenue Agency. These checks, with different methods and different timing, therefore evaluate the correspondence between what is declared and what is actually due (d2016general). The notice assessment of income and VAT taxes must be notified within four years following the declaration date. For example, on December 31 2017 were concluded the assessments for the declarations presented in 2012, which concern 2011 incomes.
Generally the declarations presented by the tax payers can be subject to various types of audits. For instance, the substantial control is aimed at adjusting the total income or turnover of the tax payer. In general, the audited tax payers are not selected randomly but they are identified on the basis of predetermined selection criteria. The motivation as a basis for this selection is that the evader is assumed to have a different behavior from the one who does not evade: evasive logics. Therefore, this criteria are believed to identify such evasive behaviour and optimize the selection of tax evaders. This kind of controls require really high costs both in terms of money and time and, for this reason, the assessment is viable only on less of the of the total population and the evaded amount of the remaining stays unknown. That is why accurate methods for the estimation and prediction of the evasion level and propensity also for notaudited individuals would be really valuable and may provide policymakers with better tools and information to contrast the phenomenon of tax evasion.
3 Model and methodology
The planning of public investments and initiatives are strictly linked to the predicted potential tax base BIT of the whole economic system expected for the current and following years. However, the estimation of the potential tax base is always upward biased with respect to the actual income coming from taxes. As a matter of fact, the declared tax base BID, which is the actual amount of tax payed by all the economic subjects, is always lower than the BIT due to the wellknown problem of tax evasion. The difference between these two quantities is the undeclared tax base BIND, commonly known as tax gap:
The quantification of the gap tax base is important in order to support the decisions in the state economic policies. Moreover, exploiting information from the previous years it may be possible to extrapolate and take into account its amount for the upcoming years. However, while the BID is available at the individual level on the bases of the income tax return submitted by the individual economic subjects, the undeclared BIT is generally unknown both at the aggregated and the individual level.
Nevertheless, thanks to the bottomup approach based on fiscal audits, the BIT is yearly detected on a sample of audited taxpayers from the whole population : is known. However, these units are not randomly selected but the selection is based on arbitrary (and unknown) criteria established by the Italian Revenue Agency (d2016general). The logic behind this choice is driven by the necessity to maximize the selection of taxpayers who have an increased risk to hide part of their incomes. This means that the mechanism of selection is not negligible with respect to the response variable and, therefore, any estimation procedure based on such a sample would be affected by a potentially strong selection bias. This kind of bias invalidate any inference from the sample to the whole population of the economic subjects (sarndal2005estimation). For instance, let us denote our variable of interest and let it be observed only on a non random sample of units selected according to a sample design . The expected value for each unit would be:
(1) 
If the sample design is not independent from the outcome variable, then:
(2) 
and it is not possible to directly get any estimate for using solely the observed outcomes. This means it is not possible to directly get any estimate for the potential tax base of notaudited units by using information on the audited ones. Furthermore, due to the confidentiality of the selection criteria, the probability of selection of each unit in the sample are not available apriori and therefore cannot be used to correct for the nonrappresentativeness of the selected sample.
These observations are of main importance when considering the appropriateness of the modeling assumptions for the chosen model. It needs to keep into account the selection bias, and provide some correction for this. The standard solution, currently adopted by the Italian Revenue Agency, is based on the wellknown Heckman Model (heckman1979sample)
, which is based on the combination of logistic and linear regression. The one considered in this paper is a two steps procedure based on the work by
zadrozny2004learning that exploits the Gradient Boosting algorithm as a strong learner in both the steps. The adoption of such an algorithm allows to move beyond linearity, exploiting all the available information and relax some strong assumption hardly matched by the available data.Furthermore, it may be interesting to point out that the availability of data at the individual level allows to go beyond the mere scope of the analysis. Indeed, predictions of the individual VAT gap can be used to compute the propensity to evade for units grouped according to some variables of interest (braiotta2015tax). These propensities can be used to identify the possible highrisk taxpayers and drive future selections for tax audit.
3.1 The 2steps approach
The approach here introduced is viable when the complete population list and a common set of covariate is available on each unit. The following hypotheses need to hold.

The probability to be included in the sample for unit depends only on its covariates .

The response variable of unit , , is conditionally independent from the sampling design given the covariates :
The second assumption allows us to transfer all the dependence of the outcome variable on the sample design to the set of covariates:
(3) 
Theoretically, Equation 3 deriving from the last assumption should allow to fit the model considering only the selected units. However, there is still an issue of nonrapresentativity of the sample with respect to the whole population. Being units in the sample different from units out of the sample both in terms of response and explicative variables, an estimation based only on the selected sample would rely on a certain dose of extrapolation. The resulting estimates would favor the fit on overrepresented units and disregard the fit on underrepresented ones. The 2steps approach is adopted to correct this kind of bias using a HorvitzThompson style estimation, directly derived from the most basic survey sampling theory (sarndal2003model), exploiting all the auxiliary information available on the population. It consists of the subsequent application of two predictive models on the available data.

The first model is a classification model. It is trained on the whole sample and targets the binary variable
selected in the sample and not selected in the sample. It tries to find a specific pattern or some regularity in the mechanism of selection of the units, so that the probability to be selected can be estimated according to the auxiliary information included in the covariates . The resulting model would be able to produce predictions on the whole population which approximate the probability of inclusion of the first order for each unit:(4) 
The second model targets the response variable and, therefore, it is then trained only on the units included in the sample (whose response variable is available). However, it is now possible to incorporate the inclusion probability resulting from Equation 4 in order to correct for the nonrepresentativity of the selected units. These can be used to produce the inverse weights defined as:
(5) where is the probability to be selected notwithstanding the set of covariates. In practice, weighting each input observation proportionally to the inverse of their selection probability, we reduce the importance of units already overrepresented in the sample while increasing the importance of underrepresented ones.
Solutions of this kind are very common in the correction for bias deriving from nonnegligible sampling design sarndal2003model, for instance when incorporating the response probability to correct for the nonresponse bias (bethlehem1988reduction; alho1990adjusting) and of the inverse probability of treatment weighting (IWTP) to correct for the treatment selection bias in causal inference (hirano2003efficient; austin2015moving
). Naturally, all these methods rely on the prediction accuracy of both the models: the classifier in the first step and the predictive model (either classifier or regressive) of the second step. Which model to choose in the two steps is very problem dependent and is up to the researcher. This work proposes the adoption of the
Classification and Regresstion Trees (CART) based Gradient Boosting algorithm for both the steps.3.2 The Gradient Boosting algorithm
The Gradient Boosting
is a very powerful algorithm that allow to build predictive models for both the classification and regression tasks. It is an ensemble algorithm that relies on the concept of boosting, which is a technique for reducing bias and variance in supervised learning, firstly introduced in the seminal paper of
schapire1990strength. The Gradient in front of the term Boosting refers to a very flexible formulation of the boosting, firstly proposed by friedman2001greedy. This particular version exploits the Gradient Descentin order to fasten the optimization procedure on the loss function of interest. This has been chosen among a set of Machine Learning algorithms for both steps because of its desirable combination of reduced computation burden and good performances in either tasks.
Let us consider the usual set of covariates and the response variable . The final aim of any supervised learning algorithm is to train itself on a set of data whose covariates and response variables are known and then produce an approximation to the function that generally relates and the expected value of . The approximation is obtained in such a way that the expected value of a prespecified loss function
is minimized with respect to the joint distribution of all the pairs
in the set of data. In practice, the algorithm learns from the examples provided to it in the form of a training set and look for that approximation such that:The choice of the loss function depends on the nature of the problem and of the outcome variable. For instance, in the case of the regression task, the usually adopted loss function is the Mean Squared Error. The peculiarity of the boosting procedure is that it approximates using a function of the form:
where are functions known as Base Learners and are real coefficients. The base learners are functions of derived from another, simple, learning algorithm and the ’s are expansion coefficients used to combine the base learners outcomes. Either the base learners and the expansion coefficients are estimated using the data from the train set using a forwardstagewise procedure. As any recursive algorithm, it starts from an initial guess and then the new set of coefficients and learner are derived as:
(6) 
and
Unfortunately, choosing the best pair at each step for an arbitrary loss function is a computationally infeasible optimization problem in general. This is where the gradient descent plays a key role, leading to the Gradient Boosting algorithm. It solves the optimization problem in Equation 6 through an approximation that is legit whenever the loss function is differentiable. At each step , the base learner is chosen according to the best fit on the pseudoresiduals deriving from the previous step:
where:
The pseudoresidual values play the role of the gradient, driving the optimization procedure towards the right direction step after step. In this simplified framework, given the base learner , the best value for can be obtained as:
A very common modification to the standard gradient boosting algorithm is the addition of a shrinkage parameter , which modifies the update rule in the following way:
This parameter controls the learning rate of the algorithm and allows for the regularization of the procedure (efron2016computer). The whole algorithm is resumed in the pseudocode of Algorithm 1.
The most common version of the Gradient Boosting uses fixedsize CART (usually small, with low number of branches and/or splits) as base learners, whose predictive ability is strongly enhanced by their boosting combination (efron2016computer). Either the shrinkage parameter and the parameters that define each single random tree (number of splits, number of branches, etc.) are not estimated during the procedure. In the Machine Learning context they are known as tuning parameters and they need to be chosen in advance and stay fixed. Typically, they are selected via searching procedure bases on the crossvalidation approach in order to avoid overfitting (friedman2001elements).
3.3 Estimation of uncertainty
The main issue with machine learning algorithm is that they provide point estimates but they cannot rely on any modeling assumption in order to derive interval estimates. However, it is possible to address this weakness resorting to a bootstrap approach (efron2003second) as it is proposed in heskes1997practical.
This method consists of deriving different bootstrap samples from the original training set, in order to get different samples approximately distributed according the joint distribution of and of the considered training set. The idea is to fit the same model on the different samples from the original training set, which will provide a different approximation for the function relating covariates and the expected value of the response variable . Using each of these functions, it is possible to produce set of predictions for all the ’s in the training set: different predictions for each unit . The procedure is outlined in Figure 1.
In such a way, a sort of empirical posterior distribution for the prediction of each is obtained and, through these, it is possible to derive interval estimates in whatever way it is preferable. For instance, it may be considered picking the and
empirical quantiles in order to obtain a
equal tail posterior interval for the prediction.4 Application to real data
The available set of data is composed of all the individual firms included in the administrative register for the taxation period 20072014.
Due to the unavoidable delay in the availability of declared information from taxpayers, there is a lag between the fiscal (audited) year and the year in which the control is performed. Moreover, considering that it is no more possible to control a fiscal year after five years, consider that data for five or six years of activity are necessary in order to have complete information on a specific audited tax year. For these reasons, the analysis will be driven on data from the year . Therefore, the population of interest, referred to the fiscal year , is composed of millions () of individual firms, where only the ( units) have been audited. In total, explanatory variables coming from the Tax Administration database were considered, concerning these area of information: demographics (gender, age, province, region); economic sector of operation; taxable income type; revenues, costs reported income, taxe base, gross tax and net tax; presumptive turnover provided by Business Sector Studies.
The software considered for the analysis of the dataset were the opensource
R and the data mining oriented SAS Enterprise Miner. Finally, R have been chosen for its greater flexibility. Unfortunately, due to the great sensitivity of the data, the whole analysis had to be performed on a single computer (Processor: Intel Pentium dualcore E1040; RAM: 4gb) without any possibility to move the data or rely on external virtual machines. These hardware limitations not only slowed down the implementation of the algorithm, but did not allow to consider the whole set of data because of the limited RAM memory. This issue has been overcome by considering only a subsample of all the units, limiting the potential of the results of this paper but still allowing to evaluate the proposed procedure and compare it to its competitor.Given the large imbalance between audited and notaudited taxpayers in the population, it has been resorted to the subsampling only on the notaudited ones. Finally, as it is shown in Table 1, the analysis has been carried out on a sample of the notaudited taxpayers ( of the notaudited, ) stratified with respect to the demographic variables, and all the audited ones ( of the audited, ).
Total Population  Subsample  

Fiscal audits  Frequency  Percentage  Frequency  Percentage 
Notaudited  
Audited  
The declared tax base and all the covariates have been retrieved from the tax administration database based on the declaration models PIT, VAT and IRAP and are available on the whole population. On the other hand, through the papers resulting from the fiscal audits, it is possible to get the effective potential tax base on the nonrandomly selected units subject to controls. The tax evasion amount for each firm is denoted as (undeclared tax base) and can be obtained as the difference between the BIT and the BID (declared tax base):
The final aim of this application is to provide an estimate for the VAT gap for the individual firm in the considered subsample according to the procedures introduced in Section 3 and compare the performances with the standard Heckman model. Moreover, predictions on the individual firms will be used to examine the evasion propensity at different levels of disaggregation, identifying categories of taxpayers more propense to evade (at higher risk of evasion) with respect to others.
It may be relevant to point out that the second step of the 2steps methodology concerns only the audited taxpayers, hence the subsampling on the notaudited ones directly affected only the first step of the procedure (Section 3.1). The estimation of the inclusion probability may be less accurate, and so the correction of the selection bias but, given the great imbalance of audited and notaudited units in the population, it is wellknown in literature how a subsampling strategy aimed at rebalancing the general situation does not impair sensibly the final estimates (more2016survey).
The variables coming from the different sources has been cleaned (elimination of unary variables and of repeated variables) and joined in one single dataset. Further data preprocessing has been considered but did not lead to any performance improvement. Indeed, machine learning algorithm are not affected by the usual data criticism of standard linear modeling such as: multicollinearity, skewness, deviations from Normality assumptions and so on
(efron2016computer).After that, the group of audited and not audited units have been compared; in particular, it may be interesting to notice the great difference in terms of declared tax base between the two groups. Table 2 shows the main summary statistics for the BID variable.
Notaudited  Audited  

Mean  
Median  
Standard deviation  
Percentiles  25  
50  
75 
The BID
value for the audited units is way greater than the one for the notaudited ones with the mean of the former statistically greater than the one of the latter (ttest
vs ; ). It seems that the selection criteria of the fiscal audits tend to favor the taxpayer with higher BID, proving the point that the sampling design does depend on some of its individual characteristics. As a matter of fact, the selection may depend also on other variables and the more general picture is further investigated in Section 4.1.Finally, both the 2steps Gradient Boosting and the Heckman model (standard model currently used by the Italian Revenue Agency) have been fitted to the same set of data. Results are shown and compared in Section 4.1.
4.1 Model fitting and uncertainty assessment
According to what has been explained in Section 3.1, two different gradient boosting have been subsequently fitted. The first one is a classification model aimed at estimating the inclusion probabilities while the second one is a weighted regressive model aimed at estimating the potential tax base. Both the models in the two steps have been fitted to the data and summarized in the software R using functions from the package gbm (ridgeway2007generalized). The fit and crossvalidation of both the steps required about hour for a single run.
First step
The classification model of the first step has been fitted on the whole subsampled population of units, with the audited/notaudited variable as target ( notaudited and audited). The final outcomes are approximations to the inclusion probabilities of each unit in the fiscal audit sample. The Gradient Boosting fitting depends on some major tuning parameters such as the number of iterations , the depth of each single tree , the learning parameter etc. They are not directly estimated by the model, but they have to be chosen before the fitting. That is why the standard solution is to decide for a fixed grid of values for each of them, fit the model for any possible combination and pick the one which returns the best performance in terms of some arbitrarily chosen metric. The refinement and extension of the grid must be chosen taking into account the computational time to fit the model for any possible combination. The following ones have been chosen for the tuning parameters in this application:
Generally, a number of iterations of allow to reach convergence for the Gradient Boosting algorithm with a . In general, it is not possible to exclude that even better results could have been achieved for a greater number of iterations, which usually provide better results for low values of . However, given the low computational power available and the computational cost of the Gradient Boosting increasing with , its value has been limited to . It may be a good idea to perform a sensitivity analysis in order to verify the robustness of the algorithm with respect to other choices of the tuning parameters but it has been excluded from this work for the sake of brevity and time.
The performance cannot be directly evaluated in terms of fitting on the training set because machine learning techniques are so flexible that risk to be affected by overfitting (friedman2001elements). For this reason, these tuning parameters have been chosen by splitting the sample into a training set and a testing set. The model is trained to estimate the probability of an audit using only information from the units in the training set and then it is used project what it has learned on the units in the testing set. According to this approach, the best set of parameters is the one who achieves the best score on the testing set.
In particular, of the units have been allocated to the training while the remaining to the testing (see Table 3). The metric chosen to evaluate the model performance in this step is the AUC score (fawcett2006introduction).
Train set  Test set  

NotAudited  
Audited  
The optimal choice returned an AUC value of and was associated to the following set of parameters:
A greater value for , combined with lower values of , may provide even better results. This may be object of further investigation in future applications.
The gradient boosting procedure returns the variables which had the greater importance in the fitting process (those who played a role in the most of the splits). In this case, the most discriminant variables were the declared tax base (BID), the activity branch, the dimension and the incomes of the firm. This is coherent with the results reported in Table 2. The probability to be selected for a taxpayer is related to its declared tax base, income and dimension: this makes sense since the richer the firm the more it can potentially evade.
Second step
In the fitting of the second model, each unit has been weighted with the inverse of its predicted inclusion probability coming from the previous step. In practice, each unit got a weight computed according to equation 5 equal to:
Given that, the same procedure of the first step is adopted also for the validation of the second Gradient Boosting, which involved only the audited units (units in ). The sample is split in a trainset ( of the units, ) and a testset ( of the units, ) as it is shown in Table 3. Since this step consisted of a regression problem, the optimal parameters have been chosen according to the index. The best value obtained for the on the testing set is , with tuning parameters:
Figure 2 compares predicted values and observed ones on the test set. Obviously, in case of perfect prediction, all the points would be aligned on the bisector. We can notice how the most and the largest of the errors are related to the underestimation of the true potential tax base.
Interval estimates for either the training set and the testing set have been produced using the technique in Section 3.3 only on the second step of the procedure. The number of bootstrap samples from the training set has been fixed to , each with the same size equal to the original training set size . Therefore, at the end of the procedure, set of predictions are obtained for each observation . The resulting intervals, obtained by computing the and quantiles of the empirical distributions of the bootstrapped predictions , contained the true values only in the of the cases. The coverage is not satisfying, but it is important to highlight how the outcome predictions are approximating the conditional expected value . Therefore, the algorithm is bootstrapping the distribution of the prediction to , which is by definition way less variable than the point observation corresponding to the set of covariates . Currently, there is not a unified framework for the production of prediction intervals in the Machine Learning framework. Other recently proposed techniques which focus on the uncertainty of the point observation are not discussed in this paper. They are based on bootstrapping the prediction error (coulston2016approximating) or on building predictive models for the resulting predictions error (shrestha2006machine). The good thing about the obtained interval is that the coverage is for both the training and the test set, reassuring about the risk of overfitting on the training data. Furthermore, a greater value for is kindly suggested to improve the approximation. However, with only bootstrap sample parameters the algorithm took approximately hours to provide the final sets of predictions on the considered hardware. It is not an very important issue since the computational time is linear in and can be drastically reduced using a better performing processor and parallelizing the procedure on a reasonable number of cores.
Finally, the predictions on the test set are compared to the ones produced by the standard Heckman model according to the same split. While the aggregate estimates of the total BIT result to be very close to each other, it is possible to notice differences in term of individual estimation. In particular, the obtained by the estimates from the Heckman model is equal to , which is sensibly lower than the one achieved by our new approach. Results are summarized in Table 4.
Gradient Boosting  Heckman  Observed  

It seems that the Heckman Model is able to catch the general behaviour, but lacks of flexibility in order to get individual values. Linearity is a too strong restrain for such a complex problem, and the Gradient Boosting is not limited by this.
4.2 Forecasting the Potential Vat Gap
Finally, the two steps Gradient Boosting has been used to produce predictions for all the units whose actual BIT is unknown (notaudited taxpayers). On the subsample actually analyzed (see Table 1), it has been predicted a VAT gap turnover of about of euro (€). This result is very close to the value returned by the Heckman model on the same subsample, which is of about of euro (€). Using the approach introduced in Section 3.3
, a credibility interval for the for the VAT gap (total
BIND) has been builded. The resulting interval contains the value estimated by the Heckman model, highlighting again an inner coherency between the two techniques.Furthermore, these predictions allowed the computation of a synthetic measure of VAT evasion propensity defined on the lined of the evasion intensity used in braiotta2015tax. This is defined for each individual as the ratio of the undeclared tax base and the potential tax base, and estimated as:
Consequently, the general propensity to evade is estimated as:
A low value of this ratio amounts to a compliant behaviour and viceversa. The VAT evasion propensity for the entire sample of the taxpayers has been estimated to be of the for the 2steps Gradient Boosting, just sligthly larger than the obtained with the Heckman model.
Afterwards, propensities have been computed using both the models for different classes of individuals according to some of the observed covariates. These may be used to identify classes of individuals with high VAT evasion propensity and they may be of help in the selection procedure of future fiscal audit. The propensity related to a specific class of individuals is straightforwardly estimated as:
Gradient Boosting  Heckman  

Sex  size  
Female  
Male  
Total 
Gradient Boosting  Heckman  

Age  size  
over  
Total 
Also in this context, the two models returned coherent results. For instance, either the higher propensity to evade VAT tax for female taxpayers and decreasing propensity with age emerge from both the approaches. On the other hand, the Gradient Boosting twosteps model seems to emphasize these differences: while the difference in VAT propensity between male and female is just 2 percentage point in the Heckman, this gap is of more than points for the 2steps Gradient Boosting approach (see Table 5). The situation is analogous in the case of the age behaviour, which pass from a points gap for Heckman to a points for the Gradient Boosting. Again, there are coherent results in terms of general behaviour but a supposedly greater ability of the former to catch specific variability. Indeed, given the better of the former on the test set, it is trustworthy that this greater variability is actually present in the population but the Heckman model is not able to catch it.
5 Concluding remarks
In this work, it is presented a nonparametric approach for the estimation of the VAT gap based on the Gradient Boosting algorithm, a machine learning technique for regression and classification problems. In conclusion, this new approach based on machine learning results preferable and more suited to deal with this kind of data because it is distribution free, it can manage any kind of variable, it is not sensible to multicollinearity issues and so on. Furthermore, machine learning based models usually provide good performances also in high dimensional settings, and allow to exploit all the information contained in large sets of data.
In practice, the estimates of the VAT gap obtained through the two approaches are very similar; however the Gradient Boosting based model produced sensibly more accurate estimates of the individual undeclared tax bases, catching the variability associated to observed variables as it is desirable. The Heckman model, on the contrary, seems to flatten out individual differences.
The possible further developments of this kind of approach are various and promising. For instance, the analysis exposed in this work has been performed only on a small subset of all the available observations because of hardware limitations. We are confident that way better results may be achieved by analyzing the whole set of data. Moreover, further investigation of methods for the building of more reliable interval estimates is of main interest being it one of the main drawbacks of the methodology. Last but not least, improving the computational power would allow the application of a complete kfold crossvalidation for the tuning of the parameters, even on finer grids, and to apply more expensive but efficient techniques such as the Extreme Gradient Boosting and Neural Networks.
Comments
There are no comments yet.