I The Regress Argument or Errors on Errors
While there is a tradition of philosophical approaches to probability which includes most notably Laplace, Ramsey, Keynes, De Finetti, von Mises, Jeffreys, (see  and  for a review) and more recently Levi’s magisterial Gambling with Truth 
, such approaches and questions while influencing some branches of Bayesian inference, have not fully entered the field of quantitative risk management111See De Finetti[6, 7, 8] . Yet epistemology, as a field, is a central element in statistical inference and risk management, [35, 36, 30]. Fundamentally, philosophical inquiry is about handling such central questions about any inference as How do you know what you know? How certain are you about it?, etc. This paper consider higher orders for such questions.
Now let’s consider estimation. All estimates, whether statistical or obtained by some other method, are by definition imprecise (being "estimates"). An estimate must tautologically have an error rate –otherwise it would not be an estimate, but a certainty or something linked to perfection. If we follow the logic, the error rate itself is also an estimate –the methods used to estimate such error are themselves imprecise. There is no flawless way to estimate an error rate. And so forth.
By a regress argument, the inability to continuously re-apply this thinking about errors on errors fosters additional model uncertainty and unreliability not taken into account in the standard literature.222See  for model risk.
While in practical applications, there is no problem with stopping the recursion in heuristically determined situations from past track records –where all errors have been determined through the test of time and survival and one has a satisfactory understanding of the structure and properties. However, refraining from caring about errors on errors should be explicitly declared as a subjective a priori decision that escapes quantitative and statistical methods. In other words we have to state and accept the subjectivity of the choice and the necessary effects on the overall model uncertainty. Using Isaac Israeli’s words, as reported by Thomas Aquinas, “Veritas est adequatio intellectus et rei."
In what follows, we show how, taking the errors on errors argument to the limit, fat tails emerge from the layering of uncertainty. Starting from a completely non fat-tailed low-risk world, represented by the Normal distribution, we increase tail risk and generate fat tails by perturbing its standard deviation, introducing errors and doubts about its “true" value. We show analytically how uncertainty induces fat tails, arguing that real life is actually even more extreme.333
To use mathematical finance practitioners, "fat tails" in this context is any distribution with thicker tails than the Gaussian, not necessary a power-law. Hence this designation encompasses the subexponential and the power law classes, as well as any mixture of Gaussian with higher kurtosis than 3.[backgroundcolor=yellow!10, nobreak=true, innerlinewidth=1pt,outerlinewidth=1pt, middlelinewidth=1pt,linecolor=brown, middlelinecolor=gray, ]
One of the contributions in this paper is the streamlining of counterfactuals.
Counterfactual analysis , or the business school version of it, "scenario analysis", branch uncontrollably in an explosive manner, hampering projections many steps in the future –typically at a minimal rate , where is the number of steps. We show that they can be structured analytically in a way that produces a single distribution of outcomes and allows variability. We manage to do do thanks to a rate of "error on error", which can be parametrized, hence allows for perturbations and sensitivity analyses.
[backgroundcolor=yellow!10, nobreak=true, innerlinewidth=1pt,outerlinewidth=1pt, middlelinewidth=1pt,linecolor=brown, middlelinecolor=gray, ]
The mechanism by which uncertainty about probability thickens the tails (by increasing the odds of tail events) is as follows.
The mechanism by which uncertainty about probability thickens the tails (by increasing the odds of tail events) is as follows.
Assume someone tells you the probability of an event is exactly zero.
You ask: "How do you know?"
Answer: "I estimated it."
Visibly if the person estimated it and there is, say, a 1% error (symmetric), then probability has a lower bound of 1%. Such uncertainty raises probability. It cannot be zero but some number higher than 1%.
Ii Layering Uncertainties
Take a rather standard probability distribution, say the Normal. Assume that its dispersion parameter, the standard deviation , is to be estimated following some statistical procedure to get . Such an estimate will nevertheless have a certain error, a rate of epistemic uncertainty, which can be expressed with another measure of dispersion: a dispersion on dispersion, paraphrasing the “volatility on volatility" of option operators [11, 13, 37]. This makes particularly sense in the real world, where the asymptotic assumptions  usually made in mathematical statistics do not hold , and where every model and estimation approach is subsumed under a subjective choice .
Letwith known mean and unknown standard deviation . To account for the error in estimating , we can introduce a density over , where represents the scale parameter of under , and its expected value. We are thus assuming that
is an unbiased estimator of, but our treatment could also be adapted to the weaker case of consistency . In other words, the estimated volatility is the realization of a random quantity, representing the true value of with an error term.
The unconditional law of is thus no longer that of a simple Normal distribution, but it corresponds to the integral of , with replaced by , across all possible values of according to . This known as a scale mixture of normals , and in symbols one has
Depending on the choice of , that in Bayesian terms would define an a priori, can take different functional forms.
Now, what if itself is subject to errors? As observed before, there is no obligation to stop at Equation (1): one can keep nesting uncertainties into higher orders, with the dispersion of the dispersion of the dispersion, and so forth. There is no reason to have certainty anywhere in the process.
For , set , with , and for each layer of uncertainty define a density , with . Generalizing to uncertainty layers, one then gets that the unconditional law of is now
This approach is clearly parameter-heavy and also computationally intensive, as it requires the specification of all the subordinated densities for the different uncertainty layers and the resolution of a possibly very complicated integral.
Let us consider a simpler version of the problem, by playing with a basic multiplicative process à la Gibrat , in which the estimated is perturbed at each level of uncertainty by dichotomic alternatives: overestimation or underestimation. We take the probability of overestimation to be , while that of underestimation is .
Let us start from the true parameter , and let us assume that its estimate is equal to
where is an error rate (for example it could represent the proportional mean absolute deviation ).
Equation (1) thus becomes
Now, just to simplify notation–but without any loss of generality–hypothesize that, for , overestimation and underestimation are equally likely, i.e. . Clearly one has that
Assume now that the same type of uncertainty affects the error rate , so that we can introduce and define the element . Figure 1 gives a tree representation of the uncertainty over two (and possibly more) layers.
With two layers of uncertainty the law of thus becomes
While at the -th layer, we recursively get
where is the
-th entry of the vector
with being the -th entry of the matrix of all the exhaustive combinations of -tuples of the set , i.e. the sequences of length representing all the combinations of and . For example, for , we have
Once again, it is important to stress that the various error rates are not sampling errors, but rather projections of error rates into the future. They are, to repeat, of epistemic nature.
Iii Hypothesis 1: Constant error rate
Assume that , i.e. we have a constant error rate at each layer of uncertainty. What we can immediately observe is that matrix collapses into a standard binomial tree for the dispersion at level , so that
Because of the linearity of the sum, when, when taking layers of epistemic uncertainty. One can easily check that the first four raw moments read as
From these, one can then obtain the following notable moments :
First notice that the mean of is both independent of and : this is a clear consequence of the construction, for which is assumed to be known. For what concerns the variance, conversely, the higher the uncertainty ( growing), the more dispersed is ; for , the variance explodes. Skewness, conversely, is not affected by uncertainty and the distribution of stays always symmetric. Finally kurtosis is a clear function of uncertainty, thus the distribution of becomes more and more leptokurtic as the layers of uncertainty increase, indicating a substantial thickening of the tails, hence a strong increase in risk.
Please observe that the explosion of the moments of order larger than one takes place for even very small values of , as grows to infinity. Even something as small as a error rate will still lead to the invalidation of the use of distributions to study . Once again, notwithstanding the error rate, the growth of uncertainty inflates tail risk, and such a behavior also occurs when .
Figure 2 shows how the tails of get thicker as increases, compatibly with the explosion of moments. The larger the higher the kurtosis of , so that its peak grows and so do the tails.
As observed before, in applications, there is no need to take large, it is totally understandable to put a cut-off somewhere for the layers of uncertainty, but such a decision should be taken a priori and motivated, in the philosophical sense.
Figure 3 shows the logplot of the density of when , for different values of . As expected, as grows, the tails of open up, tending towards a power law behavior, in a way similar to that of a risky lognormal with growing scaling parameter. Recall, however, that the first moment of will always be finite, suggesting that a pure power law behavior leading to an infinite-mean phenomenon will never take place. The result is also confirmed using other graphical tools (Zipf plot, Mean Excess Plot, etc.) like those discussed in .
It is then interesting to measure the effect of on the thickness of the tails of . The obvious effect, as per Figure 3, is the rise of tail risk.
Fix and consider the exceedance probability of over a given threshold , i.e. the tail of , when is constant. One clearly has
where is the complementary error function.
Tables I and II show the ratio of the exceedance probability of for different values of and over the benchmark represented by a simple normal with mean and variance , i.e. our starting point in case of no uncertainty on (or, in other words, for ). The two tables differ for the value of , equal to 0.1 in former, and 0.01 in latter. It is pretty clear how the layering of uncertainty, as grows, make the same tail probabilities grow dramatically. For example, the probability is times larger than the corresponding probability for a , when and is just .
Iv Hypothesis 2: Decaying error rates
As oigurbserved before, one may have (actually one needs to have) a priori reasons to stop the regress argument and take to be finite. For example one could assume that the error rates vanish as the number of layers increases, so that for , and tends to 0 when approaches a given . In this case, one can show that the higher moments tend to be capped, and the tail of less extreme, yet riskier than what one could naively think.
Take a value and fix . Then, for , hypothesize that , so that . For what concerns , without loss of generality, set . With , the variance of becomes
For we get
For a generic the variance is
where is the -Pochhammer function.
Going on computing moments, for the fourth central moment of , one gets for example
For and , we get a variance of , with a significant yet relatively benign convexity bias. And the limiting fourth central moment is , more than 3 times that of a simple Normal, which is . Such a number, even if finite–hence the corresponding scenario is not as extreme as before–definitely suggests a tail risk not to be ignored.
For values of in the vicinity of 1 and , the fourth moment of converges towards that of a Normal, closing the tails, as expected.
V A central limit theorem argument
We now discuss a central limit theorem argument for epistemic uncertainty as a generator of thicker tails and risk. For doing so, we introduce a more convenient representation of the normal distribution, which will also prove useful in Section VI.
Consider again the real-valued normal random variable , with mean and standard deviation . Its density function is thus
Without any loss of generality, let us set . Moreover let us re-parametrize Equation (6) in terms of a new parameter
, commonly called “precision" in Bayesian statistics. The precision of a random variable is nothing more than the reciprocal of its variance, and, as such, it is just another way of looking at variability (actually Gauss  originally defined the Normal distribution in terms of precision). From now on, we will therefore assume that has density
Imagine now that we are provided with an estimate of , i.e. , and take to be close enough to the true value of the precision parameter. Assuming that and are actually close is not necessary for our derivation, but we want to be optimistic by considering a situation in which who estimates knows what she is doing, using an appropriate method, checking statistical significance, etc.
We can thus write
where is now a first-order random error term such that and . Apart from these assumptions on the first two moments, no other requirement is put on the probabilistic law of .
Now, imagine that a second order error term is defined on , and again assume that it has zero mean and finite variance. The term may, as before, represent uncertainty about the way in which the quantity was obtained. Equation (8) can thus be re-written as
Iterating the error on error reasoning we can introduce a sequence such that and , , so that we can write
For , Equation (10) represents our knowledge about the parameter , once we start from the estimate and we allow for epistemic uncertainty, in the form of multiplicative errors on errors. The lower value for the variances of the error terms is meant to guarantee a minimum level of epistemic uncertainty at every level, and to simplify the application of the central limit argument below.
Now take the logs on both sides of Equation (10) to obtain
If we assume that, for every , is small with respect to 1, we can introduce the approximation , and Equation (10) becomes
To simplify treatment, let us assume that the error terms are independent from each other444In case of dependence, we can refer to one of the generalizations of the CLT [14, 15].. For large, a straightforward application of the Central Limit Theorem (CLT) of Laplace-Liapounoff  tells us that is approximately distributed as a , where . This clearly implies that , for . Notice that, for large enough, we could also assume to be a random variable (with finite mean and variance), but still the limiting distribution of would be a Lognormal. For the reader interested in industrial dynamics, the above derivation should recall the so-called Gibrat law of proportionate effects for the modeling of firms’ size [17, 19].
From now on we drop the index from , using and assuming that is large enough for the CLT to hold.
Epistemic doubt has thus a very relevant consequence from a statistical point of view. Using Bayesian terminology, the different layers of uncertainty represented by the sequence of random errors correspond to eliciting a Lognormal prior distribution on the precision parameter of the initial Normal distribution. This means that, in case of epistemic uncertainty, the actual marginal distribution of the random variable is no longer a simple Normal, but a Compound Normal-Lognormal distribution, which we can represent as
where is the density of a for the now random precision parameter .
Notice that, for the properties of the Lognormal distribution , also the distribution of is Lognormal. However, the use of the parametrization based on the precision of is convenient in view of the next section.
In fact, despite its apparent simplicity, the integral in Equation (13) cannot be solved analytically. This means that we are not able to obtain a closed form for the Compound Normal-Lognormal (CNL) distribution represented by , even if its first moments can be obtained explicitly. For example the mean is equal to ( in general), while the kurtosis is .
Vi The analytical approximation
The impossibility of solving Equation (13) can be somehow by-passed by introducing an approximation to the Lognormal distribution on
. The idea is to use a Gamma distribution to mimic the behavior ofin Equation (13), also looking at the tail behavior of both distributions.
Both the Lognormal and the Gamma distribution are in fact skewed distributions, defined on the positive semi-axis, and characterized by a peculiar property: their coefficient of variation CV (the ratio of the standard deviation and the mean) is constant, and does not depend on both the mean and the standard deviation. In a the CV is equal to , while for a positive random variable following a distribution with density
the CV is simply .
From the point of view of extreme value statistics, both the Gamma and the Lognormal are heavy-tailed distributions, meaning that their right tail goes to zero slower than an exponential function, but not "true" fat-tailed, i.e. their tail decreases faster than a power law . From the point of view of extreme value theory, both distributions are in the maximum domain of attraction of the Gumbel case of the Generalized Extreme Value distribution [9, 14], and not of the Fréchet one, i.e. the proper fat-tailed case. As a consequence, the moments of these distributions will always be finite.
As applied statisticians know 
, from a qualitative point of view, it is rather difficult to distinguish between a Lognormal and a Gamma sharing the same coefficient of variation, when fitting data. In generalized linear models, it is nothing but a personal choice to use a Lognormal rather than a Gamma regression, more or less like choosing between a Logit and a Probit. In their bulk, a Lognormal and a Gamma with the same mean and standard deviation (hence the same CV) actually approximate quite well one another, as also shown in Figure 4. The Gamma appears to give a little more mass to the smaller values, but the approximation is definitely good.
Interestingly, the Lognormal and the Gamma are also linked through the operation of exponentiation .
The main difference, when comparing a Lognormal and a Gamma sharing the same coefficient of variation is relative to the right tail. The Lognormal, in fact, shows a slightly heavier tail, whose decrease is slower, as also evident from Figure 4. To verify analytically that the Lognormal tail dominates the one of the Gamma, we can have a look at their asymptotic failure rates .
For a distribution with density and survival function , the failure rate is defined as
The quantity is called asymptotic failure rate. A distribution with a lower will have a thicker tail with respect to one with a larger asymptotic failure rate. When two distributions share the same , conversely, more sophisticated analyses are needed to study their tail behavior .
For a it is easy to verify that
while for a generic we have
where is once again the complementary error function.
By taking the limits for , we see that , while . Therefore, a Lognormal has a right tail which is heavier than that of a Gamma (actually all Gammas, given that does not depend on any parameter). A relevant consequence of this different tail behavior is that the Lognormal is bound to generate more extreme scenarios than the Gamma. This, combined with the fact that in the bulk the two distributions are rather similar–even if the Gamma slightly inflates the small values–allows us to say that we can use the Gamma as a lower bound for the Lognormal, when we center both distributions on the same coefficient of variation.
Coming back to the CLT result of Section II, we can say that, in the limit, the precision parameter can be taken to be approximately , where and are chosen to obtain the same coefficient of variation of , that is and .
In dealing with the precision parameter , moving from the Lognormal to the Gamma has a great advantage. A Normal distribution with known mean (for us ) and Gamma-distributed precision parameter has in fact an interesting closed form.
Let us come back to Equation (13), and let us re-write it by substituting the Lognormal density with an approximating which we indicate with , obtaining
The integral above can now be solved explicitly, so that
In Equation (VI) we can recognize the density function of a non-standardized Student distribution with degrees of freedom, zero location and scale parameter . As observed above, to guarantee the Gamma approximation to the , we set and , where is the sum of the variances of the epistemic random errors.
Interestingly, the t-Student distribution of Equation (VI) is fat-tailed on both sides [9, 14], especially for small values of . Given that decreases in , which is the sum of the variances of the epistemic errors, hence a measure of the overall uncertainty, the more doubts we have about the precision parameter , the more the resulting t-Student distribution is fat-tailed, thus increasing tail risk. The actual value of is indeed bound to be rather small. This result is in line with the findings of Section III.
Therefore, starting from a simple Normal distribution, by considering layers of epistemic uncertainty, we have obtained a fat-tailed distribution with the same mean (), but capable of generating more extreme scenarios, and its tail behavior is a direct consequence of imprecision and ignorance. Since we have used the Gamma distribution as a lower bound for the Lognormal, we can expect that, with a Lognormal the tails of will still be heavy and very far from normality.
In Section II we started from a normally distributed random variable and we derived the effects of layering uncertainty on the standard deviation of . We have analyzed different scenarios, all generating tails for that are thicker than those of the normal distribution we started from.
Epistemic uncertainty was represented in terms of multiplicative errors, which can also be analyzed with a CLT argument leading to a Lognormal distribution for the precision parameter . Given the impossibility of obtaining closed-form results for the Lognormal case, we used a Gamma approximation to obtain fat tails analytically, after noticing that the Lognormal will possibly generate even more extreme results, given that its tail dominates the one of the Gamma.
Now, the question is: how much do our results depend on the Normal-Lognormal-Gamma construction?
Centrally, the choice of the Normal distribution as starting point is not relevant. What really counts is the Lognormal emerging from the different layers of uncertainty on the parameter of choice, and the fact that such a Lognormal is riskier than a Gamma with the same coefficient of variation. In fact, if we start from an Exponential distribution with intensity parameter, and on that we apply the same reasoning we developed on , then we will generate fat tails. In fact, the compounding of an Exponential distribution and a Gamma–which we use as “lower bound" for the Lognormal–generates a Lomax, or Pareto II distribution, a well-known example of fat-tailed distribution . If, conversely, we start from a Gamma we obtain a compound Gamma or a beta prime distribution (depending on parametrizations), other two cases of (possibly) fat-tailed distributions . Finally, even when dealing with discrete distributions, our approach may easily generate extremes (it is not correct to speak about fat tails with discrete distributions, given that some converge results used in the continuous case do not hold ). For example, if we start from a Poisson with intensity and we apply our layers of uncertainty, what we get is a Negative Binomial (or something that resembles a Negative Binomial without the Gamma approximation), a skewed distribution, possibly with high kurtosis, used in risk management to model credit risk losses .
Viii Applications and Consequences
Consequences in terms of risk management are clear: ignoring errors on errors induce a significant underestimation of tail risk. Those with forecasting less so: it is hard to conceive that even if past data shows thin-tailed properties, future data needs to be necessarily higher.
We can also see how the out-of-sample can show degradation compared to in-sample properties: the future is necessarily to be treated as more fat-tailed than the past.
More philosophically, our approach can help explain the central point in The Black Swan : one must deal with the future as if it were to deliver more frequent (or more impacting) tail events than what is gathered from our knowledge of the past555Taking all estimates and risks as an objective truth (i.e., attaching certitude to the estimate) is the greatest mistake a decision maker can make . Uncertainty does not only emerge from the limits of our knowledge and the natural unpredictability (at least to a large extent) of complex systems and the future [27, 18, 34], but it also permeates all our quantitative models for the real world, through our unavoidable errors. Understanding this limit is an essential step for effective risk management, as it help in shaping that sense of humility and precaution that make us prefer doubts and imprecisions to false certainties. .
-  T. Aquinas (1256). De Veritate. English translation in Truth (1994), Hackett Publishing Company.
-  J.M. Bernardo, A.F.M. Smith (2000). Bayesian Theory. Wiley.
-  F.H. Bradley (1914). Essays on Truth and Reality. Clarendon Press.
-  P. Cirillo (2013). Are your data really Pareto distributed? Physica A: Statistical Mechanics and its Applications 392, 5947-5962.
-  P.C. Consul, G.C. Jain (1971). On the log-gamma distribution and its properties. Statistical Papers 12, 100-106.
-  B. de Finetti (1974). Theory of Probability, Volume 1. Wiley.
-  B. de Finetti (1975). Theory of Probability, Volume 2. Wiley.
-  B. de Finetti (2006). L’invenzione della verità. R. Cortina.
-  L. de Haan, A. F. Ferreira (2006). Extreme Value Theory. An Introduction. Springer.
-  Childers, T., 2013. Philosophy and probability. Oxford University Press.
-  E. Derman (1996). Model Risk. Risk 9, 139-145.
-  D. Draper (1995). Assessment and Propagation of Model Uncertainty. Journal of the Royal Statistical Society. Series B (Methodological) 57, 45-97.
-  B. Dupire (1994). Pricing with a Smile. Risk 7, 18-20.
-  P. Embrechts, C. Klüppelberg, T. Mikosch (2003). Modelling Extremal Events. Springer.
W. Feller (1968).
An Introduction to Probability Theory and Its Applications, Volume 1, 3rd edition. Wiley.
-  C.F. Gauss (1809). Theoria Motus Corporum Coelestium. Biodiversity Heritage Library, https://doi.org/10.5962/bhl.title.19023
-  R. Gibrat (1931). Les Inégalités économiques. Recueil Sirey.
-  G. Gigerenzer (2003). Reckoning with Risk: Learning to Live with Uncertainty. Penguin.
-  C. Kleiber, S. Kotz (2003). Statistical Size Distributions in Economics and Actuarial Sciences. Wiley.
-  S.A. Klugman, H.H. Panjer, G.E. Willmot (1998). Loss Models: From Data to Decisions. Wiley.
-  I. Levi (1967). Gambling with truth: An essay on induction and the aims of science.
-  D. Lewis (1973). Counterfactuals. John Wiley & Sons.
-  N. Johnson, S. Kotz, N. Balakrishnan (1994). Continuous Univariate Distributions, Volume 1. Wiley.
-  Laplace,Pierre‐Simon,Marquis de.(1814) 1952, Essai Philosophique sur les Probabilités. Translated as Philosophical Essay on Probabilities by E.T. Bell, 1902. Reprint, Dover Publications, Inc.
-  P. McCullagh, J. A. Nelder (1989). Generalized Linear Models. Chapman and Hall/CRC.
-  A.J. McNeil, R. Frey, P. Embrechts (2015). Quantitative Risk Management. Princeton University Press.
-  F. Knight (1921). Risk, Uncertainty, and Profit. Houghton Mifflin Company.
-  F.W. Nietzsche (1873). Über Wahrheit und Lüge im Aussermoralischen Sinne. English translation in The Portable Nietzsche (1976). Viking Press.
-  A. Papoulis (1984). Probability, Random Variables, and Stochastic Processes. McGraw-Hill.
-  N. Rescher (1983). Risk - A Philosophical Introduction to the Theory of Risk Evaluation and Management. University Press of America.
-  T. Rolski, H. Schmidli, V. Schmidt, J.L. Teugels (2009). Stochastic Processes for Insurance and Finance. Wiley.
-  B. Russell (1958). The Will to Doubt. New York Philosophical Library.
-  J. Shao (1998). Mathematical Statistics. Springer.
-  G.L.S. Shackle (1968). Expectations, Investment and Income. Clarendon Press.
-  Taleb,N.N. and A. Pilpel. "I problemi epistemologici del risk management." Daniele Pace (a cura di)" Economia del rischio. Antologia di scritti su rischio e decisione economica", Giuffrè, Milano (2004).
-  N.N. Taleb (2007) and A. Pilpel, Epistemology and Risk Management, Risk and Regulation 13:6–7
-  N.N. Taleb (2001-2018). Incerto. Random House and Penguin.
-  N.N. Taleb (2019). The Technical Incerto. STEM Press.
-  Von Plato, J. (1994). Creating modern probability: Its mathematics, physics and philosophy in historical perspective. Cambridge University Press.
-  M. West (1987). On scale mixtures of normal distribution. Biometrika 74, 646-8.