It is well-known that the income distribution as well as many other size distributions of economic interest exhibit Pareto (power law) tails,111Pareto (1896, 1897) discovered that the rank size distribution of income shows a straight line pattern on a log-log plot, which implies a power law. The power law in size distributions of economic variables has been documented for city size (Auerbach, 1913; Zipf, 1949; Gabaix, 1999; Giesen et al., 2010; Rozenfeld et al., 2011), firm size (Axtell, 2001), wealth (Klass et al., 2006; Vermeulen, 2018), and consumption (Toda and Walsh, 2015; Toda, 2017), among others.
meaning that the tail probabilitydecays like a power function for large , where is called the Pareto exponent. Oftentimes, knowing the Pareto exponent is of considerable practical interest because it determines the shape of the income distribution for the rich and hence income inequality.
When individual data on income is available, it is relatively straightforward to estimate and conduct inference on the Pareto exponent, either by maximum likelihood (Hill, 1975), log rank regressions (Gabaix and Ibragimov, 2011), fixed- asymptotics (Müller and Wang, 2017), or other methods. Even if individual data is not available, if we have binned data we can still estimate the Pareto exponent by eyeballing (Pareto, 1897) or maximum likelihood (Virkar and Clauset, 2014). However, in practice it is often the case (especially for administrative data) that only some top income shares are reported and individual data are not available. A typical example is Table 1 below, which summarizes the U.S. household income distribution.222These numbers are taken from Table A.3 (top income shares including capital gains) of the updated spreadsheet for Piketty and Saez (2003), which can be downloaded at https://eml.berkeley.edu/~saez/TabFig2017prel.xls. Such income data in the form of tabulations are quite common, including the World Inequality Database.333https://wid.world/
|Year||Top income percentiles|
In this paper, we propose an efficient estimation method for the Pareto exponent when only certain top income shares are available. Our method is based on the following observations. By definition, top income shares are the ratio between the sum of order statistics for some top percentile and total income. Assuming that the upper tail of the income distribution is Pareto, we derive the asymptotic distribution of normalized top income shares using the results on the weighted sums of order statistics due to Stigler (1974). From this result, we define the classical minimum distance (CMD) estimator (Chiang, 1956; Ferguson, 1958) and derive its asymptotic properties.
In particular, we typically cannot identify the shape of the underlying distribution without observing individual data. But if we assume the sample size is large enough (not necessarily known) and the underlying distribution is Pareto, we can show that the normalized top shares are jointly asymptotically Gaussian with the mean vector and the variance-covariance matrix being characterized by the Pareto exponent and the scale parameter. Since the scale parameter is not identified given only the shares, we eliminate it by imposing scale invariance and considering a self-normalized statistic whose distribution is still jointly normal but now fully characterized by the Pareto exponent only. Thus, the problem is asymptotically equivalent to estimating a single parameter in a joint normal distribution using a random draw from it. The efficient solution is then to consider the continuously updated minimum distance estimator (CUMDE). As we show in simulations, this estimator has excellent finite sample properties when the model is correctly specified.
When the data generating process is not exactly Pareto (such as Student- or double Pareto-lognormal distributions), our estimator still performs well when we only use small enough top percentiles such as the top 1% and the sample size is large enough, which is typically the case for income share data based on tax returns (where the number of households is in the order of a million). Such robustness to misspecification is valid as long as the tail of the underlying distribution can be well approximated by a Pareto. This condition is technically referred to as the Domain of Attraction assumption, which is satisfied by almost all commonly used distributions. See, for example, de Haan and Ferreira (2006, Chapter 1) for more discussions.
2 Weighted sums of order statistics
In this section we derive the asymptotic distribution of the weighted sums of order statistics of a Pareto distribution, which we subsequently use to construct the estimator of the Pareto exponent.
be independent and identically distributed (i.i.d.) copies of a positive random variable
with cumulative distribution function (CDF)and density . Let
denote the order statistics. Following Stigler (1974), consider the weighted sum
where is a function that is bounded and continuous almost everywhere with respect to the Lebesgue measure. When
for some , can be interpreted as the sum of ’s between the top and percentiles, divided by the sample size .
The following lemma shows that is asymptotically normal.
Let be as in (2.1). Then
The statement follows from Stigler (1974, Theorem 5) and the change of variable . Note that implies for . ∎
In the remainder of the paper, we assume that is Pareto distributed with Pareto exponent and minimum size , so . The Pareto exponent captures the shape and the minimum size characterizes the scale. Then by simple algebra, we obtain
is Pareto distributed, we can explicitly compute the moments in Lemma2.
Let be as in (2.1) and be the Pareto CDF with exponent and minimum size . Letting , we have
where is interpreted as if .
Next, we consider the joint distribution of the sums of’s over some top percentile groups. Suppose that there are groups indexed by , and the -th group corresponds to the top to percentile, where . Define
where denotes the largest integer not exceeding .444We exclude the largest order statistics since the average of them may not satisfy a central limit theorem due to the potentially heavy tail (
order statistics since the average of them may not satisfy a central limit theorem due to the potentially heavy tail (). By Lemmas 2 and 2, we have
where is some variance matrix with . The following lemma gives an explicit formula for .
The variance matrix in (2.6) is symmetric and
Furthermore, is positive definite.
3 Minimum distance estimator
In practice, the income distribution is often presented as a tabulation of top income shares as in Table 1 and micro data is not available. If is distributed as Pareto with exponent and minimum size , using , the top percentile is
Using (2.3a), the total income held by the top percentile is
Therefore the top income share is
which depends only on . If is Pareto only for the upper tail, a similar calculation yields
for . Aoki and Nirei (2017, Figure 3) calibrate the U.S. income Pareto exponent from (3.1) using and . A natural question is whether such calibration can be statistically justified for the tabulation data as in Table 1. In this section, we derive such an estimator and discuss its asymptotic properties.
3.1 Asymptotic theory
Let be the (unobserved) income data and the order statistics. Let and suppose that some top percentiles and the corresponding top income shares
are given. Suppose that is small enough such that for , we may assume that are realizations from a Pareto distribution with exponent and minimum size . To construct an estimator of based only on , we consider the vector of self-normalized non-overlapping top income shares defined by
The following proposition shows that is asymptotically normal.
Let , where is given by (2.4a). Define the -vector and matrix . Then
The variance matrix depends only on and is positive definite.
where is some symmetric and positive definite weighting matrix and is some compact parameter space.
Let be the objective function in (3.3). Suppose that as , where is also positive definite. Letting be the true Pareto exponent, we have
Since is positive definite, we have , with equality if and only if . The following proposition shows that the parameter is point-identified by this condition.
[Identification] implies .
Using standard arguments, consistency and asymptotic normality follows from the above identification result.
[Consistency] Let be compact, , and suppose as , where is positive definite. Let be the minimum distance estimator in (3.3). Then .
[Asymptotic normality] Let everything be as in Theorem 3.1 and suppose that is an interior point of . Then
as , where
for and .
By standard results in classical minimum distance estimation (Chiang, 1956; Ferguson, 1958), we achieve efficiency by choosing the weighting matrix such that . Therefore the most natural estimator is the following continuously updated minimum distance estimator (CUMDE).
[Efficient CMD] Let everything be as in Theorem 3.1 and define the continuously updated minimum distance estimator (CUMDE) by
where is given as in Proposition 3.1. Then
where and . has the minimum asymptotic variance among all CMD estimators. We can use Corollary 3.1
to construct confidence intervals of.
We now consider testing the null hypothesis: against the alternative : . The following propositions show that we can implement likelihood ratio and specification tests, which avoid computing the derivative of . We omit the proofs since they are analogous to standard GMM results (Newey and McFadden, 1994, Section 9). The likelihood ratio test can also be inverted to construct the confidence interval.
[Likelihood ratio test] Under the null : , we have
Under the alternative : , we have
[Specification test] Suppose that . If is the Pareto CDF with some exponent , then
By Corollary 3.1, we can compute by numerically solving the minimization problem (3.4). However, it is clear from Lemmas 2 and 2 that shows up everywhere in and , and hence it is more convenient to optimize over instead of . With a slight abuse of notation, let and be the values of and corresponding to . We can thus estimate (and ) using the following algorithm.
Given the top income share data for the top percentiles, define the normalized shares by (3.2).
For , define and .
Define the objective function
and compute the minimizer of over . The point estimate of the Pareto exponent is .
We evaluate the finite sample properties of the continuously updated minimum distance estimator (3.4) through simulations. We consider three data generating processes (DGPs), (i) Pareto distribution, (ii) absolute value of the Student- distribution, and (iii) double Pareto-lognormal distribution (dPlN). For the Pareto distribution, we set the Pareto exponent to and (without loss of generality) the minimum size to . For the Student-
distribution, we set the degree of freedom toso that the Pareto exponent is 2. The double Pareto-lognormal distribution is the product of independent double Pareto (Reed, 2001) and lognormal variables. dPlN has been documented to fit well to size distributions of economic variables including income (Reed, 2003), city size (Giesen et al., 2010), and consumption (Toda, 2017). Reed and Jorgensen (2004) show that a dPlN variable can be generated as
where are independent and and . For parameter values, we set , , , and , which are typical for income data (Toda, 2012).
The simulation design is as follows. For each DGP, we generate i.i.d. samples with size . We set the top percentiles to
which are the percentiles for income considered in Piketty and Saez (2003). Because the distribution is not exactly Pareto for DGP 2 and 3, we expect that the estimation suffers from model misspecification when we use large top income percentile as 10% (). Therefore to evaluate the robustness against model misspecification, we also consider using only the top 5% group (–) and the top 1% group (–). Thus, in total there are specifications (three DGPs, three sample sizes, and three choices of top income percentiles). For each specification, we estimate , construct the confidence interval based on inverting the likelihood ratio test in Proposition 3.1, and implement the specification test in Proposition 3.1 using the algorithm in Section 3.2. The numbers are based on simulations. Table 2 shows the simulation results.
We can make a few observations from Table 2. First, when the model is correctly specified (Pareto), the finite sample properties are excellent. In particular, the coverage rate is close to the nominal value 0.95. In this case, using more top percentiles (including the top 10%) is more efficient (has smaller bias and RMSE) because it exploits more information. Second, when the model is misspecified (Student- or dPlN distributions), including large top percentiles (10%) leads to large bias and incorrect coverage. Thus, it is preferable to use only percentiles within the top 1% or 5% for robustness against potential model misspecification. This is seen from the rejection probability of the specification test. Third, when the sample size is large (, which is typical for administrative data) and we use the top 1% group, the finite sample properties are good for all distributions considered here.
4 Pareto exponents in the U.S. and France
As an application, we estimate the Pareto exponent of the income distribution in the U.S. for the period 1917–2017 and France for 1900–2014. For the U.S., we use the updated top income share data (including capital gains) from Piketty and Saez (2003) (see Footnote 2 for details). For France, we obtain the top income shares from the World Inequality Database (Footnote 3).
Figure 0(a) plots the top 1% and 10% income shares (including capital gains) for the U.S. As is well-known, the series are roughly parallel and exhibit a U-shaped pattern over the century. Figure 0(b) plots the Pareto exponent estimated as in Section 3.2. “Top 1%” uses the top 0.01%, 0.1%, 0.5%, and 1% groups (), whereas “Top 10%” also includes the top 5% and 10% groups (). We do not present the confidence interval because the sample size is unknown but very large (at least
), which suggests that the standard errors are tiny based on the simulation findings in Table2.
We can make a few observations from Figure 0(b). First, the Pareto exponent estimates are significantly different when using the top 1% and 10% groups. Based on the simulation results in Table 2, this suggests that the income distribution is not exactly Pareto and that the 10% result is biased. Therefore we should focus on the top 1% result. The Pareto exponent ranges from 1.34 to 2.29. Second, Figures 0(a) and 0(b) tell different stories about income inequality. While the top 1% income share in Figure 0(a) has been rising roughly linearly since about 1975, the Pareto exponent in Figure 0(b) sharply declines (implying increased inequality) between about 1975 and 1985 but remains flat since then. This observation suggests that the rise in inequality since 1985 as seen in Figure 0(a) is mainly driven by the redistribution between the rich (top 1%) and the poor (bottom 99%), and there is no evidence of increased inequality among the rich.
Figure 2 repeats the analysis for France. Again, the point estimates of the Pareto exponent when using the top 1% and 10% groups differ significantly, and therefore we should focus on the 1% result. Unlike in the U.S., where 1960–1980 appears to be an unusual period of low inequality (high Pareto exponent), in France the Pareto exponent is relatively stable at around 1.5 prewar and 2 postwar. Therefore there seems to be a regime change at around World War II, corroborating to Piketty (2003)’s analysis.
This paper develops an efficient minimum distance estimator of the Pareto exponent using only top shares data. This is especially relevant in studying income inequality since individual level data for the top rich people are usually unavailable due to confidential reason. Our estimator is consistent and asymptotically normal, and performs excellently in finite samples as shown by Monte Carlo simulations. In particular, we recommend using only top 1 instead of 10 percentile shares to study the tail of the income distribution. We estimate the Pareto exponent to be around 1.5 and stable since 1985 in the U.S., and is around 1.5 and 2 before and after WWII in France.
Appendix A Proofs
Proof of Lemma 2.
Proof of Lemma 2.
The formula for follows from Lemma 2. Suppose and let be the asymptotic variance of . On the one hand, we have
On the other hand, noting that is asymptotically equivalent as in Lemma 2 with
it follows from the proof of Lemma 2 that
Clearly we have and . By Fubini’s theorem, . Therefore . Since and hence , we obtain
which is (2.4b).
To show that is positive definite, noting that and (A.1) holds, we have
where . Take any vector . Then as in the proof of Lemma 2, we obtain
where . Since is piece-wise continuous, we can take an absolutely continuous primitive function such that . By the fundamental theorem of calculus, we obtain
Let be the integral ignoring the factor 2. Using integration by parts, we obtain
so is positive semidefinite. Since is continuous, equality holds if and only if . Therefore is positive definite. ∎
Proof of Proposition 3.1.
Let . Since and by (2.6), using the definition of , , and , we obtain
Expressing this in matrix form, we obtain
Since by Lemma 2 each is proportional to and each element of is proportional to , the vector and matrix depend only on . Since is positive definite by Lemma 2 and has full row rank, is also positive definite. ∎
Proof of Proposition 3.1.
Let , , and . Then