 # Efficient Minimum Distance Estimation of Pareto Exponent from Top Income Shares

We propose an efficient estimation method for the income Pareto exponent when only certain top income shares are observable. Our estimator is based on the asymptotic theory of weighted sums of order statistics and the efficient minimum distance estimator. Simulations show that our estimator has excellent finite sample properties. We apply our estimation method to the U.S. top income share data and find that the Pareto exponent has been stable at around 1.5 since 1985, suggesting that the rise in inequality during the last three decades is mainly driven by redistribution between the rich and poor, not among the rich.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

It is well-known that the income distribution as well as many other size distributions of economic interest exhibit Pareto (power law) tails,111Pareto (1896, 1897) discovered that the rank size distribution of income shows a straight line pattern on a log-log plot, which implies a power law. The power law in size distributions of economic variables has been documented for city size (Auerbach, 1913; Zipf, 1949; Gabaix, 1999; Giesen et al., 2010; Rozenfeld et al., 2011), firm size (Axtell, 2001), wealth (Klass et al., 2006; Vermeulen, 2018), and consumption (Toda and Walsh, 2015; Toda, 2017), among others.

meaning that the tail probability

decays like a power function for large , where is called the Pareto exponent. Oftentimes, knowing the Pareto exponent is of considerable practical interest because it determines the shape of the income distribution for the rich and hence income inequality.

When individual data on income is available, it is relatively straightforward to estimate and conduct inference on the Pareto exponent, either by maximum likelihood (Hill, 1975), log rank regressions (Gabaix and Ibragimov, 2011), fixed- asymptotics (Müller and Wang, 2017), or other methods. Even if individual data is not available, if we have binned data we can still estimate the Pareto exponent by eyeballing (Pareto, 1897) or maximum likelihood (Virkar and Clauset, 2014). However, in practice it is often the case (especially for administrative data) that only some top income shares are reported and individual data are not available. A typical example is Table 1 below, which summarizes the U.S. household income distribution.222These numbers are taken from Table A.3 (top income shares including capital gains) of the updated spreadsheet for Piketty and Saez (2003), which can be downloaded at https://eml.berkeley.edu/~saez/TabFig2017prel.xls. Such income data in the form of tabulations are quite common, including the World Inequality Database.

In this paper, we propose an efficient estimation method for the Pareto exponent when only certain top income shares are available. Our method is based on the following observations. By definition, top income shares are the ratio between the sum of order statistics for some top percentile and total income. Assuming that the upper tail of the income distribution is Pareto, we derive the asymptotic distribution of normalized top income shares using the results on the weighted sums of order statistics due to Stigler (1974). From this result, we define the classical minimum distance (CMD) estimator (Chiang, 1956; Ferguson, 1958) and derive its asymptotic properties.

In particular, we typically cannot identify the shape of the underlying distribution without observing individual data. But if we assume the sample size is large enough (not necessarily known) and the underlying distribution is Pareto, we can show that the normalized top shares are jointly asymptotically Gaussian with the mean vector and the variance-covariance matrix being characterized by the Pareto exponent and the scale parameter. Since the scale parameter is not identified given only the shares, we eliminate it by imposing scale invariance and considering a self-normalized statistic whose distribution is still jointly normal but now fully characterized by the Pareto exponent only. Thus, the problem is asymptotically equivalent to estimating a single parameter in a joint normal distribution using a random draw from it. The efficient solution is then to consider the continuously updated minimum distance estimator (CUMDE). As we show in simulations, this estimator has excellent finite sample properties when the model is correctly specified.

When the data generating process is not exactly Pareto (such as Student- or double Pareto-lognormal distributions), our estimator still performs well when we only use small enough top percentiles such as the top 1% and the sample size is large enough, which is typically the case for income share data based on tax returns (where the number of households is in the order of a million). Such robustness to misspecification is valid as long as the tail of the underlying distribution can be well approximated by a Pareto. This condition is technically referred to as the Domain of Attraction assumption, which is satisfied by almost all commonly used distributions. See, for example, de Haan and Ferreira (2006, Chapter 1) for more discussions.

The rest of this paper is organized as follows. Section 2 sets up the order statistic framework, on which we build the asymptotics and construct our estimator in Section 3. Section 4 applies the new estimator to actual income data in the U.S. and France. Longer proofs are deferred to Appendix A.

## 2 Weighted sums of order statistics

In this section we derive the asymptotic distribution of the weighted sums of order statistics of a Pareto distribution, which we subsequently use to construct the estimator of the Pareto exponent.

Let

be independent and identically distributed (i.i.d.) copies of a positive random variable

with cumulative distribution function (CDF)

and density . Let

 Y(1)≥⋯≥Y(n)

denote the order statistics. Following Stigler (1974), consider the weighted sum

 Ln=1nn∑i=1J(in+1)Y(n−i+1),

where is a function that is bounded and continuous almost everywhere with respect to the Lebesgue measure. When

 J(x)=1[1−q

for some , can be interpreted as the sum of ’s between the top and percentiles, divided by the sample size .

The following lemma shows that is asymptotically normal.

Let be as in (2.1). Then

 √n(Ln−μ(J,F))\dtoN(0,σ2(J,F)),

where

 μ(J,F) =∫10J(x)F−1(x)\diffx, (2.2a) σ2(J,F) =∫10∫10J(x1)J(x2)f(F−1(x1))f(F−1(x2))(min\setx1,x2−x1x2)\diffx1\diffx2. (2.2b)
###### Proof.

The statement follows from Stigler (1974, Theorem 5) and the change of variable . Note that implies for . ∎

In the remainder of the paper, we assume that is Pareto distributed with Pareto exponent and minimum size , so . The Pareto exponent captures the shape and the minimum size characterizes the scale. Then by simple algebra, we obtain

 f(y) =F′(y)=αcαy−α−1, (2.3a) F−1(x) =c(1−x)−1/α, (2.3b) f(F−1(x)) =αc(1−x)1+1/α. (2.3c)

When

is Pareto distributed, we can explicitly compute the moments in Lemma

2.

Let be as in (2.1) and be the Pareto CDF with exponent and minimum size . Letting , we have

 μ(J,F)=μ(p,q):=cq1−ξ−p1−ξ1−ξ, (2.4a) σ2(J,F)=σ2(p,q) :=2c2ξ21−ξ(q1−2ξ−p1−2ξ1−2ξ+p1−ξq−ξ−p−ξξ+2p1−ξq1−ξ−p2−2ξ−q2−2ξ2−2ξ), (2.4b)

where is interpreted as if .

Next, we consider the joint distribution of the sums of

’s over some top percentile groups. Suppose that there are groups indexed by , and the -th group corresponds to the top to percentile, where . Define

 ¯Yk=1n\floornpk+1∑i=\floornpk+1Y(i), (2.5)

where denotes the largest integer not exceeding .444We exclude the largest

order statistics since the average of them may not satisfy a central limit theorem due to the potentially heavy tail (

). By Lemmas 2 and 2, we have

 √n(¯Yk−μk)\dtoN(0,σ2k),

where and are given by (2.4a) and (2.4b), respectively. Let and . Then by the Cramér-Wold device, it follows that

 √n(¯Y−μ)\dtoN(0,Σ), (2.6)

where is some variance matrix with . The following lemma gives an explicit formula for .

The variance matrix in (2.6) is symmetric and

 Σjk=⎧⎪ ⎪⎨⎪ ⎪⎩σ2k=σ2(pk,pk+1),(j=k)−c2ξ2p1−ξj+1−p1−ξj1−ξ(p−ξk+1−p−ξkξ+p1−ξk+1−p1−ξk1−ξ).(j

Furthermore, is positive definite.

## 3 Minimum distance estimator

In practice, the income distribution is often presented as a tabulation of top income shares as in Table 1 and micro data is not available. If is distributed as Pareto with exponent and minimum size , using , the top percentile is

 1−(y/c)−α=1−p⟺y=cp−1/α.

Using (2.3a), the total income held by the top percentile is

 Y(p):=∫∞cp−1/αyαcαy−α−1\diffy=cαα−1p1−1/α.

Therefore the top income share is

 S(p):=Y(p)/Y(1)=p1−1/α,

which depends only on . If is Pareto only for the upper tail, a similar calculation yields

 S(p)/S(q)=(p/q)1−1/α⟺α=11−log(S(q)/S(p))log(q/p) (3.1)

for . Aoki and Nirei (2017, Figure 3) calibrate the U.S. income Pareto exponent from (3.1) using and . A natural question is whether such calibration can be statistically justified for the tabulation data as in Table 1. In this section, we derive such an estimator and discuss its asymptotic properties.

### 3.1 Asymptotic theory

Let be the (unobserved) income data and the order statistics. Let and suppose that some top percentiles and the corresponding top income shares

 Sk=∑\floornpki=1Y(i)∑ni=1Y(i),k=1,…,K+1,

are given. Suppose that is small enough such that for , we may assume that are realizations from a Pareto distribution with exponent and minimum size . To construct an estimator of based only on , we consider the vector of self-normalized non-overlapping top income shares defined by

 ¯s=(¯s1,…,¯sK−1)⊤:=(S2−S1SK+1−SK,…,SK−SK−1SK+1−SK)⊤. (3.2)

The following proposition shows that is asymptotically normal.

Let , where is given by (2.4a). Define the -vector and matrix . Then

 √n(¯s−r)\dtoN(0,HΣH⊤).

The variance matrix depends only on and is positive definite.

Based on Proposition 3.1, it is natural to consider the classical minimum distance (CMD) estimator (Chiang, 1956; Ferguson, 1958)

 ˆα=\argminα∈A(r(α)−¯s)⊤ˆW(r(α)−¯s), (3.3)

where is some symmetric and positive definite weighting matrix and is some compact parameter space.

Let be the objective function in (3.3). Suppose that as , where is also positive definite. Letting be the true Pareto exponent, we have

 Gn(α)\ptoG(α):=(r(α)−r(α0))⊤W(r(α)−r(α0)).

Since is positive definite, we have , with equality if and only if . The following proposition shows that the parameter is point-identified by this condition.

[Identification] implies .

Using standard arguments, consistency and asymptotic normality follows from the above identification result.

[Consistency] Let be compact, , and suppose as , where is positive definite. Let be the minimum distance estimator in (3.3). Then .

###### Proof.

Clearly is continuous in . The statement follows from Proposition 3.1

, the uniform law of large numbers, and

Newey and McFadden (1994, Theorem 2.1). ∎

[Asymptotic normality] Let everything be as in Theorem 3.1 and suppose that is an interior point of . Then

 √n(ˆα−α0)\dtoN(0,V)

as , where

 V=(R⊤WR)−1R⊤WΩWR(R⊤WR)−1

for and .

###### Proof.

Immediate from Theorem 3.1 and Newey and McFadden (1994, Theorem 3.2). ∎

By standard results in classical minimum distance estimation (Chiang, 1956; Ferguson, 1958), we achieve efficiency by choosing the weighting matrix such that . Therefore the most natural estimator is the following continuously updated minimum distance estimator (CUMDE).

[Efficient CMD] Let everything be as in Theorem 3.1 and define the continuously updated minimum distance estimator (CUMDE) by

 ˆα=\argminα∈A(r(α)−¯s)⊤Ω(α)−1(r(α)−¯s), (3.4)

where is given as in Proposition 3.1. Then

 √n(ˆα−α0)\dtoN(0,(R⊤Ω−1R)−1),

where and . has the minimum asymptotic variance among all CMD estimators. We can use Corollary 3.1

to construct confidence intervals of

.

We now consider testing the null hypothesis

: against the alternative : . The following propositions show that we can implement likelihood ratio and specification tests, which avoid computing the derivative of . We omit the proofs since they are analogous to standard GMM results (Newey and McFadden, 1994, Section 9). The likelihood ratio test can also be inverted to construct the confidence interval.

[Likelihood ratio test] Under the null : , we have

 n(Gn(α)−Gn(ˆα))\dtoχ2(1).

Under the alternative : , we have

 n(Gn(α)−Gn(ˆα))\pto∞.

[Specification test] Suppose that . If is the Pareto CDF with some exponent , then

 nGn(ˆα)\dtoχ2(K−2).

### 3.2 Implementation

By Corollary 3.1, we can compute by numerically solving the minimization problem (3.4). However, it is clear from Lemmas 2 and 2 that shows up everywhere in and , and hence it is more convenient to optimize over instead of . With a slight abuse of notation, let and be the values of and corresponding to . We can thus estimate (and ) using the following algorithm.

1. Given the top income share data for the top percentiles, define the normalized shares by (3.2).

2. For , define and .

3. Define using (2.4a), (2.4b), (2.7), and Proposition 3.1, where we can set without loss of generality.

4. Define the objective function

 G(ξ)=(r(ξ)−¯s)⊤Ω(ξ)−1(r(ξ)−¯s)

and compute the minimizer of over . The point estimate of the Pareto exponent is .

5. If the sample size is known, use Corollary 3.1 or Proposition 3.1 to construct the confidence interval. It is simpler to construct the confidence interval of using

 r′k(ξ)rk(ξ)=p1−ξK+1logpK+1−p1−ξKlogpKp1−ξK+1−p1−ξK−p1−ξk+1logpk+1−p1−ξklogpkp1−ξk+1−p1−ξk

and taking the reciprocal to convert to .

### 3.3 Simulation

We evaluate the finite sample properties of the continuously updated minimum distance estimator (3.4) through simulations. We consider three data generating processes (DGPs), (i) Pareto distribution, (ii) absolute value of the Student- distribution, and (iii) double Pareto-lognormal distribution (dPlN). For the Pareto distribution, we set the Pareto exponent to and (without loss of generality) the minimum size to . For the Student-

distribution, we set the degree of freedom to

so that the Pareto exponent is 2. The double Pareto-lognormal distribution is the product of independent double Pareto (Reed, 2001) and lognormal variables. dPlN has been documented to fit well to size distributions of economic variables including income (Reed, 2003), city size (Giesen et al., 2010), and consumption (Toda, 2017). Reed and Jorgensen (2004) show that a dPlN variable can be generated as

 X=exp(μ+σX1+X2/α−X3/β),

where are independent and and . For parameter values, we set , , , and , which are typical for income data (Toda, 2012).

The simulation design is as follows. For each DGP, we generate i.i.d. samples with size . We set the top percentiles to

 p=(p1,p2,p3,p4,p5,p6)=1100(0.01,0.1,0.5,1,5,10),

which are the percentiles for income considered in Piketty and Saez (2003). Because the distribution is not exactly Pareto for DGP 2 and 3, we expect that the estimation suffers from model misspecification when we use large top income percentile as 10% (). Therefore to evaluate the robustness against model misspecification, we also consider using only the top 5% group () and the top 1% group (). Thus, in total there are specifications (three DGPs, three sample sizes, and three choices of top income percentiles). For each specification, we estimate , construct the confidence interval based on inverting the likelihood ratio test in Proposition 3.1, and implement the specification test in Proposition 3.1 using the algorithm in Section 3.2. The numbers are based on simulations. Table 2 shows the simulation results.

We can make a few observations from Table 2. First, when the model is correctly specified (Pareto), the finite sample properties are excellent. In particular, the coverage rate is close to the nominal value 0.95. In this case, using more top percentiles (including the top 10%) is more efficient (has smaller bias and RMSE) because it exploits more information. Second, when the model is misspecified (Student- or dPlN distributions), including large top percentiles (10%) leads to large bias and incorrect coverage. Thus, it is preferable to use only percentiles within the top 1% or 5% for robustness against potential model misspecification. This is seen from the rejection probability of the specification test. Third, when the sample size is large (, which is typical for administrative data) and we use the top 1% group, the finite sample properties are good for all distributions considered here.

## 4 Pareto exponents in the U.S. and France

As an application, we estimate the Pareto exponent of the income distribution in the U.S. for the period 1917–2017 and France for 1900–2014. For the U.S., we use the updated top income share data (including capital gains) from Piketty and Saez (2003) (see Footnote 2 for details). For France, we obtain the top income shares from the World Inequality Database (Footnote 3).

Figure 0(a) plots the top 1% and 10% income shares (including capital gains) for the U.S. As is well-known, the series are roughly parallel and exhibit a U-shaped pattern over the century. Figure 0(b) plots the Pareto exponent estimated as in Section 3.2. “Top 1%” uses the top 0.01%, 0.1%, 0.5%, and 1% groups (), whereas “Top 10%” also includes the top 5% and 10% groups (). We do not present the confidence interval because the sample size is unknown but very large (at least

), which suggests that the standard errors are tiny based on the simulation findings in Table

2.

We can make a few observations from Figure 0(b). First, the Pareto exponent estimates are significantly different when using the top 1% and 10% groups. Based on the simulation results in Table 2, this suggests that the income distribution is not exactly Pareto and that the 10% result is biased. Therefore we should focus on the top 1% result. The Pareto exponent ranges from 1.34 to 2.29. Second, Figures 0(a) and 0(b) tell different stories about income inequality. While the top 1% income share in Figure 0(a) has been rising roughly linearly since about 1975, the Pareto exponent in Figure 0(b) sharply declines (implying increased inequality) between about 1975 and 1985 but remains flat since then. This observation suggests that the rise in inequality since 1985 as seen in Figure 0(a) is mainly driven by the redistribution between the rich (top 1%) and the poor (bottom 99%), and there is no evidence of increased inequality among the rich.

Figure 2 repeats the analysis for France. Again, the point estimates of the Pareto exponent when using the top 1% and 10% groups differ significantly, and therefore we should focus on the 1% result. Unlike in the U.S., where 1960–1980 appears to be an unusual period of low inequality (high Pareto exponent), in France the Pareto exponent is relatively stable at around 1.5 prewar and 2 postwar. Therefore there seems to be a regime change at around World War II, corroborating to Piketty (2003)’s analysis.

## 5 Conclusion

This paper develops an efficient minimum distance estimator of the Pareto exponent using only top shares data. This is especially relevant in studying income inequality since individual level data for the top rich people are usually unavailable due to confidential reason. Our estimator is consistent and asymptotically normal, and performs excellently in finite samples as shown by Monte Carlo simulations. In particular, we recommend using only top 1 instead of 10 percentile shares to study the tail of the income distribution. We estimate the Pareto exponent to be around 1.5 and stable since 1985 in the U.S., and is around 1.5 and 2 before and after WWII in France.

## Appendix A Proofs

###### Proof of Lemma 2.

Using (2.2a), (2.3b), and the change of variable , we obtain

 μ(J,F) =∫10J(x)F−1(x)\diffx=∫1−p1−qc(1−x)−1/α\diffx =∫qpcv−ξ\diffv=cq1−ξ−p1−ξ1−ξ,

which is (2.4a). To prove (2.4b), using symmetry, (2.2b), (2.3c), and the change of variable , we obtain

 σ2(J,F) =2∫0≤x1≤x2≤1J(x1)J(x2)f(F−1(x1))f(F−1(x2))(min\setx1,x2−x1x2)\diffx1\diffx2 =2∫0≤x1≤x2≤1J(x1)J(x2)f(F−1(x1))f(F−1(x2))x1(1−x2)\diffx1\diffx2 =2c2α2∫p≤v2≤v1≤q1v1+1/α1v1+1/α2(1−v1)v2\diffv1\diffv2 =2c2ξ2∫qp∫v1p1−v1v1+ξ1v−ξ2\diffv2\diffv1 =2c2ξ21−ξ∫qp1−v1v1+ξ1(v1−ξ1−p1−ξ)\diffv1 =2c2ξ21−ξ∫qp(v−2ξ−v1−2ξ−p1−ξv−1−ξ+p1−ξv−ξ)\diffv =2c2ξ21−ξ(q1−2ξ−p1−2ξ1−2ξ−q2−2ξ−p2−2ξ2−2ξ+p1−ξq−ξ−p−ξξ+p1−ξq1−ξ−p1−ξ1−ξ) =2c2ξ21−ξ(q1−2ξ−p1−2ξ1−2ξ+p1−ξq−ξ−p−ξξ+2p1−ξq1−ξ−p2−2ξ−q2−2ξ2−2ξ),

where if . ∎

###### Proof of Lemma 2.

The formula for follows from Lemma 2. Suppose and let be the asymptotic variance of . On the one hand, we have

 v=σ2j+σ2k+2Σjk.

On the other hand, noting that is asymptotically equivalent as in Lemma 2 with

 J(x)=1[1−pj+1

it follows from the proof of Lemma 2 that

 v=Ijj+Ijk+Ikj+Ikk,

where

 Ijk=c2ξ2∫pj+1pj∫pk+1pk1v1+ξ1v1+ξ2(min\set1−v1,1−v2−(1−v1)(1−v2))\diffv1\diffv2. (A.1)

Clearly we have and . By Fubini’s theorem, . Therefore . Since and hence , we obtain

 Σjk =c2ξ2∫pj+1pj∫pk+1pk1v1+ξ1v1+ξ2(1−v1)v2\diffv1\diffv2 =c2ξ2(∫pj+1pjv−ξ2\diffv2)(∫pk+1pk(v−1−ξ1−v−ξ1)\diffv1) =c2ξ2p1−ξj+1−p1−ξj1−ξ⎛⎝−p−ξk+1−p−ξkξ−p1−ξk+1−p1−ξk1−ξ⎞⎠ =−c2ξ2p1−ξj+1−p1−ξj1−ξ⎛⎝p−ξk+1−p−ξkξ+p1−ξk+1−p1−ξk1−ξ⎞⎠,

which is (2.4b).

To show that is positive definite, noting that and (A.1) holds, we have

 Σjk=∫[p1,1]2ϕj(v1)ϕk(v2)min\set1−v1,1−v2−(1−v1)(1−v2)v1v2\diffv1\diffv2,

where . Take any vector . Then as in the proof of Lemma 2, we obtain

 z⊤Σz =2∑j,kzjzk∫p1≤v2≤v1≤1ϕj(v1)ϕk(v2)1−v1v1\diffv1\diffv2 =2∑j,kzjzk∫1p1∫v1p1ϕj(v1)ϕk(v2)1−v1v1\diffv2\diffv1 =2∫1p1∫v1p1ϕ(v1)ϕ(v2)1−v1v1\diffv2\diffv1,

where . Since is piece-wise continuous, we can take an absolutely continuous primitive function such that . By the fundamental theorem of calculus, we obtain

 z⊤Σz=2∫1p1ϕ(v)Φ(v)1−vv\diffv.

Let be the integral ignoring the factor 2. Using integration by parts, we obtain

 I =∫1p1ϕ(v)Φ(v)1−vv\diffv=∫1p1Φ′(v)Φ(v)1−vv\diffv =[Φ(v)21−vv]1p1−∫1p1Φ(v)(Φ′(v)1−vv−Φ(v)v2)\diffv =−I+∫1p1Φ(v)2v2\diffv ⟺ z⊤Σz=2I=∫1p1Φ(v)2v2\diffv≥0,

so is positive semidefinite. Since is continuous, equality holds if and only if . Therefore is positive definite. ∎

###### Proof of Proposition 3.1.

Let . Since and by (2.6), using the definition of , , and , we obtain

 √n(¯sk−rk) =√n(Sk+1−SkSK+1−SK−μkμK)=√n(n¯Yk/∑Yin¯YK/∑Yi−μkμK) =√n(¯Yk¯YK−μkμK)=1¯YK√n(¯Yk−μk)−μkμK¯YK√n(¯YK−μK) \dto1μKZk−μkμ2KZK.

Expressing this in matrix form, we obtain

 √n(¯s−r)\dtoHZ∼N(0,HΣH⊤).

Since by Lemma 2 each is proportional to and each element of is proportional to , the vector and matrix depend only on . Since is positive definite by Lemma 2 and has full row rank, is also positive definite. ∎

###### Proof of Proposition 3.1.

We prove the contrapositive. Let and . If , using and (2.4a), in particular

 rK−1(α)=rK−1(α0) ⟺p1−ξK−p1−ξK−1p1−ξK+1−p1−ξK=p1−ξ0K−p1−ξ0K−1p1−ξ0K+1−p1−ξ0K ⟺1−a1−ξb1−ξ−1=1−a1−ξ0b1−ξ0−1,

where and . By Lemma A below, the left-hand side is monotone in . Therefore and hence . ∎

Let , , and . Then