Bayesian approach to Lorenz curve using time series grouped data

This study is concerned with estimating the inequality measures associated with the underlying hypothetical income distribution from the times series grouped data on the Lorenz curve. We adopt the Dirichlet pseudo likelihood approach where the parameters of the Dirichlet likelihood are set to the differences between the Lorenz curve of the hypothetical income distribution for the consecutive income classes and propose a state space model which combines the transformed parameters of the Lorenz curve through a time series structure. Furthermore, the information on the sample size in each survey is introduced into the originally nuisance Dirichlet precision parameter to take into account the variability from the sampling. From the simulated data and real data on the Japanese monthly income survey, it is confirmed that the proposed model produces more efficient estimates on the inequality measures than the existing models without time series structures.

Authors

• 5 publications
• 3 publications
• 1 publication
• 3 publications
• 21 publications
• Granger causality of bivariate stationary curve time series

We study causality between bivariate curve time series using the Granger...
10/11/2020 ∙ by Han Lin Shang, et al. ∙ 0

• Semiparametric time series models driven by latent factor

We introduce a class of semiparametric time series models by assuming a ...
04/23/2020 ∙ by Gisele O. Maia, et al. ∙ 0

• Factor Modelling for Clustering High-dimensional Time Series

We propose a new unsupervised learning method for clustering a large num...
01/06/2021 ∙ by Bo Zhang, et al. ∙ 0

• A Local Approach for Information Transfer

In this work, a strategy to estimate the information transfer between th...
01/09/2018 ∙ by P. Garcia, et al. ∙ 0

• Clustering Time Series with Nonlinear Dynamics: A Bayesian Non-Parametric and Particle-Based Approach

We propose a statistical framework for clustering multiple time series t...
10/23/2018 ∙ by Alexander Lin, et al. ∙ 0

• A Bayesian method for the analysis of deterministic and stochastic time series

I introduce a general, Bayesian method for modelling univariate time ser...
09/17/2012 ∙ by C. A. L. Bailer-Jones, et al. ∙ 0

• Inferring Black Hole Properties from Astronomical Multivariate Time Series with Bayesian Attentive Neural Processes

Among the most extreme objects in the Universe, active galactic nuclei (...
06/02/2021 ∙ by Ji-won Park, et al. ∙ 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Lorenz curve plays a crucial role in measuring income inequality. Given individual household incomes sorted in the ascending order, the Lorenz curve relates the cumulative share of income to the cumulative share of population. When individual household income data are available, the Lorenz curve and its associated inequality measures, such as the Gini coefficient, can be accurately estimated by assuming a parametric hypothetical income distribution or using a nonparametric method for the income distribution (see e.g.  Hasegawa and Kozumi, 2002). However, since individual income data may contain sensitive information that may lead to identification of individuals, most national governments and local governments provide the household income data only in the form of grouped data, which contain the summary of income and numbers or proportions of households for predefined income classes. Estimating the Lorenz curve based on the grouped data has drawn substantial attention from both theoretical and empirical perspectives. See Chotikapanich (2008) for an overview.

There are several approaches to the Lorenz curve estimation based on the grouped data in the parametric framework. When the grouped level income data, where the numbers of households for the income classes are reported, the popular approach is to assume a hypothetical income distribution and estimate its parameters from the data. McDonald and Xu (1995) and Kleiber and Kotz (2003)

list the wide range of statistical distributions that have appeared in the context of the income distribution. Given a parameter estimate, the Lorenz curves and inequality measures can be analytically or numerically calculated. In the likelihood-based estimation, one can use the multinomial likelihood where the cell probabilities of the multinomial distribution are derived using the distribution function of the assumed income distribution

(McDonald, 1984). Alternatively, by regarding the thresholds for the income classes as the selected order statistics, the likelihood function can be constructed as the joint density of the order statistics following David and Nagaraja (2003) (see Nishino and Kakamu, 2011; Kakamu and Nishino, 2018)

. Using the theoretical moments of the income distribution, it is also possible to devise a generalised method of moments (GMM) estimator.

Hajargasht et al. (2012) and Griffiths and Hajargasht (2015) proposed the optimal GMM estimators for the parameters of the income distribution from the grouped data.

When the grouped data on the Lorenz curve or proportion of incomes, where the cumulative proportions or proportions of income for the income classes are reported, are used, a functional form of the Lorenz curve is fit to the data. In addition to the Lorenz curves derived from the statistical distributions, such as the lognormal distribution, the researchers have designed many functional forms which are sufficiently flexible and provide analytical forms of the inequality measures. See Sarabia (2008) for the list of parametric Lorenz curves. It is also possible to model the Lorenz curve semiparametrically as in Ryu and Slottje (1996)

where either the Lorenz curve or the quantile function is expanded using basis functions.

Chotikapanich and Griffith (2002) assumed that the expectation of the income share is equal to the difference between the heights of the Lorenz curves evaluated at the two consecutive cumulative proportion of population corresponding to the thresholds of the income classes and employed the pseudo Dirichlet likelihood function for the grouped data on the Lorenz curve. The parameters of the Lorenz curve is estimated by maximising the Dirichlet likelihood. Chotikapanich and Griffith (2005) considered the Bayesian estimation based on the Dirichlet likelihood and the posterior estimation is carried out using the Markov chain Monte Carlo (MCMC) method. More recently, Kobayashi and Kakamu (2019) proposed the likelihood-free approach to the Lorenz curve based on the approximate Bayesian computation (ABC) that does not rely on the Dirichlet likelihood and can be implemented even when an analytical form of the Lorenz curve is not available.

The existing literature predominantly focused on estimating the income distribution and its Lorenz curve using a single set of grouped data, typically from a national survey in a chosen year. The information contained in the grouped data from a single year is limited and an estimation result based on such data can be subject to large uncertainty. Therefore, by using an appropriate model that uses the grouped data simultaneously over a certain period of time and borrows information across the time, we can stabilise the estimates and improve the performance. Some exceptions are the lognormal state space model considered by Nishino et al. (2012) and Nishino and Kakamu (2015). The observation equation follows the lognormal distribution which is derived from the asymptotic distribution for the linear model based on the selected order statistics. The state equation consists of either the AR(1) or random walk process. Since their model is specifically designed for the grouped level income data, it cannot be used for the grouped data on the Lorenz curve or proportion of incomes. Furthermore, the shape of the lognormal Lorenz curve is known to be very restrictive and hence there is potentiality of other more flexible parametric income distributions and Lorenz curves to be considered.

The present paper extends the Dirichlet approach proposed by Chotikapanich and Griffith (2002, 2005) in such a way that the model can analyse the time series grouped data and the parameters of the Lorenz curve are time varying based on the state space model. Specifically, the parameters of the Lorenz curve that appear in the Dirichlet likelihood follow the latent autoregressive processes or random walk processes after an appropriate transformation. The precision parameter introduced in the Dirichlet likelihood can depend on the sample size of the survey of each period. In the present setting, this precision parameter is a more meaningful parameter that expresses how much we believe overall in the observed data and can be estimated more stably, because it is now estimated using the data from all periods rather than from only one sample from the Dirichlet distribution as in the single period setting of Chotikapanich and Griffith (2002). The numerical examples demonstrate the proposed model manages to reduce a large amount of uncertainty in estimating the Lorenz curve and Gini coefficient.

The rest of the paper is organised as follows. Section 2 first introduces the Dirichlet approach and develops the model for the times series grouped data. The MCMC method for posterior computation is also described. Section 3 illustrates the proposed method using the simulated data and real data using the monthly grouped data from the Family Income and Expenditure Survey of Japan. Finally, we conclude in Section 4.

2 Method

2.1 Lorenz curve estimation based on Dirichlet distribution

Suppose that the population is divided into the predefined income classes. Let us denote the observed cumulative shares of households for and income for the income classes by and , respectively. The cumulative shares of households and income are usually constructed from an income survey on

individual households. Let us denote the cumulative distribution function and probability density function of the hypothetical income distribution in the

th area by and , respectively, where is the

dimensional parameter vector of the income distribution. Then, the Lorenz curve denoted by

is defined by

 L(y|\boldmath{θ})=1μ∫y0H−1(z|% \boldmath{θ})dz,y∈[0,1],

where is the mean of the distribution and . Once the parameter estimate for is obtained, it is possible to calculate the Gini coefficient from

 G=1−2∫10L(z|\boldmath{θ})dz. (1)

Chotikapanich and Griffith (2002, 2005) assumed that the expectation of the share for an income class is equal to the difference in the values of the Lorenz curve for the two consecutive groups:

 E[qk]=L(pk|\boldmath{θ})−L(pk−1|\boldmath{θ}),k=1,…,K,

where is the income share for the th income class. Then, is assumed to follow the Dirichlet distribution

 f(q|\boldmath{θ},λ)=Γ(λ)K∏k=1qλ(L(pk|\boldmath{θ})−L(pk−1|% \boldmath{θ}))−1kΓ(λ(L(pk|\boldmath{θ})−L(pk−1|\boldmath{θ}))), (2)

where is the gamma function and

is the precision parameter. The variance and covariance of the income share implied from the Dirichlet distribution are given by

 Var(qk|\boldmath{θ},λ)=E[qk](1−E[qk])λ+1,Cov(qk,ql|\boldmath{θ},λ)=−E[qk]E[ql]λ+1, (3)

A larger value of suggests that the variation of the income shares implied from the hypothetical Lorenz curve is small. Using (2) as the likelihood function, Chotikapanich and Griffith (2002, 2005) respectively considered the maximum likelihood estimation and Bayesian estimation assuming a prior distribution for .

In the setting of Chotikapanich and Griffith (2002, 2005), the data are from, for example, a nation wide income survey for a single year. The parameters of the income distribution are estimated based on the Dirichlet likelihood function constructed from a single data point. Therefore, the parameter estimates exhibit large variation due to the small sample size. Furthermore, the data do not contain information on the Dirichlet precision parameter and is seen as a nuisance parameter or tuning parameter. Although the value of can potentially have an impact on the estimates of the parameters and Gini coefficient (Kobayashi and Kakamu, 2019), there exists no clear guideline on the choice of its value when it is fixed nor the choice of its prior distribution when the model is estimated within the Bayesian framework.

2.2 Model using time series grouped data

Suppose that the income survey is conducted monthly, quarterly or yearly as is the case in the income survey in Japan (see Section 3.2). From a series of the data from the survey we can capture how the income distribution and associated inequality measures change over time. In stead of fitting the Dirichlet model (2) independently for each period, we propose to use data from all the available periods and estimate a joint model in order to borrow strength across time and improve the estimation accuracy.

The notation used in the previous section is indexed by to denote the period. The index is added to (2) as

 f(qt|\boldmath{θ}t,λt)=Γ(λt)K∏k=1qλt(L(ptk|\boldmath{θ}t)−L(pt,k−1|\boldmath{θ}t))−1tjΓ(λt(L(ptk|\boldmath{θ}t)−L(pt,k−1|\boldmath{θ}t))),t=1,…,T, (4)

where and are the cumulative shares of income and household, is the parameters of the income distribution and is the the precision parameter of the Dirichlet distribution at the th period. In this paper, the number of income classes and the family of the hypothetical income distribution are assumed to be known and the same for all periods. The model (4) assumes the conditional independence of given and .

The change in the income distribution is captured through the change in the parameters of the income distribution over time. This is achieved by modelling a latent process for the appropriately transformed that is to be combined with (4) to form a state space model. Let us define where

is an appropriate link function. In the present paper, the parameters on the positive real line are transformed using the log link and those on the unit interval are transformed using the logit link. Appendix describes the transformations for constructing the latent processes for the Lorenz curves considered in this paper.

In this paper, is assumed to independently follow either the AR(1) process

 u1j∼N(μj,τ2j/(1−ρ2j)),utj=μj+ρj(ut−1,j−μj)+etj,|ρj|<1,etj∼N(0,τ2j),t=2,…,T, (5)

or the random walk (RW) process

 u1j∼N(0,cτ2j),utj=ut,j−1+etj,etj∼N(0,τ2j),t=2,…,T, (6)

for .

The precision parameter is modelled proportionally to the sample size of the survey in the th period:

 λt=ntexp(ψ),t=1,…,T. (7)

From (3), the variation of the income shares decreases as the sample size increases. While a similar model assumption is found in the context of small area estimation for aggregate data where the similar quantity corresponding to is assumed to be known, the present paper treats as an unknown parameter. In the present model, the parameter represents the overall sampling precision across time and the variation of the income shares varies with the sample size multiplicatively. Therefore, as seen in the application the Japanese data, compared to the separate approach of Chotikapanich and Griffith (2002, 2005) is a more meaningful parameter and can be estimated more stably as we use the entire time series data to estimate the model.

The parameters of the model (4) with (5) or (6) and (7) are , , and and their prior distributions are as follows. We assume and for the ease of posterior computation for

. The values of the hyperparameters are chosen such that the prior distributions well cover the range of the parameters of the income distribution before transformation found in the existing empirical results (see Section

3). As we do not have own prior information on the persistency of the transformed parameter of the income distribution, we assume for . Finally, as we have little prior information on , we assume with the hyperparameters such that the prior distribution covers a wide range of values of .

Finally, the joint distribution of the data

, the latent variables with and parameters and under the AR(1) specification (5) is given by

 π(q,u,\boldmath{μ},\boldmath{ϕ},\boldmath{τ}2,ψ)∝[T∏t=1f(qt|ut,ψ){d∏j=1π(utj|ut−1,j,% \boldmath{η}j)}]π(ψ)d∏j=1{π(μj)π(ρj)ρ(τ2j)},

where is the density function associated with the model (5) with and , , and are the prior densities. The joint distribution for the RW specification (6) is given by

 π(q,u,\boldmath{τ}2,ψ)∝[T∏t=1f(qt|ut,ψ){d∏j=1π(utj|ut−1,j,τ2j)}]π(ψ)d∏j=1{π(τ2j)},

with .

2.3 Posterior computation

The proposed model is estimated using the MCMC method based on the simple Gibbs sampler. Here the sampling algorithm for the AR(1) specification is described. The sampling algorithm for the RW specification can be easily obtained after a few modification. The parameters , , and and latent variables are alternately sampled from their respective full conditional distributions as follows.

1. We sample from the full conditional distribution proportional to

 π(ut|Rest)∝⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩f(q1|u1,ψ)d∏j=1{π(u1j|% \boldmath{η}j)},t=1f(qT|uT,ψ)d∏j=1{π(uTj|uT−1,j,\boldmath{η}j)},t=Tf(qt|ut,ψ)d∏j=1{π(utj|ut−1,j,\boldmath{η}j)π(ut+1,j|ut,j,% \boldmath{η}j)},otherwise.

It is not in the form of the density function of a standard distribution due to the complex form of the Dirichlet distribution in . The accept-reject Metropolis-Hastings (ARMH) algorithm which uses a normal approximation of around the mode is adopted. As necessary, the random walk Metropolis-Hastings (MH) algorithm may be mixed with probability 0.05 during the burn-in period in order to escape from badly chosen initial values.

2. From , the full conditional distribution of is where

 ^mμj=^v2μj{1−ρ2jτju1j+1−ρjτ2jT∑t=2(utj−ρjut−1,j)+mjv2j},^v2μj={(T−1)(1−ρ2j)τ2j+1−ρ2jτ2j+1v2j}.
3. The full conditional distribution of is proportional to

 π(ρj|Rest)∝√1−ρ2jexp{−(1−ρ2j)(h1j−μj)22τ2j}exp{−12τ2jT∑t=2(utj−μj−ρj(ut−1,j−μj))2}π(ρj)∝√1−ρ2jexp⎧⎨⎩−(ρj−^mρj)22^v2ρj⎫⎬⎭π(ρj).

The independence MH algorithm is used to sample with the proposal distribution where

 ^mρj=^v2ρj{1τ2jT∑t=2(utj−μj)(ut−1,j−μj)},^v2ρj=τ2j∑T−1t=2(utj−μj)2.

With the current state and proposal , the acceptance probability is given by

 min⎧⎪ ⎪⎨⎪ ⎪⎩1,√1−ρ∗2jπ(ρ∗j)√1−ρ2jπ(ρj)⎫⎪ ⎪⎬⎪ ⎪⎭.
4. For , the full conditional distribution of is given by where

 ^rj=rj+T2,^sj=sj+12[(1−ρ2j)(u1j−μj)2+T∑t=2(utj−μj−ρj(ut−1,j−μj))2].
5. The full conditional distribution of is proportional to

 π(ψ|Rest)∝{T∏t=1f(qt|ut,ψ)}π(ψ).

Again, since this is not in a standard form, we use the random walk MH algorithm with a normal proposal distribution to sample .

Combining different parametric income families with different models for the latent process leads us to fitting multiple models to the data. The models are compared based on a criterion based on the posterior predictive loss proposed by

Gelfand and Ghosh (1998). The criterion favours the model which minimises

 PPLr(M)=T∑t=1K∑k=1VMtk+rr+1T∑t=1K∑k=1(qtk−EMtk)2

where and are the mean and variance of the posterior predictive distribution of under the model . We estimate and based on the 10000 MCMC draws obtained after the 2000 draws of the burn-in period. The first term penalises the model for complexity and the second term measures the goodness of fit with the weight .

3 Numerical Examples

3.1 Simulation study

The proposed method is first illustrated using the simulated datasets. This study considers the Singh-Maddala distribution, denoted by , as the hypothetical income distribution. This is a widely used three parameter family and is known to fit to income data well. Since the Lorenz curve of the Singh-Maddala distribution does not depend on the scale parameter , only are the latent processes for the shape parameters modelled. The latent variables are and for . Appendix provides the probability density function and associated inequality measures of the income distributions used in this paper.

The simulated data are generated as follows.

1. The latent variables , are generated based on the AR(1) process given by (5).

2. The sample size is selected randomly from .

3. Given and , the household incomes, denoted by , are generated from .

4. The household incomes are sorted in the ascending order and divided into equally sized income classes, then the income shares are computed from for where , and .

In this simulation study, we set so that as in the income survey data in Japan (Section 3.2) and . We set and . The prior distributions are given by , and such that the prior distributions well cover the range of the parameter values often found in the literature. The MCMC algorithm are run for 10000 iterations after a burn-in period of 2000 iterations.

Table 1

presents the posterior means, 95% credible intervals and inefficiency factors of the parameters. The table shows that the posterior distributions are concentrated around the true values. The inefficiency factor is defined as a ratio of the numerical variance of the sample mean of the Markov chain to the variance of the independent draws

(Chib, 2001). The inefficiency factors shown in the table are reasonably small indicating the efficiency of the sampling algorithm.

Using the proposed method, the Gini coefficient and values of the Lorenz curve for

are estimated based on the posterior sample. The performance of the proposed method is compared with the two alternative approaches. The first alternative is a crude descriptive statistics which estimates the Gini coefficient from based on the trapezoids for each period given by

The second alternative is the method of Chotikapanich and Griffith (2005), which estimates the parameters separately for based on the Dirichlet likelihood assuming the independent priors for and , and prior for . In this approach, no information on the sample size is reflected to the precision parameter . For each period, the random walk MH algorithm which jointly samples is run for 10000 iterations after a burn-in period of 2000 iterations. Given the posterior samples on the parameters, the Gini coefficients and Lorenz curves are estimated. The performance of the three approaches are compared based on the relative bias, defined by where is the posterior mean under a method and is the true value of the Gini coefficient, and lengths of the credible intervals over time. We also considered a more diffuse prior distributions for , and . They have the same prior means as the default prior but the inflated prior variances.

Figure 1 presents the boxplots of the relative bias and lengths of the 95% credible intervals for , and the Gini coefficients under the proposed approach, separate Dirichlet approach of Chotikapanich and Griffiths (2005) and crude descriptive statistics over periods. The relative bias for under the proposed and separate approaches appear to be comparable and the estimates do not seem to depend on the prior specification. This is also the case for the Gini coefficient. However, the figure shows that estimating the parameters separately can occasionally incur large bias in the estimates of and the amount of bias appears to depend the prior specification. The figure shows that the 95% credible intervals for the parameters and Gini coefficients under the proposed approach are immeasurably narrower than those under the separate Dirichlet approach, indicating large reduction in the uncertainty on those quantities by borrowing strength across time through the state space model. The wide credible intervals under the separate Dirichlet approach is resulted from the large uncertainty in the posterior distributions because the data from a single period contains little information about the parameters. It is also seen that the lengths of the credible intervals under the separate approach tend to become longer as the prior variances increases. On the other hand, the proposed approach appears to be robust with respect to the prior setting.

The same observation holds for the estimates of the Lorenz curve. Figure 2 presents the relative bias and lengths of the 95% credible intervals for the Lorenz curves at under the proposed and separate Dirichlet approaches. The left panels of the figure show that the relative bias under the proposed and separate approaches are comparable. However, the right panels show that the credible intervals under the proposed approach are much narrower than those under the separate approach and the prior specification under the separate approach has an impact on the lengths of the intervals.

3.2 Application to Japanese income survey data

Now the proposed model is demonstrated using the monthly income share data from the Family Income and Expenditure Survey (FIES) prepared by Ministry of Internal Affairs and Communications of Japan. Our dataset contains the income shares of the equally sized income classes of the working households which has been adjusted to the population size between January 2000 and December 2018 (). The dataset is available from https://www.e-stat.go.jp. Figure 3 presents the time series plot of the income shares of the five income classes of our dataset and the descriptive statistics of the Gini coefficient. The figure shows that the share of the fifth (highest income) class exhibits some large variation around 0.35. The shares of the other four classes appear to be rather stable over time with smaller variation. Therefore, the figure suggests fitting an income model that well captures the behaviour of the upper tail of the distribution is important to obtain an accurate insight about the dynamics of the income distribution and inequality structure. We consider the three Lorenz curves derived from the hypothetical income distribution, lognormal (LN), Singh-Maddala (SM) and Dagum (DA) distributions and three parametric functional Lorenz curves proposed by Ortega et al. (1991) (OR) and Rasche et al. (1980) (RA) with two parameters and Kakwani (1980) (KA) with three parameters. Each Lorenz curve is incorporated the proposed framework where the parameters follow the latent AR(1) or RW. The transformations for constructing the latent processes for those Lorenz curves are described in Appendix. The separate Dirichlet approach (DIR) is fitted as well. In total eighteen models are fitted to the data. Hereafter, the models are denoted as, for example, LNAR for the LN model with AR(1) and KADIR for the KA model using the separate approach, as necessary. The same default prior distribution as in the simulation study. For the RW models, we set . The models are estimated based on the 10000 MCMC draws after the 2000 draws of the burn-in period.

Table 2 presents the log PPL for the eighteen models. For both and , KARW for the latent process resulted in the smallest PPL followed by KAAR version with the very marginal differences. In Chotikapanich and Griffith (2002) the KA Lorenz curve was also found to be the best fitting model. The SM models follow the KA models. In the proposed approach, the DA models are the least supported by the data based on PPL. The Dagum distribution has more control on the shape in the left tail than the right tail (Kakamu, 2016). This result suggests that the right tail of the income distribution for this particular dataset is relatively more important than the left tail, while the Dagum distribution is known to fit well to income data of many other countries (Kl08). For all the other Lorenz curves, the RW specification resulted in smaller PPL than the AR. In the following, we only focus on the result under the RW models. The PPL under DIR are larger than the proposed approaches due to the large uncertainty from the single period data. Among the models with the separate approach, LN resulted in the smallest PPL as it has only one parameter to estimate in the Lorenz curve per period.

Figure 4 presents the posterior means and 95% credible intervals of the parameters and Gini coefficients under KARW and KADIR. As illustrated in the simulation study, the posterior distributions under the separate approach are wide spread across the parameter spaces. Contrary, the proposed approach produced the more concentrated posterior distributions with the smoother traces of the posterior means and hence would provide more precise insight about the dynamics of the quantities of interest.

Figure 5 presents the posterior distributions of obtained from the proposed approach with RW and LNDIR, which is the best supported model under the separate approach for and under the two prior settings for and : and . In the left panel of the figure, it is seen that the locations of the posterior distributions correspond to the order of the model supported by the data shown in Table 2. Therefore, under the proposed approach, the precision parameter is indicative of how much we believe overall in the observed data. Also the posterior distributions seem to robust with respect to the prior specification. Contrary, the posterior distributions under the separate approach are wide spread and exhibit prior sensitivity, as also demonstrated by Kobayashi and Kakamu (2019).

Figure 6 presents the posterior means of the income shares for the five income classes and Gini coefficients under the proposed approach. It is seen that KA, which is the most supported by the data based on PPL, traces the observed income shares most closely. In the figure, KA, SM, DA, OR and RA seem to agree for and , but some disagreement between the first and latter models are observed for and . For and , LN resulted in the somewhat peculiar patterns especially in the case of where the income share exhibits little variation over time. This could be to the restrictive shape of the lognormal Lorenz curve which is controlled by the single parameter. Regardless of some disagreement in the estimates among the six models, the posterior means of the Gini coefficients resulted in the similar patterns, though the estimates under the lognormal model appear to be constantly slightly smaller than the rest. The deviations in the estimates of the income share, for example, under SM, DA, OR, RA from the observed share for and compensate each other such that they still agree in the Gini coefficient, which is a summary measure of the Lorenz curve. The figure also shows that the dynamics of the top 20% share is closely linked with that of the Gini coefficient. It appears that the top 20% share and Gini coefficient is declining from the middle of the sample period towards the end with the relative increase in the shares of the bottom two classes.

4 Conclusion

This paper has developed the model for the Lorenz curve based on the time series grouped data where the likelihood part is based on the Dirichlet distribution and the parameters of the Lorenz curve follow the latent process constituting a state space model. The numerical examples using the simulated data and real data of Japan have demonstrated that the parameters can be estimated more stably and the posterior distributions obtained from the proposed approach have much smaller uncertainty than separately estimating the parameters based on the Dirichlet distribution using the data only from a single period. The application to the income survey data of Japan found that the three-parameter parametric Lorenz curve of Kakwani (1980) provided the best fit to the data followed by the Singh-Maddala Lorenz curve. It was also seen that in the sample period between 2000 and 2018 the Gini coefficient exhibit a downward trend after 2008 with some occasional spikes.

Some limitations of the present paper is as follows. Although the parametric functional forms of the Lorenz curve fit to data very well, these Lorenz curves cannot provide very detailed insights about the income distribution such as its shape. An advantage of the Lorenz curve derived from the statistical distribution is that the shape of the income distribution may be grasped. However, since the Lorenz curve is location-free, we cannot obtain the complete picture of the income distribution of interest. In order to address this issue, we could consider extending the GMM approach of Hajargasht et al. (2012) and Griffiths and Hajargasht (2015) by utilising an additional information on the location of the income distribution. Furthermore, although we could obtain some insights about the dynamics of the income shares and Gini coefficient from the application, one may wish to smooth out the estimates even more for better interpretability using, e.g. a regime- or shrinkage-based modelling. These are left for the future studies.

Acknowledgments.

This work is partially supported by JSPS KAKENHI (#17J04715, #18K12754, #18K12757, #19K13667). The computational results were obtained by using Ox version 6.21 (Doornik, 2007).

Appendix Appendix Parametric Lorenz curves

Lognormal distribution

The probability density function of the lognormal distribution denoted by is given by

 hLN(x|μ,σ2)=1x√2πσ2exp{−(logx−μ)22σ2},x>0,

where is the location parameter and is the scale parameter. The Lorenz curve and Gini coefficient associated with are given by and where and are the cumulative distribution function and quantile function of . It is well known that the Lorenz curve and Gini coefficient only depend on . For the latent process, it is simply set .

The probability density function of the Singh-Maddala distribution (Singh and Maddala, 1976) denoted by is given by

 hSM(x|α,β,γ)=αγxα−1βα(1+(x/β)α)γ+1,x>0,

where are the shape parameters and is the scale parameter. The Lorenz curve and Gini coefficient are given by and where and is the incomplete beta function, is the beta function and is the gamma function. The Lorenz curve and Gini coefficient do not depend on . For the latent process, we set .

Dagum distribution

The probability density function of the Dagum distribution (Dagum, 1977) denoted by is given by

 hDA(x|α,β,κ)=ακxακ−1βακ(1+(x/β)α)κ+1,x>0,

where are the shape parameters and

is the scale parameter. The Dagum distribution is also know to fit to income data well. The Dagum and Singh-Maddala distributions are the special cases of the mode flexible family called the generalised beta distribution of the second kind (GB2), denoted by

, and and . As for the Singh-Maddala distribution, the Lorenz curve and Gini coefficient of the Dagum distribution do not depend on and are given by and . For the latent process, we set .

Kakwani Lorenz curve

The three-parameter Lorenz curve proposed by Kakwani (1980) is given by

 LKA(p|ν,ϕ,ξ)=p−νpξ(1−p)δ,

where and . This is also called the beta type Lorenz curve and is known to be the one of the most flexible and best fitting Lorenz curves. The Gini coefficient is computed using the numerical integration based on (1). For the latent process, it is set .

Ortega Lorenz curve

The Lorenz curve of Ortega et al. (1991) is given by

 LOR(p|α,δ)=pα(1−(1−p)δ),

where and . When , this is the same as the Kakwani Lorenz curve with . The Gini coefficient is computed using the numerical integration. For the latent process, it is set .

Rasche Lorenz curve

The Lorenz curve of Rasche et al. (1980) is given by

 LRA(p|δ,γ)=(1−(1−p)δ)γ,

where and . The Gini coefficient is computed using the numerical integration. For the latent process, it is set .

References

• Chib (2001) Chib, S. (2001). Markov chain Monte Carlo methods: computation and inference. In Heckman, J. J. and Leamer, E. (eds.), Handbook of Econometrics Volume 5, 3569–3649. Amsterdam: North Holland.
• Chotikapanich (2008) Chotikapanich, D. (ed.) (2008). Modeling Income Distributions and Lorenz Curves, Springer: New York.
• Chotikapanich and Griffith (2002) Chotikapanich, D. and Griffiths, W.E. (2002). Estimating Lorenz curves using a Dirichlet distribution. Journal of Business & Economic Statistics, 20, 290–295.
• Chotikapanich and Griffith (2005) Chotikapanich, D. and Griffiths, W.E. (2005). Averaging Lorenz curves. Journal of Income Inequality, 3, 1–19.
• Dagum (1977) Dagum, C. (1977). A new model of personal income distribution: Specification and estimation. Economie Appliquée, 30, 413–437.
• David and Nagaraja (2003) David, H.A. and Nagaraja, H.N. (2003). Order Statistics, 3rd ed., Wiley: New York.
• Doornik (2007) Doornik, J. (2007). Ox: object oriented matrix programming, Timberlake Consultants Press, London.
• Gelfand and Ghosh (1998) Gelfand, A. E. and Ghosh, S. K. (1998). Model choice: a minimum posterior predictive loss approach. Biometrika 85, 1–11.
• Griffiths and Hajargasht (2015) Griffiths, W. E. and Hajargasht, G. (2015). On GMM estimation of distributions from grouped data. Economics Letters, 126, 122–126.
• Hajargasht et al. (2012) Hajargasht, G., Griffiths, W. E., Brice, J., Rao, D.S. P. and Chotikapanich, D. (2012). Inference for Income Distributions Using Grouped Data. Journal of Business & Economic Statistics, 30, 563–575.
• Hasegawa and Kozumi (2002) Hasegawa, H. and Kozumi, H. (2002). Estimation of Lorenz curves: A Bayesian nonparametric approach. Journal of Econometrics, 115, 277–291.
• Kakamu (2016) Kakamu, K. (2016). Simulation studies comparing Dagum and Singh–Maddala income distributions. Comput Econ, 48, 593–605.
• Kakamu and Nishino (2018) Kakamu, K. and Nishino, H. (2018). Bayesian estimation of beta-type distribution parameters based on grouped data. Computational Economics, DOI:10.1007/s10614-018-9843-4.
• Kakwani (1980) Kakwani, N. C. (1980). On a Class of Poverty Measures. Econometrica, 48, 437–446.
• Kakwani and Podder (1976) Kakwani, N.C. and Podder, N. (1976). Efficient estimation of the Lorenz curve and associated inequality measures from grouped observations. Econometrica, 44, 137–148.
• Kleiber (2008) Kleiber, C. (2008). A guide to the Dagum distributions. Ib Modeling Income Distributions and Lorenz Curves, Springer: New York.
• Kleiber and Kotz (2003) Kleiber, C. and Kotz, S. (2003). Statistical Size Distributions in Economics and Actuarial Science. Wiley: New York.
• Kobayashi and Kakamu (2019) Kobayashi, G. and Kakamu, K. (2019). Approximate Bayesian computation for Lorenz curves from grouped data. Computational Statistics, 34, 253–279.
• McDonald (1984) McDonald, J.B. (1984). Some generalized functions for the size distribution of income. Econometrica, 52, 647–663.
• McDonald and Xu (1995) McDonald, J.B. and Xu, Y.J. (1995). A generalization of the beta distribution with applications. Journal of Econometrics, 66, 133–152.
• Nishino and Kakamu (2011) Nishino, H. and Kakamu, K. (2011). Grouped data estimation and testing of Gini coefficients using lognormal distributions. Sankhya Series B, 73, 193–210.
• Nishino et al. (2012) Nishino, H., Kakamu, K. and Oga, T. (2012). Bayesian estimation of persistent income inequality using the lognormal stochastic volatility model. Journal of Income Distribution, 21, 88–101.
• Nishino and Kakamu (2015) Nishino, H. and Kakamu, K. (2015). A random walk stochastic volatility model for income inequality. Japan and the World Economy, 36, 21–28.
• Ortega et al. (1991) Ortega, P., Fernandez, M. A., Lodoux, M., and Garcia, A. (1991). A new functional form for estimating Lorenz curves. Review of Income and Wealth, 37, 447–452.
• Sarabia (2008) Sarabia, J.M. (2008) Parametric Lorenz curves: Models and applications. in Chotikapanich, D. (ed.) Modeling Income Distributions and Lorenz Curves. Springer: New York, 167–190.
• Rasche et al. (1980) Rasche, R. H., Gaffney, J., Koo, A., and Obst, N. (1980). Functional Forms for Estimating the Lorenz Curve. Econometrica, 48, 1061–1062.
• Ryu and Slottje (1996) Ryu, H.K. and Slottje, D.J. (1996) Two flexible functional form approaches for approximating the Lorenz curve. Journal of Econometrics, 72, 251–274.
• Singh and Maddala (1976) Singh, S.K. and Maddala, G.S. (1976). A function for size distribution ofIncomes. Ecnometrica, 44, 963–970.