Bayesian Spatial Homogeneity Pursuit of Functional Data: an Application to the U.S. Income Distribution

02/16/2020
by   Guanyu Hu, et al.
Boehringer Ingelheim
0

An income distribution describes how an entity's total wealth is distributed amongst its population. In economics, the Lorenz curve is a well-known functional representation of income distribution. Clustering of Lorenz curves based on both their similarities and spatial adjacencies is motivated by examining the household incomes in each state from the American Community Survey Public Use Microdata Sample (PUMS) data. We propose a mixture of finite mixtures (MFM) model as well as a Markov random field constrained mixture of finite mixtures (MRFC-MFM) model in the context of spatial functional data analysis to capture spatial homogeneity of Lorenz curves. We design efficient Markov chain Monte Carlo (MCMC) algorithms to simultaneously infer the posterior distributions of the number of clusters and the clustering configuration of spatial functional data. Extensive simulation studies are carried out to demonstrate the effectiveness of the proposed methods. Finally, we apply our proposed algorithms to the state level income distributions from the PUMS data. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.

READ FULL TEXT VIEW PDF

Authors

page 8

page 22

page 23

page 29

02/16/2020

Bayesian Spatial Homogeneity Pursuit Regression for Count Value Data

Spatial regression models are ubiquitous in many different areas such as...
08/20/2020

Spatial homogeneity learning for spatially correlated functional data with application to COVID-19 Growth rate curves

We study the spatial heterogeneity effect on regional COVID-19 pandemic ...
10/16/2020

Analysis of professional basketball field goal attempts via a Bayesian matrix clustering approach

We propose a Bayesian nonparametric matrix clustering approach to analyz...
07/16/2020

Heterogeneity Learning for SIRS model: an Application to the COVID-19

We propose a Bayesian Heterogeneity Learning approach for Susceptible-In...
12/09/2021

Bayesian Functional Data Analysis over Dependent Regions and Its Application for Identification of Differentially Methylated Regions

We consider a Bayesian functional data analysis for observations measure...
03/31/2020

Distributed Bayesian clustering

In many modern applications, there is interest in analyzing enormous dat...
01/31/2022

Spectral image clustering on dual-energy CT scans using functional regression mixtures

Dual-energy computed tomography (DECT) is an advanced CT scanning techni...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Our study is motivated by an American Community Survey Public Use Microdata Sample (PUMS) data that contains incomes of United States (US) households in year 2017, which can be accessed via the PUMS data registry (https://www.census.gov/programs-surveys/acs/data/pums.html). Incomes of households as well as the states they live in are recorded. Our primary goal is to cluster the state level income distribution (O’sullivan and Sheffrin, 2007), i.e., how a state’s total wealth is distributed amongst its population. The income distribution has been a central concern of economic theory since the time of classical economists such as Adam Smith and David Ricardo. While economists have been conventionally concerned with the relationship between factors of production, land, labor, and capita for income distribution, modern economists now focus more on income inequality. Particularly, a balance between income inequality and economic growth is a desired goal for policy makers. Capturing homogeneity pattern of state level income distributions is of great research interest in economic studies, as it will enhance the understanding of income inequality among different regions within a country, and provide policy makers with reference as to issue different policies for the identified regions. An income distribution is often characterized with the Lorenz curve (Lorenz, 1905), which is a function showing the proportion of total income assumed by the bottom () of the population. Derived from the Lorenz curve, the Gini coefficient is a commonly used measure for income inequality (Gini, 1997), and it has been widely adopted by many international organizations, such as the United Nations and World Bank, to study income inequalities among regions. The Gini coefficient, however, is only a summary measurement of statistical dispersion of income distribution, and it is non-unique as two Lorenz curves can assume different shapes but still yield the same Gini value.

Thus far, many methods have been introduced to either directly model Lorenz curves or indirectly through the modeling of distribution functions. Popular parametric methods for income distributions in general use heavy tail distributions, including Pareto (Pareto, 1964), log-normal (Gibrat, 1931), Weibull (Bartels and Van Metelen, 1975), gamma (Bartels and Van Metelen, 1975)

, and generalized beta distributions

(McDonald, 1984; McDonald and Xu, 1995)

. Nonparametric methods include the commonly used empirical Lorenz curve estimation method and several other extensions that introduce various smoothing techniques

(Ryu and Slottje, 1996; Cowell and Victoria-Feser, 2008). Most of these existing methods only focus on modeling a univariate personal income distribution. There is a need for the development of spatial functional data analysis techniques to jointly model Lorenz curves across counties or states in economic studies. Without spatial homogeneity pattern detection and treating each state individually, more than 50 policies needs to be made, which could be a waste of public resource, while with a few clusters of states, only (the number of clusters) policies need to be made accordingly.

There are several major challenges in developing clustering algorithms for spatial functional data. First, spatial functional data such as state-level Lorenz curves often exhibit strong location-related patterns. It is necessary to incorporate such spatial structure into spatial functional data clustering algorithms. Nevertheless, most existing functional clustering algorithms are designed under the assumption that the observed functions are i.i.d curves (e.g., see a review paper by Jacques and Preda, 2014)

. These methods can be broadly classified into three paths: two-stage methods that reduce the dimension by basis representations before applying clustering approaches, nonparametric methods that define specific dissimilarities among functions followed by heuristics or geometric procedures-based clustering algorithms such as

-means, and model-based methods that specify clustering models such as mixture of Gaussian for basis coefficients. Recently, a number of works have been proposed to extend these functional clustering algorithms to the spatial context. Romano et al. (2011) and Giraldo et al. (2012) followed the second path to define dissimilarities among spatial functions based on spatial variograms and cross-variograms. Jiang and Serban (2012) followed the third path to model cluster memberships using an auto-regressive Markov random field, and introduce spatially dependent random errors in the conditional model for functions.

Second, it is desired to impose certain spatial contiguous constraints on the clustering configuration to facilitate interpretations in the spatial context. In other words, a local cluster is expected to contain spatially connected components with flexible shapes and sizes. In addition, in many economics applications, this spatial contiguous constraint may not dominate the clustering configuration globally, in the sense that two clusters that are spatially disconnected may still belong to the same cluster. For example, New England area could share certain similar demographic information with California despite the distance in between. Although a large body of model based spatial clustering approaches have been proposed in various spatial contexts, to the best of our knowledge, there is still a lack of clustering methods that allow for both locally spatially contiguous clusters and globally discontiguous clusters. For example, existing Bayesian spatial clustering methods based on mixture models, such as the finite mixture model used in the aforementioned spatial functional clustering algorithm (Jiang and Serban, 2012), can introduce spatial dependence in cluster memberships but may not fully guarantee spatial contiguity. Among the methods that guarantee spatial contiguity, they may either impose certain constraints on cluster shapes (Knorr-Held and Raßer, 2000; Kim et al., 2005; Lee et al., 2017), or fail to allow globally discontiguous clusters (Li and Sang, 2019).

Finally, an important consideration in clustering is how to determine the number of clusters. In Bayesian statistics, Dirichlet Process mixture models (DPM) have gained large popularity because of their flexibility in allowing an unknown number of clusters. Recently,

Miller and Harrison (2018) proved that DPM can produce an inconsistent estimate of the number of clusters, and propose a mixture of finite mixtures model to resolve the issue while inheriting many attractive mathematical and computational properties of DPM. However, their method may not be efficient for spatial clustering as it does not take into account any spatial information.

To address these challenges when facing the analysis of spatial income Lorenz curves, in this article, we develop a new Bayesian nonparametric method that combines the ideas of Markov random field models and mixture of finite mixtures models to leverage geographical information. A distinction of the method is its ability to capture both locally spatially contiguous clusters and globally discontiguous clusters. Moreover, it utilizes an efficient Markov chain Monte Carlo (MCMC) algorithm to estimate the number of clusters and clustering configuration simultaneously while avoiding complicated reversible jump MCMC or allocation samplers. We introduce this new Bayesian nonparametric clustering model to the analysis of the US state level household income Lorenz curves. In particular, we use a similarity measure among functional curves based on the inner product matrix under elastic shape analysis (Srivastava and Klassen, 2016), which has a nice invariance property to shape-preserving transformations. The results of real data reveal interesting clustering patterns of income distributions among different states, which provide important information to study regional income inequalities.

The rest of this paper is organized as follows. The motivating PUMS data is introduced in detail in Section 2. We briefly review elastic shape analysis of functions in Section 3.1, followed by a review of nonparametric Bayesian clustering methods in Section 3.2. We describe the proposed Markov random field constrained mixture of finite mixture prior model and introduce our functional data clustering model in Section 3.3. In Section 4

, the Bayesian inference including the MCMC sampling algorithm, the model selection criterion for tuning parameter, post-MCMC inference, and convergence diagnostic criteria are introduced. Simulation and case study using the PUMS data are presented respectively in Sections 

5 and 6. Section 7 closes the paper with some conclusions and discussions.

2 Motivating Data

Our motivating data comes from the 2018 submission in the PUMS data registry. US households’ incomes and the states they live in are recorded for the 50 states plus Wahington, DC. For simplicity, we refer to them as “51 states” in the rest of this paper. The Lorenz curve (Lorenz, 1905) is a commmonly used functional representation of the distribution of income or wealth for representing inequality of the wealth distribution. In general, functional data is multivariate data with an ordering on the dimensions (Müller, 2008), and can be generally written as

(1)

where is usually in a continuum such as time, and is a function often assumed to be smooth. The Lorenz curve, specifically, assumes that the household income

follows a cumulative distribution function (CDF) 

with respective probability density function

. Let be the inverse CDF defined as . The Lorenz curve is defined as

where . By definition, when plotted in a graph, the Lorenz curve always starts at and ends at (1,1), and measures on the bottom for of households, what percentage of total income they have.

In practice, the empirical Lorenz curve can be constructed from data in a similar fashion as the empirical distribution function. Define

where , for , and is the -th order statistic observations in group . Under mild regularity conditions, converges uniformly in and almost surely to (Gastwirth, 1972). The Gini index, as a derived measure, is defined as two times the area between the Lorenz curve and the 45 degree line of equality from to .

For the 2017 US household income data, examples of Lorenz curves are presented in Figure 1. The Lorenz curve computed on the national level using all observations is marked in black solid line, with a corresponding Gini coefficient of 0.4804. A closer look at the state-level Lorenz curves, however, reveals that the income distributions do vary across states. The Lorenz curves for two selected states, Utah and New York, are also illustrated in Figure 1(a) as an example. It is rather apprarent that while Utah’s curve lies above the national curve, indicating more equality, New York’s curve lies below, suggesting a larger gap between rich and poor. Lorenz curves for all US states are plotted upon the national curve in Figure 1(b), and they form a “cloud” instead of being similar to each other. The ability of Lorenz curves to describe income inequalities is clearly demonstrated here.

Figure 1: (a) Lorenz curves calculated based on the PUMS 2017 Household Income data on the national level and for two selected states; (b) Lorenz curves for all US states.
Figure 2: Descriptive statistics of PUMS data on the US map: (a) Gini coefficient; (b) state median income.

In addition to the Lorenz curves, descriptive statistics, which include the Gini coefficient and state median income, are presented in Figure 2. With a Gini index value of 0.423, Utah becomes the state that has the least income inequality, and Washington, DC has the worst income inequality with a Gini 0.512. It also has the highest median income of $90,000, while Mississippi has the lowest median income of $43,500.

3 Methodology

3.1 Functional Representation of Income Distribution

We begin the section by reviewing the functional data shape analysis technique. In order to cluster functional data, we need to define appropriate metrics to quantify similarities among functional curves. There are four important features of functional data including quantity, frequency, similarity, and smoothness. Commonly used distance metrics such as the Euclidean distance are no longer appropriate candidates for functional data analysis. In this article, we consider the inner product matrix calculated using a specific representation of curves called the square-root velocity function (SRVF; Srivastava et al., 2010)

. This inner product matrix is a summary statistic that encodes the similarity information among curves for subsequent clustering analysis.

The SRVF of a function is defined as:

(2)

where is the first order derivative of function on . There are several advantages of using SRVF for functional data analysis. First, the scaling, rotation and re-parameterization variabilities still remain based on SRVF. In addition, the elastic metric is invariant to the reparameterization of functions. The SRVF represents unit-length curves as a unit hypersphere in the Hilbert manifold. For given functions and which belong to  and their corresponding SRVFs, and , the inner product is defined based on the definition in Zhang et al. (2015) as follows:

(3)

where is the rotation action, and is an element of , the set of all orientation-preserving diffeomorphisms over the domain (Zhang et al., 2015). The inner product of two functions is easily obtained by the algorithm in Tucker et al. (2013). Given , we can calculate the inner product matrix using the definition in (3) and R package fdasrvf (Tucker, 2019).

3.2 Mixture of Finite Mixtures (MFM) for Functional data

Next, we introduce nonparametric Bayesian methods to capture spatial homogeneity of functional data. We start with a logit transformation of the inner product matrix

to make each entry

of the matrix within the range of a Gaussian distribution. The transformed inner product matrix is denoted as

, with each entry , where and denote the maximum and minimum values in , respectively. The larger  is, the closer and will be. We further assume that

(4)

where is the number of true underlying clusters,

denotes the normal distribution,

denotes the cluster membership of the -th curve; and are symmetric matrices, with indicating the mean closeness of any function in cluster and any function in cluster , and  indicating the precision of closeness between any function in cluster and any function in cluster .

Let denote all possible partitions of nodes into clusters. Given , let denote the sub matrix of consisting of entries with and . The joint likelihood of under model (4) can be expressed as

(5)

A common Bayesian specification when is given can be completed by assigning independent priors to , and , and it can be easily incorporated into a framework as finite mixture models. A popular solution for unknown is to introduce the Dirichlet process mixture prior models (Antoniak, 1974) as following:

(6)

where , and .

Dirichlet process is parameterized by a base measure and a concentration parameter . If a set of values of are drawn from , a conditional prior can be obtained by integration (Blackwell et al., 1973):

(7)

Here, is the distribution concentrated at a single point . Equivalent models can also be obtained by introducing cluster membership ’s and letting the unknown number of clusters go to infinity (Neal, 2000).

(8)

where . For each cluster , the parameters determine the cluster specific distribution .

By integrating out mixing proportions , we can obtain the prior distribution of that allows automatic inference on the number of clusters , which is also well known as the Chinese restaurant process (CRP; Aldous, 1985; Pitman, 1995; Neal, 2000). Through the popular Chinese restaurant metaphor, , are defined through the following conditional distribution (Pólya urn scheme, Blackwell et al., 1973):

(9)

where is the size of cluster .

While the CRP has a very attractive feature of simultaneous estimation on the number of clusters and the cluster configuration, a striking limitation of this model has been recently discovered. Miller and Harrison (2018) proved that the CRP produces extraneous clusters in the posterior leading to inconsistent estimation of the number of clusters even when the sample size grows to infinity. A modification of the CRP called Mixture of finite mixtures (MFM) model is proposed to circumvent this issue (Miller and Harrison, 2018):

(10)

where  is a proper probability mass function (p.m.f.) on  and  is a point-mass at . Compared to the CRP, the introduction of new tables is slowed down by the factor , which allows a model-based pruning of the tiny extraneous clusters. The coefficient  needs to be precomputed as:

where , and . By convention,  and .

The conditional prior of under MFM can be stated as below:

(11)

where are the distinct values taken by and is the number of existing clusters. The cluster membership , for , in (10) can be defined in a Pólya urn scheme similar to CRP:

(12)

where is the number of existing clusters.

Adapting MFM to our model setting for functional clustering, the model and prior can be expressed hierarchically as:

(13)

We assume is a distribution truncated to be positive through the rest of the paper, which has been proved by Miller and Harrison (2018) and Geng et al. (2019) to guarantee consistency for the mixing distribution and the number of clusters. We refer to the hierarchical model above as MFM-fCluster.

3.3 Markov Random Field Constrained MFM in Functional data

A possible weakness of MFM for spatial functional data is due to its lack of ability to account for spatial structure or dependence, i.e., MFM neglects the spatial smoothness of a map, and hence the resulting clustering does not comply any spatial constraints and might be sensitive to noises in the data. This drawback can be addressed by introducing spatial coupling between adjacent features. Applying a Markov random field prior to spatial statistical modeling is a classical Bayesian approach widely used in image segmentation problems (Geman and Geman, 1984). In this section, we combine the similar idea of Markov random fields with MFM to introduce spatial constraints for clustering.

The Markov random field (MRF; Orbanz and Buhmann, 2008)

provides a convenient approach to address the difficult problem of modeling a collection of dependent random variables

(Winkler, 2012). Interactions among variables are constrained to a small group that are usually assumed to be closer spatially, in order to reduce the complexity of the problem. The neighborhood dependence structure of MRF is encoded by a weighted graph in space, with vertices representing random variables at spatial locations, denoting a set of edges representing statistical dependence among vertices, and denoting the edge weights representing the magnitudes of dependence.

The MRF for a collection of random variables on a graph

has a valid joint distribution

, with being the cost function with the following form

(14)

where denotes the set of all cliques in , each term is a non-negative function over the variables in clique , and is a normalization term. By Hammersley-Clifford theorem, the corresponding conditional distributions enjoy the Markov property, i.e., , where  denotes the set of neighbors of observation . Considering only pairwise interactions, we model the conditional cost functions as

(15)

where is a parameter controlling the magnitude of spatial smoothness.

Markov random field constrained MFM (MRFC-MFM) is composed of an interaction term modeled by a MRF cost function to capture spatial interactions among vertices and a vertex-wise term modeled by a MFM. The resulting model defines a valid MRF distribution , which can be written as

(16)

with defined by conditional distributions in (11) and from conditional cost functions . This constrained model exhibits a key property that the MRF constraints only change the finite component of the MFM model as shown in Theorem 1 below. The proof is deferred to Appendix A.

Theorem 1

Let denote the size of the -th cluster excluding , denote the number of clusters excluding the -th observation, and assume is a valid MRF cost function. The conditional distribution of a MRFC-MFM takes the form

An immediate corollary of Theorem 1 can be defined in a Pólya urn scheme after introducing the cluster assignments parameters , .

Corollary 1

Suppose the conclusion of Theorem 1 holds. Then,

where , i.e., all elements of except for .

The above scheme offers an intuitive interpretation of MRFC-MFM again using the Chinese restaurant metaphor: the probability of a customer sitting at a table depends not only on the number of other customers already sitting at that table, but also the number of other customers that have spatial ties to the -th customer. The parameter  controls the strength of spatial ties, and ultimately controls estimation on the number of clusters. The larger the value for , the stronger the spatial smoothing effect and the smaller the number of clusters. This can be clearly observed in the simulation results presented in Section 5 (Figure 6). In particular, the MFM model developed in Miller and Harrison (2018) can be viewed as a special case of MRFC-MFM when . We use the notation to represent the MRFC-MFM prior with a smoothness parameter and a base distribution . The Markov random field constraint-mixture of finite mixture-functional clustering method (MRFC-MFM-fCluster) can be hierarchically written as

(17)

where 

is a normal-gamma distribution whose hyperparameters are the same with (

3.2). It is noted that while the model in (17) introduces spatial dependence to promote locally contiguous clustering, it still allows any customer a chance to sit with any other customer so that globally discontiguous clustering can be captured.

4 Bayesian Inference

MCMC is used to draw samples from the posterior distributions of the model parameters. In this section we present the sampling scheme, the posterior inference of cluster configurations, and meatrics to evaluate the estimation performance and clustering accuracy.

4.1 The MCMC Sampling Schemes

Our goal is to sample from the posterior distribution of the unknown parameters , , and . While inference in MFMs can be done with methods such as reversible jump Markov chain Monte Carlo or even allocation samplers, they often suffer from poor mixing and slow convergence. We adapt the algorithm in Miller and Harrison (2018) to exploit the Pólya urn scheme for the MRFC-MFM. An efficient collapsed Gibbs sampler is used for Bayesian inference by marginalizing out analytically. The sampler for MFM is presented in Algorithm 1 and the sampler for MRFC-MFM is presented in Algorithm 2

. These two algorithms only differ by the posterior probability of an observation assigned to an existing cluster. Both algorithms efficiently cycle through the full conditional distributions of

given , , and  for .

For the hyperparameters in both simulation studies and real data analysis, we use , , and . For , is assigned to diagonal terms and is assigned to off-diagonal terms in order to make it more informative. The choices of  ensure that the functions within a cluster will be closer to each other than between clusters. We arbitrarily initialized the algorithms with nine clusters and randomly allocated the cluster configurations. Various other choices were tested and we did not find any evidence of sensitivity to the initialization.

1:procedure c-MFM-fCluster
2:Initialize , and .
3:     for each iter to M  do
4:Update conditional on in a closed form as
5:Update conditional on in a closed form as
Where , and . Here is the number of clusters formed by current .
6:Update conditional on and , for each in , we can get a closed form expression for :
where denotes the partition obtained by removing and
Where , , , and .
7:     end for
8:end procedure
Algorithm 1 Collapsed sampler for MFM-fCluster
1:procedure c-MRFC-MFM-fCluster
2:Initialize , and .
3:     for each iter to M  do
4:Update conditional on in a closed form as
5:Update conditional on in a closed form as
Where , and . Here is the number of clusters formed by current .
6:Update conditional on and , for each in , we can get a closed form expression for :
where denotes the partition obtained by removing and
Where , , , and .
7:     end for
8:end procedure
Algorithm 2 Collapsed sampler for MRFC-MFM-fCluster

4.2 Post MCMC Inference

Dahl’s method (Dahl, 2006) is a popular post-MCMC inference algorithm for the clustering configurations and the estimated parameters. The inference of Dahl’s method is based on the membership matrices, , from the posterior samples. The membership matrix for the -th post-burn-in MCMC iteration is defined as:

(18)

where denotes the indicator function, i.e., indicates observations and  are in the same cluster in the -th posterior sample after burn-in iterations. Based on the membership matrices for the posterior samples, a Euclidean mean for membership matrices is calculated by:

The iteration with the least squares distance to is obtained by

(19)

The estimated parameters, together with the cluster assignments , are then extracted from the -th post burn-in iteration. An advantage of the Dahl’s method is to utilize the information of the empirical pairwise probability matrix .

Convergence diagnostic of the clustering algorithm is evaluated using the Rand index (RI; Rand, 1971), which is defined as

where and are two partitions of , and and  respectively denote the number of pairs of elements of that are (a) in a same set in and a same set in , (b) in different sets in and different sets in , (c) in a same set in but in different sets in , and (d) in different sets in and a same set in . Larger RI indicates better cluster performance. In particular, indicates that and are identical in terms of modulo labeling of the nodes.

5 Simulation

In this section, we detail the simulation settings, the evaluation metrics, and the comparison performance results.

5.1 Simulation Setting and Evaluation Metrics

We simulate the data using the spatial structure of the 51 states. We fix the true number of clusters at 3 and consider two different partition settings. The first partition setting shown in Figure 3 consists of two disjoint parts in the east and west coast. It is designed to mimic a rather common economic pattern that geographically distant regions share similar income distribution pattern, and geographical proximity is not the sole determining factor for homogeneity in income distribution. Another setting is the three-cluster partition in Figure 4, where geographical proximity becomes the dominating factor that determines the true cluster configuration.

Following Salem and Mount (1974), we generate simulated observations for each state from a Gamma distribution to mimic the long-tailed pattern that is often observed in econometrics data. In addition, an additional noise term following a Gamma distribution is added with probability 0.1 so that the income distribution within each state does not comprise a perfect fit to one certain distribution. We assume each cluster has its own set of distribution parameters shared by all states within it. The true values of the parameters are set so that the Lorenz curves computed on the simulated data are highly similar to those computed from real data (see Table 1). We consider two different parameter settings with small and large differences in income distributions among clusters, corresponding to weak and strong signal designs, respectively. For a total of 100 replicates, we show the Gini indices for different clusters of both weak and strong signal designs in Figure 5, which clearly exhibits major and minor overlapping among clusters, respectively.

The final clustering performance is evaluated using the estimated number of clusters and RI. The RI is calculated using the final clustering result selected by Dahl’s method for each replicate, and we calculate an average RI over all replicates in each setting.

Figure 3: Illustration of the first partition setting with three true clusters, where the first cluster consists of two disjoint components.
Figure 4: Illustration of the second partition setting with three clusters.
Signal Cluster Design
1
Weak 2
3
1
Strong 2
3
Table 1: Simulation designs with weak and strong signals. The symbol 

denotes Gamma distribution, and “Bin” denotes binomial distribution.

Figure 5: Histograms of Gini indices calculated from the simulated state-wise income data (5,100 in each panel from 100 replicates) for weak and strong signals under the two true partition settings.

5.2 Simulation Results

We first examine the inference results of the number of clusters, as well as the accuracy of clustering results from both MFM-fCluster and MRFC-MFM-fCluster. Each parameter setting listed in Table 1 is run with 100 replicates. For MRFC-MFM-fCluster, four choices of the spatial smoothing parameter, , are considered. The graph distance (GD; Bhattacharyya and Bickel, 2014) is used as the distance measure to construct the neighborhood graph used in the Markov random field model. Different upper limits of distance for two states to be considered as “neighbor” are tested in exploratory simulation studies, and it was observed that the choice of such upper limit is less impactful on the final clustering results when compared to the smoothing parameter . Therefore, in all presented simulation studies, the upper limit for graph distance is set to 3.

Figure 6 shows the histogram of the posterior samples of the number of clusters under each model. For ease of exposition, only frequencies for values between 1 and 5 are plotted. An immediate observation is the severe over-clustering by MFM-fCluster, which produces four final clusters for more than 60 replicates in partition design 1, and for more than 35 replicates in partition design 2. This is mainly due to noises in the simulated data. In contrast, under design 1, MRFC-MFM-fCluster with leads to over 50 correctly inferred number of clusters for the weak signal case, and with leads to over 60 for the strong signal case. This indicates that even under a design where not all members of one cluster are spatially contiguous, MRFC-MFM-fCluster is able to effectively avoid over-clustering and robust to noises. Under design 2, as the three true clusters do not have any disjoint parts, exerting stronger spatial smoothing with leads to over 50 correctly inferred number of clusters for weak signals, and a notable number of 87 for strong signals.

Figure 6: Histogram of the number of clusters inferred by the proposed method using Lorenz curves for each state under the two partition designs with two signal strengths.

5.3 Performance Comparison

Performance comparisons of our proposed method with two competing methods are made. In the first competing method, we treat the SRVFs derived from Lorenz curves as vectors, and use

-means to cluster them. The second competing method is the model-based clustering for sparsely sampled functional data proposed by James and Sugar (2003), which is available in R package funcy, and can be performed with function funcit() with option methods=”fitfclust”. Clustering recovery performances of all three methods are measured using the RI. For our proposed method, as can vary, we present all four RIs obtained for the four values in each setting. As neither competing methods can estimate the number of clusters but instead require it to be provided, to make a fair comparison, we provide the clustering configuration inferred corresponding to the that leads to the most correct number of clusters to competitors. Average RIs over 100 replicates are presented in Table 2, where we report the results for MRFC-MFM-fCluster for the four choices of ordered the same way as in Figure 6, as well as the case where , which corresponds to MFM-fCluster.

Despite the fact that this comparison favors the competing methods as in practice one needs to specify the number of clusters for them, in three out of the four total cases considered, our proposed method has the highest average RI over 100 replicates almost regardless of the choice of , with the only other exception being slightly worse than the model-based clustering when signal is strong under design 2. It can be observed that the -means method does not perform as well as our proposed method, and under design 1 this disparity becomes even larger for strong signals than for weak signals. The model-based clustering in general slightly outperforms the -means method, yet still does not achieve comparably high RI values as those of ours.

Design Signal MRFC-MFM-fCluster MFM-fCluster -means model-based
Design 1 Weak 0.806   0.808   0.808   0.809 0.786 0.785 0.790
Strong 0.952   0.962   0.962   0.953 0.939 0.925 0.943
Design2 Weak 0.859   0.835   0.835   0.760 0.831 0.817 0.810
Strong 0.981   0.961   0.961   0.942 0.985 0.978 0.989
Table 2: Comparison of final clustering performance using average RI over 100 simulation replicates

In addition, computation times for all methods are benchmarked using R package microbenchmark (Mersmann, 2019) on a desktop computer running Windows 10 Enterprise, with i7-8700K CPU@3.70GHz using single-core mode. A total of 20 replicates are performed to compute the average running time for each method. As expected, -means takes the least time of 1.62 seconds due to its simple iterative algorithm. Unlike the -means which can only provide clusters without making statistical inference of cluster memberships and sizes, our proposed method utilizes conjugate forms for efficient Bayesian inference that provides not only estimates of clusters but also their uncertainty measures at only a slightly higher computation cost. Indeed, it takes on average 20.79 seconds for one simulated dataset with 500 MCMC iterations, as in our empirical studies 500 iterations are sufficient for the chain to converge and stabilize. The model-based approach, however, takes more than three minutes to finish. Due to the time-consuming nature of the model-based approach, the actual simulation studies are conducted on a 16-core desktop computer using parallel computation. The code is submitted for review and will be made publicly available at GitHub after the acceptance of the manuscript.

6 Analysis of PUMS Data

In this section, we apply the proposed MRFC-MFM-fCluster to the analysis of US households’ income in 2017. Similar as in the simulation studies, the Lorenz curves for all states are obtained for the functional clustering analysis. Based on (2) and (3), we get the inner product matrix . The smoothing tuning parameter is selected to be 3. All hyperparameters are set to be consistent with those in simulation studies described in Section 4.1.

From the result of MRFC-MFM-fCluster, the 51 states are classified into four clusters as illustrated in Figure 7. Cluster 1 includes 17 states, which have the strongest income equality; Cluster 2 has 27 states, and Cluster 3 has 6 states. Washington, DC forms a cluster by itself with its exceptionally high Gini index of 0.512, while the average Gini indices in the other three clusters are 0.438, 0.470, and 0.492, respectively. The average Lorenz curves of the four identified clusters are plotted in Figure 8. The difference is rather obvious. From the clustering results, we see that (i) Washington, DC has the worst equality of household income distribution, while it has the highest median household income in US; (ii) most states in Cluster 1 are spatially contingent, and belong to a subregion often referred to as west or mid-west, such as Wyoming and Nebraska; (iii) most states with good income distribution equality indicated by their relatively low Gini coefficients have lower median household income; (iv) most states in Cluster 1 and Cluster 2 have similar income distributions with nearby states, which indicates that leveraging geographical information is helpful in Lorenz curves clustering.

One particularly important merit of our proposed method is that it allows globally discontinuous clusters. As shown in Figure 7, New Mexico and Tennessee belong to the same cluster. Their 2017 Gini coefficients are 0.4851 and 0.4858, respectively, which indicates these two states have very similar income distributions in terms of Gini coefficients. Based on 2010 American Community Survey from U.S. Census Bureau https://factfinder.census.gov/, the Gini coefficients of these two states have been historically very close. In addition, there are several government policies that could be applied for different clusters. For the states in Clusters 3 and 4, increasing the minimum wage and expanding the earned income tax are two strategies for improving the equality of income distribution. Most states in Cluster 1 have much lower median household income. Decreasing the income tax will help increase their overall household income, which is at the cost of minor sacrifice in income distribution equality. Furthermore, an increase in government expenditures will help increase the household income directly for the states in Cluster 1.

The posterior estimate of in (17) is

(20)

It is noticeable that the diagonal entries of is much larger than the off-diagonal entries, which suggests the within-cluster similarity is much higher than between-clusters similarities. Cluster 1 has least similarity with Cluster 4 based on (20), which is consistent with the results presented in Figure 8.

Figure 7: Illustration of the four clusters identified by the proposed method for the 51 states.
Figure 8: Average Lorenz curves for states in the four identified clusters.

Finally, to make sure the cluster configuration presented here is not a random occurrence but reflects the true pattern demonstrated by the data, we run 100 separate MCMC chains with different random seeds and initial values, and obtained 100 final clustering schemes. The RI between each scheme and the present clustering scheme in Figure 7 is calculated, and they average to 0.948, indicating high concordance of conclusion regardless of random seeds.

7 Discussion

In this paper, we proposed both mixture of finite mixtures (MFM) and Markov random field constrained mixture of finite mixtures (MRFC-MFM) to capture spatial homogeneity of income distribution based on functional inner product of Lorenz curves. Two efficient algorithms are proposed for the inference of the proposed methods. Comprehensive simulation studies are carried out to show MRFC-MFM achieves better performance than the traditional MFM model in terms of spatial homogeneity pursuit. It also outperforms the -means and model-based methods under various designs, and the comparison of performance is relatively robust under different choices of the spatial smoothing parameters. A case study using the PUMS data reveals a number of important findings of income distributions across 51 states in US.

A few topics beyond the scope of this paper are worth further investigation. In this paper, the logit transformation of inner product matrix is proposed. Modeling the original inner product matrix is an interesting alternative in future work. In addition, the smoothness parameter in the Markov random field needs to be selected. Treating it as an unknown parameter and proposing a prior in a hierarchical model for it may improve the efficiency. Besides the geographical information, other auxiliary covariates, such as demographic information, could also be taken into account for clustering in our future work.

Appendix A Proof of Theorem 1

Full conditionals from (16) can be obtained up to a constant as the product of the full conditionals of each part

(21)

As a direct characteristic from the defined cost function in (15), the support of is the set of existing cluster parameters