Individual Heterogeneity Learning in Distributional Data Response Additive Models

by   Zixuan Han, et al.

In many complex applications, data heterogeneity and homogeneity exist simultaneously. Ignoring either one will result in incorrect statistical inference. In addition, coping with complex data that are non-Euclidean becomes more common. To address these issues we consider a distributional data response additive model in which the response is a distributional density function and the individual effect curves are homogeneous within a group but heterogeneous across groups, the covariates capturing the variation share common additive bivariate functions. A transformation approach is first utilized to map density functions into a linear space. We then apply the B-spline series approximating method to estimate the unknown subject-specific and additive bivariate functions, and identify the latent group structures by hierarchical agglomerative clustering (HAC) algorithm. Our method is demonstrated to identify the true latent group structures with probability approaching one. To improve the efficiency, we further construct the backfitted local linear estimators for grouped structures and additive bivariate functions in post-grouping model. We establish the asymptotic properties of the resultant estimators including the convergence rates, asymptotic distributions and the post-grouping oracle efficiency. The performance of the proposed method is illustrated by simulation studies and empirical analysis with some interesting results.



There are no comments yet.


page 19

page 21


A robust spline approach in partially linear additive models

Partially linear additive models generalize the linear models since they...

Local Inference in Additive Models with Decorrelated Local Linear Estimator

Additive models, as a natural generalization of linear regression, have ...

A fusion learning method to subgroup analysis for longitudinal trajectories

Uncovering the heterogeneity in the disease progression of Alzheimer's i...

Boosting Functional Response Models for Location, Scale and Shape with an Application to Bacterial Competition

We extend Generalized Additive Models for Location, Scale, and Shape (GA...

Local Regression Distribution Estimators

This paper investigates the large sample properties of local regression ...

Bayesian Lifetime Regression with Multi-type Group-shared Latent Heterogeneity

Products manufactured from the same batch or utilized in the same region...

Simultaneous Parameter Learning and Bi-Clustering for Multi-Response Models

We consider multi-response and multitask regression models, where the pa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the development of modern technology, data are increasingly being measured and recorded at several discrete times or a continuous time interval, which are called functional data, and have become a prevailing type of data (Ramsay, 1982). Functional data analysis provides statistical methodology to deal this kind of data, among which functional regression is widely used to model the relationship of responses and predictors. A great quantity of researches focus on this field, which can be divided into three classes according to whether the responses or predictors are functional or scalar data as follows. The first case is that both responses and predictors are functional data, see (Malfait and Ramsay, 2003; Yao et al., 2005; He et al., 2010; Jiang and Wang, 2011; Cheng et al., 2016); the second one is functional responses with scalar predictors, see (Ferre and Yao, 2003, 2005; Hall and Horowitz, 2007; Hilgert et al., 2013); while the other one is scalar responses with functional predictors, see (Zhu et al., 2012; Luo et al., 2016; Li et al., 2017).

As a specific case of functional data, data consist of samples of distributions or densities appear in various research domains increasingly often. Examples giving rise to such data are distributions of cross-sectional or intraday stock return (Sen and Ma, 2015; Kokoszka et al., 2019), mortality densities (Petersen and Muller, 2019), distribution of intra-hub connectivity in neuroimaging (Saha et al., 2016; Petersen et al., 2019). Compared with conventional functional data, distribution or density function takes data as a whole to present its internal structure without the limitation of sample order and the information dimension. In recent years, along with the application of such data, the attention has turn to the complex regression models in which the random distributions or probability densities serve as responses or predictors. In this article, we focus on the model with density functions as responses coupled with scalar predictors, viz., distribution-on-scalar regression model.

Considering the density functions as elements of a Hilbert space, they do not constitute a linear subspace due to the inherent constraints of being nonnegative and integrated to one. To deal with this constrain, one way is to adopt the geometric approach by choosing a suitable metric. With the the infinite-dimension version of Aitchison geometry, Talská et al. (2018)

defined a density-on-scalar linear regression model in Bayes Hilbert spaces.

Chen et al. (2020) proposed distribution-on-distribution regression by adopting Wasserstein metric and tangent bundle of the Wasserstein space. Except this, some other models are within the broad framework of non-Euclidean data. For instance, Petersen and Muller (2019) proposed the Fréchet linear regression in a general metric space equipped with the Wasserstein metric. Jeon and Park (2020) developed a unified nonparametric additive regression models with Hilbertian responses where density-valued variables constitute Bayes-Hilbert spaces equipped with a suitable inner product. Another way of dealing with the nonlinear constrain of density functions is to map them into Hilbert space by transformation method. Petersen and Muller (2016)

proposed a continuous and invertible transformation, e.g., the log quantile density transformation (LQD) to map probability densities to an unrestricted space of square integrable functions for further modeling and statistical inferences.

In conventional statistical analysis, data is generally assumed to be homogeneous. However, this assumption might be inappropriate in many practical applications when the data are collected from objects with different characteristics or in different situations. Ample empirical studies show that inter-class individual homogeneity and intra-class heterogeneity may exist simultaneously, while ignoring the individual heterogeneity during the analysis may lead to incorrect statistical results, and ignoring the homogeneity will result in inefficient statistical inference. Therefore, the density functions within a heterogeneous population should be clustered into several homogeneous groups by some classification measures.

A mature literature proposes various methods to identify the latent group structures. Vogt and Linton (2017) applied a distance-based clustering algorithm to the kernel estimation of nonparametric regression function. Vogt and Linton (2018) extended it to a multiscale statistic to avoid the selection of specific bandwidth. Su et al. (2016)

developed a so-called classifier-Lasso (C-Lasso) shrinkage method to the linear panel structure model. As a extension,

Su et al. (2019) developed a penalized-sieve-estimation-based C-Lasso procedure to heterogeneous time-varying panel data. Chen (2019) proposed a kernel based hierarchical agglomerative clustering (HAC) algorithm with less restrictive assumptions to the same kind data compared with former method. Relevant approaches discussed above are contributed in the context of functional or panel data.

The focus of this article is to identify and estimate the latent group structures in additive regression model with density responses. The heterogeneity in the density responses are indeed found when we explore the COVID-19 data. On February 11, 2020, the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was officially named as ‘COVID-19’ by the World Health Organization (WHO). With the increase in the infection range and transmission speed, the WHO declared the COVID-19 a pandemic on March 11, 2020. At that time, there were about 37371 confirmed cases and 1130 deaths reported in about 114 countries. By December 15, 2020, according to the official repository of WHO (, there have been 73,315,291 confirmed cases reported in about 271 countries and regions, leading to about 1,627,480 deaths. The tremendous increase on these numbers indicate that the COVID-19 is impacting the whole world drastically. It should be noted that due to the medical conditions and adopted prevention policies, along with some other factors both subjective and objective, the epidemic situation in different countries may vary to some extent.

Figure 1: Densities of relative daily mortality rate of COVID-19 in 149 countries over a period of 240 days.

To explore the epidemic in each country relative to the global situation with influential factors, we take the density of relative daily mortality rate over a period 240 days in each country as response, where the daily mortality rate is defined as the ratio of deaths per day to the total population of each country, and the relative one is defined as the ratio of daily mortality rate in each country to the total mortality rate of all countries in the world. The predictors considered for explanation are ‘aging’ (the percentage of population ages 65 and above), ‘beds’ (the number of hospital beds per 1000 people), ‘physicians’ (the number of physicians per 1000 people), ‘nurses’ (the number of nurses per 1000 people) , ‘GDP’ (the GDP per capita in the US dollar) , and ‘diabetes’ (the percentage of population with diabetes). Since the outbreak time varies in different countries, we set the date on which each country first reported at least 30 deaths as the origin of time scale. Totally 149 countries are taken into account and the densities of daily mortality rate for these 149 countries are presented by Figure 1. The existence of different shapes of the densities among countries intrigues us to consider the subject-specific functions and impose latent grouped structures representing heterogeneity in the functional additive model proposed by Han et al. (2020), i.e.

where is the density of the relative daily mortality rate over a period 240 days in country , is the LQD transformation,

is the covariate vector of country

. Specifically, if country is sharing the kind pattern, and is one of group-specific functions.

Motivated by Chen (2019), we apply HAC method to the estimation of individual functions , based on which the density functions of relative daily mortality rate can be clustered into four groups with different patterns of epidemic situations, i.e., , which is shown in Figure 2. The figure is also served as sufficient and necessary evidence that the functional additive model implemented for the analysis of COVID-19 data should consider the individual heterogeneity among countries.

Figure 2: Densities of relative daily mortality rate of COVID-19 over a period of 240 days in four different groups corresponding to the clustering results.

In order to incorporate both inter-class homogeneity and intra-class heterogeneity in the data, we extend the additive functional regression for densities as responses proposed by Han et al. (2020) to the case of density functions with heterogeneity:




where is a partition of the index set , which means that , and , for any . We assume that the number of groups and the membership in each individual group are unknown.

In the above model, are the random densities, each coupled with the dimensional covariates , with common support . Without loss of generality, we take . Denote as the log quantile density transformation, and . is the subject-specific function and is the bivariate additive components. For the purpose of the identification, we assume that are: , for , . are independent processes with and .

Obviously, our proposed model (1) reveals that it is a natural extension of functional additive model. Specifically, if the subject-specific functions are homogeneous, viz., , , , then the model (1) becomes the additive function regression for densities as responses proposed by Han et al. (2020)

Actually, Han et al. (2020) has conducted great work and gained excellent academic achievements. It proposed a novel additive functional regression model to accommodate random densities as responses and multivariate predictors with the increasing popularity of analyzing data in the form of distributional functions, and established relevant theoretical properties and statistical inference. Motivated by this model, we are interested to identify the subgroups when analyzing data consisting of densities with heterogeneity.

In real applications, since only a random sample generated by the density is available, we first estimate each density by the modified kernel density approach proposed by Petersen and Muller (2016) along with the LQD transformation. The identification and estimation methodology of the latent group structures are accomplished in three steps. First, we utilize B-spline series approximating approach to estimate subject-specific function and bivariate additive components . Then, we apply HAC method to identify the membership of each latent group. Finally, backfitted local linear regression is applied to improve the efficiency of group-specific function and . Theoretical results concerning the resultant estimations including the uniform convergence rate of the initial estimation, the consistency of the estimation of group number and partitions, both the uniform convergence rate and the asymptotic normality of the group-specific functions and the post-clustering estimation of additive component are also established.

The rest of this article is organized as follows. In Section 2, the modified kernel estimation and the LQD transformation for density functions are introduced as the preliminary works. Section 3 presents the procedure for identification and estimation of latent group structures and additive components in the model. Theoretical results are included in Section 4. The Monte Carlo simulations are conducted to illustrate the efficiency of proposed procedure in Section 5. We also demonstrate the application of proposed method by analyzing the COVID-19 and GDP data in Section 6. A discussion follows in Section 7. Detailed proofs of theoretical results and additional numeric results are in the supplementary materials.

2 Preliminaries

In this section, some preliminaries required for the model will be presented.

2.1 Log Quantile Density Transformation


be the class of univariate continuous probability density function

, with a common support and satisfying . Without loss of generality, we take . For each , let , be the corresponding quantile function, i.e, , and be the quantile density function defined as the derivative of quantile function, i.e. for .

To map each into the linear space , we utilize the log quantile density transformation (Petersen and Muller, 2016) that is defined as

2.2 Modified Density Estimation

A challenge for fitting the regression model with density response is that in real applications, , thus , is unobservable, and can only be estimated from the random sample generated by , viz., , with corresponding covariate . Without loss of generality, we assume

. Due to the boundary effects of conventional kernel density estimators,

Petersen and Muller (2016) proposed a modified kernel density estimator as follows, i.e.

with the weight function be as , as and 1 otherwise. Bandwidth , is of bounded variation and symmetric of 0, satisfying the conditions that , , and are finite. Different from the conventional kernel density estimator, modified kernel density estimator possesses the consistency property, viz.,

Based on the estimation of density functions, to fit model (1), we take as the substitution of and denote , then the model (1) is written as


where is the random error resulting from the estimation of .

3 Identification and estimation methodology

In this section, we provide the methodology for identification of the latent group structures and estimation of additive components in the proposed model through a three-step procedure. The identification and estimation procedure is provided in Algorithm 1.


1:  Data: , where ,
2:  Estimated Density: Modified kernel density estimation .
3:  Transformed Density:
4:  Initial Estimation: B-spline approximation
5:  HAC Algorithm: Estimated group sets index
  • 0.8

  • Begin with groups, where each subject is a group.

  • Merge the two groups with smallest distance into one.

  • Recalculate the distance between new groups after each grouping.

  • Repeat the previous two steps until the number of groups achieves .

6:  Refined Estimation: Backfitted local linear estimation , , .
Algorithm 1 Identifying Heterogeneity in Additive Model with Density Responses

3.1 Initial estimation

Spline series approximating method is commonly used to estimate unknown nonparametric functions, with detailed practical guidance in De Boor (1978) and Stone (1994).

Let be the set of B-spline basis functions of order with interior knots, where . Let be the set of B-spline basis functions of order for () with interior knots, where . Then, defined the normalized spline basis of () as , and denote as the scaled version of .

Based on the basis functions, define the tensor product of B-spline basis as

The spline approximation of subject-specific and bivariate components are given by

Denote , therefore, the model (1) can be written as , and we can approximate by

Then, the corresponding estimations are


where is a -dimensional vector satisfying


3.2 Identifying latent groups structures via HAC algorithm

Based on the initial estimations of subject-specific functions (4), the classic hierarchical agglomerative clustering (HAC) algorithm is applied for identifying the latent groups given the group number . To tackle this problem, we need a metric to measure the distance between pair of functions and . For instance, Chen et al. (2019) used -distance to measure the similarity of the coefficient functions in nonlinear models. Vogt and Linton (2017) combined -distance with -means procedure to obtain the group structure. Vogt and Linton (2018) developed the multiscale techniques based on for clustering.

Similarly as Cardot et al. (1999), in this article we work with the general -distance

which can be estimated as

Assume that the number of groups is given. The HAC algorithm for finding out the latent group structures among the individual subject-specific functions can be summarized as follows. First, begin with groups with each subject as a group. Second, calculate the distance matrix , find the smallest element except for the main diagonal elements and merge the corresponding groups into one. Next, recalculate the distance between new groups and the distance matrix with reduced size after each grouping. The distance between two groups is defined as the furthest distance between any two functions with one in a group and the other one in another group. Finally, repeat the previous two steps until the number of groups achieves .

3.3 Estimation of Latent Group Structures

Given the true group membership, the model (1) can be written in group-structure form


To estimate the group-specific functions and bivariate additive component functions , we apply the backfitted local linear algorithm proposed by Fan (1993). Define

By plugging in the initial estimation , we obtain the estimation of as

For each given , can be approximated by the first-order Talyor expansion

Define the weighted squared function as

where is a nonnegative kernel function with bandwidth and denotes the estimated membership of group . Then by minimizing , the estimation of as is derived as


where and for .

With the estimation of group-specific functions, we can have the estimation of

where is the indicator function.

For each given and , we approximate by

Then the pointwise estimator of as can be obtained by minimizing

Since the pointwise estimation of for each may not be smooth due to the dependence on the estimation of density responses , then an additional local smoothing step is implemented in the direction of to rectify the smoothness. To do so, for each , we define the following function

where is a nonnegative kernel function with bandwidth . By minimizing , the refined estimator of can be obtained as


where and for .

3.4 Selection of Number of Groups

The identification and estimation of the latent group structures discussed above is based on the prior that the number of groups is known, but it is not true for practical problem. A major challenge in clustering is to estimate the group when the number is unknown. Following Chen (2019), the information criterion is applied for the selection of group number.



and is a tuning parameter whose value may rely on .

The number of latent groups is estimated by minimizing the criterion , i.e.,

where is pre-specified as the maximal number of groups.

In the simulation studies, we take two choices of into account. They are

where and denotes the cardinality of the group . They correspond to two information criterions, the generalized Bayesian information criterion (GBIC) with and the generalized Akaike information criterion(GAIC) with . Other information criterions can be found in Chen et al. (2019) and Chen (2019).

3.5 Selection of Bandwidth

In this article, the leave-one-out cross-validation method is implemented for the bandwidth selection. Let , we select the bandwidth given the number of latent group by minimizing the following mean squared error

where is the estimated group membership. For each , and are the estimations with bandwidth of and obtained by using observations except the -th observation respectively.

4 Theoretical Results

Throughout this paper, for any fixed interval , we denote the space of -th order smooth function as , and the class of Lipschitz continuous functions for some fixed constant as Meanwhile, let and denote the support of and , respectively. Obviously, . The necessary assumptions for the asymptotic results are listed as follows.

(A1) For any , is differentiable, and there exists a constant , such that are all bounded by .

(A2) (a) The kernel density is Lipschitz-continuous, bounded and symmetric about 0. Furthermore, for some constant . (b) The kernel density satisfies the conditions that , , and are finite. The kernel density also satisfies the above assumptions.

(A3) (a) For each , the process is stationary and -mixing dependent for , with the mixing coefficient decaying to zero at a geometric rate, i.e., there exists constants and , such that for all . (b) The covariate variables , , and the errors

satisfy the following moment conditions that for some


For each , the covariance function

has finite non-decreasing eigenvalues

, satisfying .

(A4) The latent group functions , , have continuous second order derivatives on the support interval, namely, , and for some constant . Meanwhile, the additive component functions are continuous functions on and twice continuously partial differentiable with respective to and , where is a compact subset of .

(A5) (a) The density function of covariate , , is continuous and bounded, with continuous derivatives of marginal densities at each . (b) For and , the joint density of and , , as well as the joint density of and , , are continuous and partially differentiable with respect to and , with continuous second order partial derivatives.

(A6) (a) Let , where is the element of distance matrix discussed before, then . (b) There exists a positive constant , with , such that.

(A7) , , , , , as .

Remark 1.

Assumption (A1) is basic and essential to derive the consistency of densities after transformation. The conditions in (A2) on the kernel function are mild and can be satisfied by commonly used kernel functions such as uniform and Epanechnikov kernel. Assumption (A3) (a) relaxes the dependence of error process and covariates spaces to the -mixing dependence which is one of the weakest dependence conditions. The moment conditions in (A3) (b) is crucial to derive the uniform convergence and other asymptotic properties based on the kernel function. The smoothness conditions of component functions in (A4) and (A5) are greatly relaxed. Assumption (A6) (a) indicates that can converge to zero at an appropriate rate, and (b) are useful in proving the consistency of estimated group number via information criterion proposed before. (A7) are common conditions applied in kernel smoothing to satisfy the optimal convergence rates.

We first derive the uniform consistency of initial estimations of and which is presented by Theorem 1.

Theorem 1.

Assume that (A1)(A4) and (A7) hold, and are the initial estimations of and respectively, which are defined by (4), ,. Then as and , it holds that