With the massive data readily available in the digital and information era, advanced statistical learning methodologies for analysis of big data are in high demand. Traditional statistical methods are often used to discover the general rule of the population. However, in many applications we are also interested in an individual entity for personalized solutions or products. For instance, in precision medicine, each patient has his/her own traits. Therefore, it is crucial and beneficial to make individualized treatments and prescribe personalized medicine (Liu and Meng, 2016; Qian and Murphy, 2011; Zhao et al., 2012; Yang et al., 2012; Collins and Varmus, 2015; Wang et al., 2007). In business, the so-called ’Market of One’ strategy that makes a customer feel that he or she is exclusive or preferred by the firm, becomes popular for companies to design personalized products. Indeed, individualized learning and inference matters in many applications.
Since no two patients or two customers are exactly the same, heterogeneity often exists in a population. It poses a challenge to combine the data from different individuals, especially for making improved inferences in individualized learning. A class of conventional methods is to cluster/group individual entities into subgroups and, assuming homogeneity within each subgroup, then use the data in the same subgroup for statistical analysis (Jain et al., 1999; Xu and Wunsch, 2005; Agrawal et al., 1998; Binder, 1978; Ng and Han, 1994; Gan et al., 2007; Liao, 2005; Jain, 2010). The clustering and grouping in the conventional methods are typically performed in a priori
. Such approaches have several disadvantages. Firstly, the constitution of subgroups often depends on a predetermined total number of subgroups, which is a parameter that is either difficult or not reliable to choose in practice. Secondly, since analytic outcomes and inference (e.g. estimated parameters and testing) are the same for all individuals in the same subgroup, such a procedure potentially diminishes hidden local structures. More importantly, in many cases, there may not be clearly-cut and well-divided subgroups in the population. In these situations, the conventional subgroup analysis may impose an artificial grouping structure to the population, which can potentially lead to large biases and invalid inference for many individuals. Another class of conventional methods is to assume mixture models, including classical hierarchical models and Bayesian nonparametric models(Duda and Hart, 1973; Lindsay, 1995; Figueiredo and Jain, 2000; Ferguson, 1973; Antoniak, 1974; Lo, 1984; Teh et al., 2005). Similar to the clustering method, the mixture models assume that the population contains several homogeneous subpopulations, but unlike clustering, there is no clear boundary between the subpopulations. However, inference on each individual is not the focus of such a procedure. It is often done as an afterthought, by estimating the mixture likelihood. Furthermore, a mixture model may not be able to explain the population heterogeneity when the assumed latent structure is invalid. In addition, when given an observation, it is usually difficult to tell which subpopulation it belongs to.
In this article, we propose a new method called individualized group learning, abbreviated as iGroup. Instead of grouping at the population level, the iGroup approach focuses on each individual and forms an individualized group for the target individual, by locating individuals that share similar characteristics of the target. It sidesteps aforementioned difficulties by forming an iGroup specifically for the target individual while ignoring other entities that have little in common with the target. Figure 1
demonstrates the difference between group identifications in a two-dimensional feature space. The left panel shows the result from a k-means clustering method with three groups. Each point is assigned with one cluster label. Data points having the same label are assumed to follow an identical statistical model, even though a large amount of heterogeneity may still exist among the individuals in the same group. The right panel demonstrates the individualized groups for two selected points (bold). Instead of assuming disjoint cluster regions, the individualized group, whose boundary is shown as a solid line, is specific and unique for each individual. Therefore, the laws for two individuals are generally different as their identified individualized groups are different. iGroup corresponds essentially to a local nonparametric approach.
In this paper, two sets of information are utilized in our proposed framework to define similarity and to form groups. One is individual level estimator , which is a direct estimation of , the parameter of interest, for each individual
in a parametric model with observation, without any grouping. The other is exogenous information , which is observed outside of the parametric model but can reveal similarity between the parameters. Both and can provide useful information in identifying groups so that closeness in the space of implies closeness in the space of . Depending on the feasibility and availability of the two information sets, iGroup can be constructed based on three different information sets: , , . They will be discussed in detail in later sections.
To ease our notation, from now on, let us say our goal is to provide an estimation on
for the individual 0. The estimator is constructed with a specified loss function, the observations on individual 0 and all other available observations and . By focusing on individualized local structures, the proposed iGroup learning is robust and effective for handling heterogeneity arising from diverse sources in big data, and it is ideally suited for specific objective-oriented applications in individualized inference. Additionally, in terms of computation, by ignoring a large number of irrelevant entities and zooming directly to the relevant individuals, the iGroup learning is parallel in nature and can scale up better for big data. In this paper, we investigate the validity and theoretical property of iGroup learning and provide simulation studies and applications to demonstrate the grouping effectiveness of the proposed methodology.
focuses on three different information sets with asymptotic analysis and theoretical results. Section4 provides three simulated studies and Section 5 provides two real data applications. Section 6 concludes.
2 General Framework
2.1 Problem setup
Assume for each individual , we observe , where observations and differ in their utilities. Specifically, is the observed data that is directly related to the parameter of interest at the individual level, with a known distribution . The exogenous variable serves as a proxy that reveals the similarity among ’s in the population level. Specifically, we assume that is related to an unknown parameter through an unknown distribution , and the parameter is an unknown continuous function of , i.e. , where the function is not necessarily an one-to-one mapping. The continuity of guarantees that closeness in implies closeness in . The hierarchical structure and the relationship among the variables are demonstrated in Figure 2,
where is an unknown (prior) population distribution of , which may be heterogeneous in nature. Although is unknown and unspecified, it appears in theoretical calculations throughout the theoretical analysis in this paper. Without further clarification, all unconditioned expectations
are assumed to take over all random variables including, which follows the unknown prior . Posterior expectations on conditioned on certain observed information are explicitly noted with in the subscript such as . The distribution is known except the parameter , but both the function and the distribution are unknown. The role of the exogenous variable will be discussed further in later sections. In some cases may not be available.
One example of the above setup is that is the daily stock price returns of company , which follows a distribution and is the company’s characteristics (e.g. sectors, capital sizes, financial exposure, etc), which is related to stock volatility . Another example is that is a binary indicator whether individual has a certain disease and is the individual’s health indices such as weight, height, blood pressure, etc., where the underlying
is the probability of infection.
Denote by an -neighborhood (or a clique) of individual 0, where is a distance/similarity measure and is the threshold value. Thus, the clique is a set of indexes of individuals that are similar to individual 0. In our model development, we impose two regularity assumptions as below.
Assumption 1 (Dense Assumption).
There exists a constant such that for all , in probability when .
Assumption 2 (Smooth Parameter Assumption).
There exists a positive constant , such that for all
where is a metric on .
The dense assumption suggests that individual 0 of interest is not isolated from other individuals, i.e. for arbitrarily small , there are a sufficiently large number of other individuals in its neighborhood as . The smooth parameter assumption guarantees that whenever and are close, the distributions of and induced from and , respectively, are close to each other. Under these two assumptions, it is beneficial to aggregate information from the neighborhood to estimate since one can always find sufficient number of similar individuals in the neighborhood of individual
. A key consideration in this aggregation is the familiar bias-variance trade-off — aggregation over a larger group increases the sample size thus reduces estimation variance, but it also brings bias.
2.2 Aggregated estimation in iGroup
There are two common methods to aggregate information by creating ‘pooled’ estimators for . The first approach constructs a weighted estimator for the target individual 0, directly using the point estimators of other individuals based on . The second approach aggregates objective functions of other individuals, where the point estimator is obtained by optimizing an aggregated objective function. Specifically, these two methods can be formulated as
where is the weight assigned to individual when constructing iGroup estimator for individual 0.
The weight is crucial for the aggregated estimators as it controls how much information is borrowed from other individuals. We propose to incorporate both individual level estimator and exogenous observation into the weight function as both can provide useful information of . Specifically, let
The weight is decomposed into two parts. The first part measures the similarity between and , and can be a kernel function
When has a finite support, the weight function has a hard grouping structure — individuals lying far enough from individual 0 are not considered at all. Otherwise, it has a soft grouping structure such that dissimilar individuals are assigned with non-zero but tiny weights.
The second part measures the similarity between ’s. But unlike , using a distance measure such as is not a good practice, since it ignores the error in and and may be biased. Note that when and , the kernel concentrates on a smaller and smaller area adjacent to . In this area, aggregating individual will not improve the estimation of . An example of one-dimension case is shown in Figure 3. Vertical bars mark the locations of . When is away from its target value , a small bandwidth tends to give large weights to individuals in a local region around . Aggregating these individual in such a local region will not correct the bias .
We propose the following weight function that considers the distribution instead of the point estimator . Specifically, let
Notice that, the posterior distribution of , given , is
If (hence provides useful information about ), then the predictive distribution of , given , is
Thus, the weight function in (5) is the Radon-Nikodym derivative between the predictive distribution and the sampling distribution . As a result, for any measurable function , we have
That is, the weighted expectation of under the sampling distribution equals to its expectation under the predictive distribution if . This property brings invariance under different sampling distributions. More importantly, it shows that the weighted averages, such as (1) and (2), estimates the expectations under the predictive distribution. This gives the iGroup estimators promising asymptotic properties as we will discuss later in Section 3.
The shape (thin or flat) of the weight as a function of does not change with the number of individuals . However, the shape is influenced by the variation (accuracy) of . The larger the variance of is, the flatter the weight function tends to be. If is estimated without any measurement error, the weight is proportional to the indicator function . It reduces to the case in which the individual estimator or the individual objective function is used without grouping.
2.3 Evaluating the weight functions
The weight function in (4) can be directly evaluated. Similar to a bandwidth selection problem for kernel smoothing, one can choose the bandwidth for in (4) by either using the plug-in method (Chiu, 1991) or through cross-validation procedure. The plug-in bandwidth is proportional to (see Section 3). Also, the leave-one-out cross validation process gives an empirical optimal bandwidth, as discussed in Section 3.6.
The evaluation of the weight function in (5) is more complicated, since the conditional probability and the integral are unknown as the relationship between and is not explicit. We propose an approximation method to evaluate below.
Denote the estimator of and the observed exogenous variable as the tuple To calculate the weight in (5), we treat them as
samples from the joint distribution of. We use the kernel method to estimate the conditional probability nonparametrically by
where are two kernel functions with , as the corresponding bandwidths. To estimate the integral in (5), we use the interpretation discussed above that it is the conditional distribution given . Hence we need samples from the joint distribution of observed from the same individual with parameter . However, this is infeasible because in our problem setting, no two individual share the same true parameter and for each individual only one is observed. To generate samples from such a distribution, we consider a bootstrap method. Denote and as the two bootstrap estimators for , obtained by re-sampling with replacement (not applicable when has few observations). Then is an approximate sample of , guaranteeing are generated from the same individual . Therefore the integral can be estimated by
where are three kernel functions with as the corresponding bandwidths. The bandwidths can be selected by either minimizing asymptotic mean integrated squared error (AMISE) or a rule-of-thumb bandwidth estimator. This estimation of the integral is an approximation that requires to be sufficiently large.
3 Theoretical Results
In this section, we consider several model settings for which we apply the proposed iGroup method and discuss their corresponding theoretical properties, especially in terms of their asymptotic performance. In particular, we first define a target estimator that minimizes the Bayes risk, and then investigate the asymptotic performance of iGroup estimators in (1) and (2) in approximating the target estimator . We also quantify the bias and variance of iGroup estimators as well as the target estimator in term of estimating . Throughout this paper, we consider the asymptotic framework that the number of individuals goes to infinity, while the number of observations for each individual is fixed and finite.
3.1 Risk decomposition and the target estimator
We are interested in making inference about individual 0, with given data information that may include the observations and plus information from other relevant individuals. Let be a point estimator for , which is constructed with information sets and . The iGroup estimator in (1) is such an estimator. Similarly, and are point estimators constructed solely based on either or . Under squared loss, the overall risk of in estimating can be decomposed into two nonnegative parts: the expected squared error of in estimating the corresponding posterior mean and the overall risk of the posterior mean itself, as shown in Proposition 1.
Suppose has a prior distribution . Under squared loss, we have the following overall risk decomposition.
where , and are the posterior means under prior and observations , and correspondingly.
The proof is given in Appendix.
Proposition 1 reveals that the overall risk is minimized by setting to the corresponding posterior mean under the prior , which is the population-level (unknown) distribution for . Throughout this paper, we call the estimator that minimizes the overall risk the target estimator. More specifically, under squared loss and different information sets, we denote the target estimators with
Here, refers to the squared loss. For the ease of presentation, we also use a simple notation to represent one of the Bayes estimators in (6) when its meaning is apparent.
Similarly, for a general loss function , we define the target estimator as the Bayes estimator that minimizes the expected loss, given the available observation on individual 0 and the prior such that
Suppose has a prior distribution and is a loss function, which is second-order partially differentiable with respect to such that and . Then for estimator constructed based on information set , or , we have
where is the corresponding Bayes estimator based on the same information set as .
The proof is given in Appendix.
The target estimator as a function of and is not directly available, because neither the population distribution nor the likelihood function is explicitly known or assumed. The iGroup estimator in (1) constructed based on observed finite sample is desired to approach the target estimator when more and more similar individuals contribute to the estimator . See Diaconis and Freedman (1986) for discussions of target point estimators and target parameters in Bayesian literature.
3.2 Case 1: With exogenous variable only
In the cases when the individual level estimator is not reliable to construct the individual groups, iGroup may be constructed with the exogenous variable only. In this case, the corresponding target estimator is defined as:
Recall that the relationship between and is given by a deterministic relationship
where is an unknown continuous function. Furthermore, is a noisy observation of . Since is a conceptual parameter, we may simply assume that
where the error satisfies , with .
is an unbiased estimator of. Then, the combined estimator
has all the properties of a conventional kernel smoothing estimator if is a standard kernel function. The boundary and asymptotic conditions/assumptions on the weight function and the bandwidth are summarized in Assumption 3.
Assumption 3 (Boundary and asymptotic conditions).
The kernel function satisfies
And, in addition, when , satisfies
Theorem 1 follows immediately from consistency theorem on a standard multivariate kernel smoothing estimator (Wasserman, 2010). When the number of individuals goes to infinity, the bias of with bandwidth is of order and the variance is of order , where is the dimension of as defined in Assumption 1. In such case, the asymptotic optimal choice of bandwidth that minimizes the mean squared error, , is of order , same as a -dimensional kernel smoothing problem.
Another way of combining individuals is aggregating the objective functions as shown in (2). A combined estimator with respect to kernel is defined by
The estimator is consistent and has a similar asymptotic performance to a -dimensional kernel smoothing estimator as stated in Theorem 2. This approach is useful especially when is not available, such as in the cases that the number of observations for each individual is less than the number of parameters.
Suppose the conditions in Assumption 3 hold and in addition,
is convex and second order partial differentiable with respect to ,
for any given , as a function of is continuous,
has a unique minimum at
The optimal choice of bandwidth is and the optimized mean squared error is
The proof is given in Appendix.
The above theorems suggest that the individualized combined estimator by aggregating either individual estimators or objective functions would result in an improvement in mean squared error and it shares a similar asymptotic performance as a -dimensional kernel smoothing estimator.
When , . Hence, estimating becomes estimating the unknown function evaluated at . When , and are in general different. Let and be the bias and variance of the target estimator in estimating such that
The above bias and variance are defined with respect to a fixed with random .
The asymptotic bias and variance of in estimating a fixed are given by
where the intrinsic bias and the intrinsic variance are defined in (11).
The proof is given in Appendix. In the conditional probabilities, , as a function of , is considered random under a given .
The bias and variance of in terms of estimating a fixed can therefore be decomposed into two parts. The first part (the intrinsic part) comes from the bias and variance of estimating itself to and the second part comes from estimating nonparametrically. Since is observed with error, this is similar to error in variable problem where certain intrinsic bias cannot be avoided (Fuller, 2009; Carroll et al., 1995; Wansbeek and Meijer, 2000; Bound et al., 2001). Such intrinsic bias and variance are asymptotically linear of , which is the noise level of , as shown in Theorem 4. Especially, when is exactly zero, all intrinsic terms vanish, and it reduces to the exact case when .
Suppose is second-order differentiable and the distribution of has finite higher moments. Then, for a fixed
has finite higher moments. Then, for a fixed, when ,
The proof is given in Appendix.
Research in nonparametric regression with error in variable shows a slower convergence rate to recover the function at any given (Stefanski and Carroll, 1990; Fan and Truong, 1993). Our problem is different. We focus on providing a point estimator of without knowning , but its noisy version . Even if we known the function precisely, is not known as we do not observe . When considering an individual with fixed but unobserved , it is difficult to choose an optimal bandwidth by bias-variance optimization with the non-zero intrinsic terms in Theorem 3, because in this case the asymptotic mean squared error may not have a local minimum. However, if we assume the target individual 0 is randomly chosen from the population, the target estimator is the estimator that minimizes the overall risk under squared loss, i.e. a Bayes estimator, because it minimizes the squared loss pointwise for any . Furthermore, immediately from Theorem 1, is a consistent estimator for . The overall performance of for all individuals of the population could be optimized by choosing a proper bandwidth as stated in the following Theorem 5. It provides a way to optimize the bandwidth globally.
is the risk of the Bayes estimator , and all above expectations is taken over all random variables assuming an empirical population distribution for . The optimal choice of the bandwidth is with the corresponding overall risk .
The proof is given in Appendix.
The magnitude of the measurement error of , measured by , compared to that of the individual estimation error is crucial for the performance of the iGroup method. The bias and variance of iGroup estimator increase when increases (see Theorem 4). And the asymptotic Bayes risk also depends on . When iGroup is based on unreliable , it could result in a worse estimator compared to the one without any grouping. This phenomenon will be demonstrated in Section 4.
Remark: Results in Theorems 3, 4 and 5 can be generalized to the iGroup estimator , which combines the objective functions, except that the target estimator changes from is replaced by . As shown in (19) in the Appendix, is asymptotically a kernel smoothing estimator with the same bias and variance rates.
3.3 Case 2: Without exogenous variables
In this case, we assume the exogenous variable is not available. Our target estimator is under squared loss and is under a general loss function . The iGroup estimation depends solely on . The weight function (5) used in (1) and (2) now reduces to
where corresponds to the unknown distribution of in the whole population. As discussed in Section 2.3
, an estimation of this weight function can be achieved by kernel density estimation on the bootstrapped samples.
The weight function (12) is used to aggregated individual unbiased estimators to the posterior mean, and to aggregate objective functions to the corresponding Bayes estimator under certain loss function, as shown in Theorems 6 and 7.
Suppose is defined as in Equation (12) and is a sufficient and unbiased estimator of for all , then as :
Furthermore, if for any fixed and , then
The proof is given in Appendix.
For the aggregated estimator (2), suppose the objective function used satisfies
where is non-negative and for all , and is constant with respect to . Then is the loss function corresponding to , under which the target estimator is
For example, if the objective function is the negative log-likelihood function