Spatial point processes intensity estimation with a diverging number of covariates

12/27/2017 ∙ by Achmad Choiruddin, et al. ∙ Université Grenoble Alpes UQAM Aalborg University 0

Feature selection procedures for spatial point processes parametric intensity estimation have been recently developed since more and more applications involve a large number of covariates. In this paper, we investigate the setting where the number of covariates diverges as the domain of observation increases. In particular, we consider estimating equations based on Campbell theorems derived from Poisson and logistic regression likelihoods regularized by a general penalty function. We prove that, under some conditions, the consistency, the sparsity, and the asymptotic normality are valid for such a setting. We support the theoretical results by numerical ones obtained from simulation experiments and an application to forestry datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Background

Spatial point pattern data arise in many contexts, e.g. ecology (see e.g. Møller and Waagepetersen, 2004; Renner et al., 2015), epidemiology (e.g. Diggle, 1990, 2013), criminology (e.g. Baddeley et al., 2015; Shirota et al., 2017), biology (e.g. Illian et al., 2008) and astronomy (e.g. Baddeley et al., 2015), where interest lies in describing the distribution of an event in space. Stochastic models generating spatial point patterns are called spatial point processes (see e.g. Møller and Waagepetersen, 2004; Illian et al., 2008; Diggle, 2013; Baddeley et al., 2015).

Usually, the first step to analyze spatial point pattern data is to investigate the intensity. The intensity serves as the first-order characteristics of a spatial point process and often becomes the main interest in many studies, especially when the intensity is suspected to depend on spatial covariates. Examples include the study of spatial variation of specific disease risk related to pollution sources (e.g. Diggle, 1990, 2013), crime rate analysis in a city related to some demographical information (e.g. Shirota et al., 2017), and modeling of the spatial distribution of trees species in a forest related to some environmental factors (e.g. Waagepetersen, 2007; Thurman et al., 2015; Renner et al., 2015).

We focus in this study on the log-linear model for the intensity function of an inhomogeneous spatial point process defined by

(1.1)

where are the spatial covariates measured at location , represents the state space of the spatial point processes (usually ) and is a real

-dimensional parameter. Hence, our main concern is to assess the magnitudes of the vector

. For parametric estimation, while maximum likelihood estimation (e.g. Berman and Turner, 1992; Rathbun and Cressie, 1994) has been widely implemented for Poisson point processes models, estimating equation-based methods (e.g. Waagepetersen, 2007, 2008; Guan and Shen, 2010; Baddeley et al., 2014) are simpler to implement for more general spatial point processes models, overcoming the possible drawback of MCMC methods which are usually computational expensive (Møller and Waagepetersen, 2004)

. However, when the number of covariates is relatively large, maximum likelihood estimation and estimating equation-based methods become undesirable: all covariates are selected yielding an increasing standard error for parameter estimates.

1.2 Feature selection techniques

To select significant covariates, one may consider a traditional procedure such as a stepwise method. This technique starts with an initial set of covariates, then considers adding or deleting a covariate from the current set at each iteration using a criterion such as an F-statistic or AIC. However, such procedure has a number of limitations: it can be numerically unstable and exhibits high variance due to its discrete procedure

(e.g. Breiman, 1996; Fan and Li, 2001; Friedman et al., 2008). It is even computationally unfeasible especially when the number of covariates is too large (e.g. Breiman, 1996; Zou, 2006).

To overcome this drawback, regularization techniques have recently been developed for spatial point processes intensity estimation. Such methods are able to perform variable selection while keeping interesting properties in terms of prediction. For Poisson point process models, the idea is to penalize the Poisson likelihood by a penalty function such as penalty (see Renner and Warton, 2013; Thurman and Zhu, 2014). For more general point process models, instead of employing the likelihood of the processes which often requires computational intensive MCMC methods (Møller and Waagepetersen, 2004), penalized versions of estimating equations based on Campbell theorem derived both from Poisson and logistic regression likelihoods have been developed (see Thurman et al., 2015; Choiruddin et al., 2017). Furthermore, Thurman et al. (2015) and Choiruddin et al. (2017) show that, under some conditions, the estimates obtained from such procedures are consistent, sparse, and asymptotically normal.

1.3 Issues in high dimensional data

The motivation of our paper comes from a study of biodiversity in a 50-hectare region () of the tropical moist forest of Barro Colorado Island (BCI) in central Panama, where censuses have been carried out such that all free-standing woody stems at least 10 mm diameter at breast height were identified, tagged, and mapped, resulting in maps of over 350,000 individual trees with around 300 species (see Condit, 1998; Hubbell et al., 1999, 2005). In the same region, many environmental covariates such as topographical attributes and soil properties have been also collected. In particular, we are interested to study the spatial distribution of 3,604 locations of Beilschmiedia pendula Lauraceae (BPL) trees and to model its intensity as a parametric function of 93 covariates consisting of 2 topological attributes, 13 soil properties and 78 interactions between two soil nutrients.

Although it seems that the number of covariates is not very large with respect to the number of data points, two hours are required to estimate the parameters and select covariates using a standard stepwise procedure. To do this, we use the step function in R which intrinsically assumes that is a Poisson point process since we use the AIC in the stepwise procedure. For a general point process, other criteria could be investigated such as the one based on the statistic, but they require to estimate the asymptotic covariance matrix of the estimates at each step: even if we know the right covariates, it is known as a difficult task (see e.g. Coeurjolly and Guan (2014)) especially when the number of parameters is large. That would easily triple the time of this estimation/selection procedure. To evaluate the performance of such a selection/estimation procedure, a simulation would be required which is unrealistic (1000 replications of a single model would take 250 days). This motivates us to consider regularization methods.

Thurman et al. (2015) and Choiruddin et al. (2017) are the first two theoretical works. Both these works have the important limitation that the number of covariates is finite. We extend this in the present paper. Asymptotic properties which consider a diverging number of parameters for -estimators have a long story (e.g Huber, 1973; Portnoy, 1984) but have recently been investigated for penalized regression estimators by Fan and Peng (2004); Zou and Zhang (2009). In particular, as argued by Fan and Peng (2004), even though the asymptotic properties (i.e., consistency, sparsity, and asymptotic normality) proposed by Fan and Li (2001) for penalized generalized linear models under the assumption that is finite, are encouraging, there are many naive and simple model selection procedures which possess those properties. Establishing the validity of these asymptotic properties in a diverging number of parameters setting is, therefore, a major importance. We study this type of asymptotic properties in the spatial point processes framework. Hence, our work can be regarded as an extension of the study conducted by Choiruddin et al. (2017).

A standard way of measuring asymptotic for spatial point process is the increasing domain asymptotic. Therefore, we investigate the problem where grows with the volume of the observation domain. In our setting, plays the same role as , the number of observations, in standard problems such as in linear models or generalized linear models. We obtain consistency, sparsity, and asymptotic normality for our estimator. One of our main assumptions is that as , which is similar to the one required by Fan and Peng (2004) when is simply replaced by (the sample size in their context).

Our results are general: (1) a large choice of penalty functions (either convex or non-convex function) and methods (e.g. ridge, lasso, elastic net, SCAD, and MC+) are available; (2) we include a large class of mixing spatial point processes. The implementation is done by combining the spatstat (Baddeley et al., 2015) R package with the two R packages implementing penalized methods for generalized linear models: glmnet (Friedman et al., 2010) and ncvreg (Breheny and Huang, 2011).

1.4 Outline of the paper

In Section 2, we introduce brief background on spatial point processes as well as regularization methods for spatial point processes intensity estimation. Section 3 presents our asymptotic results. We investigate in Section 4 the finite sample performance of our estimates in a simulation study and in an application to tropical forestry datasets. Conclusion and discussion are presented in Section 5. Proofs of the main results are postponed to Appendices A-C.

2 Regularization methods for spatial point processes

This section gives brief introduction on spatial point processes and reviews regularization methods for spatial point processes intensity estimation previously studied by Choiruddin et al. (2017) when the number of parameters is finite.

Let be a spatial point process on . We view as a locally finite random subset of . Let be a compact set of Lebesgue measure which will play the role of the observation domain. A realization of in is thus a set , where and is the observed number of points in . Suppose has intensity function and second-order product density . Campbell theorem (see e.g. Møller and Waagepetersen, 2004) states that, for any function or

(2.2)
(2.3)

We may interpret as the probability of occurence of a point in an infinitesimally small ball with centre and volume . Intuitively, is the probability for observing a pair of distinct points from occuring jointly in each of two infinitesimally small balls with centres and volume . For further background materials on spatial point processes, see for example Møller and Waagepetersen (2004); Illian et al. (2008).

In our study, we assume that the intensity function depends on parameter , . The standard parametric methods for estimating are by maximizing the weighted Poisson likelihood (e.g. Guan and Shen, 2010) or the weighted logistic regression likelihood (e.g. Baddeley et al., 2014; Choiruddin et al., 2017) given respectively by

(2.4)
(2.5)

where is a weight non-negative function depending on the first and the second-order characterictics of and is a non-negative real-valued function. The solution of maximizing (2.4) (resp. (2.5)) is called Poisson estimator (resp. the logistic regression estimator). We refer readers to Guan and Shen (2010) for further details on the weight function and to Baddeley et al. (2014) for the role of function .

These standard methods cannot perform variable selection. To do so, Thurman et al. (2015) and Choiruddin et al. (2017) suggest to maximize a penalized version of (2.4)-(2.5)

(2.6)

where is either the Poisson likelihood (2.4) or the logistic regression likelihood (2.5). We refer the second term of (2.6) to a penalization term. In this term, we have mainly two parts: (1) a penalty function parameterized by and (2) the volume of the observation domain which plays the same role as the sample size in the spatial point process framework.

For any non-negative , we say that is a penalty function if is a non-negative function with . Some examples, described in Table 1, include penalty (Hoerl and Kennard, 1988), penalty (Tibshirani, 1996), elastic net (Zou and Hastie, 2005), SCAD (Fan and Li, 2001), and MC+ (Zhang, 2010). Note that, as indicated by (2.6), we allow each direction to have different tuning parameters . Such a method is called an adaptive method (e.g. adaptive lasso (Zou, 2006) and adaptive elastic net (Zou and Zhang, 2009)). For further backgrounds about penalty function and regularization methods, see, for example, Friedman et al. (2008).

Penalty
penalty
penalty
Enet , for any
SCAD ,
for any
MC+ , for any
Table 1: Examples of penalty function.

3 Asymptotic properties

In this section, we present asymptotic properties of the regularized Poisson estimator when both and as . In particular, we consider as a -dimensional point process observed over a sequence of observation domain which expands to as . We assume that has a log-linear form given by (1.1) for which the dimension of parameter , denoted now by , diverges to as . In Section 3.1, we provide notation and conditions, and discuss the differences from the setting where is fixed. Our main results are presented in Section 3.2. For sake of conciseness, we do not present the asymptotic results for the regularized logistic regression estimator. The results are very similar. The main difference is lying in the conditions (.6) and (.7) for which the matrices and have a different expression (see Remark 2).

3.1 Notation and conditions

Throughout this section and Appendices A-C, let

(3.7)
(3.8)

be respectively the weighted Poisson likelihood and its penalized version.

Let denote the -dimensional vector to estimate, where is the -dimensional vector of non-zero coefficients and is the ()-dimensional vector of zero coefficients. We assume that the number of non-zero coefficients, , does not depend on . Let and denote the corresponding -dimensional and ()-dimensional vectors of spatial covariates. We denote the regularized Poisson estimator by .

We recall the classical definition of strong mixing coefficients adapted to spatial point processes (e.g. Politis et al., 1998): for and , define

(3.9)

where is the -algebra generated by is the minimal distance between sets and , and denotes the class of Borel sets in .

We define the matrices and by

where is the classical pair correlation function Møller and Waagepetersen (2004) given by

when both and exist with the convention . For a Poisson point process, we have since . If, for example, (resp. ), this indicates that pair of points are more likely (resp. less likely) to occur at locations than for a Poisson point process.

We denote the top-left corner of , by . It is worth noticing that , and depend on only through and not through . In what follows, for a squared symmetric matrix , and

denote respectively the smallest and largest eigenvalue of

.

Under the conditions (.8)-(.9), we define the sequences , and by

(3.10)
(3.11)
(3.12)

where is any positive constant.

Consider the following conditions (.1)-(.9) which are required to derive our asymptotic results:

  1. For every , where is convex, compact, and contains the origin of in its interior.

  2. The intensity function has the log-linear specification given by (1.1) where and is an open convex bounded set of . Furthermore, we assume that there exists a neighborhood of such that

  3. The covariates and the weight function satisfy

  4. There exists an integer such that for , the product density exists and satisfies .

  5. For the strong mixing coefficients (3.9), we assume that there exists some such that .

  6. .

  7. .

  8. The penalty function is non-negative on , continuously differentiable on with derivative assumed to be a Lipschitz function on . Furthermore, given we assume that there exists , where as , such that, for sufficiently large, is thrice continuously differentiable in the ball centered at with radius and we assume that the third derivative is uniformly bounded.

  9. as .

Conditions (.1)-(.8) are quite similar to the ones required by Choiruddin et al. (2017) in the setting when the number of parameters to estimate is fixed. Condition (.2) is slightly stronger since we have to ensure that is finite for in the neighborhood of . Note that follows directly from condition (.3). We derive asymptotic properties when both and tend to infinity with . However, to obtain an estimator which is consistent and has two other properties: sparsity and asymptotic normality, we need that the number of covariates does not grow too fast with respect to the volume of the observation domain. This condition is stated by condition (.9) which is similar to the one required by Fan and Peng (2004) when is simply replaced by (the sample size in their context).

3.2 Main results

We state our main results here. Proofs are relegated to Appendices A-C.

We first show in Theorem 1 that the regularized Poisson estimator converges in probability and exhibits its rate of convergence.

Theorem 1.

Assume the conditions (.1)-(.5) and (.7)-(.9) hold. Let and be given respectively by (3.10) and (3.12). If and , then there exists a local maximizer of such that .

This implies that, the regularized Poisson estimator is root- consistent. Note that, as expected, the convergence rate is times the convergence rate of the estimator obtained when is fixed (see Theorem 1 Choiruddin et al., 2017). In addition, when we compare our results with the ones obtained by Fan and Peng (2004), who also considered a diverging number of parameters setting, our estimator has the same rate of convergence when we replace by to their context. This rate of convergence also appears in other contexts considering diverging number of parameters setting (see e.g. Lam and Fan, 2008; Zou and Zhang, 2009; Li et al., 2011; Cho and Qu, 2013; Wang and Zhu, 2017).

Now, we demonstrate in Theorem 2 that such a root- consistent estimator ensures the sparsity of ; that is, the estimate will correctly set to zero with probability tending to 1 as , and is asymptotically normal.

Theorem 2.

Assume the conditions (.1)-(.9) are satisfied. If , and as , the root- consistent local maximizer in Theorem 1 satisfies:

  1. Sparsity: as ,

  2. Asymptotic Normality: ,

where

(3.13)
(3.14)

As a consequence, is the asymptotic covariance matrix of . Here, is the inverse of , where is any square matrix with .

Remark 1.

For lasso and adaptive lasso, . For other penalties, since , then . Since from conditions (.1)-(.3), is asymptotically negligible with respect to .

Remark 2.

Theorems 1 and 2 remain true for the regularized logistic regression estimator if we replace in the expression of the matrices and , by and extend the condition (.3) by adding .

The proofs of Theorems 1 and 2 for this estimator are slightly different mainly because unlike the Poisson likelihood for which we have , for the regularized logistic regre is now stochastic and we only have . Despite the additional difficulty, we maintain that no additional assumption is required.

We show in Theorem 2 that the sparsity and asymptotic normality are still valid when the number of parameters diverges. By Remark 1, when is large enough, in (3.13) becomes approximately

which is precisely the asymptotic covariance matrix of the estimator of obtained by maximizing the likelihood function or solving estimating equations based on the submodel knowing that . This shows that when is sufficiently large, our estimator is as efficient as the oracle one.

To satisfy Theorem 2, we require that , and as simultaneously. In particular, conditions on and ensure the asymptotic normality of while condition on is used to prove the sparsity. Conditions regarding and are similar to the ones imposed by Fan and Peng (2004) when is replaced by in their context. However, we require a slightly stronger condition on than the one required by Fan and Peng (2004) which in the present setting could be written as . As compensation, we do not need to impose, as Fan and Peng (2004) did, for any , . Such a condition is not straightforwardly satisfied in our setting since the other conditions only imply that .

Further details regarding , and for each method are presented in Table 2. For the ridge regularization method, , preventing from applying Theorem 2 for this penalty. For lasso and elastic net, for some constant (=1 for lasso). The two conditions and as cannot be satisfied simultaneously. This is different for the adaptive versions where a compromise can be found by adjusting the ’s, as well as the two non-convex penalties SCAD and MC+, for which can be adjusted. For the regularization methods we consider in this study, the condition is implied by the condition as and condition (.9).

Method
Ridge
Lasso 0
Enet
AL 0
Aenet
SCAD
MC+
  • if for sufficient large

  • if for sufficient large

Table 2: Details of the sequences , and for a given regularization method.

4 Numerical results

This section is devoted to present numerical results. More precisely, we conduct simulation experiments in Section 4.1 to assess the finite sample peformance of our estimates and apply our method to an application in ecology in Section 4.2. We apply the regularized Poisson likelihood (PL) and the regularized weighted Poisson likelihood (WPL) to select covariates and estimate their coefficients. Similar approach can be straightforwardly used for the regularized versions using logistic regression likelihood.

To numerically evaluate the parameters estimates, we apply Berman-Turner method (Berman and Turner, 1992) combined with coordinate descent algorithm (Friedman et al., 2007) to perform variable selection and parameter estimation. Berman-Turner device allows to show that maximizing (2.4) is equivalent to fitting a weighted Poisson generalized linear model, so the standard software for generalized linear models (GLMs) can be used. This has been exploited by the spatstat R package (Baddeley et al., 2015). As we make links between spatial point processes intensity estimation and GLMs, we only have to deal with feature selection procedures for GLMs. Hence, we clearly have many advantages: the various computational strategies are carefully studied, and, in particular, efficiently implemented in R. In this study, to compute the regularization path solutions, we employ coordinate descent algorithm (Friedman et al., 2007). This is implemented in the glmnet (Friedman et al., 2010) for regularization methods for GLMs using some convex penalties (i.e., ridge, lasso, elastic net, adaptive lasso and adaptive elastic net) and in the ncvreg (Breheny and Huang, 2011) for regularization methods for GLMs using some non-convex penalties (i.e., SCAD and MC+). More details for computational strategies are discussed in detail by Choiruddin et al. (2017).

Our methods rely on the tuning parameter . Some previous studies (see e.g. Zou et al., 2007; Wang et al., 2007, 2009) suggest to use a modified BIC criterion to select the tuning parameter. We follow the literature and choose by minimizing , a modified version of the BIC criterion, defined by

where is the number of selected covariates with non-zero regression coefficients and is the volume of observation domain. To implement the adaptive methods (i.e., adaptive lasso and adaptive elastic net), we follow Zou (2006) and define , where

is the estimates obtained from ridge regression and

is a tuning parameter chosen by criterion as described above. Following Choiruddin et al. (2017), we fix for elastic net and its adaptive version, for SCAD, and for MC+. For further discussion regarding the selection of for SCAD and MC+, see e.g. Fan and Li (2001) and Breheny and Huang (2011).

4.1 Simulation study

In this section, we investigate the behavior of our estimators in a simulation experiment in different situations when a large number of covariates for fitting spatial point process intensity estimation is involved. We intend to extend the setting considered by Choiruddin et al. (2017). We start with relatively complex situation where strong multicollinearity is present (Scenarios 1a and 2a) and we then consider a more complex setting using real datasets (Scenarios 1b and 2b). We have two different scenarios (Scenarios 1 and 2) for which the number of true covariates as well as their coefficients are different.

The spatial domain we consider is . The true intensity function has the form , where and . We set such that the mean number of points over is equal to . We consider two different scenarios described as follows.

  1. We define the true vector . To define the covariates, we center and scale the pixel images of elevation () and gradient of elevation () contained in the bei datasets of spatstat library in R and use them as two true covariates. In addition, we create two settings to define extra covariates:

    1. First, we generate 48

      pixel images of covariates as a standard Gaussian white noise and denote them by

      . Second, we transform them, together with and , to have multicollinearity. In particular, we define , where . More precisely, is such that and for , except , to preserve the correlation between and . In this setting, .

    2. We center and scale the 13 pixel images of soil nutrients obtained from the study in tropical forest of Barro Colorado Island (BCI) in central Panama (see Condit, 1998; Hubbell et al., 1999, 2005) and convert them to be pixel images as and . In addition, we consider the interaction between two soil nutrients such that we have 50 covariates in total. We use 48 covariates (13 soil nutrients and 35 interactions between them) as the extra covariates. Together with and , we keep the structure of the covariance matrix to preserve the complexity of the situation. In this setting, we have .

  2. In this setting, we consider five true covariates out of 50 covariates. In addition of elevation () and gradient of elevation (), we convert pixel images of concentration of Aluminium (), Boron () and Calcium () in the soil to be pixel images as and and set them to be other three true covariates. All five covariates are centered and scaled. We define the true coefficient vector . As in Scenario 1, we make two settings to define 45 extra covariates:

    1. This setting is similar to that of Scenario 1a. We generate 45 pixel images of covariates as standard Gaussian white noise, denote them by , and define , where is such that and for , except for , to preserve the correlation among - . We still define .

    2. We use the real dataset as in Scenario 1b and consider similar setting. In this setting, we define 5 true covariates which have different regression coefficients as in Scenario 1b.

Method Regularized PL Regularized WPL Regularized PL Regularized WPL
TPR FPR PPV TPR FPR PPV TPR FPR PPV TPR FPR PPV
Scenario 1a
Lasso 1001 13 28 96 4 62 97 23 20 64 1 76
Enet 1001 34 12 93 8 48 97 48 10 59 2 58
AL 1001 1 92 97 01 96 95 3 68 70 01 98
Aenet 1001 2 76 97 1 85 95 6 52 67 01 95
SCAD 1001 7 41 97 1 87 96 4 61 56 01 79
MC+ 1001 8 37 96 1 85 96 5 58 52 1 74
Scenario 1b
Lasso 1001 45 10 91 11 52 1001 96 4 20 6 22
Enet 1001 63 7 87 18 31 1001 98 4 15 6 14
AL 1001 26 19 95 5 81 99 85 5 26 5 35
Aenet 1001 30 15 95 6 74 1001 87 5 24 5 30
SCAD 1001 26 18 93 5 76 1001 76 5 23 4 28
MC+ 1001 26 17 93 5 76 99 76 5 22 5 27
Scenario 2a
Lasso 98 93 10 84 73 14 98 96 10 47 35 16
Enet 99 98 10 85 80 11 99 98 10 46 38 12
AL 95 49 18 83 35 27 95 64 15 50 23 28
Aenet 96 52 17 84 40 21 96 68 14 48 26 20
SCAD 86 74 13 65 45 36 75 60 21 39 26 30
MC+ 87 78 13 65 47 35 73 60 22 39 26 30
Scenario 2b
Lasso 80 64 13 75 60 12 78 69 11 64 57 9
Enet 85 73 12 82 69 11 84 79 11 68 64 8
AL 56 26 19 54 25 20 59 35 17 48 30 13
Aenet 59 30 18 57 29 18 64 43 15 52 36 11
SCAD 43 21 20 42 20 23 46 24 27 41 25 16
MC+ 44 21 20 43 20 23 46 24 26 41 26 16
  • Approximate value

Table 3: Empirical selection properties (TPR, FPR, and PPV in ) based on 2000 replications of Thomas processes on the domain for two different values of and for the two different scenarios. Different penalty functions are considered as well as two estimating equations, the regularized Poisson likelihood (PL) and the regularized weighted Poisson likelihood (WPL).

With these scenarios, we simulate 2000 spatial point patterns from a Thomas point process using the function in the package. We set the interaction parameter to be and let . Briefly, smaller values of correspond to tighter clusters, and smaller values of correspond to a fewer number of parents (see e.g. Møller and Waagepetersen, 2004, for further details regarding the Thomas point process). For each scenario with different , we fit the intensity to the simulated point pattern realizations.

We report the performances of our estimates in terms of two characteristics: selection and prediction properties. We present the selection properties in Table 3 and the prediction properties in Table 4

Method Regularized PL Regularized WPL Regularized PL Regularized WPL
Bias SD RMSE Bias SD RMSE Bias SD RMSE Bias SD RMSE
Scenario 1a
Lasso 0.19 0.19 0.27 0.43 0.29 0.52 0.29 0.60 0.67 0.94 0.53 1.08
Enet 0.27 0.22 0.35 0.72 0.32 0.79 0.34 0.66 0.74 1.21 0.40 1.27
AL 0.05 0.18 0.19 0.14 0.24 0.28 0.19 0.60 0.63 0.57 0.57 0.81
Aenet 0.07 0.19 0.20 0.20 0.27 0.33 0.22 0.60 0.64 0.69 0.55 0.88
SCAD 0.19 0.19 0.27 0.29 0.32 0.43 0.14 0.55 0.57 1.10 0.71 1.31
MC+ 0.20 0.19 0.28 0.32 0.37 0.49 0.15 0.55 0.57 1.15 0.72 1.35
Scenario 1b
Lasso 0.18 1.03 1.05 0.57 0.58 0.81 1.97 8.00 8.23 1.85 2.11 2.81
Enet 0.27 1.32 1.34 0.81 0.73 1.09 1.87 7.73 7.96 1.94 2.02 2.80
AL 0.18 0.73 0.76 0.28 0.43 0.51 1.26 6.23 6.36 1.68 1.70 2.39
Aenet 0.21 0.72 0.75 0.36 0.44 0.57 1.05 5.45 5.55 1.76 1.49 2.31
SCAD 0.26 0.99 1.02 0.39 0.63 0.74 1.20 5.55 5.68 1.71 1.59 2.34
MC+ 0.26 0.99 1.03 0.40 0.64 0.76 1.21 5.53 5.66 1.71 1.59 2.33
Scenario 2a
Lasso 1.45 1.89 2.38 2.24 2.47 3.34 0.94 8.86 8.91 4.53 5.79 7.35
Enet 1.54 1.89 2.44 2.38 2.62 3.54 1.27 6.54 6.66 4.95 4.85 6.93
AL 1.57 1.80 2.39 2.20 2.16 3.09 1.33 6.38 6.52 4.31 4.50 6.23
Aenet 2.05 1.60 2.59 2.64 2.11 3.38 1.95 4.75 5.13 4.89 3.73 6.14
SCAD 2.26 1.75 2.86 3.84 2.43 4.54 3.74 3.45 5.09 5.79 2.73 6.40
MC+ 2.45 1.77 3.02 3.95 2.39 4.61 3.81 3.41 5.12 5.82 2.71 6.42
Scenario 2b
Lasso 3.28 2.87 4.36 3.36 3.20 4.64 3.85 13.41 13.95 4.61 11.20 12.11
Enet 3.39 2.45 4.18 3.48 2.75 4.44 3.76 7.86 8.71 4.66 6.96 8.37
AL 3.64 1.59 3.97 3.69 1.78 4.10 3.89 8.99 9.80 4.70 6.95 8.39
Aenet 3.71 1.34 3.95 3.79 1.58 4.10 4.03 4.89 6.34 4.88 4.38 6.55
SCAD 4.56 2.22 5.07 4.67 2.27 5.19 5.22 3.27 6.16 5.65 3.18 6.48
MC+ 4.53 2.24 5.05 4.64 2.29 5.18 5.23 3.25 6.15 5.66 3.21 6.51
Table 4: Empirical prediction properties (Bias, SD, and RMSE) based on 2000 replications of Thomas processes on the domain for two different values of and for the two different scenarios. Different penalty functions are considered as well as two estimating equations, the regularized Poisson likelihood (PL) and the regularized weighted Poisson likelihood (WPL).

To evaluate the selection properties of the estimates, we consider the true positive rate (TPR), the false positive rate (FPR), and the positive predictive value (PPV). We want to find the methods which have a TPR close to 100 meaning that it can select correctly all the true covariates, a FPR close to 0 showing that it can remove all the extra covariates from the model, and a PPV close to 100 indicating that, for Scenario 1 (resp. Scenario 2), it can keep exactly the two (resp. five) true covariates and remove all the 48 (resp. 45) extra covariates. In general, for both regularized PL and regularized WPL, the best selection properties are obtained from larger which indicates weaker spatial dependence. To compare the regularization methods, we emphasize here that the main difference between regularization methods which satisfy (adaptive lasso, adaptive elastic net, SCAD, and MC+) and which cannot satisfy (lasso, elastic net) our theorems is that the methods which cannot satisfy our theorems tend to over-select covariates, leading to suffering from larger FPR and smaller PPV in general. Among all regularization methods considered in this study, adaptive lasso and adaptive elastic net seem to outperform the other methods in most cases. Although adaptive lasso and adaptive elastic net perform quite similarly, the adaptive lasso is slightly better.

In this simulation study, we are still able to show that even when the strong multicollinearity exists such as in Scenario 1a, our proposed methods work well for the penalization methods satisfying our theorems. However, as probably expected, our methods are getting difficult to distinguish between the important and the noisy covariates as the setting becomes more and more complex. In the experiments we conduct, we find that the regularized PL and WPL (with adaptive lasso) perform quite similar for the easiest (Scenario 1a) and the toughest (Scenario 2b) setting. For Scenarios 1b and 2a, the regularized WPL with adaptive lasso seems to be more favorable. From Table 3, we would recommend in general to combine the regularized WPL with the adaptive lasso to perform variable selection.

Table 4 gives the prediction properties of the estimates (except for

which is excluded) in terms of biases, standard deviations (SD), and square root of mean squared errors (RMSE), some criteria we define by

where and are respectively the empirical mean and variance of the estimates , for .

In general, the properties improve with larger due to weaker spatial dependence. Regarding the regularization methods considered in this study, adaptive lasso and adaptive elastic net perform best. Adaptive elastic net becomes more preferable than adaptive lasso for a clustered process () and for a structured spatial data (Scenarios 1b and 2b). The adaptive elastic net is more efficient than the adaptive lasso especially in the complex situation: large number of covariates, strong multicollinearity, clustered processes, and complex spatial structure due to the advantage of combining and penalties.

By employing regularized WPL, we have potentially more efficient estimates than that of the regularized PL, especially for the more clustered process. However, this does not mean that the regularized WPL is able to improve the RMSE since it usually introduces extra biases. Regularized WPL seems more appropriate for the case having covariates with complex spatial structure (Scenarios 1b and 2b). Otherwise, regularized PL is more favorable. From Table 4, when the focus is on prediction, we would recommend to apply adaptive elastic net as a general advice, and we would combine with regularized WPL if the covariates have complex spatial structure (e.g. Scenarios 1b and 2b) or combine with regularized PL if there is no evidence of complex spatial structure in the covariates (e.g. Scenarios 1a and 2a).

Note that, from Table 3, the adaptive lasso is more preferable if the focus is on variable selection while, from Table 4, the adaptive elastic net is more favorable if the focus is for prediction. To have a more general recommendation, we would recommend applying adaptive elastic net when we are faced with a complex situation: a large number of covariates, strong multicollinearity, clustered processes and complex spatial structure. By combining and penalties, the adaptive elastic net provides a nice balance between selection and prediction properties. This is why in most complex cases (Scenario 2 with ), adaptive elastic net decides to choose more covariates than adaptive lasso (which includes true and noisy covariates) to suffer from slightly less appropriate properties for the selection performance but to be able to improve significantly the prediction properties.

4.2 Application to forestry datasets

We now consider the study of ecology in a tropical rainforest in Barro Corrolado Island (BCI), Panama, described previously in Section 1. In particular, we are interested in studying the spatial distribution of 3,604 locations of Beilschmiedia pendula Lauraceae (BPL) trees by modeling its intensity as a log-linear function of 93 covariates consisting of 2 topological attributes, 13 soil properties, and 78 interactions between two soil nutrients.Regarding the relatively large number of covariates, we apply our proposed methods to select few covariates among them and estimate their coefficients. In particular, we use the regularized Poisson methods with the lasso, adaptive lasso, and adaptive elastic net. Note that we center and scale all the covariates to observe which covariates owing relatively large effect on the intensity.

Method Regularized PL Regularized WPL
#Selected #Non-selected #Selected #Non-selected
LASSO 77 16 20 73
AL 50 43 9 84
AENET 69 24 9 84
Table 5: Number of selected and non-selected covariates among 93 covariates by regularized Poisson likelihood with lasso, adaptive lasso and adaptive elastic net regularization.
Covariates Regularized PL Regularized WPL
LASSO AL AENET LASSO AL AENET
Elev 0.33 0.37 0.34 0.23 0.14 0.14
Slope 0.37 0.37 0.37 0.45 0.44 0.46
Cu 0.45 0.30 0.30 0.16 0.22 0.19
Mn 0.11 0.10 0.11 0.18 0.14 0.14
P -0.49 -0.45 -0.48 -0.50 -0.43 -0.39
Zn -0.69 -0.54 -0.70 -0.21 -0.31 -0.25
Al:P -0.28 -0.24 -0.28 -0.13 -0.14 -0.13
Mg:P 0.49 0.26 0.30 0.38 0.38 0.34
N.Min:pH 0.42 0.39 0.39 0.22 0.17 0.17
Table 6: Nine common covariates selected

We present in Table 5 the number of selected and non-selected covariates by each method. Out of 93 covariates, more than from the total number of covariates are selected by regularized PL while much fewer covariates are selected by regularized WPL. The regularized PL seems to overfit the model.

Regarding lasso method, 77 covariates are selected by regularized PL method while 20 covariates are selected by regularized WPL. Compared to the two adaptive methods (i.e., adaptive lasso and adaptive elastic net), lasso tends to keep less important covariates. This may explain why lasso cannot satisfy our Theorem 2. In terms of selection properties, adaptive lasso and adaptive elastic net perform similarly when regularized WPL is applied.

Table 6 gives the information regarding nine covariates commonly selected among the six methods. Although the magnitudes of the estimates can be slightly different, the signs all agree with each other.

These results suggest that BPL trees favor to live in the areas of higher elevation and slope with a high concentration of Copper and Manganese in the soil. Furthermore, BPL trees prefer to live in the areas with lower concentration levels of Phosphorus and Zinc in the soil. The interaction between Aluminum and Phosphorus gives a negative association with the appearance of BPL trees while the interaction between Magnesium and Phosphorus and the interaction between Nitrogen mineralization and pH show a positive association with the occurrence of BPL trees. The maps of 3,604 locations of BPL trees, as well as the nine commonly selected covariates, are depicted in Figure 1.