1 Introduction
1.1 Background
Spatial point pattern data arise in many contexts, e.g. ecology (see e.g. Møller and Waagepetersen, 2004; Renner et al., 2015), epidemiology (e.g. Diggle, 1990, 2013), criminology (e.g. Baddeley et al., 2015; Shirota et al., 2017), biology (e.g. Illian et al., 2008) and astronomy (e.g. Baddeley et al., 2015), where interest lies in describing the distribution of an event in space. Stochastic models generating spatial point patterns are called spatial point processes (see e.g. Møller and Waagepetersen, 2004; Illian et al., 2008; Diggle, 2013; Baddeley et al., 2015).
Usually, the first step to analyze spatial point pattern data is to investigate the intensity. The intensity serves as the firstorder characteristics of a spatial point process and often becomes the main interest in many studies, especially when the intensity is suspected to depend on spatial covariates. Examples include the study of spatial variation of specific disease risk related to pollution sources (e.g. Diggle, 1990, 2013), crime rate analysis in a city related to some demographical information (e.g. Shirota et al., 2017), and modeling of the spatial distribution of trees species in a forest related to some environmental factors (e.g. Waagepetersen, 2007; Thurman et al., 2015; Renner et al., 2015).
We focus in this study on the loglinear model for the intensity function of an inhomogeneous spatial point process defined by
(1.1) 
where are the spatial covariates measured at location , represents the state space of the spatial point processes (usually ) and is a real
dimensional parameter. Hence, our main concern is to assess the magnitudes of the vector
. For parametric estimation, while maximum likelihood estimation (e.g. Berman and Turner, 1992; Rathbun and Cressie, 1994) has been widely implemented for Poisson point processes models, estimating equationbased methods (e.g. Waagepetersen, 2007, 2008; Guan and Shen, 2010; Baddeley et al., 2014) are simpler to implement for more general spatial point processes models, overcoming the possible drawback of MCMC methods which are usually computational expensive (Møller and Waagepetersen, 2004). However, when the number of covariates is relatively large, maximum likelihood estimation and estimating equationbased methods become undesirable: all covariates are selected yielding an increasing standard error for parameter estimates.
1.2 Feature selection techniques
To select significant covariates, one may consider a traditional procedure such as a stepwise method. This technique starts with an initial set of covariates, then considers adding or deleting a covariate from the current set at each iteration using a criterion such as an Fstatistic or AIC. However, such procedure has a number of limitations: it can be numerically unstable and exhibits high variance due to its discrete procedure
(e.g. Breiman, 1996; Fan and Li, 2001; Friedman et al., 2008). It is even computationally unfeasible especially when the number of covariates is too large (e.g. Breiman, 1996; Zou, 2006).To overcome this drawback, regularization techniques have recently been developed for spatial point processes intensity estimation. Such methods are able to perform variable selection while keeping interesting properties in terms of prediction. For Poisson point process models, the idea is to penalize the Poisson likelihood by a penalty function such as penalty (see Renner and Warton, 2013; Thurman and Zhu, 2014). For more general point process models, instead of employing the likelihood of the processes which often requires computational intensive MCMC methods (Møller and Waagepetersen, 2004), penalized versions of estimating equations based on Campbell theorem derived both from Poisson and logistic regression likelihoods have been developed (see Thurman et al., 2015; Choiruddin et al., 2017). Furthermore, Thurman et al. (2015) and Choiruddin et al. (2017) show that, under some conditions, the estimates obtained from such procedures are consistent, sparse, and asymptotically normal.
1.3 Issues in high dimensional data
The motivation of our paper comes from a study of biodiversity in a 50hectare region () of the tropical moist forest of Barro Colorado Island (BCI) in central Panama, where censuses have been carried out such that all freestanding woody stems at least 10 mm diameter at breast height were identified, tagged, and mapped, resulting in maps of over 350,000 individual trees with around 300 species (see Condit, 1998; Hubbell et al., 1999, 2005). In the same region, many environmental covariates such as topographical attributes and soil properties have been also collected. In particular, we are interested to study the spatial distribution of 3,604 locations of Beilschmiedia pendula Lauraceae (BPL) trees and to model its intensity as a parametric function of 93 covariates consisting of 2 topological attributes, 13 soil properties and 78 interactions between two soil nutrients.
Although it seems that the number of covariates is not very large with respect to the number of data points, two hours are required to estimate the parameters and select covariates using a standard stepwise procedure. To do this, we use the step function in R which intrinsically assumes that is a Poisson point process since we use the AIC in the stepwise procedure. For a general point process, other criteria could be investigated such as the one based on the statistic, but they require to estimate the asymptotic covariance matrix of the estimates at each step: even if we know the right covariates, it is known as a difficult task (see e.g. Coeurjolly and Guan (2014)) especially when the number of parameters is large. That would easily triple the time of this estimation/selection procedure. To evaluate the performance of such a selection/estimation procedure, a simulation would be required which is unrealistic (1000 replications of a single model would take 250 days). This motivates us to consider regularization methods.
Thurman et al. (2015) and Choiruddin et al. (2017) are the first two theoretical works. Both these works have the important limitation that the number of covariates is finite. We extend this in the present paper. Asymptotic properties which consider a diverging number of parameters for estimators have a long story (e.g Huber, 1973; Portnoy, 1984) but have recently been investigated for penalized regression estimators by Fan and Peng (2004); Zou and Zhang (2009). In particular, as argued by Fan and Peng (2004), even though the asymptotic properties (i.e., consistency, sparsity, and asymptotic normality) proposed by Fan and Li (2001) for penalized generalized linear models under the assumption that is finite, are encouraging, there are many naive and simple model selection procedures which possess those properties. Establishing the validity of these asymptotic properties in a diverging number of parameters setting is, therefore, a major importance. We study this type of asymptotic properties in the spatial point processes framework. Hence, our work can be regarded as an extension of the study conducted by Choiruddin et al. (2017).
A standard way of measuring asymptotic for spatial point process is the increasing domain asymptotic. Therefore, we investigate the problem where grows with the volume of the observation domain. In our setting, plays the same role as , the number of observations, in standard problems such as in linear models or generalized linear models. We obtain consistency, sparsity, and asymptotic normality for our estimator. One of our main assumptions is that as , which is similar to the one required by Fan and Peng (2004) when is simply replaced by (the sample size in their context).
Our results are general: (1) a large choice of penalty functions (either convex or nonconvex function) and methods (e.g. ridge, lasso, elastic net, SCAD, and MC+) are available; (2) we include a large class of mixing spatial point processes. The implementation is done by combining the spatstat (Baddeley et al., 2015) R package with the two R packages implementing penalized methods for generalized linear models: glmnet (Friedman et al., 2010) and ncvreg (Breheny and Huang, 2011).
1.4 Outline of the paper
In Section 2, we introduce brief background on spatial point processes as well as regularization methods for spatial point processes intensity estimation. Section 3 presents our asymptotic results. We investigate in Section 4 the finite sample performance of our estimates in a simulation study and in an application to tropical forestry datasets. Conclusion and discussion are presented in Section 5. Proofs of the main results are postponed to Appendices AC.
2 Regularization methods for spatial point processes
This section gives brief introduction on spatial point processes and reviews regularization methods for spatial point processes intensity estimation previously studied by Choiruddin et al. (2017) when the number of parameters is finite.
Let be a spatial point process on . We view as a locally finite random subset of . Let be a compact set of Lebesgue measure which will play the role of the observation domain. A realization of in is thus a set , where and is the observed number of points in . Suppose has intensity function and secondorder product density . Campbell theorem (see e.g. Møller and Waagepetersen, 2004) states that, for any function or
(2.2)  
(2.3) 
We may interpret as the probability of occurence of a point in an infinitesimally small ball with centre and volume . Intuitively, is the probability for observing a pair of distinct points from occuring jointly in each of two infinitesimally small balls with centres and volume . For further background materials on spatial point processes, see for example Møller and Waagepetersen (2004); Illian et al. (2008).
In our study, we assume that the intensity function depends on parameter , . The standard parametric methods for estimating are by maximizing the weighted Poisson likelihood (e.g. Guan and Shen, 2010) or the weighted logistic regression likelihood (e.g. Baddeley et al., 2014; Choiruddin et al., 2017) given respectively by
(2.4)  
(2.5) 
where is a weight nonnegative function depending on the first and the secondorder characterictics of and is a nonnegative realvalued function. The solution of maximizing (2.4) (resp. (2.5)) is called Poisson estimator (resp. the logistic regression estimator). We refer readers to Guan and Shen (2010) for further details on the weight function and to Baddeley et al. (2014) for the role of function .
These standard methods cannot perform variable selection. To do so, Thurman et al. (2015) and Choiruddin et al. (2017) suggest to maximize a penalized version of (2.4)(2.5)
(2.6) 
where is either the Poisson likelihood (2.4) or the logistic regression likelihood (2.5). We refer the second term of (2.6) to a penalization term. In this term, we have mainly two parts: (1) a penalty function parameterized by and (2) the volume of the observation domain which plays the same role as the sample size in the spatial point process framework.
For any nonnegative , we say that is a penalty function if is a nonnegative function with . Some examples, described in Table 1, include penalty (Hoerl and Kennard, 1988), penalty (Tibshirani, 1996), elastic net (Zou and Hastie, 2005), SCAD (Fan and Li, 2001), and MC+ (Zhang, 2010). Note that, as indicated by (2.6), we allow each direction to have different tuning parameters . Such a method is called an adaptive method (e.g. adaptive lasso (Zou, 2006) and adaptive elastic net (Zou and Zhang, 2009)). For further backgrounds about penalty function and regularization methods, see, for example, Friedman et al. (2008).
Penalty  

penalty  
penalty  
Enet  , for any 
SCAD  , 
for any  
MC+  , for any 
3 Asymptotic properties
In this section, we present asymptotic properties of the regularized Poisson estimator when both and as . In particular, we consider as a dimensional point process observed over a sequence of observation domain which expands to as . We assume that has a loglinear form given by (1.1) for which the dimension of parameter , denoted now by , diverges to as . In Section 3.1, we provide notation and conditions, and discuss the differences from the setting where is fixed. Our main results are presented in Section 3.2. For sake of conciseness, we do not present the asymptotic results for the regularized logistic regression estimator. The results are very similar. The main difference is lying in the conditions (.6) and (.7) for which the matrices and have a different expression (see Remark 2).
3.1 Notation and conditions
Throughout this section and Appendices AC, let
(3.7)  
(3.8) 
be respectively the weighted Poisson likelihood and its penalized version.
Let denote the dimensional vector to estimate, where is the dimensional vector of nonzero coefficients and is the ()dimensional vector of zero coefficients. We assume that the number of nonzero coefficients, , does not depend on . Let and denote the corresponding dimensional and ()dimensional vectors of spatial covariates. We denote the regularized Poisson estimator by .
We recall the classical definition of strong mixing coefficients adapted to spatial point processes (e.g. Politis et al., 1998): for and , define
(3.9) 
where is the algebra generated by is the minimal distance between sets and , and denotes the class of Borel sets in .
We define the matrices and by
where is the classical pair correlation function Møller and Waagepetersen (2004) given by
when both and exist with the convention . For a Poisson point process, we have since . If, for example, (resp. ), this indicates that pair of points are more likely (resp. less likely) to occur at locations than for a Poisson point process.
We denote the topleft corner of , by . It is worth noticing that , and depend on only through and not through . In what follows, for a squared symmetric matrix , and
denote respectively the smallest and largest eigenvalue of
.Under the conditions (.8)(.9), we define the sequences , and by
(3.10)  
(3.11)  
(3.12) 
where is any positive constant.

For every , where is convex, compact, and contains the origin of in its interior.

The intensity function has the loglinear specification given by (1.1) where and is an open convex bounded set of . Furthermore, we assume that there exists a neighborhood of such that

The covariates and the weight function satisfy

There exists an integer such that for , the product density exists and satisfies .

For the strong mixing coefficients (3.9), we assume that there exists some such that .

.

.

The penalty function is nonnegative on , continuously differentiable on with derivative assumed to be a Lipschitz function on . Furthermore, given we assume that there exists , where as , such that, for sufficiently large, is thrice continuously differentiable in the ball centered at with radius and we assume that the third derivative is uniformly bounded.

as .
Conditions (.1)(.8) are quite similar to the ones required by Choiruddin et al. (2017) in the setting when the number of parameters to estimate is fixed. Condition (.2) is slightly stronger since we have to ensure that is finite for in the neighborhood of . Note that follows directly from condition (.3). We derive asymptotic properties when both and tend to infinity with . However, to obtain an estimator which is consistent and has two other properties: sparsity and asymptotic normality, we need that the number of covariates does not grow too fast with respect to the volume of the observation domain. This condition is stated by condition (.9) which is similar to the one required by Fan and Peng (2004) when is simply replaced by (the sample size in their context).
3.2 Main results
We first show in Theorem 1 that the regularized Poisson estimator converges in probability and exhibits its rate of convergence.
Theorem 1.
This implies that, the regularized Poisson estimator is root consistent. Note that, as expected, the convergence rate is times the convergence rate of the estimator obtained when is fixed (see Theorem 1 Choiruddin et al., 2017). In addition, when we compare our results with the ones obtained by Fan and Peng (2004), who also considered a diverging number of parameters setting, our estimator has the same rate of convergence when we replace by to their context. This rate of convergence also appears in other contexts considering diverging number of parameters setting (see e.g. Lam and Fan, 2008; Zou and Zhang, 2009; Li et al., 2011; Cho and Qu, 2013; Wang and Zhu, 2017).
Now, we demonstrate in Theorem 2 that such a root consistent estimator ensures the sparsity of ; that is, the estimate will correctly set to zero with probability tending to 1 as , and is asymptotically normal.
Theorem 2.
Remark 1.
Remark 2.
We show in Theorem 2 that the sparsity and asymptotic normality are still valid when the number of parameters diverges. By Remark 1, when is large enough, in (3.13) becomes approximately
which is precisely the asymptotic covariance matrix of the estimator of obtained by maximizing the likelihood function or solving estimating equations based on the submodel knowing that . This shows that when is sufficiently large, our estimator is as efficient as the oracle one.
To satisfy Theorem 2, we require that , and as simultaneously. In particular, conditions on and ensure the asymptotic normality of while condition on is used to prove the sparsity. Conditions regarding and are similar to the ones imposed by Fan and Peng (2004) when is replaced by in their context. However, we require a slightly stronger condition on than the one required by Fan and Peng (2004) which in the present setting could be written as . As compensation, we do not need to impose, as Fan and Peng (2004) did, for any , . Such a condition is not straightforwardly satisfied in our setting since the other conditions only imply that .
Further details regarding , and for each method are presented in Table 2. For the ridge regularization method, , preventing from applying Theorem 2 for this penalty. For lasso and elastic net, for some constant (=1 for lasso). The two conditions and as cannot be satisfied simultaneously. This is different for the adaptive versions where a compromise can be found by adjusting the ’s, as well as the two nonconvex penalties SCAD and MC+, for which can be adjusted. For the regularization methods we consider in this study, the condition is implied by the condition as and condition (.9).
Method  

Ridge  
Lasso  0  
Enet  
AL  0  
Aenet  
SCAD  
MC+ 

if for sufficient large

if for sufficient large
4 Numerical results
This section is devoted to present numerical results. More precisely, we conduct simulation experiments in Section 4.1 to assess the finite sample peformance of our estimates and apply our method to an application in ecology in Section 4.2. We apply the regularized Poisson likelihood (PL) and the regularized weighted Poisson likelihood (WPL) to select covariates and estimate their coefficients. Similar approach can be straightforwardly used for the regularized versions using logistic regression likelihood.
To numerically evaluate the parameters estimates, we apply BermanTurner method (Berman and Turner, 1992) combined with coordinate descent algorithm (Friedman et al., 2007) to perform variable selection and parameter estimation. BermanTurner device allows to show that maximizing (2.4) is equivalent to fitting a weighted Poisson generalized linear model, so the standard software for generalized linear models (GLMs) can be used. This has been exploited by the spatstat R package (Baddeley et al., 2015). As we make links between spatial point processes intensity estimation and GLMs, we only have to deal with feature selection procedures for GLMs. Hence, we clearly have many advantages: the various computational strategies are carefully studied, and, in particular, efficiently implemented in R. In this study, to compute the regularization path solutions, we employ coordinate descent algorithm (Friedman et al., 2007). This is implemented in the glmnet (Friedman et al., 2010) for regularization methods for GLMs using some convex penalties (i.e., ridge, lasso, elastic net, adaptive lasso and adaptive elastic net) and in the ncvreg (Breheny and Huang, 2011) for regularization methods for GLMs using some nonconvex penalties (i.e., SCAD and MC+). More details for computational strategies are discussed in detail by Choiruddin et al. (2017).
Our methods rely on the tuning parameter . Some previous studies (see e.g. Zou et al., 2007; Wang et al., 2007, 2009) suggest to use a modified BIC criterion to select the tuning parameter. We follow the literature and choose by minimizing , a modified version of the BIC criterion, defined by
where is the number of selected covariates with nonzero regression coefficients and is the volume of observation domain. To implement the adaptive methods (i.e., adaptive lasso and adaptive elastic net), we follow Zou (2006) and define , where
is the estimates obtained from ridge regression and
is a tuning parameter chosen by criterion as described above. Following Choiruddin et al. (2017), we fix for elastic net and its adaptive version, for SCAD, and for MC+. For further discussion regarding the selection of for SCAD and MC+, see e.g. Fan and Li (2001) and Breheny and Huang (2011).4.1 Simulation study
In this section, we investigate the behavior of our estimators in a simulation experiment in different situations when a large number of covariates for fitting spatial point process intensity estimation is involved. We intend to extend the setting considered by Choiruddin et al. (2017). We start with relatively complex situation where strong multicollinearity is present (Scenarios 1a and 2a) and we then consider a more complex setting using real datasets (Scenarios 1b and 2b). We have two different scenarios (Scenarios 1 and 2) for which the number of true covariates as well as their coefficients are different.
The spatial domain we consider is . The true intensity function has the form , where and . We set such that the mean number of points over is equal to . We consider two different scenarios described as follows.

We define the true vector . To define the covariates, we center and scale the pixel images of elevation () and gradient of elevation () contained in the bei datasets of spatstat library in R and use them as two true covariates. In addition, we create two settings to define extra covariates:

First, we generate 48
pixel images of covariates as a standard Gaussian white noise and denote them by
. Second, we transform them, together with and , to have multicollinearity. In particular, we define , where . More precisely, is such that and for , except , to preserve the correlation between and . In this setting, . 
We center and scale the 13 pixel images of soil nutrients obtained from the study in tropical forest of Barro Colorado Island (BCI) in central Panama (see Condit, 1998; Hubbell et al., 1999, 2005) and convert them to be pixel images as and . In addition, we consider the interaction between two soil nutrients such that we have 50 covariates in total. We use 48 covariates (13 soil nutrients and 35 interactions between them) as the extra covariates. Together with and , we keep the structure of the covariance matrix to preserve the complexity of the situation. In this setting, we have .


In this setting, we consider five true covariates out of 50 covariates. In addition of elevation () and gradient of elevation (), we convert pixel images of concentration of Aluminium (), Boron () and Calcium () in the soil to be pixel images as and and set them to be other three true covariates. All five covariates are centered and scaled. We define the true coefficient vector . As in Scenario 1, we make two settings to define 45 extra covariates:

This setting is similar to that of Scenario 1a. We generate 45 pixel images of covariates as standard Gaussian white noise, denote them by , and define , where is such that and for , except for , to preserve the correlation among  . We still define .

Method  Regularized PL  Regularized WPL  Regularized PL  Regularized WPL  

TPR  FPR  PPV  TPR  FPR  PPV  TPR  FPR  PPV  TPR  FPR  PPV  
Scenario 1a  
Lasso  100^{1}  13  28  96  4  62  97  23  20  64  1  76 
Enet  100^{1}  34  12  93  8  48  97  48  10  59  2  58 
AL  100^{1}  1  92  97  0^{1}  96  95  3  68  70  0^{1}  98 
Aenet  100^{1}  2  76  97  1  85  95  6  52  67  0^{1}  95 
SCAD  100^{1}  7  41  97  1  87  96  4  61  56  0^{1}  79 
MC+  100^{1}  8  37  96  1  85  96  5  58  52  1  74 
Scenario 1b  
Lasso  100^{1}  45  10  91  11  52  100^{1}  96  4  20  6  22 
Enet  100^{1}  63  7  87  18  31  100^{1}  98  4  15  6  14 
AL  100^{1}  26  19  95  5  81  99  85  5  26  5  35 
Aenet  100^{1}  30  15  95  6  74  100^{1}  87  5  24  5  30 
SCAD  100^{1}  26  18  93  5  76  100^{1}  76  5  23  4  28 
MC+  100^{1}  26  17  93  5  76  99  76  5  22  5  27 
Scenario 2a  
Lasso  98  93  10  84  73  14  98  96  10  47  35  16 
Enet  99  98  10  85  80  11  99  98  10  46  38  12 
AL  95  49  18  83  35  27  95  64  15  50  23  28 
Aenet  96  52  17  84  40  21  96  68  14  48  26  20 
SCAD  86  74  13  65  45  36  75  60  21  39  26  30 
MC+  87  78  13  65  47  35  73  60  22  39  26  30 
Scenario 2b  
Lasso  80  64  13  75  60  12  78  69  11  64  57  9 
Enet  85  73  12  82  69  11  84  79  11  68  64  8 
AL  56  26  19  54  25  20  59  35  17  48  30  13 
Aenet  59  30  18  57  29  18  64  43  15  52  36  11 
SCAD  43  21  20  42  20  23  46  24  27  41  25  16 
MC+  44  21  20  43  20  23  46  24  26  41  26  16 

Approximate value
With these scenarios, we simulate 2000 spatial point patterns from a Thomas point process using the function in the package. We set the interaction parameter to be and let . Briefly, smaller values of correspond to tighter clusters, and smaller values of correspond to a fewer number of parents (see e.g. Møller and Waagepetersen, 2004, for further details regarding the Thomas point process). For each scenario with different , we fit the intensity to the simulated point pattern realizations.
We report the performances of our estimates in terms of two characteristics: selection and prediction properties. We present the selection properties in Table 3 and the prediction properties in Table 4
Method  Regularized PL  Regularized WPL  Regularized PL  Regularized WPL  

Bias  SD  RMSE  Bias  SD  RMSE  Bias  SD  RMSE  Bias  SD  RMSE  
Scenario 1a  
Lasso  0.19  0.19  0.27  0.43  0.29  0.52  0.29  0.60  0.67  0.94  0.53  1.08 
Enet  0.27  0.22  0.35  0.72  0.32  0.79  0.34  0.66  0.74  1.21  0.40  1.27 
AL  0.05  0.18  0.19  0.14  0.24  0.28  0.19  0.60  0.63  0.57  0.57  0.81 
Aenet  0.07  0.19  0.20  0.20  0.27  0.33  0.22  0.60  0.64  0.69  0.55  0.88 
SCAD  0.19  0.19  0.27  0.29  0.32  0.43  0.14  0.55  0.57  1.10  0.71  1.31 
MC+  0.20  0.19  0.28  0.32  0.37  0.49  0.15  0.55  0.57  1.15  0.72  1.35 
Scenario 1b  
Lasso  0.18  1.03  1.05  0.57  0.58  0.81  1.97  8.00  8.23  1.85  2.11  2.81 
Enet  0.27  1.32  1.34  0.81  0.73  1.09  1.87  7.73  7.96  1.94  2.02  2.80 
AL  0.18  0.73  0.76  0.28  0.43  0.51  1.26  6.23  6.36  1.68  1.70  2.39 
Aenet  0.21  0.72  0.75  0.36  0.44  0.57  1.05  5.45  5.55  1.76  1.49  2.31 
SCAD  0.26  0.99  1.02  0.39  0.63  0.74  1.20  5.55  5.68  1.71  1.59  2.34 
MC+  0.26  0.99  1.03  0.40  0.64  0.76  1.21  5.53  5.66  1.71  1.59  2.33 
Scenario 2a  
Lasso  1.45  1.89  2.38  2.24  2.47  3.34  0.94  8.86  8.91  4.53  5.79  7.35 
Enet  1.54  1.89  2.44  2.38  2.62  3.54  1.27  6.54  6.66  4.95  4.85  6.93 
AL  1.57  1.80  2.39  2.20  2.16  3.09  1.33  6.38  6.52  4.31  4.50  6.23 
Aenet  2.05  1.60  2.59  2.64  2.11  3.38  1.95  4.75  5.13  4.89  3.73  6.14 
SCAD  2.26  1.75  2.86  3.84  2.43  4.54  3.74  3.45  5.09  5.79  2.73  6.40 
MC+  2.45  1.77  3.02  3.95  2.39  4.61  3.81  3.41  5.12  5.82  2.71  6.42 
Scenario 2b  
Lasso  3.28  2.87  4.36  3.36  3.20  4.64  3.85  13.41  13.95  4.61  11.20  12.11 
Enet  3.39  2.45  4.18  3.48  2.75  4.44  3.76  7.86  8.71  4.66  6.96  8.37 
AL  3.64  1.59  3.97  3.69  1.78  4.10  3.89  8.99  9.80  4.70  6.95  8.39 
Aenet  3.71  1.34  3.95  3.79  1.58  4.10  4.03  4.89  6.34  4.88  4.38  6.55 
SCAD  4.56  2.22  5.07  4.67  2.27  5.19  5.22  3.27  6.16  5.65  3.18  6.48 
MC+  4.53  2.24  5.05  4.64  2.29  5.18  5.23  3.25  6.15  5.66  3.21  6.51 
To evaluate the selection properties of the estimates, we consider the true positive rate (TPR), the false positive rate (FPR), and the positive predictive value (PPV). We want to find the methods which have a TPR close to 100 meaning that it can select correctly all the true covariates, a FPR close to 0 showing that it can remove all the extra covariates from the model, and a PPV close to 100 indicating that, for Scenario 1 (resp. Scenario 2), it can keep exactly the two (resp. five) true covariates and remove all the 48 (resp. 45) extra covariates. In general, for both regularized PL and regularized WPL, the best selection properties are obtained from larger which indicates weaker spatial dependence. To compare the regularization methods, we emphasize here that the main difference between regularization methods which satisfy (adaptive lasso, adaptive elastic net, SCAD, and MC+) and which cannot satisfy (lasso, elastic net) our theorems is that the methods which cannot satisfy our theorems tend to overselect covariates, leading to suffering from larger FPR and smaller PPV in general. Among all regularization methods considered in this study, adaptive lasso and adaptive elastic net seem to outperform the other methods in most cases. Although adaptive lasso and adaptive elastic net perform quite similarly, the adaptive lasso is slightly better.
In this simulation study, we are still able to show that even when the strong multicollinearity exists such as in Scenario 1a, our proposed methods work well for the penalization methods satisfying our theorems. However, as probably expected, our methods are getting difficult to distinguish between the important and the noisy covariates as the setting becomes more and more complex. In the experiments we conduct, we find that the regularized PL and WPL (with adaptive lasso) perform quite similar for the easiest (Scenario 1a) and the toughest (Scenario 2b) setting. For Scenarios 1b and 2a, the regularized WPL with adaptive lasso seems to be more favorable. From Table 3, we would recommend in general to combine the regularized WPL with the adaptive lasso to perform variable selection.
Table 4 gives the prediction properties of the estimates (except for
which is excluded) in terms of biases, standard deviations (SD), and square root of mean squared errors (RMSE), some criteria we define by
where and are respectively the empirical mean and variance of the estimates , for .
In general, the properties improve with larger due to weaker spatial dependence. Regarding the regularization methods considered in this study, adaptive lasso and adaptive elastic net perform best. Adaptive elastic net becomes more preferable than adaptive lasso for a clustered process () and for a structured spatial data (Scenarios 1b and 2b). The adaptive elastic net is more efficient than the adaptive lasso especially in the complex situation: large number of covariates, strong multicollinearity, clustered processes, and complex spatial structure due to the advantage of combining and penalties.
By employing regularized WPL, we have potentially more efficient estimates than that of the regularized PL, especially for the more clustered process. However, this does not mean that the regularized WPL is able to improve the RMSE since it usually introduces extra biases. Regularized WPL seems more appropriate for the case having covariates with complex spatial structure (Scenarios 1b and 2b). Otherwise, regularized PL is more favorable. From Table 4, when the focus is on prediction, we would recommend to apply adaptive elastic net as a general advice, and we would combine with regularized WPL if the covariates have complex spatial structure (e.g. Scenarios 1b and 2b) or combine with regularized PL if there is no evidence of complex spatial structure in the covariates (e.g. Scenarios 1a and 2a).
Note that, from Table 3, the adaptive lasso is more preferable if the focus is on variable selection while, from Table 4, the adaptive elastic net is more favorable if the focus is for prediction. To have a more general recommendation, we would recommend applying adaptive elastic net when we are faced with a complex situation: a large number of covariates, strong multicollinearity, clustered processes and complex spatial structure. By combining and penalties, the adaptive elastic net provides a nice balance between selection and prediction properties. This is why in most complex cases (Scenario 2 with ), adaptive elastic net decides to choose more covariates than adaptive lasso (which includes true and noisy covariates) to suffer from slightly less appropriate properties for the selection performance but to be able to improve significantly the prediction properties.
4.2 Application to forestry datasets
We now consider the study of ecology in a tropical rainforest in Barro Corrolado Island (BCI), Panama, described previously in Section 1. In particular, we are interested in studying the spatial distribution of 3,604 locations of Beilschmiedia pendula Lauraceae (BPL) trees by modeling its intensity as a loglinear function of 93 covariates consisting of 2 topological attributes, 13 soil properties, and 78 interactions between two soil nutrients.Regarding the relatively large number of covariates, we apply our proposed methods to select few covariates among them and estimate their coefficients. In particular, we use the regularized Poisson methods with the lasso, adaptive lasso, and adaptive elastic net. Note that we center and scale all the covariates to observe which covariates owing relatively large effect on the intensity.
Method  Regularized PL  Regularized WPL  

#Selected  #Nonselected  #Selected  #Nonselected  
LASSO  77  16  20  73 
AL  50  43  9  84 
AENET  69  24  9  84 
Covariates  Regularized PL  Regularized WPL  

LASSO  AL  AENET  LASSO  AL  AENET  
Elev  0.33  0.37  0.34  0.23  0.14  0.14 
Slope  0.37  0.37  0.37  0.45  0.44  0.46 
Cu  0.45  0.30  0.30  0.16  0.22  0.19 
Mn  0.11  0.10  0.11  0.18  0.14  0.14 
P  0.49  0.45  0.48  0.50  0.43  0.39 
Zn  0.69  0.54  0.70  0.21  0.31  0.25 
Al:P  0.28  0.24  0.28  0.13  0.14  0.13 
Mg:P  0.49  0.26  0.30  0.38  0.38  0.34 
N.Min:pH  0.42  0.39  0.39  0.22  0.17  0.17 
We present in Table 5 the number of selected and nonselected covariates by each method. Out of 93 covariates, more than from the total number of covariates are selected by regularized PL while much fewer covariates are selected by regularized WPL. The regularized PL seems to overfit the model.
Regarding lasso method, 77 covariates are selected by regularized PL method while 20 covariates are selected by regularized WPL. Compared to the two adaptive methods (i.e., adaptive lasso and adaptive elastic net), lasso tends to keep less important covariates. This may explain why lasso cannot satisfy our Theorem 2. In terms of selection properties, adaptive lasso and adaptive elastic net perform similarly when regularized WPL is applied.
Table 6 gives the information regarding nine covariates commonly selected among the six methods. Although the magnitudes of the estimates can be slightly different, the signs all agree with each other.
These results suggest that BPL trees favor to live in the areas of higher elevation and slope with a high concentration of Copper and Manganese in the soil. Furthermore, BPL trees prefer to live in the areas with lower concentration levels of Phosphorus and Zinc in the soil. The interaction between Aluminum and Phosphorus gives a negative association with the appearance of BPL trees while the interaction between Magnesium and Phosphorus and the interaction between Nitrogen mineralization and pH show a positive association with the occurrence of BPL trees. The maps of 3,604 locations of BPL trees, as well as the nine commonly selected covariates, are depicted in Figure 1.
