# Regularized Nonlinear Regression for Simultaneously Selecting and Estimating Key Model Parameters

In system identification, estimating parameters of a model using limited observations results in poor identifiability. To cope with this issue, we propose a new method to simultaneously select and estimate sensitive parameters as key model parameters and fix the remaining parameters to a set of typical values. Our method is formulated as a nonlinear least squares estimator with L1-regularization on the deviation of parameters from a set of typical values. First, we provide consistency and oracle properties of the proposed estimator as a theoretical foundation. Second, we provide a novel approach based on Levenberg-Marquardt optimization to numerically find the solution to the formulated problem. Third, to show the effectiveness, we present an application identifying a biomechanical parametric model of a head position tracking task for 10 human subjects from limited data. In a simulation study, the variances of estimated parameters are decreased by 96.1 estimated parameters without L1-regularization. In an experimental study, our method improves the model interpretation by reducing the number of parameters to be estimated while maintaining variance accounted for (VAF) at above 82.5 Moreover, the variances of estimated parameters are reduced by 71.1 compared to that of the estimated parameters without L1-regularization. Our method is 54 times faster than the standard simplex-based optimization to solve the regularized nonlinear regression.

## Authors

• 1 publication
• 1 publication
• 1 publication
• 1 publication
• 2 publications
• 1 publication
• 1 publication
• 1 publication
• 1 publication
• 1 publication
• 1 publication
• ### Lass-0: sparse non-convex regression by local search

We compute approximate solutions to L0 regularized linear regression usi...
11/13/2015 ∙ by William Herlands, et al. ∙ 0

• ### Adaptive Bayesian SLOPE – High-dimensional Model Selection with Missing Values

The selection of variables with high-dimensional and missing data is a m...
09/14/2019 ∙ by Wei Jiang, et al. ∙ 0

• ### A geometric proximal gradient method for sparse least squares regression with probabilistic simplex constraint

In this paper, we consider the sparse least squares regression problem w...
07/02/2021 ∙ by Guiyun Xiao, et al. ∙ 0

• ### The Parameter Houlihan: a solution to high-throughput identifiability indeterminacy for brutally ill-posed problems

One way to interject knowledge into clinically impactful forecasting is ...
02/05/2019 ∙ by DJ Albers, et al. ∙ 0

• ### On parameters transformations for emulating sparse priors using variational-Laplace inference

So-called sparse estimators arise in the context of model fitting, when ...
03/06/2017 ∙ by Jean Daunizeau, et al. ∙ 0

• ### Estimating adaptive cruise control model parameters from on-board radar units

Two new methods are presented for estimating car-following model paramet...
11/15/2019 ∙ by Yanbing Wang, et al. ∙ 0

• ### A novel Multiplicative Polynomial Kernel for Volterra series identification

Volterra series are especially useful for nonlinear system identificatio...
05/20/2019 ∙ by Alberto Dalla Libera, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In parameter estimation, a model is considered to be identifiable when a unique set of parameters is specified for given measurement data. However, when the data is limited, estimating unknown parameters of a model results in poor identifiability Gutenkunst et al. (2007). In such a case, small changes in the data could result in very different estimated parameters, for example, a rather randomly chosen local minimum out of multiple local minima Lund et al. (2019); Ramadan et al. (2018). The resulted overfitting impairs the model parsimony and generalizability Pitt and Myung (2002). The overfitted model may yield good results with a training data set used to estimate parameters, but it may yield poor estimates with a new test data set. Moreover, this issue becomes worse when parameters are estimated from the data corrupted by random noise geman1992neural.

A biomechanical model often has unknown parameters to be estimated with limited data due to unavailable internal states and the non-invasive nature of human data collection Little et al. (2010). For a limited observation data set, one way for improving identifiability is to build a parsimonious model by lumping a large number of parameters into a small number of lumped parameters Do et al. (2018a). Such a parsimonious model has better interpretability and provides higher estimation accuracy for arbitrary data Babyak (2004).

As a way to build a parsimonious model, the Least absolute shrinkage and selection operator (Lasso) was first introduced in Tibshirani (1996) and then further developed in Zou and Hastie (2005); Tibshirani et al. (2005); Yuan and Lin (2006); Zou (2006). The Lasso is typically used to select sensitive parameters among parameters of a linear model. A sensitive parameter subset is considered to have relatively large impact on the output of the model. That is, small changes in the sensitive parameters result in large changes in the model response Tibshirani and Wasserman (1988).

To formally introduce the Lasso, we suppose that we have where and .

is a function, which depends on the true parameter vector

. is independent and identically distributed with and . Without loss of generality, we assume that the true parameter vector has the first entries non-zero. That is, , for and , for . Finally, when , we consider the following linear least squares problem with -regularization.

 ^θ =argminθ[∥y−Xθ∥22+nλp∑k=1|θk|], (1)

where is the input. is the observation vector. In (1

), the hyperparameter

determines the amount of regularization. The Lasso shrinks more number of parameters toward 0 as increases in general. Moreover, insensitive parameters are shrunk to 0 if is sufficiently large Zou (2006). The remaining parameters, which are not shrunk to 0, are considered sensitive parameters Ramadan et al. (2018). In this paper, we consider the sensitive parameters as key model parameters. The -regularization methods and weighted least squares method were used to select nonlinear auto regressive with exogenous variables (NARX) models Qin et al. (2012). The Lasso was used to remove insensitive parameters when all unknown parameters are to be non-negative Kump et al. (2012). The modified Lasso was proposed for the nonlinear induction motor identification problem, which deals with a similar problem to our study Rasouli et al. (2012)

. However, they do not provide consistency and the asymptotic normality results. The modified principal component analysis (PCA) based on the Lasso was proposed for dimension reduction

Jolliffe et al. (2003). To select and estimate sensitive parameters of a model, sensitive parameters were selected based on parameter estimation variances predicted by the Fisher information matrix Ramadan et al. (2018, 2019).

In this paper, our objective is to develop a regularized nonlinear parameter estimation method for a model with unknown parameters using a limited data set. Thus, we formulate the model parameter estimation problem as a nonlinear least squares problem with -regularization as follows.

 ^~θ= argmin~θ[∥y−f(X;~θ)∥22+nλp∑k=1|~θk|], (2)

where and is the set of typical values. Note that these values may be obtained as the mean values of based on preliminary information or estimation. The details of (2) are introduced in Section 3. In our preliminary work, we developed a parameter selection method for system identification with application to head-neck position tracking and reported the model parameter estimates Kyubaek and Jongeun (in press).

The contributions of the paper are as follows. First, we consider nonlinear regression with a generalized penalty function that includes an -penalty function and provide its consistency and oracle properties (i.e., convergence to the correct sparsity and asymptotic normality) in Section 2

. To the best of the authors’ knowledge, our work is the first to provide such analysis for nonlinear regression with a generalized penalty function. For example, convergence properties for various penalized linear regression have been discussed

Johnson et al. (2008); Fan and Li (2001). Note that we do not assume the distribution of errors, which is different from the assumption of Fan and Li (2001). In Section 3, we then reformulate the regularized nonlinear regression for simultaneously selecting and estimating key model parameters by defining in (2) as the deviation of the k-th parameter from its nominal value. Next, we improve the optimization algorithm of Rasouli et al. (2012) to numerically solve the nonlinear least squares problem. Finally, to show the effectiveness, we present an application identifying a biomechanical parametric model of a head-neck position tracking task from limited data in simulation and experimental studies. In a simulation study, our algorithm reduces the variance of most parameter estimates as well as the bias. In an experimental study, the variance of selected sensitive parameters is reduced by on average while maintaining the goodness of fit at above . In addition, our method is 54 times faster as compared to the Lasso with the brute force optimization, e.g., the standard simplex-based optimization we presented recently Ramadan et al. (2018).

## 2 Consistency and Oracle Properties

We propose a penalized nonlinear regression approach for parameter selection and estimation. First of all, we adopt the following equivalent nonlinear least squares estimator with a generalized penalty function from (2):

 ^θn=argminθ∈D[Qn(θ):=Sn(θ)+np∑k=1pλn(|θk|)], (3)

where . The first term in corresponds to nonlinear least squares estimation, and the second term, is the penalty function used for parameter selection. in the penalty function is a nonnegative regularization parameter. Note that we consider a generalized penalty function in (3). The appropriate choice of the penalty function, including -regularization, is further investigated in the assumption  2.

Without loss of generality, we assume only a few parameters are non-zero, such that the true parameter vector has the first entries non-zero. That is, , for and , for . For the nonlinear function in (3), we further consider the following assumptions.

###### Assumption 1

(Nonlinear function)

1. The true parameter is in the interior of the bounded parameter set , and is twice differentiable with respect to near for all .

2. Let and . Then, there exists a positive definite matrix such that as .

3. As and ,

 ˙F(θ1)T˙F(θ1)(˙F(θ0)T˙F(θ0))−1→I

uniformly, where

is the identity matrix

4. There exists a such that

 limsupn→∞1nn∑i=1sup∥θ−θ0∥≤δ(∂2f(xi;θ)∂θk∂θs)2<∞

For the penalty function in (3), we further consider the following assumptions.

###### Assumption 2

(Penalty function) The first and second derivative of the penalty function denoted by and have the following properties:

1. For a nonzero fixed ,

 limn→∞n1/2qλn(|θ|)=0,limn→∞q′λn(|θ|)=0.
2. For any ,

 limn→∞n1/2inf|θ|≤Mn−1/2qλn(|θ|)→∞.
###### Remark 1

Assumption 2 is satisfied for several well known penalty functions, e.g., SCAD, Adaptive Lasso, and Hard penalty with proper choices of . Assumption 2-(1) is satisfied for -regularization with a proper choice of . The details are discussed in Johnson et al. (2008).

The following theorem shows the existence of a local minimizer of with the order of .

###### Lemma 1

Under Assumptions 1 and 2-(1), for any , there exists a positive constant that makes, for large enough n,

 P(inf∥v∥=CQn(θ0+n−1/2v)−Qn(θ0)>0)>1−η,

where .

###### Theorem 1

Under the assumptions in Lemma 1

, there exists, with probability tending to 1, a root-n-consistent local minimizer

of , that is, .

Next theorem shows oracle properties (i.e., convergence to the correct sparsity and asymptotic normality) of the estimator on the true set.

###### Theorem 2

Assume that is the local minimizer of with the root-n-consistency. If Assumptions 1 and 2 hold,

• for the set ,  ,

 P(Mk) → 0
• For , ,

 n1/2(2Γ11)(^θ11−θ01+(2Γ11)−1bn) d⟶ N(0,2Γ11σ2),

where , and is the first submatrix of .

The proofs of Lemma 1 and Theorem 1 are given in Appendix A and Appendix B, respectively and the proof of Theorem 2 is given in Appendix C.

## 3 Application to Head Position Tracking

In this section, we evaluate our approach to solve the nonlinear least squares problem with -regularization from simulation and experimental studies of a biomechanical parametric model of a head-neck position tracking task in Ramadan et al. (2018). The reliability of the head-neck position tracking task to quantify head-neck motor control is demonstrated in Popovich Jr et al. (2015). As compared to the Levenberg optimization algorithm in Rasouli et al. (2012), our method implements the Levenberg-Marquardt optimization algorithm to numerically solve our nonlinear least squares problem with -regularization. The Levenberg-Marquardt optimization uses the diagonal elements of the hessian matrix approximation to overcome the slow convergence problem when the value of the damping factor is large Moré (1978). The details of our algorithm are shown in Appendix D and Table 3.

Additionally, in order to simultaneously select and estimate key model parameters of the head-neck system, we reformulate the Lasso penalty function as -regularization on the deviation of parameters from a set of typical values. In this paper, we adopt the mean of the nominal parameter values obtained from preliminary estimation as the set of typical values. Therefore, our method simultaneously selects and estimates only sensitive parameters while fixing insensitive parameters onto the mean of the nominal parameter values. In order to compare with the method of Ramadan et al. (2018), we set the number of sensitive parameters to 5. In this case, we increase the regularization hyperparameter value until we obtain 5 sensitive parameters. The goodness of fit is quantitatively evaluated by variance accounted for (VAF). VAF represents how much the experimental data can be explained by a fitted regression model. VAF is formally defined as follows.

 VAF(θ)(%)=[1−∑ni=1(yi−^yi(θ))2∑ni=1y2i]×100

### 3.1 Subjects

10 healthy subjects participated in the experimental study. They did not have any history of neck pain lasting more than three days or any neurological motor control impairments. The Michigan State University’s Biomedical and Health Institutional Review Board approved the test protocol. The subjects signed an informed consent before participating in the experiment Ramadan et al. (2018, 2019).

### 3.2 Parametric model

Fig. 1 shows the block diagram of the head-neck system for position tracking. This is a representative physiological feedback control model Peng et al. (1996); Chen et al. (2002). The model consists of 14 parameters. As shown in Table 1, 2 out of 14 parameters are set to fixed values from Peng et al. (1996). The remaining 12 parameters to be estimated are:

 θ=[Kvis Kvcr Kccr τ τ1A τCNS1τC τCNS2 τMS1 τMS2 B K]

The remaining 12 parameters have the lower and upper bounds from Ramadan et al. (2018) and are normalized using min-max normalization in order to ignore the scale differences between parameters.

### 3.3 The Experiment

As shown in Fig. 2, each subject wears a helmet attached with two string potentiometers measuring the axial rotation of the head. Subjects rotated their heads about the vertical axis to track the command signals on the display. The command signal is a pseudorandom sequence of steps with random step durations and amplitudes. The angle of the signal is bounded between . The output signal is the head rotation angle. Each subject performed three 30-second trials, and the sampling rate was 60 Hz Ramadan et al. (2018).

### 3.4 Simulation study

In this section, we analyze the bias and variance of estimated parameters from a simulation study with the known true parameters for comparison. First, we generate the simulated data where and , which are the input and observation vectors, respectively. Additionally, we obtain 20 sets of nominal parameter values from preliminary estimation performed 20 times over all three trials for each subject. Finally, we evaluate our method by setting the mean of 20 sets of nominal parameter values as a set of typical values. In Fig. 3, as compared to the nonlinear least squares estimator without -regularization, except for , the biases of the other parameters were decreased by on average. In addition, the variances of all estimated parameters were decreased by on average.

### 3.5 In vivo experimental study

In this section, the variances of estimated parameters from our method are compared with the result of the standard simplex-based optimization in Ramadan et al. (2018). All parameters are pushed further toward the mean of nominal parameter values obtained from preliminary estimation as the regularization hyperparameter increases. The regularization hyperparameter increases until 5 sensitive parameters are selected. After selecting a sensitive parameter subset for each subject, we select the most frequent subset among all subjects for the fair comparison with Ramadan et al. (2018). The Lasso in Ramadan et al. (2018) selected 5 parameters of as the most frequent subset of sensitive parameters, and the subset of was selected by our method. As a result, 4 out of 5 sensitive parameters are selected by both methods.

Second, we evaluate our method based on the goodness of fit measured by variance accounted for (VAF). All values are given as mean standard deviation across subjects. The goodness of fit of our method () over 10 subjects is almost equal to that of the Lasso ( ) in Ramadan et al. (2018). Without -regularization, over all subjects. In Fig. 4, for all subjects, the estimated responses are almost same as the measured responses, and the estimated responses are smoother than the measured responses.

Third, as shown in Table 2, the variance of estimated parameters is reduced by 71.4 on average across parameters and subjects as compared to those of estimated parameters without regularization.

Finally, we compute the average computation time using the “timeit” function from (The MathWorks Inc., Natick, MA, U.S.A.). The average computation time of our method for a subject is 54 times faster than that of the Lasso with the standard simplex-based optimization in Ramadan et al. (2018). In particular, the average computation time of our method across subjects is 5.6 seconds per trial, and that of the Lasso in Ramadan et al. (2018) is 302.0 seconds per trial.

## 4 Discussion

We provided consistency and oracle properties (i.e., convergence to the correct sparsity and asymptotic normality) for a nonlinear regression approach with a generalized penalty function. As a result, we proved the existence of a local minimizer and the convergence to the sparse unknown parameters for the penalized nonlinear least squares estimator. It is important to note that for the first time, we have proved convergence properties of the penalized nonlinear least squares estimator, as compared to previous studies Johnson et al. (2008); Wu (1981); Fan and Li (2001).

In the simulation study, we showed that the bias and variance of estimated parameters of our method were decreased as compared to those of estimated parameters without -regularization. When we set the mean of nominal parameter values as the typical values for non-selected estimates, the variance significantly was decreased. In addition, although the -regularization is known to induce biased estimates, the bias with our method also slightly was decreased except for one parameter with the increased bias. The reason might be that our method pushes all parameters toward the mean of nominal parameter values obtained from 20 preliminary estimation. Therefore, if we set the appropriate values as the typical values, we could achieve lower values of bias and variance errors.

In the experimental study, our proposed method was compared with the Lasso in Ramadan et al. (2018) using three performance criteria. First, we confirmed that the proposed method simultaneously selected and estimated sensitive parameters in the nonlinear model. As a result, five parameters were selected and estimated in our method, and the remaining parameters were fixed to the mean of nominal parameters obtained from preliminary estimation. With our method, 4 out of 5 sensitive parameters were the same as those selected in Ramadan et al. (2018). This result showed that our method behaved similarly in sensitive parameter selection by Ramadan et al. (2018) using the Fisher information matrix. Selected sensitive parameters may vary slightly depending on the performance of the optimizer and the condition of initial points. However, they have not changed much over repeated randomized realizations. In addition, we presented VAF to quantitatively evaluate the goodness of fit of the estimated model. From the standard nonlinear least squares problem without -regularization, VAF was about 84.9, and our method achieved about 82.5. Hence, the goodness of fit of the model estimated from our method was similar to that of model estimated from the standard nonlinear least squares problem without -regularization although only 5 selected parameters were used for estimation in our method. Moreover, as shown in Fig. 4, most curve fitting errors occur around the peak points, which shows the limitation of the presented dynamics models that do not perfectly reflect the real physiological head-neck control processes. This could be due to the switching of human control strategies with sudden changes in head-neck orientations Chen et al. (2002).

Next, the model identifiability was improved by sensitive parameter selection from our method. The variance of estimated 12 parameters is reduced by on average. In general, when the number of parameters to be estimated is large with limited data, the model has poor (or lack of) identifiability. Therefore, through our method with key parameter selection, i.e., the nonlinear least squares problem with -regularization on the deviation of parameters from the mean of the nominal parameter values, the uniqueness of the estimated solution can be ensured even for an original problem with a lack of identifiability due to limited data.

Finally, the average computation time of our method was 54 times faster than that of the Lasso with the brute force optimization algorithm in Ramadan et al. (2018). Our method reduced the computation time by eliminating large inverse matrix computation by modifying a Jacobian formulation as a minimization formulation.

## 5 Conclusions

In this paper, we tackled a parameter estimation problem with limited data by formulating it as a nonlinear least squares estimator with -regularization.

We effectively improved the model identifiability by applying the Lasso to the nonlinear least squares problem. As asymptotic results, we provided consistency and oracle properties for a nonlinear regression approach with a generalized penalty function. Based on these results, we proposed a novel solution to our problem by solving the nonlinear least squares problem with -regularization on the deviation of parameters from the nominal values in order to simultaneously select and estimate model parameters.

From simulation and experimental studies, we successfully demonstrated that the proposed method simultaneously selected and estimated sensitive parameters, improved the model identifiability by reducing the variance of estimated parameters and took a much shorter computation time than that of the Lasso in Ramadan et al. (2018).

Future work would be to apply our method to other clinical patient-specific calibration of the disease models Do et al. (2018b); Zhang et al. (2019).

###### Acknowledgements.
This work was supported by the Mid-career Research Programs through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2018R1A2B6008063, 2019R1A2C1002213). This publication was made possible by grant number U19AT006057 from the National Center for Complementary and Integrative Health (NCCIH) at the National Institutes of Health. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of NCCIH. The research of Wei-Ying Wu was supported by Ministry of Science and Technology of Taiwan under grants (MOST 107-2118-M-259-001-).

## Conflict of interest

The authors declare that they have no conflict of interest.

## Appendix A Proof of Lemma 1

We first find the lower bound of .

 Qn(θ0+n−1/2v)−Qn(θ0) =Sn(θ0+n−1/2v)−Sn(θ0) +n(p∑k=1pλn(|θ0k+n−1/2vk|)−p∑k=1pλn(|θ0k|)) =n−1/2(∇Sn(θ0))Tv+12n−1vT∇2Sn(θ∗)v
 +n(s∑k=1pλn(|θ0k+n−1/2vk|)−s∑k=1pλn(|θ0k|)) +np∑k=s+1pλn(|θ0k+n−1/2vk|),where θ∗=(θ∗1,θ∗2,…,θ∗p) lies between θ0+n−1/2v and θ0, =n−1/2(∇Sn(θ0))Tv+12n−1vT∇2Sn(θ∗)v +n(s∑k=1qλn(|θ∗0k|)sgn(θ∗0k)n−1/2vk) +np∑k=s+1pλn(|θk+n−1/2v|) ≥n−1/2(∇Sn(θ0))Tv+12n−1vT∇2Sn(θ∗)(v) +n(s∑k=1qλn(|θ∗0k|)sgn(θ∗0k)n−1/2vk) :=A+B+C.

For the sake of simplicity,
we define , where . For the term ,

 n−1/2(∇Sn(θ0))Tv =−2n−1/2vT˙F(θ0)ϵ, =−2vT[F1/2n(˙F(θ0)T˙F(θ0))−1/2˙F(θ0)Tϵ], where F0=1n˙F(θ0)T˙F(θ0).

By Assumption 1, and we claim that,

 (˙F(θ0)T˙F(θ0))−1/2˙F(θ0)Tϵ d⟶N(0,σ2I).

The claim follows from the lemma 2.1 in Huber and others (1973). The condition for the lemma in our setting is

 ∥∥∥˙F(θ0)(˙F(θ0)T˙F(θ0))−1/2∥∥∥∞→0

where denote the maximum absolute value of all elements of matrix . Since is an matrix and is a matrix,

 ∥∥∥˙F(θ0)(˙F(θ0)T˙F(θ0))−1/2∥∥∥∞ ≤p∥∥∥(˙F(θ0)T˙F(θ0))−1/2∥∥∥∞∥∥˙F(θ0)∥∥∞ =pn−1/2∥∥F−1/20∥∥∞∥∥˙F(θ0)∥∥∞ =O(p√n).

The first and second conditions from Assumption 1 imply the last equality. Therefore, we have

 −2vT[F1/20(˙F(θ0)T˙F(θ0))−1/2˙F(θ0)Tϵ] (1) d⟶ N(0,4σ2vTΓv).

For the term ,

 12n−1vT∇2Sn(θ∗)v =n−1vT{(˙F(θ∗)T˙F(θ∗))+¨F(θ∗)T(I⊗(d−ϵ))}v =vT⎡⎢ ⎢⎣(˙F(θ0)T˙F(θ0))nZ−1nv⎤⎥ ⎥⎦,where Zn={(˙F(θ∗)T˙F(θ∗))+¨F(θ∗)T(I⊗(d−ϵ)}−1nF0.

If we show , with Assumption 1, we obtain

 A+B d⟶ N(vTΓv,4σ2vTΓv). (2)

can be rewritten as,

 Z−1n =(nF0)−1{(˙F(θ∗)T˙F(θ∗))+¨F(θ∗)T(I⊗(d−ϵ)} =(nF0)−1(˙F(θ∗)T˙F(θ∗))+(nF0)−1¨F(θ∗)T(I⊗d) −(nF0)−1¨F(θ∗)T(I⊗ϵ).

By Assumption 1, the first term converges to almost surely. By the conditions 1, 2, and 4 of Assumption 1 and Cauchy-Schwarz inequality, the second term converges to zero almost surely. For the last term, it is enough to show that

 1nfks(θ)Tϵ p⟶0 (3)

uniformly on with probability 1, because of the conditions 1 and 2 of Assumption 1. Then, by the condition 4 of Assumption 1, (3.17) can be shown in a manner similar to that of Wu (1981) For the term , by Assumption 2 for a fixed ,

 (n1/2s∑k=1qλn(|θ∗0k|)sgn(θ∗0k)vk) ⟶ 0. (4)

Thus, combined (2) with (4), we have

 A+B+C d⟶ N(vTΓv,4σ2vTΓv),

which leads to the desired result for large enough .

## Appendix B Proof of Theorem 1

By Lemma 1 and the continuity of , we obtain Theorem 1.

## Appendix C Proof of Theorem 2

Proof of (i)
First, we break into two sets:

 Mk ={ω:^θk≠0,|^θk|≥Cn−1/2} +{ω:^θk≠0,|^θk|

Then, it is enough to show for any , and . For any , we can show for large enough because of the consistency.

To verify for large enough , we first show on the set . Note that

 n−1/2∇Sn(θ)−n−1/2∇Sn(θ0) =n−1/2∇2Sn(θ∗∗)(θ−θ0) =n1/2(˙F(θ∗∗)T˙F(θ∗∗))nZ−1n(θ−θ0)=Op(1),

where lies between and . Since , and Thus, we have

 sup∥θ−θ0∥≤Cn−1/2∥n−1/2∇Sn(θ)−n−1/2∇Sn(θ0)∥=Op(1). (1)

Combining (1) with , we have

 ∥n−1/2∇Sn(θ)∥=Op(1) (2)

for which satisfies Since is the local minimizer of with the root-n-consistent, we attain

 n1/2qλn(|^θk|)=Op(1) (3)

from

 n−1/2∂Qn(θ)∂θk∣∣θ=^θ =n−1/2∂Sn(θ)∂θk∣∣θ=^θ+n1/2qλn(|^θk|)sgn(^θk).

Therefore, there exists such that, for large enough ,

 P{ω:^θk≠0,|^θk|M′}<ϵ/2.

In addition, by the second assumption of Assumption 2,

 {ω:^θk≠0,|^θk|M′} ={ω:^θk≠0,|^θk|

for the large enough , which leads to for large enough .

Proof of (ii)
By the Taylor expansion,

 n−1/2∇Qn(^θ) = n−1/2∇Sn(^θ)+n−1/2qλn(^θ)⋅sgn(^θ) = n−1/2(∇Sn(θ0)+∇2Sn(θ∗∗)(^θ−θ0)) +n1/2qλn(^θ)⋅% sgn(^θ)

where Since is the local minimizer of , so that

 n−1/2(−∇Sn(θ0))= n−1∇2Sn(θ∗∗)(n1/2(^θ−θ0)) +n1/2qλn(^θ)⋅sgn(^θ).

Finally,

 n1/22Γ11(^θ11−θ01+(2Γ11)−1bn) d⟶ N(0,4Γ11σ2),

because , and the consistency of .

## Appendix D Nonlinear least squares estimator with L1-regularization.

In order to apply the Lasso to the nonlinear least squares problem, we reformulate the Levenberg-Marquardt (LM) optimization algorithm as a linear least squares problem as follow.

 ^~θj+1 =min~θ∥Λ~θj+1−Λ~θj+JT~r(~θ