Derivative-Based Optimization with a Non-Smooth Simulated Criterion

08/08/2017 ∙ by David T. Frazier, et al. ∙ Monash University 0

Indirect inference requires simulating realizations of endogenous variables from the model under study. When the endogenous variables are discontinuous functions of the model parameters, the resulting indirect inference criterion function is discontinuous and does not permit the use of derivative-based optimization routines. Using a specific class of measure changes, we propose a novel simulation algorithm that alleviates the underlying discontinuities inherent in the indirect inference criterion function, permitting the application of derivative-based optimization routines to estimate the unknown model parameters. Unlike competing approaches, this approach does not rely on kernel smoothing or bandwidth parameters. Several Monte Carlo examples that have featured in the literature on indirect inference with discontinuous outcomes illustrate the approach. These examples demonstrate that this new method gives superior performance over existing alternatives in terms of bias, variance and coverage.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Simulation-based estimation methods, such as the method of simulated moments (

McFadden, 1989, Duffie and Singleton, 1993) and indirect inference (Smith, 1993, Gourieroux et al., 1993, Gallant and Tauchen, 1996), are widely used inference procedures that are applicable to any model where simulation of data is possible. These methods are particularly useful in settings where the underlying structural model is too difficult for maximum likelihood estimation, but where simulation from the model is straightforward.

Given a fixed value of the unknown model parameters, simulation-based methods require the user to simulate synthetic realisations of endogenous variables, often referred to as simulated outcomes, from the underlying structural model. Once these simulated realisations have been generated, statistics based on the simulated data are calculated and then compared against statistics based on the observed data. Estimators for the unknown model parameters are then obtained by minimising a well-defined distance between the simulated summary statistics and their observed counterparts.

However, in many interesting cases, the simulated outcomes from the structural model of interest are discontinuous transformations of the underlying model parameters, i.e., small changes in the parameter values can lead to substantial changes in the simulated data. Hence, the resulting sample criterion functions used in estimation will be discontinuous functions of the model parameters, even though the corresponding limit of the sample criterion function will often be differentiable in the model parameters. Indeed, the simulation of discontinuous outcomes, and the resulting discontinuity of the sample criterion function, is a relatively common occurrence in the indirect inference (hereafter, II) literature. Notable examples where II has been applied in the context of discontinuous simulated outcomes include the following: dynamic labor market models that are subject to the so-called “initial-conditions” problem (An and Liu, 2000

); switching-type models, such as autoregressive models with exponential marginal distributions (

Di Iorio and Calzolari, 2006); structural models of human capital accumulation with learning (Nagypal, 2007); first-price auction models with heterogeneous bidders (Li and Zhang, 2015); certain dynamic sample selection models (Altonji et al., 2013); and the application of II to dynamic discrete choice models (see, e.g., Bruins et al., 2018 for a discussion).

Two potential solutions that can partially circumvent the difficulties encountered in II estimation with a discontinuous criterion function are the use of numerical derivatives within an optimisation scheme, and the use of derivative-free optimisation approaches. A simple solution to this issue would be to apply finite-differencing derivative estimates within a Newton-Raphson or quasi-Newton algorithm, even though the criterion function may be discontinuous in finite samples. The logic behind such an approach is often based on the following notion: if we could construct a criterion function using an infinite number of simulated samples, the resulting criterion function would be smooth enough to permit the use of numerical derivatives. While producing an infinite number of simulation is infeasible, if one takes the number of simulations used in II to be very large, which significantly increases the required computational effort, this would effectively smooth the discontinuous criterion function and give a basis for the use of such numerical derivatives within parameter estimation. Gottard and Calzolari (2017) provide simulation results supporting this approach, when the number of simulations is larger than the sample size. However, computational issues aside, even with a large simulation size, estimating these derivatives using finite-differencing requires specifying a tuning parameter that imparts bias on the resulting estimates. In practice, the tuning parameter induces a trade-off between bias and variance and may have a substantial effect on the estimators in finite samples, especially when the underlying model is discontinuous (see, e.g., Glynn, 1989; Andrieu et al., 2011; Detemple et al., 2005).

An alternative approach to finding II estimators in this setting is to use derivative-free methods, such as simplex-based algorithms or genetic algorithms. These methods can be quite useful when the dimension of the model parameters is relatively small, however, such methods often encounter difficulties when the dimension of the parameters is large (see

Bruins et al., 2018, for a discussion of this issue).

Recently, building on the initial work of Keane and Smith (2003) and Di Iorio and Calzolari (2006), Bruins et al. (2018) have proposed a generalized II (GII) approach to alleviate the issue of discontinuous, as functions of the model parameters, simulated outcomes. The GII approach replaces the discontinuous simulated outcomes by a kernel-smoothed version that depends on a bandwidth parameter. For a positive value of the bandwidth parameter, GII allows the use of derivative-based optimisation routines to estimate the unknown model parameters. Furthermore, under certain regularity conditions, including that the bandwidth parameter shrinks to zero fast enough, the GII approach produces consistent and asymptotically normal estimates of the model parameters. However, the application of GII can encounter certain difficulties in practice: the GII approach relies on a somewhat arbitrary choice of kernel function and bandwidth parameter; for any fixed sample size, the use of artificially smoothed simulated outcomes imparts a non-negligible bias on the resulting parameter estimates; as is generally true with kernel smoothing methods, the choice of the bandwidth parameter is crucial for obtaining reliable performance.

The goal of this paper is to propose a novel simulation algorithm that yields a differentiable sample II criterion function in situations where data simulated under the structural model is discontinuous. Unlike the aforementioned GII approach, this new approach does not rely on any smoothing approaches, nor does it require the user to select a bandwidth parameter before the procedure can be implemented.

The key to this new II approach is a local change of variables (COV) technique that draws inspiration from the Hessian optimal partial proxy (HOPP) method of Joshi and Zhu (2016)

. The HOPP method is a COV technique that allows the construction of unbiased estimators, up to a third-order term, for the derivatives of certain expectations that are of keen interest in financial mathematics, such as the so-called “Greeks” that are associated with option pricing. Additional COV strategies for calculating derivatives of similar expectations can be found in

Fu (1994), Lyuu and Teng (2011), Chan and Joshi (2011) and Peng et al. (2018), with these ideas first introduced by Glynn (1987) in the study of discrete-event systems.444We refer the interested reader to Fu (2006) for an overview of these methods.

Unlike the problems to which the HOPP method is applied, which focuses on estimating derivatives of an expectation at a point, in II we are interested in obtaining uniformly, over the parameter space, consistent estimates for the derivatives of a simulated sample criterion function. We propose a modification of the HOPP approach that can be applied to II estimation and demonstrate that this new approach results in II criterion functions that are continuously differentiable in the model parameters. As a result, this new procedure permits the use of derivative-based optimisation routines to estimate the unknown model parameters, even though the original simulated outcomes are discontinuous. Critically, derivatives calculated from II criterion functions that use this technique are uniformly, over the parameter space, consistent estimators of their corresponding limit counterparts.

The approach considered herein amounts to a direct approximation of the II criterion function in a neighborhood of the point where the original criterion function is discontinuous. As such, our approach is a form of “generalized indirect inference”. However, unlike the GII approach of Bruins et al. (2018), which relies on a global kernel smoothing approximation, the GII approach proposed herein relies on a local approximation. To differentiate these two GII approaches, hereafter we refer to our approach as change of variable generalized indirect inference (GII-COV), while the kernel-based approach of Bruins et al. (2018) is referred to as kernel generalized indirect inference (GII-K).

We demonstrate that our GII-COV approach yields consistent estimators for the derivatives of the simulated moments used within II estimation. As a result, GII-COV allows the consistent application of derivative-based optimisation routines to produce computationally efficient II parameter estimates, even though the model under study produces discontinuous simulated outcomes. A direct result of the GII-COV approach is a criterion function that is twice-continuously differentiable, uniformly in the parameters, which ensures that estimators obtained from this approach will have standard asymptotic properties, under fairly weak regularity conditions.

While numerical differentiation is the most common tool for calculating derivatives in econometrics, the computational tool we use for derivative calculation in this paper is automatic differentiation. This technique is common in computer science and financial mathematics (Glasserman, 2003) and generally leads to faster derivative calculations than finite-differencing techniques, especially in a high-dimensional settings. Automatic differentiation is a numerical procedure for estimating derivatives and can be viewed as a type of optimal finite-differencing derivative estimator; in particular, numerical derivatives calculated via automatic differentiation do not exhibit the bias associated with finite-differencing derivative estimates. Instead, derivatives calculated via automatic differentiation produce exact numerical derivatives of the function under consideration, up to floating point errors. Therefore, the use of automatic differentiation will not only speed up the execution of a derivative-based optimisation algorithm but also produces results that are free from the bias inherent in finite-differencing derivative estimates.

The remainder of the paper is organised as follows. Section 2 supplies the general set-up and notation for the structural model and briefly reviews II estimation procedures. Section 3 proposes a change of variables technique that we use within our GII approach, and demonstrates that this approach to II permits the consistent application of derivative-based optimisation routines to estimate the unknown model parameters. Illustrative examples showcase the precise implementation details regarding this approach to II. Section 4 discusses the asymptotic properties of this approach. In Section 5, we apply our GII approach to several dynamic discrete choice models that have featured in the literature on II with discontinuous outcomes, and compare the resulting parameter estimates against the GII approach of Bruins et al. (2018) and two popular derivative-free methods. The results demonstrate that our approach compares favourably to existing approaches. Section 6 concludes. All proofs and tables are relegated to the appendix.

2 Model, Examples and Standard Indirect Inference

In this section we first present the model setup and describe the standard indirect inference (hereafter, II) approach. In addition, we briefly examine the application of II in several economic examples.

In the remainder of the paper, we use the following notation. Consider a vector and a matrix . We use to denote the Euclidean norm, use to denote the operator norm (i.e., and define for a matrix . Let be a vector function consisting of differentiable scalar functions. For , we denote by the derivative of with respect to the -th component of for , and the gradient of with respect to is denoted by . The gradient of the vector function is given by the matrix . For , define the -neighborhood of the point as . Also, let denote the indicator function on the set .

2.1 Models and Examples

Assume the researcher wishes to conduct inference on unknown parameters , where denotes the parameter space of , with its size, that govern the behavior of an endogenous variable , whose support is . Conditional on an exogenous variable , with support , and an unobservable state variable , with support , the endogenous variable evolves according to the following (causal) structural model:



is an error term that is independent and identically distributed (iid) according to the known cumulative distribution functions

, with corresponding density function , and whose support is . The exogenous variables are independent of , and the functions and are known up to the unknown parameters .

We are interested in cases where the function is discontinuous in the state variable , and where, due to the complexity of the structural model, likelihood-based inference for is infeasible or prohibitively difficult. This model setup in (1)-(2) is fairly common in econometrics, and covers a wide variety of models. For illustration purposes, we provide four classes of examples considered under our framework.

Example 1 (Binary Choice Models with Serially Dependent Errors).

Suppose that we observe a panel of realisations with cross-sectional unit and time period , generated from a binary choice model with autoregressive (AR) errors:

where the variable is a vector of exogenous variables with support , the unobserved variable is generated by an AR(1) model with unobserved iid innovation , which follows the known distribution . Here, the state variable is and the structural parameters are .

Example 2 (Ordered Probit Model with Individual Effects).


be a categorical variable taking values in

for individual at time . Given an observed, non-constant, vector , we assume that according to the model

Here, are unknown threshold parameters and is a state variable, which can be interpreted as the individuals latent utility, given by


is an iid, time-invariant and individual-specific unobservable random variable following

, while is an unobservable iid innovation following , and we define . The structural parameters are .

Example 3 (Switching-type Models).


be iid exponentially distributed with unity intensity parameter and let

be iid uniform , with and also independent. The exponential autoregressive process evolves according to

where , , ; see, e.g., Iorio and Calzolari (2006) for II estimation of this model. The state variable is iid and the structural parameters are .

Example 4 (G/G/1 Queue).

Let denote the inter-departure time for the -th customer. Let be the corresponding inter-arrival time and the service time, with independent of . Let , and assume that we know and , with both and known up to the unknown parameters structural parameters . The inter-departure time process evolves according to, for ,

In this example, the state variable is given by and specific parametric assumptions on and can be considered for the purposes of II estimation; see, e.g., Heggland and Frigessi (2004) for a discussion of II estimation in queuing models.

In what follows, to simplify discussion and notations, when discussing general quantities we will only consider a “cross-sectional” sample from the structural model in equations (1)-(2). However, we note that in cases where the observed data has a panel structure, with fixed time dimension , as in Examples 1 and 2, this cross-sectional sample can always be obtained by redefining the observed variables. However, to avoid notational clutter, we will avoid such a scheme and instead focus only on cross-sectional setting.

2.2 Standard Indirect Inference

The focus of II is to conduct estimation and inference on the true parameters of the structural model, denoted throughout by , when the model has a complex parametric structure. Even when the structural model is complex, it is often easy to simulate data from the structural model given parameter values.

The first step of II is to estimate an intermediate or auxiliary model, using the observed data and simulated data, separately. II estimates of the structural parameters are then obtained by minimising a well-chosen distance between the two sets of estimated auxiliary parameters.

To formalise the above, we denote the simulated unobservables by , for , where is the number of simulations for each observation, and where each is generated iid from the known distribution . We can then construct simulated outcomes for any according to


To form an auxiliary model for the dependent variable , let be a vector of covariates that consists of observed variables with support

. The auxiliary model is a tractable parametric model that attempts to capture the relationship between

and . Let be the parameter space for the auxiliary parameters with . We consider that the auxiliary model implies moment conditions characterized by some known moment function , which satisfies, for some ,


We denote by and estimators of the auxiliary parameter based on the observed dataset and the -th dataset simulated from the structural model, respectively. The auxiliary parameter estimates and are respectively given as the solution to the sample and simulated counterpart of equation (4): for and ,

The most common II methods to estimate correspond to the so-called “trinity” of classical hypothesis tests: Wald, Lagrange multiplier (LM) and likelihood ratio (LR). The LM and Wald approaches to II estimation allow computationally simple and efficient estimation of , in the class of II estimators. In contrast, the LR approach to II does not in general deliver efficient estimators (Gourieroux et al., 1993). Thus, we focus on the LM and Wald approaches for II, respectively denoted by LM-II and W-II in what follows.

The LM-II approach is based on the simulated auxiliary moments , given by

For LM-II, the criterion function is a quadratic form in the simulated auxiliary moments, evaluated at :

where is a sequence of positive-definite weighting matrices. The W-II estimator is calculated using the (normed) difference between and some version of . A common version of the W-II estimator is to define the average simulated auxiliary estimator and then minimize the criterion function , where

The LM-II and W-II estimators of are then defined as the minimiser of their corresponding criterion functions.

When is discontinuous in , as is the case in the structural model (3), the resulting II criterion function is discontinuous in and derivative-based optimisation procedures cannot necessarily be trusted to deliver accurate estimates of . We further examine this discontinuity in the confines of Example 1.

Example 1 (cont.).

Let be simulated uniform random variables that are used to generate simulated outcomes, for . For fixed , the simulated unobservable term is constructed recursively as , with , and the simulated state variable is given by . Simulated outcomes are then generated via


For this model, Li (2010) and Bruins et al. (2018)

suggest the linear probability model as an auxiliary model for II:

where and is an error term. We set as the moment function, and the auxiliary parameter estimates are given by

for . Given a parameter value and simulated samples , one can then construct the criterion function for the LM-II or the W-II approach. However, because the map is discontinuous, derivatives of and need not exist.

In each of the examples treated in the previous subsection, II estimation is based on the discontinuous mapping . As a result, derivative-based optimisation procedures may not deliver accurate estimates for the unknown model parameters. To solve this issue, we propose to alter the standard II simulation approach by introducing a (sequence of) change of variables that will alleviate the discontinuity in the map . This alternative simulation approach will deliver II criterion functions that allow the application of derivative-based optimisation routines that can consistently estimate the unknown model parameters, even though the original model is non-smooth in the parameters.

3 A New Generalized Indirect Inference Approach

Let denote a point at which we wish to evaluate the LM-II or W-II criterion function. According to equation (3), is discontinuous and the derivatives and need not exist. The logic behind our approach is the observation that II is not required to simulate outcomes that perfectly represent the actual data generating process, so long as the difference between the two vanishes as the sample size increases.

Define the population criterion functions for the LM and Wald approaches as

where , is the solution to and is some positive definite matrix, and where , , is assumed to be twice continuously differentiable. In a nutshell, our approach is to construct a “generalized” II criterion, say or for , satisfying: for ,

In the following section, we demonstrate how to construct such a criterion using a “change of variables” (hereafter, COV) technique. This COV will ensure that a small change in the parameter, say from to , for small , will not drastically alter the simulated data, thus alleviating the discontinuity.

3.1 Assumptions

To present the COV technique used in this paper, we first introduce assumptions on the structural model in equations (1)-(2) and the auxiliary model. The assumptions presented below are not restrictive and, with minor modifications, cover all of the examples referenced in this paper.

Assumption 1.
  • The observed random variables are iid across cross-sectional unit , and is independent of the innovation .

  • The innovation is iid and with known continuously differentiable density, .

  • The function is twice continuously differentiable in and for any .

  • The parameter spaces and are compact.

Assumption 2.
  • Assumption 1 holds for the simulated process for all .

  • is continuous at each , for any .

  • For all , there exists a random variable such that with for all and for all .

Assumption 3.
  • For any and , there exist a finite integer and a collection of random functions with such that the function in (1) can be written as

    Here are known constants and the random variable follows the standard uniform. Also, the random function are twice-continuously differentiable and satisfy that , , and .

  • For each , there exists a random variable so that and for any .

The above assumptions are fairly weak, and in the supplemental appendix, we demonstrate that, up to minor modifications, Assumptions 1-3 are satisfied for each of the examples presented in Section 2.1. The specific interpretation of the assumptions is as follows. Assumption 1 requires that the discontinuities in arise only from and ensures that we can present the COV in the most general context. The Assumption of iid data in 1(a) can be extended to independent non-identically distributed (inid) data, or weakly dependent data, at the cost of further notation and more involved technical arguments. In particular, under these more general assumption, only the structure of the random functions that partition the support of will change; we refer the interested reader to the supplementary appendix for an example with weakly dependent data. Throughout the remainder, we will refer to as critical point functions. Assumption 2 imposes regularity conditions on the simulated moment function. The regularity conditions in Assumption 2 are standard in the literature on II with discontinuous outcomes. Assumption 3 restricts the analysis to univariate , however, the extension to multivariate is almost automatic, and can be accomplished by further decomposing the simulation algorithm into corresponding scalar innovations. Assumption 3 formalizes the structure of the model in a way that ensures we can sequester the discontinuities into various regions of the support for . As such, Assumption 3 requires an upper bound on the number of possible discontinuities in that arise from the structural function . The term can be interpreted as the maximum number of discontinuities allowed by the model. For example, in the dynamic binary panel models, for all and . Under Assumption 3(a), the discontinuous sizes, , are assumed to be known constants, which holds for the first two examples examined in Section 2.1. A minor modification of Assumption 3(a) also covers the second two examples covered in Section 2.1. Assumption 3(a) is employed for simplicity and can be extended, at the cost of additional notation, to the case where the discontinuous sizes, , are twice-differentiable functions of the structural parameters. Assumptions 3

(a)-(b) are needed to establish uniform convergence over the parameter space and are used for our asymptotic analysis given later.

Assumption 3(a) restricts the class of distributions for the errors to be continuous. While it may be feasible to extend this approach to cases where is discrete, given that these examples are much less frequent in econometrics than their continuous counterpart, we do not consider such situations herein. Before concluding, we note that even though

is restricted to be a continuous random variable, this assumption is satisfied in a wide variety of examples, such as those given in Examples 1-4, as well as censored-type models, such as Tobit-type models.

3.2 Approximate Derivatives

Under Assumptions 1-3, we carry out a change of variables on the original uniform random variables to construct new simulated outcomes that will be differentiable. Let denote a point at which we wish to evaluate the derivative of the function , and consider the following transformation of :555There are many such transformations that will accomplish our goal, see Chan and Joshi (2011) and Lyuu and Teng (2011). The above transformation is chosen as there is theoretical evidence to suggest that in (6) is optimal, in terms of minimizing mean squared error, for indicator functions (Joshi and Zhu, 2016). Since many of the commonly encountered discontinuities in II arise from simulating indicator function, this optimality should transfer to our settings.


for with , and the corresponding Jacobian term of this COV is defined as , i.e., for ,

Under Assumption 3(a), the difference is strictly positive almost surely for all , and all . Thus, there exists a constant such that for all . This boundedness of the Jacobian term , for all

, is critical to derive the uniform Laws of Large Numbers that will be needed to obtain uniform convergence of our proposed II criterion function and their derivatives. Moreover, under Assumption

3(b), thus there exists a such that for all and .

Given the transformed series , for , we then construct new simulated outcomes according to

II can now proceed by replacing the moment function with the following moment function:

and the moment conditions for II estimation can be defined through the function , given by

The result below shows that the derivatives of the moment function with respect to are unbiased and uniformly consistent estimators of their corresponding limit counterparts.

Proposition 1.

Suppose that Assumptions 1-3 hold. Then, the first and second derivatives of with respect to , calculated at , exist and are unbiased estimators for the first and second derivatives of with respect to , calculated at . Furthermore, as ,

uniformly in and for every .

The above result demonstrates that simulated moments produced by this procedure have the derivatives with respect to , evaluated at , which are consistent estimates of their limit counterparts. Therefore, this COV approach allows us to construct “generalized” LM-II and Wald-II criterion functions as

where with the estimator satisfying the following moment condition: for each and given . We can then define generalized II estimators based on this COV approach, and denoted by and , via the minimisation problem


Hereafter, we refer to such II estimators as GII change of variable (GII-COV) estimators.

Example 1 (cont.).

To implement our GII-COV approach, recall that the standard simulated outcome was Now, consider the critical point functions:666Recall that, while the critical point functions can depend on the simulated data set , we alleviate this dependence to simplify notations.

Let be a point at which we wish to evaluate the function . As in (6), we construct as follows:

The corresponding Jacobian term depends on only though the critical point function and is differentiable in . The new simulated outcomes are generated exactly as the original outcomes, , except that the new outcomes are simulated by replacing with the uniforms . That is, . Using and , we can then construct the approximation to the moment function

and to the II binding function,

By the definition of above, we have that if and only inf , which yields an alternative representation: Thus, both and are functions of only through the Jacobian term , which is differentiable in . Hence, both approximation functions are differentiable in ; for example,

Moreover, by Proposition 1, the approximation derivative , evaluated at , is an unbiased and uniformly consistent estimator for the population counterpart .

While the GII-COV estimators can be defined as in equation (7), it is nonetheless interesting to note that the criterion functions and have very regular behaviour. In particular, from the results of Proposition 1 we can deduce that these II criterion functions are consistent estimators of their corresponding limit counterparts. To obtain this consistency result, the following assumption is additionally employed.

Assumption 4.

Recall .

  • for all .

  • The parameter lies in the interior of and is the unique solution to .

  • for some positive-definite matrix .

Assumption 4(a) ensures that the structural model is correctly specified. Assumption 4(b) requires that there exists an unique auxiliary parameter vector satisfying the population moment condition evaluated at the true structural parameter . From Assumption 4(c), the possibly random weighting matrix converges to a positive-definite matrix .

Theorem 1.

Suppose that Assumptions 1-4 hold and let . Then,


uniformly in as ,


additionally, if (i) the map is twice continuously differentiable given any , (ii) there exists a random variable so that and for