Efficient Estimation in NPIV Models: A Comparison of Various Neural Networks-Based Estimators

by   Jiafeng Chen, et al.

We investigate the computational performance of Artificial Neural Networks (ANNs) in semi-nonparametric instrumental variables (NPIV) models of high dimensional covariates that are relevant to empirical work in economics. We focus on efficient estimation of and inference on expectation functionals (such as weighted average derivatives) and use optimal criterion-based procedures (sieve minimum distance or SMD) and novel efficient score-based procedures (ES). Both these procedures use ANN to approximate the unknown function. Then, we provide a detailed practitioner's recipe for implementing these two classes of estimators. This involves the choice of tuning parameters both for the unknown functions (that include conditional expectations) but also for the choice of estimation of the optimal weights in SMD and the Riesz representers used with the ES estimators. Finally, we conduct a large set of Monte Carlo experiments that compares the finite-sample performance in complicated designs that involve a large set of regressors (up to 13 continuous), and various underlying nonlinearities and covariate correlations. Some of the takeaways from our results include: 1) tuning and optimization are delicate especially as the problem is nonconvex; 2) various architectures of the ANNs do not seem to matter for the designs we consider and given proper tuning, ANN methods perform well; 3) stable inferences are more difficult to achieve with ANN estimators; 4) optimal SMD based estimators perform adequately; 5) there seems to be a gap between implementation and approximation theory. Finally, we apply ANN NPIV to estimate average price elasticity and average derivatives in two demand examples.



There are no comments yet.


page 1

page 2

page 3

page 4


Layer-wise synapse optimization for implementing neural networks on general neuromorphic architectures

Deep artificial neural networks (ANNs) can represent a wide range of com...

Improved Neural Network Monte Carlo Simulation

The algorithm for Monte Carlo simulation of parton-level events based on...

Efficient Hardware Realizations of Feedforward Artificial Neural Networks

This article presents design techniques proposed for efficient hardware ...

Covariate Distribution Balance via Propensity Scores

The propensity score plays an important role in causal inference with ob...

Adaptive Estimation and Uniform Confidence Bands for Nonparametric IV

We introduce computationally simple, data-driven procedures for estimati...

Optimal estimation for Large-Eddy Simulation of turbulence and application to the analysis of subgrid models

The tools of optimal estimation are applied to the study of subgrid mode...

Combining Discrete Choice Models and Neural Networks through Embeddings: Formulation, Interpretability and Performance

This study proposes a novel approach that combines theory and data-drive...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep layer Artificial Neural Networks (ANNs) are increasingly popular in machine learning (ML), statistics, business, finance, and other fields. The universal approximation property of multi-layer (including single-hidden layer) ANNs (with various nonlinear activation functions) has been established by

Hornik, Stinchcombe and White (1989)

and others. Early on, computational difficulties have hindered the wide applicability of ANNs. Recently, fast algorithms have led to successful applications of deep layer ANNs in image detection, speech recognition, natural language processing and other areas with complex nonlinearity and large data sets of high quality.


By high quality we mean a data set with signal-to-noise ratio that is very high. Unfortunately, many economic and social science data sets have low signal-to-noise ratios.

Many problems where neural networks are extremely effective involve prediction problems (i.e. estimating conditional means)—or problems in which nuisance parameters are themselves predictions—on large datasets. It remains to be seen whether deep ANNs are similarly effective for structural estimation problems with nonparametric endogeneity, where relations to prediction are more tenuous or more complex.

To that end, we consider semiparametric efficient estimation of the average partial derivative of a nonparametric instrumental variables regression (NPIV) via ANN sieves. Average partial derivatives of structural relationships are linked to (cross) elasticities of endogenous demand systems in economics, finance, and business. We make three contributions. First, we consider efficient estimation for average derivatives in NPIV models with ANN sieves and derive the theoretical properties of two classes of estimation procedures—optimal criterion-based procedures (sieve minimum distance) and efficient score-based procedures.222We also exposit the theoretical properties of the associated inefficient estimators. Second, we detail a practitioner’s recipe for implementing these two classes of estimators. Third, and perhaps most importantly, we show a large amount of Monte Carlo evidence that compares the finite-sample performance of the estimation paradigms that we consider. These are implemented using large scale designs, some with up to 13 continuous regressors, various nonlinearities and correlations among the covariates.

We now briefly introduce the two classes of estimation procedures that we consider. The first set of estimators belong to the class of sieve minimum distance (SMD), both under identity and optimal weighting (which we refer to as P-ISMD , for plug-in SMD and OP-OSMD, for orthogonal plug-in optimal SMD, respectively). The criterion-based SMD paradigm is numerically equivalent to a semiparametric two-step procedure, where the unknown NPIV function (in the model ) is estimated via an ANN SMD in the first step, and the average partial derivative of

is estimated using an unconditional moment in the second step. The results of

Ai and Chen (2003, 2007, 2012) (henceforth, AC03, AC07, and AC12) show that the optimally-weighted SMD procedure (OP-OSMD) automatically yields a semiparametrically efficient estimator of average derivatives of . The second type of estimators are based on influence functions (equivalently, on semiparametric scores).333These estimators have a long history in semiparametrics. See Bickel et al. (1993) and Section 25.8 of Van der Vaart (2000), as well as references therein, for an introduction. By considering the influence functions of the efficient SMD estimator (OP-OSMD, derived in AC12), we obtain the efficient score estimation procedure. Similarly, we may derive the identity score procedure (IS) from (P-ISMD). In particular, we establish the asymptotic properties of the efficient score estimator when combined with a sieve first-step, which is novel to our knowledge.

These two classes of estimators–criterion-based (SMD) and score-based estimation procedures—represent two different perspectives for semiparametric estimation. In sieve minimum distance, semiparametric efficiency is achieved through clever choices of weighting of the minimum distance criterion, similar to optimally weighted GMM. Influence function estimators, on the other hand, treat the estimation as a two-step GMM problem. Compared to simpler settings, e.g. estimating average treatment effect under unconfoundedness, the influence functions here are not in closed form and involve a Riesz representer of a Hilbert space constructed by a norm connected to the SMD objective. Components of the influence functions may nonetheless be consistently estimated via sieve approximations.444In the case of linear sieves, these components may be estimated in closed form without use of nonlinear optimization.

We compare the finite sample performance of these estimation procedures (OP-OSMD and ES ) in three Monte Carlo designs with moderate sample sizes ( or ).555For reference and comparison, we also compare with the inefficient counterparts, P-ISMD and IS; moreover, we also compare with versions of the score-based estimators that utilize cross-fitting, popular in the double machine learning literature (Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey and Robins, 2018, 2021). In our first Monte Carlo design we estimate the average partial derivative of a NPIV function with respect to an exogenous variable using various ANN sieves. In the second and third Monte Carlo designs, we estimate the average partial derivative of a NPIV function with respect to an endogenous variables using various ANN sieves and spline sieves. Our Monte Carlo experiments allow for comparisons along several dimensions:

  • Within a type of estimation procedure, do ANN estimators exhibit superior finite-sample performance compared to linear sieve estimators (e.g. splines), when dimension of exogenous variables is moderately high?666To be clear, we are not speaking of “high dimension” in the sense.

  • Across types of estimators, how do ANN SMD estimators compare to ANN score estimators, along with alternative procedures like adversarial GMM (Dikkala, Lewis, Mackey and Syrgkanis, 2020)?

  • For ANN estimators, how much does ANN architecture matter? How much do other tuning parameters matter?

The results from the sets of numerical experiments we conduct provide us with some takeaways.

  • Hyperparameter tuning—choice of instrument basis, learning rate, stopping criterion—is delicate and can affect performance of ANN-based estimators. In particular, generally nonconvex optimization could lead to unstable performances. However, certain values of the hyperparameters do result in good performance.

  • We do not empirically observe systematic differences in performance as a function of neural architecture, within the feedforward neural network family. In our experience, the importance of neural architecture in our setting is not as high as tuning the optimization procedure.

  • Stable inferences are currently more difficult to achieve for ANN based estimators for models with nonparametric endogeneity.

  • OP-OSMD and IS have smaller bias than P-ISMD estimators, when combined with ANN for the average derivative parameter.

  • OP-OSMD and P-ISMD with splines for the average derivative parameter are less biased, stable and accurate, and can outperform their ANN counterparts, even when the dimension is high (as high as thirteen, which exceeds theory predictions).

  • Generally, there seems to be gaps between intuitions suggested by approximation theory and current implementation.

Lastly, as an application to real data, we apply the single hidden layer ANN sieve NPIV to estimate average price elasticity of gasoline demand using the data set of Blundell, Horowitz and Parey (2012) and to estimate average derivatives of the price-quantity relationship in strawberry, both applications containing multi-dimensional covariates.

Literature on ANNs in econometrics. ANNs can be viewed as an example of nonlinear sieves, which, compared to linear sieves (or series), can have faster approximation error rates for large classes of nonlinear functions of high dimensional regressors. Once after the approximation error rate of a specific ANN sieve is established for a class of unknown functions, the asymptotic properties of estimation and inference based on the ANN sieve could be established by applying the general theory of sieve-based methods. Prior to the current wave of theoretical papers on multi-layer ANN in econometrics and statistics, including Yarotsky (2017),Farrell et al. (2018), Schmidt-Hieber (2019), Athey et al. (2019) and the references there in, there are already many theoretical results on nonparametric M-estimation and inference based on single hidden layer (now called shallow) ANNs. For example, Wooldridge and White (1988)

obtained consistency of ANN least squares estimation of conditional mean function for time series heterogenous near epoch dependent data.

Chen and White (1999) established faster than

(in root mean-squared error metric) ANN sieve M-estimation of various nonparametric functions, such as conditional means, conditional quantiles, conditional densities, of time series models, and also obtained the root-

consistent, asymptotic normality of plug-in ANN estimator of regular functionals. Stinchcombe and White (1998) provided consistent specification test via ANN sieves. Chen, Hong and Shum (2007) studied nonparametric likelihood ratio Vuong-style model selection test via ANN sieves, which relies on the root- consistent asymptotic normality of plug-in ANN estimation of nonparametric entropy. Perhaps due to computational difficulties, ANNs nonlinear sieves have not been popular in economics until the recent computational advances from the machine learning community.

To the best of our knowleadge, Hartford, Lewis, Leyton-Brown and Taddy (2017) is the first paper to apply multi-layer ANNs to estimate NPIV function. The nonparametric convergence rates in AC03 and AC07 explicitly allow for nonlinear sieves such as ANNs to approximate and estimate the unknown structure functions of endogenous variables. They establish the root- asymptotic normality of regular functionals of nonparametric conditional moment restrictions with smooth residual functions. However, there is no published work on efficient estimation of expectation functionals of unknown functions of endogenous variables via ANNs. We provide these results in this paper.

Literature on estimation of linear functionals of NPIVs. The identification and nonparametric consistent estimation of a pure NPIV function in a model have been first considered in Newey and Powell (2003). The semiparametric efficiency bound for EFs, including WADs of NPIV or nonparametric quantile instrumental variables (NPQIV) functions as examples, has been previously characterized by AC12.777Ai and Chen (2012) derived the efficiency bound via the “orthogonalized residual” approach, which extends the earlier work of Chamberlain (1992a) and Brown and Newey (2002) to allow for unknown functions entering a system of sequential moment restrictions. Although AC12 suggested an estimator that could achieve the semiparametric efficiency bound for regular EFs of nonparametric conditional moment restrictions, they did not provide sufficient conditions to formally establish that their estimator is indeed semiparametric efficient. AC07 established the asymptotic normality of plug-in estimator of WAD of a possibly misspecified NPIV model. Severini and Tripathi (2013) presented efficiency bound calculation for average weighted derivatives of a NPIV model without assuming point identification of the NPIV function, but pointed out that the asymptotic normality estimator of linear functionals of NPIV in Santos (2012) fails to achieve the efficiency bound. Chen et al. (2019) proposed efficient estimation of weighted average derivatives of nonparametric quantile IV regression via penalized sieve GEL procedure, extending that of Donald et al. (2003) to allow for unknown quantile function of endogenous variables.

The rest of the paper is organized as follows. Section 2 introduces the model as a special case of the sequential moment restrictions containing unknown functions. It also presents the semiparametric efficiency score (or efficient influence function). Section 3 provides implementation details for all the estimators considered in the Monte Carlo studies. Section 4 contains three simulation studies and detailed Monte Carlo comparisons of various ANN and spline based estimators. Section 5 presents two empirical illustrations and Section 6 concludes the paper.

2 Two Efficient Estimation Procedures for Average Derivatives in NPIV Models

Using the framework in AC12, we first present the NPIV model and then present two procedures that will lead to semiparametric efficient estimation for average derivatives in NPIV models. The first one is based on efficient score (or efficient influence ) equation, and the second one is based on an optimally weighted criterion.

We are interested in semiparametric efficient estimation of the average partial derivative:

where is a known positive weight function, takes the partial derivative w.r.t. the first argument and the unknown function is identified via a real-valued conditional moment restriction


Previously, while allowing for (global misspecification), Ai and Chen (2007) (AC07) presented a root-

consistent asymptotically normally distributed identity-weighted SMD estimator of

, allowing for nonlinear sieves such as single hidden layer ANN sieve is allowed for in their sufficient conditions. Ai and Chen (2012) (AC12) presented the semiparametric efficiency bound of (see their example 3.3) and an efficient estimator based on orthogonalized optimally weighted SMD (see their section 4.2).

In this paper we present several efficient estimators of via more general ANN nonlinear sieve approximation to when

is a vector of continuous endogenous and exogenous covariates of moderate high dimension (say up to 13 dimension). For example:

  • In creftype 1, the DGP corresponds to

    where is endogenous and can be of high dimension. Note here that is exogenous. The parameter of interest is .

  • In creftype 2, the DGP corresponds to

    where is endogenous but enters linearly. The parameter of interest is .

  • In creftype 3, the DGP corresponds to

    where is endogenous but now enters nonlinearly. Monte Carlo 3(a) specifies and hence the parameter of interest . Monte Carlo 3(b) lets where is a nonlinear function, and then the parameter of interest .

2.1 Efficient score and efficient variance for

In this section, we specialize the general efficiency bound result of AC12 to our setting. We rewrite our model using their notation. Denote . The model can be written as


We define the orthogonalized residual as

which is the residual from a projection of on conditional on , where is the orthogonal projection coefficient:

Orthogonalizing the two moment conditions makes an efficiency analysis tractable—the same technique is used in, e.g., Chamberlain (1992b).

We know apply and specialize the results in AC12 to the plug-in model (3)


where is a scalar and is a real-valued function of , and .

Let , and

We note that this is also a special case of Example 3.3 of AC12, which already characterized the efficiency bound for . We recall their result for the sake of easy reference, and compute


where . Let be one solution (not necessarily unique) to the optimization problem (4). We note that such one solution always exists since the problem is convex, and we have:

Remark 2.1.

Characterization of Efficient Score. Applying Theorem 2.3 of AC12, we have: the semiparametric efficient score for in Model (3) is given by

where is one solution to (4). And the semiparametric information bound for is .

(1) If , then can not be estimated at the -rate.

(2) If

, then the semiparametric efficient variance for

is: .

In the rest of the paper we shall assume that and hence is a estimable regular parameter. We note that by definition, the efficient score (indeed any moment condition proportion to an influence function) automatically satisfies the orthogonal moment condition.

2.2 Efficient influence function equation based procedure

From the remark above, the semiparametric efficient influence function for takes the form



It is clear that is the unique solution to the efficient IF equation , that is

One efficient estimator, , for is simply based on the sample version of the efficient IF equation with plug-in consistent estimates of all the nuisance functions:

In this paper can be various ANN sieve minimum distance estimators (see below), but, for simplicity, the nuisance functions and are estimated by plug-in linear sieves estimators.

2.3 Optimally weighted SMD procedure

Another efficient estimator for can be found by optimally-weighted sieve minimum distance, where the population criterion is (see AC12):


The discrepancy measure is the expectation of the two moment conditions, conditional on their respective -fields:

and the optimal weight matrix is diagonal and proportional to the inverse variance of each moment condition:

A sieve minimum distance estimator for may be constructed by (i) replacing expectations with sample means, (ii) replacing conditional expectations with projection onto linear sieve bases, (iii) replacing the optimal weight matrix with a consistent estimator, and (iv) replacing the infinite dimensional optimization with finite dimensional optimization over a sieve space for . This paper focuses on approximating by ANN sieves. In particular, a sample analogue of the above objective function is

where and are estimators of and respectively; see Sections 3 and 4 below for examples of different estimators. Let be a sieve for (and in this paper we focus on various ANN sieves). We define the optimally weighted SMD estimator as an approximate solution to

This is an estimator proposed in AC12.

Two remarks are in order. First, note that the optimal weight matrix is diagonal because and are uncorrelated by design. Second, since the optimal weight matrix is diagonal and is a free parameter, we can view the minimization as sequential:

This is important because solving the model sequentially while maintaining efficiency plays a role in computing the estimators.

We may analyze the asymptotic properties of this estimator. Since we may view the optimally weighted SMD problem as either a minimum distance program or a sequential GMM estimator, we may carry out two separate analyses of the asymptotic properties. The analysis of the estimator as a minimum distance problem is a specialization of Ai and Chen (2007, 2012, 2003); Chen and Pouzo (2015), while the analysis as a sequential moment restriction specializes Chen and Liao (2015) in Appendix LABEL:sub:cl.

2.4 Analysis as optimally weighted SMD

We are interested in a functional of the parameter . A simple linearization shows that888Let be a real-valued function whose domain is some topological vector space . The notation denotes the directional derivative of with respect to in the direction of , which is

Note that when , the above definition is a usual directional derivative

A key insight is that we may define an inner product over the space of , , as

where and .

The linear operator, then admits a Riesz representation where is the Riesz representer.

Next, we analyze the Riesz representer and the inner product. First, observe that by picking we can simplify some terms in the Riesz representer:

which implies that

We can now consider the inner product, which turns out has a representation as a sample mean, from a local expansion of the criterion function (for the precise argument, see Ai and Chen, 2003, 2007):


We then have a heuristic derivation of the influence function


Riesz Representer. Lastly, we need to characterize the Riesz representer . The argument in AC03 parametrizes as a “scale times direction” coordinate. For a fixed scale , the minimum norm property of Riesz representers implicitly defines


Solving the condition

by plugging in then yields

as the solutions for the representers where is defined in (10) above. If we assume completeness condition then as the unique solution to (4) or (5) and

The consistency, root- asymptotic normality, consistent variance estimation can all be obtained by directly applying AC (2003, 2007) for single hidden layer ANN sieves. Chen, Liao and Wang (2021) results can be applied for multi-layer ANN sieves.

2.4.1 Identity weighted SMD

This is a special case of AC (2007, section 4.2). We include the asymptotic linear expansion for the sake of comparison. In particular, the influence function, associated with the identity-weighted SMD estimator, is of the form



The consistency, root- asymptotic normality, consistent variance estimation can all be obtained by directly applying AC (2007) for single hidden layer ANN sieves.

3 Implementation of the estimators

We now explain the implementation of various estimators for the average derivative as these tend to be complex, especially when we would like to estimate functionals of NPIV models efficiently. In this section, we describe in broad strokes the construction of the eventual estimators for the average derivative, which often involve estimation of nuisance parameters and functions. These nuisance parameters—which often take the form of known transformations of conditional means and variances—require further choice of estimation routines and tuning parameters, details of which are relegated to Section 4.2.

Quick map of estimation procedures

We provide a simple map that connects the above models and approaches to estimators we use. For implementation details see Section 3 below.

  1. For SMD estimators [P-ISMD, OP-OSMD]: Solve sample and sieve version of (7) (Section 3.1)

  2. Standard error for SMD estimators: Estimate the components of the influence functions in (9), and take the sample variance. (Section 3.3)

  3. Score estimators [IS, ES]: Estimate the components of the influence functions as in (LABEL:eq:ES). Set the influence functions to zero and solve for . (Section 3.2)

Additionally, we describe the estimator when the analyst is willing to assume more semiparametric structure (e.g. partial linearity) on the structural function. We also conclude the section with a brief discussion of software implementation issues.

A note on notation

Recall that we use to denote the outcome, to denote variables (endogenous or exogenous) that are included in the structural function, and to denote exogenous variables that are excluded from the structural function. Certain entries of and may be shared. Again, the NPIV moment condition


and we are interested in , where is the partial derivative of with respect to its first argument, evaluated at . Let

collect the data, viewed as random variables in the population.

We also set up notation for objects related to the sample. Let there be a sample of observations. We denote as vectors and matrices respectively of realized values of the random vector . We will slightly abuse notation and write , for a function , to be the -matrix of outputs obtained by applying row-wise, and similarly for expressions of the type .999

This notation conforms with how vector operations are broadcast in popular numerical software packages, such as Matlab and the Python scientific computing ecosystem (NumPy, SciPy, PyTorch, etc.).

For a vector valued function , we let be the projection matrix onto the column space of .

3.1 Sieve minimum distance (SMD) estimators

Consider a linear sieve basis for , where . For a sample of realizations of , is the sample best mean square linear predictor (that approximates the conditional mean) of , since it returns the fitted values of a regression of on flexible functions of :

Under the NPIV restriction (12), taking and , we should expect

This motivates the analogue of the SMD criterion (7) in the sample, where we choose so as to minimize the size of the projected residual :


When the norm chosen is the usual Euclidean norm , we obtain the identity-weighted SMD estimator for , .

Given a preliminary estimator for , we may form an estimator of the residual conditional variance by forming the estimated residuals and then projecting onto , e.g. via the linear sieve basis or via other nonparametric regression techniques such as nearest neighbors. With such an estimator of the heteroskedasticity, we can form a weight matrix . Using the norm in (13) yields the optimally-weighted SMD estimator for , .

With an estimated of the structural function , we can form two plug-in estimators of . The first is the simple plug-in estimator:

The simple plug-in estimator does not take into account the covariance between the two moment conditions, and . The second estimator, the orthogonalized plug-in estimator, orthogonalizes the second moment against the first:

where is an estimator of the population projection coefficient of the second moment onto the first moment condition :


One choice of is to plug in sample counterparts—plugging in for , plugging in a preliminary (which could be the ) for , and plugging in an estimator for —and finally approximate via a linear sieve regression, say with the basis .

To summarize, the SMD estimator can be implemented as follows.

Identity Weighted SMD Estimator of

  1. Sieve for Conditional Expectation: Choose a Sieve basis for : (more details on this later)

  2. Construct Objective function

    1. Obtain the sample least squares projection of onto

    2. Optimizing : define

Optimal SMD Estimator of

  1. Same as Step (1) above

  2. Estimate Weight Function : with a preliminary estimator of (use id weighted one for instance), form an estimator by projecting on , the sieve basis for to obtain . Form .

  3. Optimizing : define .

Estimators for

  1. Simple Plug In Estimator. Given an estimator of , use

  2. Orthogonalized Plug In Estimator

    1. Obtain an estimator of . One can use with being for example the simple plug in estimator and the above estimator of the variance of the first moment.

    2. Orthogonal Plug In Estimator. Obtain

Combining simple plug-in with identity-weighted SMD yields the estimation procedure that we term P-ISMD, and combining orthogonal plug-in with optimally weighted SMD yields the estimation procedure that we call OP-OSMD.

3.2 Influence function-based estimators

We also implement influence function based estimators. As we highlighted in the previous section, one influence function estimator for takes the following form


with defined below. Moreover, given an estimator for and for , we can form the influence function estimator:

Identity score estimator (Is)

One influence function, which corresponds to the influence function of the P-ISMD estimator has taking the following form. We refer to the resulting influence function estimator as IS, for identity score.

Efficient score estimator (Es)

On the other hand, the efficient influence function (ES) uses a different :

where is as in (14), and