1 Introduction
Deep layer Artificial Neural Networks (ANNs) are increasingly popular in machine learning (ML), statistics, business, finance, and other fields. The universal approximation property of multilayer (including singlehidden layer) ANNs (with various nonlinear activation functions) has been established by
Hornik, Stinchcombe and White (1989)and others. Early on, computational difficulties have hindered the wide applicability of ANNs. Recently, fast algorithms have led to successful applications of deep layer ANNs in image detection, speech recognition, natural language processing and other areas with complex nonlinearity and large data sets of high quality.
^{1}^{1}1By high quality we mean a data set with signaltonoise ratio that is very high. Unfortunately, many economic and social science data sets have low signaltonoise ratios.
Many problems where neural networks are extremely effective involve prediction problems (i.e. estimating conditional means)—or problems in which nuisance parameters are themselves predictions—on large datasets. It remains to be seen whether deep ANNs are similarly effective for structural estimation problems with nonparametric endogeneity, where relations to prediction are more tenuous or more complex.To that end, we consider semiparametric efficient estimation of the average partial derivative of a nonparametric instrumental variables regression (NPIV) via ANN sieves. Average partial derivatives of structural relationships are linked to (cross) elasticities of endogenous demand systems in economics, finance, and business. We make three contributions. First, we consider efficient estimation for average derivatives in NPIV models with ANN sieves and derive the theoretical properties of two classes of estimation procedures—optimal criterionbased procedures (sieve minimum distance) and efficient scorebased procedures.^{2}^{2}2We also exposit the theoretical properties of the associated inefficient estimators. Second, we detail a practitioner’s recipe for implementing these two classes of estimators. Third, and perhaps most importantly, we show a large amount of Monte Carlo evidence that compares the finitesample performance of the estimation paradigms that we consider. These are implemented using large scale designs, some with up to 13 continuous regressors, various nonlinearities and correlations among the covariates.
We now briefly introduce the two classes of estimation procedures that we consider. The first set of estimators belong to the class of sieve minimum distance (SMD), both under identity and optimal weighting (which we refer to as PISMD , for plugin SMD and OPOSMD, for orthogonal plugin optimal SMD, respectively). The criterionbased SMD paradigm is numerically equivalent to a semiparametric twostep procedure, where the unknown NPIV function (in the model ) is estimated via an ANN SMD in the first step, and the average partial derivative of
is estimated using an unconditional moment in the second step. The results of
Ai and Chen (2003, 2007, 2012) (henceforth, AC03, AC07, and AC12) show that the optimallyweighted SMD procedure (OPOSMD) automatically yields a semiparametrically efficient estimator of average derivatives of . The second type of estimators are based on influence functions (equivalently, on semiparametric scores).^{3}^{3}3These estimators have a long history in semiparametrics. See Bickel et al. (1993) and Section 25.8 of Van der Vaart (2000), as well as references therein, for an introduction. By considering the influence functions of the efficient SMD estimator (OPOSMD, derived in AC12), we obtain the efficient score estimation procedure. Similarly, we may derive the identity score procedure (IS) from (PISMD). In particular, we establish the asymptotic properties of the efficient score estimator when combined with a sieve firststep, which is novel to our knowledge.These two classes of estimators–criterionbased (SMD) and scorebased estimation procedures—represent two different perspectives for semiparametric estimation. In sieve minimum distance, semiparametric efficiency is achieved through clever choices of weighting of the minimum distance criterion, similar to optimally weighted GMM. Influence function estimators, on the other hand, treat the estimation as a twostep GMM problem. Compared to simpler settings, e.g. estimating average treatment effect under unconfoundedness, the influence functions here are not in closed form and involve a Riesz representer of a Hilbert space constructed by a norm connected to the SMD objective. Components of the influence functions may nonetheless be consistently estimated via sieve approximations.^{4}^{4}4In the case of linear sieves, these components may be estimated in closed form without use of nonlinear optimization.
We compare the finite sample performance of these estimation procedures (OPOSMD and ES ) in three Monte Carlo designs with moderate sample sizes ( or ).^{5}^{5}5For reference and comparison, we also compare with the inefficient counterparts, PISMD and IS; moreover, we also compare with versions of the scorebased estimators that utilize crossfitting, popular in the double machine learning literature (Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey and Robins, 2018, 2021). In our first Monte Carlo design we estimate the average partial derivative of a NPIV function with respect to an exogenous variable using various ANN sieves. In the second and third Monte Carlo designs, we estimate the average partial derivative of a NPIV function with respect to an endogenous variables using various ANN sieves and spline sieves. Our Monte Carlo experiments allow for comparisons along several dimensions:

Within a type of estimation procedure, do ANN estimators exhibit superior finitesample performance compared to linear sieve estimators (e.g. splines), when dimension of exogenous variables is moderately high?^{6}^{6}6To be clear, we are not speaking of “high dimension” in the sense.

Across types of estimators, how do ANN SMD estimators compare to ANN score estimators, along with alternative procedures like adversarial GMM (Dikkala, Lewis, Mackey and Syrgkanis, 2020)?

For ANN estimators, how much does ANN architecture matter? How much do other tuning parameters matter?
The results from the sets of numerical experiments we conduct provide us with some takeaways.

Hyperparameter tuning—choice of instrument basis, learning rate, stopping criterion—is delicate and can affect performance of ANNbased estimators. In particular, generally nonconvex optimization could lead to unstable performances. However, certain values of the hyperparameters do result in good performance.

We do not empirically observe systematic differences in performance as a function of neural architecture, within the feedforward neural network family. In our experience, the importance of neural architecture in our setting is not as high as tuning the optimization procedure.

Stable inferences are currently more difficult to achieve for ANN based estimators for models with nonparametric endogeneity.

OPOSMD and IS have smaller bias than PISMD estimators, when combined with ANN for the average derivative parameter.

OPOSMD and PISMD with splines for the average derivative parameter are less biased, stable and accurate, and can outperform their ANN counterparts, even when the dimension is high (as high as thirteen, which exceeds theory predictions).

Generally, there seems to be gaps between intuitions suggested by approximation theory and current implementation.
Lastly, as an application to real data, we apply the single hidden layer ANN sieve NPIV to estimate average price elasticity of gasoline demand using the data set of Blundell, Horowitz and Parey (2012) and to estimate average derivatives of the pricequantity relationship in strawberry, both applications containing multidimensional covariates.
Literature on ANNs in econometrics. ANNs can be viewed as an example of nonlinear sieves, which, compared to linear sieves (or series), can have faster approximation error rates for large classes of nonlinear functions of high dimensional regressors. Once after the approximation error rate of a specific ANN sieve is established for a class of unknown functions, the asymptotic properties of estimation and inference based on the ANN sieve could be established by applying the general theory of sievebased methods. Prior to the current wave of theoretical papers on multilayer ANN in econometrics and statistics, including Yarotsky (2017),Farrell et al. (2018), SchmidtHieber (2019), Athey et al. (2019) and the references there in, there are already many theoretical results on nonparametric Mestimation and inference based on single hidden layer (now called shallow) ANNs. For example, Wooldridge and White (1988)
obtained consistency of ANN least squares estimation of conditional mean function for time series heterogenous near epoch dependent data.
Chen and White (1999) established faster than(in root meansquared error metric) ANN sieve Mestimation of various nonparametric functions, such as conditional means, conditional quantiles, conditional densities, of time series models, and also obtained the root
consistent, asymptotic normality of plugin ANN estimator of regular functionals. Stinchcombe and White (1998) provided consistent specification test via ANN sieves. Chen, Hong and Shum (2007) studied nonparametric likelihood ratio Vuongstyle model selection test via ANN sieves, which relies on the root consistent asymptotic normality of plugin ANN estimation of nonparametric entropy. Perhaps due to computational difficulties, ANNs nonlinear sieves have not been popular in economics until the recent computational advances from the machine learning community.To the best of our knowleadge, Hartford, Lewis, LeytonBrown and Taddy (2017) is the first paper to apply multilayer ANNs to estimate NPIV function. The nonparametric convergence rates in AC03 and AC07 explicitly allow for nonlinear sieves such as ANNs to approximate and estimate the unknown structure functions of endogenous variables. They establish the root asymptotic normality of regular functionals of nonparametric conditional moment restrictions with smooth residual functions. However, there is no published work on efficient estimation of expectation functionals of unknown functions of endogenous variables via ANNs. We provide these results in this paper.
Literature on estimation of linear functionals of NPIVs. The identification and nonparametric consistent estimation of a pure NPIV function in a model have been first considered in Newey and Powell (2003). The semiparametric efficiency bound for EFs, including WADs of NPIV or nonparametric quantile instrumental variables (NPQIV) functions as examples, has been previously characterized by AC12.^{7}^{7}7Ai and Chen (2012) derived the efficiency bound via the “orthogonalized residual” approach, which extends the earlier work of Chamberlain (1992a) and Brown and Newey (2002) to allow for unknown functions entering a system of sequential moment restrictions. Although AC12 suggested an estimator that could achieve the semiparametric efficiency bound for regular EFs of nonparametric conditional moment restrictions, they did not provide sufficient conditions to formally establish that their estimator is indeed semiparametric efficient. AC07 established the asymptotic normality of plugin estimator of WAD of a possibly misspecified NPIV model. Severini and Tripathi (2013) presented efficiency bound calculation for average weighted derivatives of a NPIV model without assuming point identification of the NPIV function, but pointed out that the asymptotic normality estimator of linear functionals of NPIV in Santos (2012) fails to achieve the efficiency bound. Chen et al. (2019) proposed efficient estimation of weighted average derivatives of nonparametric quantile IV regression via penalized sieve GEL procedure, extending that of Donald et al. (2003) to allow for unknown quantile function of endogenous variables.
The rest of the paper is organized as follows. Section 2 introduces the model as a special case of the sequential moment restrictions containing unknown functions. It also presents the semiparametric efficiency score (or efficient influence function). Section 3 provides implementation details for all the estimators considered in the Monte Carlo studies. Section 4 contains three simulation studies and detailed Monte Carlo comparisons of various ANN and spline based estimators. Section 5 presents two empirical illustrations and Section 6 concludes the paper.
2 Two Efficient Estimation Procedures for Average Derivatives in NPIV Models
Using the framework in AC12, we first present the NPIV model and then present two procedures that will lead to semiparametric efficient estimation for average derivatives in NPIV models. The first one is based on efficient score (or efficient influence ) equation, and the second one is based on an optimally weighted criterion.
We are interested in semiparametric efficient estimation of the average partial derivative:
where is a known positive weight function, takes the partial derivative w.r.t. the first argument and the unknown function is identified via a realvalued conditional moment restriction
(1) 
Previously, while allowing for (global misspecification), Ai and Chen (2007) (AC07) presented a root
consistent asymptotically normally distributed identityweighted SMD estimator of
, allowing for nonlinear sieves such as single hidden layer ANN sieve is allowed for in their sufficient conditions. Ai and Chen (2012) (AC12) presented the semiparametric efficiency bound of (see their example 3.3) and an efficient estimator based on orthogonalized optimally weighted SMD (see their section 4.2).In this paper we present several efficient estimators of via more general ANN nonlinear sieve approximation to when
is a vector of continuous endogenous and exogenous covariates of moderate high dimension (say up to 13 dimension). For example:

In creftype 1, the DGP corresponds to
where is endogenous and can be of high dimension. Note here that is exogenous. The parameter of interest is .

In creftype 2, the DGP corresponds to
where is endogenous but enters linearly. The parameter of interest is .

In creftype 3, the DGP corresponds to
where is endogenous but now enters nonlinearly. Monte Carlo 3(a) specifies and hence the parameter of interest . Monte Carlo 3(b) lets where is a nonlinear function, and then the parameter of interest .
2.1 Efficient score and efficient variance for
In this section, we specialize the general efficiency bound result of AC12 to our setting. We rewrite our model using their notation. Denote . The model can be written as
(2)  
We define the orthogonalized residual as
which is the residual from a projection of on conditional on , where is the orthogonal projection coefficient:
Orthogonalizing the two moment conditions makes an efficiency analysis tractable—the same technique is used in, e.g., Chamberlain (1992b).
We know apply and specialize the results in AC12 to the plugin model (3)
(3) 
where is a scalar and is a realvalued function of , and .
Let , and
We note that this is also a special case of Example 3.3 of AC12, which already characterized the efficiency bound for . We recall their result for the sake of easy reference, and compute
(4) 
where . Let be one solution (not necessarily unique) to the optimization problem (4). We note that such one solution always exists since the problem is convex, and we have:
(5) 
Remark 2.1.
Characterization of Efficient Score. Applying Theorem 2.3 of AC12, we have: the semiparametric efficient score for in Model (3) is given by
where is one solution to (4). And the semiparametric information bound for is .
(1) If , then can not be estimated at the rate.
In the rest of the paper we shall assume that and hence is a estimable regular parameter. We note that by definition, the efficient score (indeed any moment condition proportion to an influence function) automatically satisfies the orthogonal moment condition.
2.2 Efficient influence function equation based procedure
From the remark above, the semiparametric efficient influence function for takes the form
(6) 
Denote
It is clear that is the unique solution to the efficient IF equation , that is
One efficient estimator, , for is simply based on the sample version of the efficient IF equation with plugin consistent estimates of all the nuisance functions:
In this paper can be various ANN sieve minimum distance estimators (see below), but, for simplicity, the nuisance functions and are estimated by plugin linear sieves estimators.
2.3 Optimally weighted SMD procedure
Another efficient estimator for can be found by optimallyweighted sieve minimum distance, where the population criterion is (see AC12):
(7) 
The discrepancy measure is the expectation of the two moment conditions, conditional on their respective fields:
and the optimal weight matrix is diagonal and proportional to the inverse variance of each moment condition:
A sieve minimum distance estimator for may be constructed by (i) replacing expectations with sample means, (ii) replacing conditional expectations with projection onto linear sieve bases, (iii) replacing the optimal weight matrix with a consistent estimator, and (iv) replacing the infinite dimensional optimization with finite dimensional optimization over a sieve space for . This paper focuses on approximating by ANN sieves. In particular, a sample analogue of the above objective function is
where and are estimators of and respectively; see Sections 3 and 4 below for examples of different estimators. Let be a sieve for (and in this paper we focus on various ANN sieves). We define the optimally weighted SMD estimator as an approximate solution to
This is an estimator proposed in AC12.
Two remarks are in order. First, note that the optimal weight matrix is diagonal because and are uncorrelated by design. Second, since the optimal weight matrix is diagonal and is a free parameter, we can view the minimization as sequential:
This is important because solving the model sequentially while maintaining efficiency plays a role in computing the estimators.
We may analyze the asymptotic properties of this estimator. Since we may view the optimally weighted SMD problem as either a minimum distance program or a sequential GMM estimator, we may carry out two separate analyses of the asymptotic properties. The analysis of the estimator as a minimum distance problem is a specialization of Ai and Chen (2007, 2012, 2003); Chen and Pouzo (2015), while the analysis as a sequential moment restriction specializes Chen and Liao (2015) in Appendix LABEL:sub:cl.
2.4 Analysis as optimally weighted SMD
We are interested in a functional of the parameter . A simple linearization shows that^{8}^{8}8Let be a realvalued function whose domain is some topological vector space . The notation denotes the directional derivative of with respect to in the direction of , which is
A key insight is that we may define an inner product over the space of , , as
where and .
The linear operator, then admits a Riesz representation where is the Riesz representer.
Next, we analyze the Riesz representer and the inner product. First, observe that by picking we can simplify some terms in the Riesz representer:
which implies that
We can now consider the inner product, which turns out has a representation as a sample mean, from a local expansion of the criterion function (for the precise argument, see Ai and Chen, 2003, 2007):
(8) 
We then have a heuristic derivation of the influence function
(9) 
Riesz Representer. Lastly, we need to characterize the Riesz representer . The argument in AC03 parametrizes as a “scale times direction” coordinate. For a fixed scale , the minimum norm property of Riesz representers implicitly defines
(10) 
Solving the condition
by plugging in then yields
as the solutions for the representers where is defined in (10) above. If we assume completeness condition then as the unique solution to (4) or (5) and
The consistency, root asymptotic normality, consistent variance estimation can all be obtained by directly applying AC (2003, 2007) for single hidden layer ANN sieves. Chen, Liao and Wang (2021) results can be applied for multilayer ANN sieves.
2.4.1 Identity weighted SMD
This is a special case of AC (2007, section 4.2). We include the asymptotic linear expansion for the sake of comparison. In particular, the influence function, associated with the identityweighted SMD estimator, is of the form
where
(11) 
The consistency, root asymptotic normality, consistent variance estimation can all be obtained by directly applying AC (2007) for single hidden layer ANN sieves.
3 Implementation of the estimators
We now explain the implementation of various estimators for the average derivative as these tend to be complex, especially when we would like to estimate functionals of NPIV models efficiently. In this section, we describe in broad strokes the construction of the eventual estimators for the average derivative, which often involve estimation of nuisance parameters and functions. These nuisance parameters—which often take the form of known transformations of conditional means and variances—require further choice of estimation routines and tuning parameters, details of which are relegated to Section 4.2.
Quick map of estimation procedures
We provide a simple map that connects the above models and approaches to estimators we use. For implementation details see Section 3 below.

For SMD estimators [PISMD, OPOSMD]: Solve sample and sieve version of (7) (Section 3.1)

Standard error for SMD estimators: Estimate the components of the influence functions in (9), and take the sample variance. (Section 3.3)

Score estimators [IS, ES]: Estimate the components of the influence functions as in (LABEL:eq:ES). Set the influence functions to zero and solve for . (Section 3.2)
Additionally, we describe the estimator when the analyst is willing to assume more semiparametric structure (e.g. partial linearity) on the structural function. We also conclude the section with a brief discussion of software implementation issues.
A note on notation
Recall that we use to denote the outcome, to denote variables (endogenous or exogenous) that are included in the structural function, and to denote exogenous variables that are excluded from the structural function. Certain entries of and may be shared. Again, the NPIV moment condition
(12) 
and we are interested in , where is the partial derivative of with respect to its first argument, evaluated at . Let
collect the data, viewed as random variables in the population.
We also set up notation for objects related to the sample. Let there be a sample of observations. We denote as vectors and matrices respectively of realized values of the random vector . We will slightly abuse notation and write , for a function , to be the matrix of outputs obtained by applying rowwise, and similarly for expressions of the type .^{9}^{9}9
This notation conforms with how vector operations are broadcast in popular numerical software packages, such as Matlab and the Python scientific computing ecosystem (NumPy, SciPy, PyTorch, etc.).
For a vector valued function , we let be the projection matrix onto the column space of .3.1 Sieve minimum distance (SMD) estimators
Consider a linear sieve basis for , where . For a sample of realizations of , is the sample best mean square linear predictor (that approximates the conditional mean) of , since it returns the fitted values of a regression of on flexible functions of :
Under the NPIV restriction (12), taking and , we should expect
This motivates the analogue of the SMD criterion (7) in the sample, where we choose so as to minimize the size of the projected residual :
(13) 
When the norm chosen is the usual Euclidean norm , we obtain the identityweighted SMD estimator for , .
Given a preliminary estimator for , we may form an estimator of the residual conditional variance by forming the estimated residuals and then projecting onto , e.g. via the linear sieve basis or via other nonparametric regression techniques such as nearest neighbors. With such an estimator of the heteroskedasticity, we can form a weight matrix . Using the norm in (13) yields the optimallyweighted SMD estimator for , .
With an estimated of the structural function , we can form two plugin estimators of . The first is the simple plugin estimator:
The simple plugin estimator does not take into account the covariance between the two moment conditions, and . The second estimator, the orthogonalized plugin estimator, orthogonalizes the second moment against the first:
where is an estimator of the population projection coefficient of the second moment onto the first moment condition :
(14) 
One choice of is to plug in sample counterparts—plugging in for , plugging in a preliminary (which could be the ) for , and plugging in an estimator for —and finally approximate via a linear sieve regression, say with the basis .
To summarize, the SMD estimator can be implemented as follows.
Identity Weighted SMD Estimator of

Sieve for Conditional Expectation: Choose a Sieve basis for : (more details on this later)

Construct Objective function

Obtain the sample least squares projection of onto

Optimizing : define

Optimal SMD Estimator of

Same as Step (1) above

Estimate Weight Function : with a preliminary estimator of (use id weighted one for instance), form an estimator by projecting on , the sieve basis for to obtain . Form .

Optimizing : define .
Estimators for

Simple Plug In Estimator. Given an estimator of , use

Orthogonalized Plug In Estimator

Obtain an estimator of . One can use with being for example the simple plug in estimator and the above estimator of the variance of the first moment.

Orthogonal Plug In Estimator. Obtain

Combining simple plugin with identityweighted SMD yields the estimation procedure that we term PISMD, and combining orthogonal plugin with optimally weighted SMD yields the estimation procedure that we call OPOSMD.
3.2 Influence functionbased estimators
We also implement influence function based estimators. As we highlighted in the previous section, one influence function estimator for takes the following form
(15) 
with defined below. Moreover, given an estimator for and for , we can form the influence function estimator:
Identity score estimator (Is)
One influence function, which corresponds to the influence function of the PISMD estimator has taking the following form. We refer to the resulting influence function estimator as IS, for identity score.
(16)  
(17)  
(18) 
Efficient score estimator (Es)
On the other hand, the efficient influence function (ES) uses a different :
where is as in (14), and
(19)  
Comments
There are no comments yet.