Why Machine Learning Cannot Ignore Maximum Likelihood Estimation

The growth of machine learning as a field has been accelerating with increasing interest and publications across fields, including statistics, but predominantly in computer science. How can we parse this vast literature for developments that exemplify the necessary rigor? How many of these manuscripts incorporate foundational theory to allow for statistical inference? Which advances have the greatest potential for impact in practice? One could posit many answers to these queries. Here, we assert that one essential idea is for machine learning to integrate maximum likelihood for estimation of functional parameters, such as prediction functions and conditional densities.



page 12

page 18


Maximum Likelihood Estimation from a Tropical and a Bernstein–Sato Perspective

In this article, we investigate Maximum Likelihood Estimation with tools...

Maximum Likelihood Estimation in Gaussian Process Regression is Ill-Posed

Gaussian process regression underpins countless academic and industrial ...

Boltzmann Machine Learning with the Latent Maximum Entropy Principle

We present a new statistical learning paradigm for Boltzmann machines ba...

Bethe Learning of Conditional Random Fields via MAP Decoding

Many machine learning tasks can be formulated in terms of predicting str...

Solving Schrödinger Bridges via Maximum Likelihood

The Schrödinger bridge problem (SBP) finds the most likely stochastic ev...

Generative Visual Dialogue System via Adaptive Reasoning and Weighted Likelihood Estimation

The key challenge of generative Visual Dialogue (VD) systems is to respo...

A Unifying Framework for Some Directed Distances in Statistics

Density-based directed distances – particularly known as divergences – b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Avalanche of Machine Learning The growth of machine learning as a field has been accelerating with increasing interest and publications across fields, including statistics, but predominantly in computer science. How can we parse this vast literature for developments that exemplify the necessary rigor? How many of these manuscripts incorporate foundational theory to allow for statistical inference? Which advances have the greatest potential for impact in practice? One could posit many answers to these queries. Here, we assert that one essential idea is for machine learning to integrate maximum likelihood for estimation of functional parameters, such as prediction functions and conditional densities.

The statistics literature proposed the familiar maximum likelihood estimators (MLEs) for parametric models and established that these estimators are asymptotically linear such that their

-scaled and mean-zero-centered version is asymptotically normal [22]

. This allowed the construction of confidence intervals and formal hypothesis testing. However, due to the restrictive form of these models, these parametric model-based MLEs target a projection of the true density on the parametric model that is hard to interpret. In response to this concern with misspecified parametric models, a rich statistical literature has studied so-called sieve-based or sieve MLEs involving specifying a sequence of parametric models that grow towards the desired true statistical model

[11, 20]. Such a sieve will rely on a tuning parameter that will then need to be selected with some method such as cross-validation.

Simultaneously, and to a large degree independently from the statistics literature, the computer science literature developed a rich set of tools for constructing data-adaptive estimators of functional parameters, such as a density of the data, although this literature mostly focused on learning prediction functions. This has resulted in a wealth of machine learning algorithms using a variety of strategies to learn the target function from the data. In addition, a more recent literature in statistics established super learning based on cross-validation as a general approach to optimally combine a library of such candidate machine learning algorithms into an ensemble that performs asymptotically as well as an oracle ensemble under specified conditions [14, 29]. Although the latter framework is optimal for fast learning of the target function, it lacks formal statistical inference for smooth features of the target function and for the target function itself.

In this chapter, we will argue that, in order to preserve statistical inference, we should preference sieve MLEs as much as possible. If the target function is not a data density, one can use, more generally, minimum loss estimation with an appropriate loss function. For example, if the goal is to learn the conditional mean of an outcome, one could use the squared-error loss and use minimum least squares estimation. Therefore, in this chapter, the abbreviation MLE also represents the more general minimum loss estimator.

We will demonstrate the power of sieve MLEs with a particular sieve MLE that also relies on the least absolute shrinkage and selection operator (LASSO) [23] and is termed the highly adaptive LASSO minimum loss estimator (HAL-MLE) [1, 24]. We will argue that the HAL-MLE is a particularly powerful sieve MLE, generally theoretically superior to other types of sieve MLEs. Moreover, for obtaining statistical inference for estimands that aim to equal or approximate a causal quantity defined in an underlying causal model, we discuss the combination of HAL-MLE with targeted maximum likelihood estimators (TMLEs) [30, 31, 32] in HAL-TMLEs as well as using undersmoothed HAL-MLE by itself as a powerful statistical approach [8, 24, 25, 31]. HALs can be used for the initial estimator in TMLEs as well as for nuisance functions representing the treatment and missingness/censoring mechanism. We will also clarify why many machine learning algorithms are not suited for statistical inference by having deviated from sieve MLEs. This chapter is a compact summary of recent and ongoing research; further background and references can be found in the literature cited. Sections 2-4 and 7 are more technical in nature, introducing core definitions and properties. Some readers may be interested in jumping to the material in Sections 5, 6, and 8 for technical but narrative discussions on contrasting HAL-MLE with other estimators as well as implementation of HAL-MLE and HAL-TMLE.

2 Nonparametric MLE and Sieve MLE

Before introducing the HAL-MLE, we describe the technical foundations of nonparametric and sieve MLEs.

2.1 Nonparametric MLE

We have observational unit on which we observe a random vector of measurements

with probability distribution

(e.g., with covariates , intervention or treatment , and outcome ). Consider the case that we observe i.i.d. copies

from the common probability measure

and suppose that we know that is an element of a set of possible probability distributions, which is called the statistical model. Here, the underscore indicates the true unknown probability measure. In addition, suppose that our target functional parameter is defined by a mapping from the probability measures into the space of -variate real-valued cadlag functions [10].

Let be the parameter space for this functional parameter. Moreover, we assume that the target functional estimand can be characterized as the minimizer: , for some loss function also denoted with when viewed as a function of . A first natural attempt for estimation of would be to define the MLE over the entire parameter space:

where we use empirical process notation .

However, for realistic statistical models , the parameter space is generally infinite dimensional and too flexible such that the minimizer of the empirical risk will overfit to the data and results in an ill-defined or a statistically poor estimator.

2.2 Sieve MLE

A sieve MLE recognizes this need to regularize the nonparametric MLE, and proposes a sequence of subspaces indexed by some Euclidean-valued tuning parameter that measures the complexity of the space [11, 20]. In addition, the sieve is able to approximate the target function such that for a sequence of values , approximates the entire parameter space as approximates infinity. These spaces are restricted in a manner so that an MLE: , is well defined for all , or, at least, for a range of values that approximates the complete index set as sample size grows. We can then define a cross-validation selector involving a -fold sample splitting of the sample of observations where the -th sample split yields an empirical distribution and of the validation and training sample, respectively. Uniform bounds on this loss function implies that this cross-validation selector performs as well as the oracle selector: [27, 28, 34].

2.3 Score Equations Solved by Sieve MLE

A key property of a sieve MLE is that it solves score equations. That is, one can construct a family of one dimensional paths through at indexed by a direction , and we have:

We can define a path-specific score: , for all . This mapping will generally be a linear operator in and can therefore be extended to a linear operator on a Hilbert space generated or spanned by these directions . One will then also have that for any in the linear span of , thereby obtaining that solves a space of scores. For scores that can be approximated by scores for certain , this might provide the basis for showing that is solved up to the desired approximation. In particular, one can apply this class of score equations at to obtain the score equations solved by the cross-validated sieve MLE .

3 Special Sieve MLE: HAL-MLE

We now introduce a special sieve MLE, the HAL-MLE. The HAL-MLE is generally theoretically superior to other types of sieve MLEs, with its statistical properties presented in the subsequent section.

3.1 Definition of HAL-MLE

A particular sieve MLE is given by the HAL-MLE. It starts with a representation , where , and is a generalized difference for an infinitesimal small cube in the -dimensional Euclidean space with respect to the measure generated by the -specific section that sets the coordinates outside equal to zero [1, 5, 24]. This representation shows that any cadlag function can be represented as a linear combination of indicator basis functions indexed by knot point , across all subsets of with coefficients given by . In particular, if

is such that all its sections are like discrete cumulative distribution functions, then this representation of

is literally a finite linear combination of these indicator basis functions. We note that each basis function

is a tensor product

of univariate indictor basis functions , functions that jump from to at knot point . Moreover, the -norm of these coefficients in this representation is the so-called sectional variation norm:

This sectional variation norm of represents a measure of the complexity of . Functions that can be represented as would correspond with for all subsets of size larger than or equal to . Such functions would have a dramatically smaller sectional variation norm than a function that requires high-level interactions. If the dimension , then this sectional variation norm is just the typical variation norm of a univariate function obtained by summing the absolute values of its changes across small intervals. For dimensional , this is defined the same way, but now one sums so-called generalized differences , representing the measure assigns to the small 2-dimensional cube . Similarly, one defines generalized differences for general -dimensional function as the measure it would give to a -dimensional cube , treating the function as a -variate cumulative distribution function.

This measure of complexity now defines the sieve as all cadlag functions with a sectional variation norm smaller than . The sieve-based MLE is then defined as: . Using a discrete support for each -specific section , we can represent with , where represent the knot points. Thus, we obtain a standard LASSO estimator:

and resulting . As above, is then selected with the cross-validation selector . This can generally be implemented with LASSO software implementations such as glmnet in R [4]. Additionally, HAL9001 provides implementations for linear, logistic, Cox, and Poisson regression, which also provides HAL estimators of the conditional hazards and intensities [7].

3.2 Score Equations Solved by HAL-MLE

In order to define a set of score equations solved by this HAL-MLE we can use as paths for a vector . In order to keep the -norm constant along this path, we then enforce the constraint , where

It can be verified that now for in a neighborhood of . Therefore, we know that the HAL-MLE solves the score equation for each of these paths:

This shows that the HAL-MLE solves a class of score equations. Moreover, this result can be used to prove that

for all for which . That is, the -norm constrained HAL-MLE also solves the unconstrained scores solved by the MLE over the finite linear model if implied by the fit . By selecting the -norm to be larger, this set of score equations approximates any score, thereby establishing that the HAL-MLE, if slightly undersmoothed, will solve score equations up to uniformly over all scores. For further details we refer prior work [25].

This capability of the HAL-MLE to solve all score equations, even uniformly over a class that will contain any desired efficient influence curve of a target parameter, provides the fundamental basis for establishing its remarkably strong asymptotic statistical performance in estimation of smooth features of as well as of itself.

4 Statistical Properties of the HAL-MLE

Having introduced the special sieve MLE, the HAL-MLE , we now further enumerate its statistical properties.

4.1 Rate of Convergence

Firstly, , where is the loss-based dissimilarity. This generally establishes that some -norm of will be [2, 24].

4.2 Asymptotic Efficiency for Smooth Target Features of Target Function

Let be a pathwise differentiable target feature of and let be its canonical gradient at , possibly also depending on a nuisance parameter . Let be its exact second-order remainder. By selecting to be large enough so that , we obtain that is asymptotically linear with influence curve . That is, is an asymptotically efficient estimator of . The only conditions are that , where one can use that . Generally speaking, the latter will imply that the exact remainder has the same rate of convergence. In fact, in double robust estimation problems, this exact remainder equals zero (because it is at ).

4.3 Global Asymptotic Efficiency

Moreover, now consider a very large class of pathwise differentiable target features indexed by some , where is the canonical gradient and is the exact remainder. Then, by undersmoothing enough so that , that is, we use global undersmoothing, we obtain that is an asymptotically linear estimator of having influence curve , with a remainder that satisfies , so that converges weakly to a Gaussian process as a random function in function space. That is, is an asymptotically efficient estimator of in a sup-norm sense. In particular, it allows construction of simultaneous confidence bands. We note that this also implies that , a kernel smoother of at is an asymptotically efficient estimator of , a kernel smoother of the true target function . In fact, this holds uniformly in all , and for a fixed we have weak convergence of to a Gaussian process. That is, the undersmoothed HAL-MLE is an efficient estimator of the kernel smoothed functional of , for any .

This may make one wonder if

itself is not asymptotically normally distributed as well? While not a currently solved problem, we conjecture that, indeed,

converges weakly to a normal distribution, where the rate of convergence may be as good as . If this results holds, then the HAL-MLE also allows formal statistical inference for the function itself!

4.4 Nonparametric Bootstrap for Inference

It has been established that the nonparametric bootstrap consistently estimates the multivariate normal limit distribution of the plug-in HAL-MLE for a vector of target features [3]

. In fact, it will consistently estimate the limit of a Gaussian process of the plug-in HAL-MLE for an infinite class of target features as discussed above. The approach is computationally feasible and still correct by only bootstrapping the LASSO-based fit of the model that was selected by the HAL-MLE; the bootstrap is only refitting a high-dimensional linear regression model with at most

basis functions (and generally much smaller than ). This allows one to obtain inference that also attains second-order behavior. In particular, obtaining the bootstrap distribution for each -norm larger than or equal to the cross-validation selector of the -norm and selecting the -norm at which the confidence intervals plateau provides a robust finite sample inference procedure [3].

It will also be of interest to understand the behavior of the nonparametric bootstrap in estimating the distribution of the HAL estimator of the target function itself (say, at a point). Before formally addressing this, we need to first establish that the HAL-MLE is asymptotically normal as conjectured in a previous subsection.

4.5 Higher-Order Optimal for Smooth Target Features

In recent work, higher-order TMLEs were proposed that target extra score equations beyond the first-order canonical gradient of the target feature, which are selected so that the exact second-order remainder for the plug-in TMLE will become a higher-order difference (say, third-order for the second order TMLE) [33]. Because an undersmoothed HAL-MLE also solves these extra score equations, it follows that the plug-in HAL-MLE, when using some undersmoothing, is not just asymptotically efficient but also has a reduced exact remainder analogue to the higher-order TMLE. (More on TMLEs in later sections.)

The key lesson is that the exact remainder in an expansion of a plug-in estimator can be represented as an expectation of a score: , for some direction . For example, in the nonparametric estimation of , where and , we have , which can be written as , which is an expectation of a score of the form . The HAL-MLE approximately solves for some approximation of , thereby reducing the remainder to . The latter is a term that is —only relying on the consistency of —not even a rate is required. Higher-order TMLE directly targets these scores to map the remainder in higher and higher order differences, but the undersmoothed HAL-MLE is implicitly doing the same (although not solving the score equations exactly).

This demonstrates that solving score equations has enormous implications for first- and higher-order behavior of a plug-in estimator. Typical machine learning algorithms are generally not tailored to solve score equations, and, thereby, will not be able to achieve such statistical performance for their plug-in estimator. In fact, most machine learning algorithms are not grounded in any asymptotic limit distribution theory.

5 Contrasting HAL-MLE with Other Estimators

One might wonder: what is particularly special about this sieve in the HAL-MLE relative to other sieve MLEs, such as those using Fourier basis or polynomial basis, wavelets, or other sequences of parametric models. Also, why might we prefer it over other general machine learning algorithms?

5.1 HAL-MLE vs. Other Sieve MLE

The simple answer is that these sieve MLEs generally do not have the (essentially) dimension-free/smoothness-free rate of convergence , but instead their rates of convergence heavily rely on assumed smoothness. HAL-MLE uses a global complexity property, the sectional variation norm, rather than relying on local smoothness. The global bound on the sectional variation norm provides a class of cadlag functions that has a remarkably nice covering number , hardly depending on the dimension . Due to this covering number, the HAL-MLE has this powerful rate of convergence—only assuming that the true target function is a cadlag function with finite sectional variation norm.

A related advantage of the special HAL sieve is that the union of all indicator basis functions is a small Donsker class, even though it is able to span any cadlag function. Most sets of basis functions include “high frequency” type basis functions that have a variation norm approximating infinity. As a consequence, these basis functions do not form a nice Donsker class. In particular, this implies that the HAL-MLE does not overfit, as long as the sectional variation norm is controlled. The fact that HAL-MLE itself is situated in this Donsker class also means that the efficient influence curves at HAL-MLE fits will fall in a similar size Donsker class. As a consequence, the Donsker class condition for asymptotic efficiency of plug-in MLE and TMLE is naturally satisfied when using the HAL-MLE, while other sieve-based estimators easily cause a violation of the Donsker class condition.

This same powerful property of the Donsker class spanned by these indicator basis functions also allows one to prove that nonparametric bootstrap works for the HAL-MLE, while, generally speaking, the nonparametric bootstrap generally fails to be consistent for machine learning algorithms.

Another appealing feature of the HAL sieve is that it is only indexed by the -norm, while many sieve MLEs rely on a precise specification of the sequence of parametric models that grow in dimension. It should be expected that the choice of this sequence can have a real impact in practice. HAL-MLE does not rely on an ordering of basis functions, but rather it just relies on a complexity measure. For each value of the complexity measure, it includes all basis functions and represents an infinite dimensional class of functions rich enough to approximate any function with complexity satisfying this bound. A typical sieve MLE can only approximate the true target function, while the HAL-MLE includes the true target function when the sectional variation norm bound exceeds the sectional variation norm of the true target function.

As mentioned, the HAL-MLE solves the regular score equations from the data-adaptively selected HAL-model at rate . As a consequence, the HAL-MLE is able to uniformly solve the class of all score equations—only restricted by some sectional variation norm bound, where this bound can go to infinity as sample size increases. This strong capability for solving score equations appears to be unique for HAL-MLE relative to other sieve-based estimators. It may be mostly due to actually being an MLE over an infinite function class (for a particular variation norm bound). We also note that a parametric model-based sieve MLE would be forced to select a small dimension to avoid overfitting. However, the HAL-MLE adaptively selects such a model among all possible basis functions, and the dimension of this data-adaptively selected model will generally be larger. The latter is due to the HAL-MLE only being an -regularized MLE for that adaptively selected parametric model, and thereby does not overfit the score equations, while the typical sieve MLE would represent a full MLE for the selected model.

By solving many more score equations approximately HAL-MLE can span a much larger space of scores than a sieve MLE that solves many fewer score equations perfectly.

5.2 HAL-MLE vs. General Machine Learning Algorithms

Sieve MLEs solve score equations and, thereby, are able to approximately solve a class of score equations—possibly enough to approximate efficient influence curves of the target features of interest, and thus be asymptotically efficient for these target features. We discussed how these different sieve MLE can still differ in their performance in solving score equations, and that the HAL-MLE appears to have a unique strategy by choice of basis functions and measure of complexity. Thus, this gives it a distinct capability to solve essentially all score equations at a desired approximation error.

Many machine learning algorithms fail to be an MLE over any subspace of the parameter space. Such algorithms will have poor performance in solving score equations. As a consequence, they will not result in asymptotically linear plug-in estimators and will generally be overly biased and nonrobust.

6 Considerations for Implementing HAL-MLE

Having presented core definitions, properties, and comparisons, we now turn to some additional considerations for implementing HAL-MLEs.

6.1 HAL-MLE Provides Interpretable Machine Learning

The role of interpretable algorithms is a major consideration for many applications [16]. Is HAL-MLE interpretable? The HAL-MLE is a finite linear combination of indicator basis functions, so-called tensor products of zero-order splines, and the number of basis functions is generally significantly smaller than . HAL-MLE has also been generalized to higher-order splines so that it is forced to have a certain smoothness. These higher-order spline HAL-MLE have the same statistical properties as reviewed above, as long as the true function satisfies the enforced level of smoothness. Such smooth HAL-MLE can be expected to result in even sparser fits, e.g., a smooth monotone function can be fitted with a few first-order splines (piecewise linear) while it needs relatively many knot points when fitting with a piecewise constant function. Either way, the HAL-MLE is a closed form object that can be evaluated and is thus interpretable. Therefore, the HAL-MLE has the potential to play a key role in interpretable machine learning.

6.2 Estimating the Target Function Itself

HAL-MLE requires selecting the set of knot points and, thereby, the collection of spline basis functions. The largest set of knot points we have recommended (and suffices for obtaining the nonparametric rate of convergence is given by: , which corresponds (for continuous covariates) to basis functions. The LASSO-based fit will then select a relatively small subset () of this user-supplied collection.

Rather than selecting this full set of basis functions, one can incorporate model assumptions. This could include only selecting up to two-way tensor products, ranking the basis functions by their sparsity (i.e., proportion of 1’s among the observations) and selecting the top , and specifying a specific additive model using standard glm-formula notation, such as , and selecting the knot points accordingly. In particular, one could use some glmnet fit using main terms and standard interactions to decide on this type of additive model. In addition, one can specify that the coefficients of a certain set of basis functions should be nonnegative and others should be non-positive, thereby enforcing monotonicity of functions in the additive model. Finally, one can select among zero order and more generally -th order spline basis functions, thereby specifying a desired smoothness of the HAL-MLE.

To reduce the computational burden of the implementation of HAL-MLE, one can randomly subset from a large set of basis functions or subset in a deterministic manner.

For example, for a continuous covariate , rather than using as knot points ,

, one selects only 5 knot points—the 5 quantiles of the

observations . In this manner, the number of basis functions will not grow linearly in , it is now growing by a fixed factor . Similarly, the above restriction on the degree of the tensor products to only reduces the -factor to . Therefore, such choices reduces the total number of basis functions from to around basis functions.

In certain cases, one might view some of these restrictions as a specification of the actual statistical model. For example, the statistical model might assume that is an additive model including any one-way, two-way, and three-way function. However, in general, many of these model choices for the HAL-MLE will be hard to defend based on prior knowledge, although they might result in a statistically improved estimator relative to using the most nonparametric HAL-MLE. Therefore, we recommend building a super learner whose library contains a variety of such HAL-MLE specifications. In addition, by ranking these HAL-MLE fits by their complexity, one could implement a cross-validation scheme that implements and evaluates estimators from least complex to increasingly complex and stops when the cross-validated risk of the next HAL-MLE drops below the performance of the less complex previous HAL-MLE. In this manner, the resulting super learner avoids having to implement the highly computer intensive HAL-MLEs, except when they are really needed. Because the discrete super learner is asymptotically equivalent with the oracle selector, the resulting discrete super learner will perform as well as the best possible HAL-MLE, thereby also inheriting the rate of convergence of the most nonparametric HAL-MLE.

6.3 Using the HAL-MLE for Smooth Feature Inference

By specifying the class of target features for which we want the HAL-MLE to be efficient, we could undersmooth the HAL-MLE until all the efficient score equations (one for each target feature) are solved at the desired level. In this case, the above discrete super learner is not making the right trade-off. One might still be able to solve all the score equations under a restricted model choice, such as only including up to three-way tensor product basis functions. Thus, one might still rank various HAL-MLE specifications from least complex to maximal complex, but would now keep increasing complexity of the HAL-MLE according to this ranking until the desired class of score equations are solved at the desired level (i.e., uniformly in all target features).

7 HAL and TMLE

Given the bevy of statistical properties for the HAL-MLE, one might wonder why one would need TMLE [30, 32]? However, the finite sample statistical properties of the HAL-MLE rely on solving the desired score equations up to a small enough error, even though the HAL-MLE has the statistical capacity of solving essentially all score equations up to an asymptotic rate . On the other hand, if the HAL-MLE is too undersmoothed, then its performance in estimating the target function deteriorates, and that can also have a detrimental effect on statistical performance. Targeted undersmoothing can deal with this to some degree by precisely finding the level of undersmoothing that solves the desired target equations at the preferred level (which could be , where

is a standard error of the efficient influence curve of the target feature). Nonetheless, in finite samples it might simply not be possible to achieve this goal.

7.1 Targeting a Class of Target Features

Therefore, to achieve improved finite sample performance of the HAL-MLE for a desired target feature or class of target features, it follows to use TMLE with an initial HAL-MLE estimator [24]. In this manner, the user has precise control over a set of target score equations while the HAL-MLE will provide important benefits by solving a large class of score equations and having a good rate of convergence.

Let be a HAL-MLE, possibly a discrete super learner based on a library of different HAL-MLE specifications. Let be our class of target features and let be the corresponding class of canonical gradients. We can then construct a universal least favorable path through at , possibly using an initial estimator , and resulting MLE: , so that . Such a TMLE update step satisfies the identity:

Because , and assuming a HAL-MLE also has , under a strong positivity assumption, we will have:

In addition, because HAL-MLEs fall in the well understood Donsker class of cadlag functions with a uniform bound on their sectional variation norm [2], it generally also follows that is a Donsker class with the same covering number rate as this cadlag function class. Therefore, we will also have:

where the rate of the remainder would follow from using the empirical process finite sample asymptotic equicontinuity results [2]. As a consequence, we have:

In particular, it follows that is an asymptotically efficient estimator of , possibly viewed as elements in function space endowed with a supremum norm [24].

Therefore, in great generality, we have that TMLEs that use HAL as an initial estimator are asymptotically efficient for the target features targeted by the TMLE assuming only that the nuisance functions are cadlag, have finite sectional variation norm, and a strong positivity assumption.

It is not required that the HAL-MLE is undersmoothed for TMLE. The solving of the score equations is carried out by the TMLE so that the HAL-MLE can be optimized for estimating and . In particular, one can now use the discrete super learner discussed above with a library of HAL-MLEs.

7.2 Score Equation Preserving TMLE Update

Suppose that the HAL-MLE was undersmoothed and thereby was itself already a great estimator for these target features and also for other target features that were not targeted by the TMLE. In that case, could the TMLE destroy the score equations solved by the HAL-MLE and deteriorate some of the good performance of the initial HAL-MLE, such as its properties in estimation of other features or its higher-order properties in estimation of the actual target features targeted by the TMLE? That is, given an initial estimator has already succeeded in solving a large class of important score equations, making it potentially globally efficient and higher-order efficient for a large class of features, we should be worried about the TMLE not preserving these score equations during its TMLE update step solving the desired target score equations up to user supplied error.

This motivates us to generalize the TMLE update step to a TMLE update that is not only solving the target score equations but also preserves the score equations already solved by the initial estimator.

In particular, this is a motivation for using universal least favorable paths in the definition of the TMLE update, because such paths are only maximizing the likelihood in the direction of the target score equations, thereby not affecting any score equation orthogonal to these target score equations. However, in addition, one might do the following. We already specified the score equations exactly solved by the HAL-MLE above, one score for each coefficient that has a nonzero coefficient corresponding with a path that keeps the -norm constant. Given this specified set of score equations and its linear span of scores, we could compute the projection of the first-order canonical gradient onto the space spanned by these scores, and subtract it from , resulting in an orthogonalized .

One now defines the TMLE using the universal least favorable path based on this orthogonalized set of scores rather than . In this case, the TMLE update will still solve the score equations in that were solved by the initial HAL-MLE , and, in addition, it will solve the score equations . As a consequence, it will also solve . So in this way, we have used the general TMLE to obtain a TMLE that not only targets a new set of score equations but also preserves the score equations already solved by the initial estimator .

In future work, we will study, implement, and evaluate this score equation preserving TMLE, and other variations of such score equation preserving TMLEs. The key message is that we will have further robustified the TMLE by not only solving the target score equations and preserving the rate of convergence of the initial estimator, but also preserving the score equations solved by the initial estimator with all its important additional statistical benefits.

8 Considerations for Implementing HAL-TMLE

Causal inference relies on a general roadmap involving defining a causal model, causal quantity of interest, establishing an identification result, and thereby defining a target estimand of the data distribution [12]. These steps then dictate the statistical estimation problem in terms of data, statistical model, and target estimand [13, 30]. For estimation and statistical inference, the underlying causal model and its assumptions play no role, such that one now just proceeds with the roadmap of targeted learning given the statistical model and target estimand [19, 30]. That is, we can: compute the canonical gradient of the target estimand, define an initial estimator, describe a TMLE update of this initial estimator, and then provide confidence intervals. In particular, the above plug-in HAL-MLE and score equation preserving TMLE using the HAL-MLE as initial estimator can now be applied. In addition, one can develop the higher-order TMLE using the HAL-MLE as initial estimator [33]. We now focus on additional considerations for implementing HAL-TMLEs.

8.1 Double Robustness

Many causal inference estimation problems have a double robust structure in the sense that the exact remainder has a cross-product structure that can be naturally bounded by a product of and for certain -norms (typically -norms), or, equivalently, in terms of and . This structure implies that in randomized trials or observational studies where there is substantial knowledge of the censoring and treatment mechanism , one can obtain excellent finite sample performance for TMLEs utilizing the resulting fast converging estimator of . In such a particular appeal due to its ability to remove all bias in the initial estimator when .

8.2 HAL for Treatment and Censoring Mechanisms

Using HAL for (respecting the known model for ) has important benefits. In particular, even if is inconsistent, the TMLE will remain asymptotically linear if is somewhat undersmoothed. In fact, if one uses a nonparametric HAL-MLE ignoring the model for , and the full data model is nonparametric, then the TMLE will even be efficient, despite the inconsistency of . Similarly, the inverse probability of treatment and censoring weighted estimator using an undersmoothed HAL-MLE is asymptotically linear in these causal inference problems, and its efficiency is maximized by using an undersmoothed nonparametric HAL-MLE. On the other hand, in such a setting, if one would use other machine learning algorithms, including a super learner, these same estimators would lose their asymptotic linearity and not even converge at the desired -rate—failing to provide valid inference.

Therefore, we learn that it is not only beneficial to use HAL-MLE as initial estimator of due to above mentioned reasons, but it is also highly beneficial to use a HAL-MLE for .

Let’s consider the case where it is known that the treatment mechanism only depends on confounders. By estimating with a model-based HAL-MLE, perhaps undersmoothed, the above arguments show important gains. However, the above arguments also state that ignoring the model for will even make it more efficient. Therefore, there is an important selection problem among candidate HAL-MLEs of

that have varied complexity, ranging from the actual model to a fully nonparametric HAL-MLE. Cross-validation would then select the model-based HAL-MLE with high probability and thus be ignorant of the subtle bias-variance trade-offs at stake.

More generally, selecting among candidate estimators of based on the log-likelihood loss can be problematic when the positivity assumption is practically violated. For example, this could result in an estimator that approaches zero, and, as a consequence, results in erratic TMLE updates. This is typically resolved by truncation, but one still needs to select the truncation level. Similarly, the adjustment set used by might need to be tailored toward the MSE of the TMLE of the target estimand using , rather than toward the estimation of . For example, some baseline covariates might be highly predictive of treatment while not being real confounders (like instrumental variables), and it is well established that adjustment for such variables can easily increase both variance and bias [6, 9, 15, 17, 18]. Therefore, an important feature of causal inference problems is the targeted selection among candidate estimators of .

8.3 Collaborative TMLE

TMLE provides a natural approach for such a selection by evaluating the performance of a candidate estimator of with the increase of the log-likelihood during the TMLE update step. The MLE of in the universal least favorable path through the initial estimator corresponds with fitting the target estimand itself. Specifically, it is the path along which the ratio of the squared change in target estimand by increase in log-likelihood is maximized, i.e., the slope in change of estimand per unit increase in log-likelihood is maximal. Therefore, this increase in log-likelihood during the TMLE update step indeed provides a targeted criterion for evaluating the performance of a candidate estimator of . After having selected an optimal choice with respect to this criterion, one can iterate the process by making the TMLE update the initial estimator and repeating this greedy selection, but now only selecting among candidate estimators that are more complex (say, higher log-likelihood) than the one selected at the previous step. In this manner, we obtain a sequence of TMLE updates of with corresponding increasingly complex estimators, . And can now be selected with cross-validation based on the loss function for , or a AIC/BIC type selector to avoid the double cross-validation. This approach can now be used, for example, to select the -norm in a HAL-MLE, the truncation level, but also to select the maximal degree of the tensor products, or maximal sparsity of the basis functions, and so on [3, 8, 25, 33, 26]. Simulations have demonstrated that this collaborative (C-TMLE) procedure can dramatically outperform a TMLE that uses pure likelihood based estimation of when there are positivity issues [8]. One can also let a HAL-MLE fit of impact the basis functions included in the HAL-MLE of , and use this C-TMLE procedure to select the -norm in the HAL-MLE of and the resulting HAL-MLE of (as well as the truncation level). The latter type of estimator was termed the outcome adaptive HAL-TMLE [8], which built on prior work in outcome-adaptive LASSOs [21], providing a powerful tool for variable selection in , and was shown to have strong practical performance.

9 Closing

There is a tendency for the machine learning literature to focus on piecemeal and small extensions of the ‘flashy’ estimator of the moment. This was recently random forests, but is currently deep learning. However, the statistical theory and empirical process literature has offered strong guidance on the development of data-adaptive estimators for features of the data distribution that provide inference, while fully utilizing the knowledge of a statistical model.

In particular, for optimal robust (higher-order) estimation of target estimands, we need to solve specific first- and higher-order canonical gradients of the target estimand. Also, having the functional parameter of the data distribution needed for estimation of the target estimand as a member of a function class with a well behaving entropy integral (as implied by the covering number of the function class) provides good rates of convergence and allows for bootstrap based inference. HAL-MLE and HAL-TMLE appear to satisfy these key fundamental properties. It would be exciting for the general machine learning literature to build on these areas.


  • [1] David Benkeser and Mark van der Laan. The highly adaptive lasso estimator. In

    2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

    , pages 689–696. IEEE, 2016.
  • [2] Aurélien Bibaut and Mark van der Laan. Fast rates for empirical risk minimization over cadlag functions with bounded sectional variation norm. arXiv preprint arXiv:1907.09244, 2019.
  • [3] Weixin Cai and Mark van der Laan. Nonparametric bootstrap inference for the targeted highly adaptive least absolute shrinkage and selection operator (LASSO) estimator. The International Journal of Biostatistics, 16(2), 2020.
  • [4] Jerome Friedman, Trevor Hastie, Rob Tibshirani, Balasubramanian Narasimhan, Kenneth Tay, Noah Simon, and Junyang Qian. Package glmnet . CRAN R Repository, 2021.
  • [5] Richard Gill, Mark van der Laan, and Jon Wellner. Inefficient estimators of the bivariate survival function for three models. Annales de l’IHP Probabilités et statistiques, 31(3):545–597, 1995.
  • [6] Sander Greenland. Invited commentary: variable selection versus shrinkage in the control of multiple confounders. American Journal of Epidemiology, 167(5):523–529, 2008.
  • [7] Nima Hejazi, Jeremy Coyle, and Mark van der Laan.

    hal9001: Scalable highly adaptive lasso regression in R.

    Journal of Open Source Software

    , 5(53):2526, 2020.
  • [8] Cheng Ju, David Benkeser, and Mark van der Laan. Robust inference on the average treatment effect using the outcome highly adaptive LASSO. Biometrics, 76(1):109–118, 2020.
  • [9] Jessica Myers, Jeremy Rassen, Joshua Gagne, Krista Huybrechts, Sebastian Schneeweiss, Kenneth Rothman, Marshall Joffe, and Robert Glynn. Effects of adjusting for instrumental variables on bias and precision of effect estimates. American Journal of Epidemiology, 174(11):1213–1222, 2011.
  • [10] Georg Neuhaus. On weak convergence of stochastic processes with multidimensional time parameter. The Annals of Mathematical Statistics, 42(4):1285–1295, 1971.
  • [11] Whitney Newey. Convergence rates and asymptotic normality for series estimators. Journal of Econometrics, 79(1):147–168, 1997.
  • [12] Judea Pearl. Causality. Cambridge University Press, 2009.
  • [13] Maya Petersen and Mark van der Laan. Causal models and learning from data: integrating causal modeling and statistical estimation. Epidemiology, 25(3):418, 2014.
  • [14] Eric Polley, Sherri Rose, and Mark van der Laan. Super learning. In Targeted Learning: Causal Inference for Observational and Experimental data, pages 43–66. Springer, 2011.
  • [15] Andrea Rotnitzky, Lingling Li, and Xiaochun Li. A note on overadjustment in inverse probability weighted estimation. Biometrika, 97(4):997–1001, 2010.
  • [16] Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019.
  • [17] Enrique Schisterman, Stephen Cole, and Robert Platt. Overadjustment bias and unnecessary adjustment in epidemiologic studies. Epidemiology, 20(4):488, 2009.
  • [18] Sebastian Schneeweiss, Jeremy Rassen, Robert Glynn, Jerry Avorn, Helen Mogun, and M. Alan Brookhart. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology, 20(4):512, 2009.
  • [19] Megan Schuler and Sherri Rose. Targeted maximum likelihood estimation for causal inference in observational studies. American Journal of Epidemiology, 185(1):65–73, 2017.
  • [20] Xiaotong Shen. On methods of sieves and penalization. The Annals of Statistics, pages 2555–2591, 1997.
  • [21] Susan M Shortreed and Ashkan Ertefaie. Outcome-adaptive lasso: variable selection for causal inference. Biometrics, 73(4):1111–1122, 2017.
  • [22] Stephen Stigler. The epic story of maximum likelihood. Statistical Science, pages 598–620, 2007.
  • [23] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • [24] Mark van der Laan. A generally efficient targeted minimum loss based estimator based on the highly adaptive LASSO. The International Journal of Biostatistics, 13(2), 2017.
  • [25] Mark van der Laan, David Benkeser, and Weixin Cai. Efficient estimation of pathwise differentiable target parameters with the undersmoothed highly adaptive LASSO. arXiv preprint arXiv:1908.05607, 2019.
  • [26] Mark van der Laan, Antoine Chambaz, and Cheng Ju. C-tmle for continuous tuning. In Targeted Learning in Data Science, pages 143–161. Springer, 2018.
  • [27] Mark van der Laan and Sandrine Dudoit. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. Technical Report 130, Division of Biostatistics, University of California, Berkeley, 2003.
  • [28] Mark van der Laan, Sandrine Dudoit, and Aad van der Vaart. The cross-validated adaptive epsilon-net estimator. Statistics & Decisions, 24(3):373–395, 2006.
  • [29] Mark van der Laan, Eric Polley, and Alan Hubbard. Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1), 2007.
  • [30] Mark van der Laan and Sherri Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Science & Business Media, 2011.
  • [31] Mark van der Laan and Sherri Rose.

    Targeted Learning in Data Science: Causal Inference for Complex Longitudinal Studies

    Springer, 2018.
  • [32] Mark van der Laan and Daniel Rubin. Targeted maximum likelihood learning. The international Journal of Biostatistics, 2(1), 2006.
  • [33] Mark van der Laan, Zeyi Wang, and Lars van der Laan. Higher order targeted maximum likelihood estimation. arXiv preprint arXiv:2101.06290, 2021.
  • [34] Aad van der Vaart, Sandrine Dudoit, and Mark van der Laan. Oracle inequalities for multi-fold cross validation. Statistics & Decisions, 24(3):351–371, 2006.