# On Statistical Efficiency in Learning

A central issue of many statistical learning problems is to select an appropriate model from a set of candidate models. Large models tend to inflate the variance (or overfitting), while small models tend to cause biases (or underfitting) for a given fixed dataset. In this work, we address the critical challenge of model selection to strike a balance between model fitting and model complexity, thus gaining reliable predictive power. We consider the task of approaching the theoretical limit of statistical learning, meaning that the selected model has the predictive performance that is as good as the best possible model given a class of potentially misspecified candidate models. We propose a generalized notion of Takeuchi's information criterion and prove that the proposed method can asymptotically achieve the optimal out-sample prediction loss under reasonable assumptions. It is the first proof of the asymptotic property of Takeuchi's information criterion to our best knowledge. Our proof applies to a wide variety of nonlinear models, loss functions, and high dimensionality (in the sense that the models' complexity can grow with sample size). The proposed method can be used as a computationally efficient surrogate for leave-one-out cross-validation. Moreover, for modeling streaming data, we propose an online algorithm that sequentially expands the model complexity to enhance selection stability and reduce computation cost. Experimental studies show that the proposed method has desirable predictive power and significantly less computational cost than some popular methods.

## Authors

• 44 publications
• 12 publications
• 9 publications
• 59 publications
06/19/2015

### Information-based inference for singular models and finite sample sizes

A central problem in statistics is model selection, the choice between c...
03/23/2017

### Cross-Validation with Confidence

Cross-validation is one of the most popular model selection methods in s...
09/14/2021

### Targeted Cross-Validation

In many applications, we have access to the complete dataset but are onl...
05/27/2020

### Bayesian model selection in the ℳ-open setting – Approximate posterior inference and probability-proportional-to-size subsampling for efficient large-scale leave-one-out cr

Comparison of competing statistical models is an essential part of psych...
12/24/2020

### Leave Zero Out: Towards a No-Cross-Validation Approach for Model Selection

As the main workhorse for model selection, Cross Validation (CV) has ach...
06/04/2018

### Post model-fitting exploration via a "Next-Door" analysis

We propose a simple method for evaluating the model that has been chosen...
10/07/2020

### Combination of digital signal processing and assembled predictive models facilitates the rational design of proteins

Predicting the effect of mutations in proteins is one of the most critic...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

How much knowledge can we learn from a given set of data? Statistical modeling provides a simplification of real-world complexity. It can be used to learn key representations from available data and to predict future data. To model the data, typically the first step in data analysts is to narrow the scope by specifying a set of candidate parametric models (referred to as model class). The model class can be determined by exploratory studies or scientific reasoning. For data with specific types and sizes, each postulated model may have its advantages. In the second step, data analysts estimate the parameters and fitting performance of each candidate model. An illustration of a typical learning procedure is plotted in Fig.

1, where the underlying data generating process may or may not be included in the model class. Selecting the model with the best fitting performance usually leads to suboptimal results. For example, the largest model always fits the best in a nested model class. But an overly complex model can lead to inflated variance in parameter estimation and thus overfitting. Therefore, the third step is to apply a suitable model selection procedure, which will be elaborated in the next section.

How can we quantify the theoretical limits of learning procedures? We first introduce the following definition that quantifies the predictive power of each candidate model.

###### Definition 1 (Out-sample prediction loss)

The loss function for each sample size and (model class) is a map , usually written as , where is the data domain, is the parameter space associated with model , and is included to emphasize the model under consideration. As Fig. 1 shows, for a loss function and a given dataset which are independent and identically distributed (i.i.d.), each candidate model produces an estimator (referred to as the minimum loss estimator) defined by

 ^θn[α]Δ=argminθ∈Hn[α]1nn∑i=1l(zi,θ;α). (1)

Moreover, given by candidate model , denoted by

, the out-sample prediction loss, also referred to as the generalization error in machine learning, is defined by

 Ln(α) Δ=E∗l(⋅,^θn[α];α) =∫Zp(z)l(z,^θn[α];α)dz. (2)

Here,

denotes the expectation with respect to the distribution of a future unseen random variable

(conditional on the observed data). We also define the risk by

 Rn[α]=E∗,oLn[α],

where the expectation in is taken with respect to the observed data.

The above notation applies to both supervised and unsupervised learning. In supervised learning,

often consists of a label and feature , and only the entries of associated with are involved in the evaluation of

. In Statistical Learning Theory, the

may be written as , where is the estimated function under . We note that the loss function here is not tied to a particular parametrization of . In comparison, our earlier notion of involves model parameters to develop technical results in this paper. The parameterization may represent regression coefficients from basis expansions (e.g., polynomials, splines, and wavelets) or distributional parameters in a finite mixture model.

Throughout the paper, we consider loss functions such that is always nonnegative. A common choice is to use negative log-likelihood of model minus that of the true data-generating model (or its closest candidate model). Table I lists some other loss functions widely used in machine learning. We also provide two motivating examples below.

###### Example 1 (Generalized linear models)

In a generalized linear model (GLM), each response variable

is assumed to be generated from a particular distribution (e.g. Gaussian, Binomial, Poisson, Gamma), with its mean linked with potential covariates through where is a link function. In this example, data , unknown parameters are , and models are subsets of . We may be interested in the most appropriate distribution family as well as the most significant variables ’s (relationships).

###### Example 2 (Neural networks)

In establishing a neural network (NN) model, we need to choose the number of neurons and hidden layers, activation function, and connectivity configuration. In this example, data are similar to that of the above example, and unknown parameters are the weights on connected edges. Clearly, with a larger number of neurons and connections, more complex functional relationships can be modeled. However, selecting large models may result in overfitting and more computational complexity.

Based on Definition 1, a natural way to define the limit of statistical learning is by using the optimal prediction loss.

###### Definition 2 (Oracle performance)

For a given data (of size ) and model class , the oracle performance is defined as , the optimal out-sample prediction loss offered by candidate models.

The oracle performance is associated with three key elements: data, loss function, and model class. Motivated by the original derivation of Akaike information criterion (AIC) [1, 2] and Takeuchi’s information criterion (TIC) [3]

, we propose a penalized selection procedure and prove adaptivity to the oracle under some regularity assumptions. Those assumptions allow a wide variety of loss functions, model classes (i.e., nested, non-overlapping or partially-overlapping), and high dimensions (i.e., the models’ complexity can grow with sample size). It is worth noting that asymptotic analysis for a fixed number of candidate models with fixed dimensions is generally straightforward. Under some classical regularity conditions (e.g.,

[4, Theorem 19.28]

), the likelihood-based principle usually selects the model that attains the smallest Kullback-Leibler divergence from the data generating model. However, our high dimensional setting considers models whose dimensions and parameter spaces may depend on sample size. Thus we cannot directly use the technical tools that have been used in the classical asymptotic analysis for misspecified modes. We will develop some new technical tools in our proof. Our theoretical results extend the classical statistical theory on AIC for linear (fixed-design) regression models to a broader range of generalized linear or nonlinear models. Moreover, we also review the conceptual and technical connections between cross-validation and information criteria. In particular, we show that the proposed procedure can be much more computationally efficient than cross-validations (with comparable predictive power).

Why is it necessary to consider a high dimensional model class, in the sense that the number of candidate models or each model’s complexity is allowed to grow with sample size? In the context of regression analysis, technical discussions that address the question have been elaborated in

[5, 6]. Here, we give an intuitive explanation for a general setting. We let denote the minimum loss parameter defined by

 θ∗n[α] Δ=argminθ∈Hn[α]E∗l(⋅,θ;α). (3)

We note that for some models such as neural networks may not be unique. Using Taylor expansion under some regularity conditions, may be expressed as

 Ln[α]= (4)

where , and

is a sequence of random variables that converges to zero in probability. The main idea of (

4) is to expand at a projection point under some uniform convergence condition in its vicinity. Theoretical justifications of (4) or its variants for a model whose dimension depends on have been studied in several earlier work, e.g. in [7, 8, 9]. The out-sample prediction loss consists of two additive terms: the bias and the variance. Large models tend to reduce the bias but inflate the variance (overfitting), while small models tend to reduce the variance but increase the bias (underfitting) for a given fixed dataset. Suppose that “all models are wrong”, meaning that the data generating model is not included in the model class. Usually, the bias is non-vanishing (with ) for a fixed model complexity (say ), and it is approximately a decreasing function of ; while on the other hand, the variance vanishes at rate for a fixed , and it is an increasing function of . Suppose for example that the bias and variance terms are approximately and , respectively, for some positive constants . Then the optimal is at the order of .

In view of the above arguments, as more data become available, the model complexity needs to be enlarged to strike a balance between bias and variance (or approach the oracle). To illustrate, we generated

data from a logistic regression model, where coefficients are

and covariates ’s are independent standard Gaussian (for ). We consider the nested model class , and the loss function is chosen to be the negative log-likelihood. We summarize the results in Fig. 4. As model complexity increases, the model fitting as measured by in-sample loss improves (Fig. (a)a). In contrast, the predictive power, as measured by the out-sample prediction loss, first improves and then deteriorates after some “optimal dimension” (Fig. (b)b). Also, the optimal dimension becomes larger as the sample size increases.

As data sequentially arrive, the selected model from our proposed method (and many other existing methods such as cross-validation) suffers from fluctuations due to randomness. A conceptually appealing and computationally efficient way is to move from small models to larger models sequentially. For that purpose, based on the proposed method, we further propose a sequential model expansion strategy that aims to facilitate interpretability of learning.

The outline of the paper is given as follows. In Section III, we propose a computationally efficient method that determines the most appropriate learning model as more data become available. We prove that the oracle performance can be asymptotically approached under some regularity assumptions. In Section IV, we propose a model expansion technique building upon a new online learning algorithm, which we refer to as “graph-based” learning. The online learning algorithm may be interesting on its own as it exploits graphical structure when updating the expert systems and computing the regrets. In a supplementary material available at [10]

, we experimentally demonstrate the applications of the proposed methodology to generalized linear models and neural networks in selecting the variables/neurons. The related open-sources codes are also provided.

## Ii Related Work

A wide variety of model selection techniques have been proposed in the past fifty years, motivated by different viewpoints and justified under various circumstances. We refer to [11, 12, 13, 14, 15] for more surveys. This section briefly reviews some closely related work in information criterion and cross-validation, and includes a derivation of TIC.

### Ii-a Information Criteria

Examples of penalized selection include final prediction error criterion [16], AIC [1, 2], TIC [3], BIC [17]

and its Bayesian counterpart Bayes factor

[18], minimum description length criterion [19], Hannan and Quinn criterion [20], predictive minimum description length criterion [21, 22], method [23], generalized information criterion (GIC) with  [24, 25, 5], generalized cross-validation method (GCV) [26], the Goldenshluger-Lepski method [27, 28, 29], and the bridge criterion (BC) [30]. Recently, a regularization approach named as information criterion estimation (ICE) [31] is proposed that extends TIC to handle non-MLE estimates in over-parameterized models. An extension of AIC and Mallows’

method is the ‘slope heuristics’ approach proposed in

[32, 33] for Gaussian model selection and later developed to more general settings [34, 35, 36]. The main idea of slope heuristics is to recognize the existence of a minimal penalty such that the out-sample prediction loss of the selected model with lighter penalties explode, and to show that a penalty equal to twice the minimal penalty often enables model selection that meets the inequality: also called oracle inequality, for close to 1 and negligible with respect to the value of . In theory, the asymptotic efficiency is a limiting requirement of the oracle inequality with and as . There have been fruitful results in non-asymptotic quantifications of and using concentration inequalities (see e.g. [37, 38, 39, 40, 35, 36]). Non-asymptotic analysis is often based on concentration inequalities or Stein’s method [12, 41]. In this work, we are not looking for oracle inequalities with non-asymptotic analysis. On the other hand, the recent development of slope heuristics has motivated the data-driven construction of penalty terms instead of using pre-determined penalty functions. An example in this direction is the dimension jump method [33, 42], which, for a given penalty shape, identifies the suitable multiplicative constant by searching for a significant jump of the selected dimension against different constants.

### Ii-B Cross-validation (CV)

The basic idea of cross-validation [43, 44] is to split the data into two parts: one for training and one for testing. The model with the best testing performance is selected, hoping that it will perform well for future data. It is a common practice to apply a 10-fold CV, 5-fold CV, 2-fold CV, or 30%-for-testing. In general, the advantages of the CV method are its stability and easy implementation. However, is cross-validation really the best choice?

In fact, it has been shown that only the delete- CV method with  [45, 46, 47, 48], or the delete- CV method (or leave-one-out, LOO) [49] can exhibit asymptotic optimality. Specifically, the former CV exhibits the same asymptotic behavior as BIC, which is typically consistent in a well-specified model class (i.e., it contains the true data generating model), but is suboptimal in a misspecified model class. The latter CV is shown to be asymptotically equivalent to AIC/TIC and GCV if  [49, 5], which is asymptotically efficient in a misspecified model class but usually overfits in a well-specified model class. An appropriate choice of the splitting ratio often depends on specific learning tasks, such as the prediction of unobserved data, selection of model, selection of other criteria [50], goodness-of-fit test [51]. We refer to [5, 52, 14, 30, 15] for more detailed discussions on the discrepancy and reconciliation of different CVs.

In particular, for the prediction purpose, common folklore that advocates the use of -fold or 30%-for-testing CV are asymptotically suboptimal (in the sense of Definition 3

), even in linear regression models

[5]. Since the only optimal CV is LOO-type in misspecified settings, it is more appealing to apply AIC or TIC that gives the same asymptotic performance, and significantly reduces the computational complexity by times. For general misspecified nonlinear model class, we shall prove that the GTIC procedure asymptotically approaches the oracle. While the asymptotic performance of LOO is not clear in that case, it is typically more complex to implement. To demonstrate that, we shall provide some experimental studies in the supplementary material. As a result, the GTIC procedure can be a promising competitor of various standard CVs adopted in practice.

### Ii-C Background of TIC

TIC [3] was heuristically derived as an alternative of AIC, also from an information-theoretic view rooted in Kullback-Leibler (KL) divergence. Recall that AIC selects a model that minimizes the negative maximum log-likelihood value plus the model dimension. In the seminal work of [49], TIC is shown to be asymptotically equivalent to cross-validation when the purpose is to minimize the KL divergence, and AIC is a special case of TIC when the models under consideration are well-specified. It appears neither widely appreciated nor used [53] compared with other information criteria such as AIC or Bayesian information criterion (BIC) [17]. In terms of provable asymptotic performance, only AIC is known to be asymptotically efficient for variable selection in regression models [54] and autoregressive order selection in time series models [55, 56] when models are misspecified. Conceptually, TIC was proposed as a surrogate for AIC in general misspecified settings, but the optimality of AIC and TIC in the general context remains unknown. As the original paper of TIC [3] was not written in English, we review it for the completeness of the paper. Similar derivations can be found in, e.g., [31].

Suppose that our goal is to select the model that minimizes logarithmic loss (or equivalently, minimizes the KL divergence from the true data-generating distribution), where is the MLE under model . For notational convenience, we drop the model index and focus on one model. The motivation of TIC was to approximate by , where the first term is computable from data and the second term is to be asymptotically approximated. Under some regularity conditions, the classical sandwich formula of MLE [57, Theorem 3.2] gives for some in the parameter space, with

Applying Taylor expansion at , we have

 E∗{−logp^θn(z)} ≈E∗{−logpθ∗(z)}+12(^θn−θ∗)TV(^θn−θ∗) n−1n∑i=1{−logp^θn(zi)} ≈n−1n∑i=1{−logpθ∗(zi)}−(^θn−θ∗)Tn−1n∑i=1∂logpθ∗(zi)∂θ+12(^θn−θ∗)TV(^θn−θ∗)

and thus

 λn=E∗{−logp^θn(z)}−n−1n∑i=1{−logp^θn(zi)}≈(^θn−θ∗)Tn−1n∑i=1∂logpθ∗(zi)∂θ (5)

for large . Using

 n−1n∑i=1∂logpθ∗(zi)∂θ=n−1n∑i=1∂logpθ∗(zi)∂θ−n−1n∑i=1∂logp^θn(zi)∂θ≈n−1n∑i=1∂2logpθ∗(zi)∂θ2(θ∗−^θn)

and the asymptotic normality of , we may further approximate by where . For a well-specified model, we have and with denoting the model dimension, and thus TIC becomes AIC.

Why should TIC be preferred over AIC in nonlinear models in general? Intuitively speaking, TIC has the potential of exploiting the nonlinearity while AIC does not. Recall our Example 2 in the introduction, with the loss being the negative log-likelihood. It is well known from machine learning practice that neural network structures play a key role in effective prediction. However, information criteria such as AIC impose the same amount of penalty as long as the number of neurons remains the same, regardless of how neurons are configured.

In this paper, we extend the scope of allowable loss functions and theoretically justify the use of GTIC (and thus TIC). Under some regularity conditions (elaborated in the Appendix), we shall prove that the selected by the GTIC procedure is asymptotically efficient (in the sense of Definition 3). This is formally stated as a theorem in Subsection III-C. Our theoretical results extend some existing statistical theories on AIC for linear models. We note that the technical analysis of high dimensional (non) linear model classes is highly nontrivial. We will develop some new technical tools in the Appendix, which may be interesting on their own rights.

## Iii Adaptivity to the Oracle

### Iii-a Notation

Let , , , denote respectively a set of finitely many candidate models (also called the model class), a candidate parametric model, its dimension, its associated parameter space. Let denote the dimension of the largest candidate model. We will frequently use a subscript to emphasize the dependency on the sample size and include an in the arguments of many variables or functions to emphasize their dependency on the model (and parameter space) under consideration. For a measurable function , we define . For example, We let , and

, which are respectively measurable vector-valued and matrix-valued functions of

. We define the matrices

 Vn(θ;α) Δ=E∗∇θψn(⋅,θ;α) Jn(θ;α) Δ=E∗{ψn(⋅,θ;α)×ψn(⋅,θ;α)T}

Recall the definition of . Its sample analog (also referred to as the in-sample loss) is defined by Similarly, we define

 ^Vn(θ;α) Δ=En∇θψn(⋅,θ;α) ^Jn(θ;α) Δ=En{ψn(⋅,θ;α)×ψn(⋅,θ;α)T}

When is the negative log-likelihood, the above is the score function, and and are candidates for estimating a Fisher information matrix.

Throughout the paper, the vectors are arranged in a column and marked in bold. Let denote the Euclidean norm of a vector or spectral norm of a matrix. Let denote the interior of a set . For any vector () and scalar , let . For a positive semidefinite matrix and a vector of the same dimension, we shall abbreviate as . For a given probability measure and a measurable function , let denote the -norm. Unless otherwise stated, denotes the expectation with respect to the true data generating process. Let (resp.

) denote the smallest (resp. maximal) eigenvalue of a symmetric matrix

. For a sequence of scalar random variables , we write if in probability, and , if it is stochastically bounded. For a fixed measurable vector-valued function , we define

 GnfΔ=√n(En−E∗)f,

the empirical process evaluated at . For , we write if for a universal constant . For a vector or a vector-valued function , we let or denote the th component.

We use and to respectively denote the deterministic and in probability convergences. Unless stated explicitly, all the limits throughout the paper are with respect to where is the sample size.

### Iii-B Approaching the Oracle – Selection Procedure

An appropriate model selection procedure is necessary to strike a balance between the model fitting and model complexity based on the observed data to obtain the optimal predictive power. The basic idea of penalized selection is to impose an additive penalty term on the in-sample loss so that larger models are more penalized. In this paper, we follow the aphorism that “all models are wrong”, and assume that the model class under consideration is misspecified.

###### Definition 3 (Efficient learning)

Our goal is to select that is asymptotically efficient, in the sense that

 Ln[^αn]minα∈AnLn[α]→p1 (6)

as .

Note that this requirement is weaker than selecting the exact optimal model . Also, the concept of asymptotic efficiency in model selection is reminiscent of its counterpart in parameter estimation theory. A similar definition has been adopted in the study of the optimality of AIC in autoregressive order selection [55] and variable selection in linear regression models [54].

It is worth noting that the above definition is in the scope of the available data and a specified class of models. Because we are in a data-driven setting where it is unrealistic to compete with the best performance attainable with full knowledge of the underlying distribution, we chose the above rationale of efficient learning instead of using

 Ln[^αn]minα∈AnE∗l(⋅,θ∗n[α];α)→p1

whose denominator does not reveal the influence of finite-sample data. In other words, Definition 3 calls for a model whose predictive power can practically approach the best offered by the candidate models (i.e., the oracle in Definition 2).

A related but different school of thoughts is structural risk minimization in the statistical learning literature. In that context, the out-sample prediction loss is usually bounded using in-sample loss plus a positive term (e.g., a function of the Vapnik-Chervonenkis (VC) dimension [58] for a classification model). Definitive treatment of this line of work can be found in, e.g., [59, 60, 61, 40] and the references therein. The major difference of our setting compared with that in learning theory is our requirement that the positive term plus the in-sample loss should asymptotically approach the true out-sample loss (as sample size goes to infinity).

Another related notion often used to describe model selection performance is minimax-rate optimality [11, 52]. In nonparametric estimation of the regression function under the squared loss, tight minimax risk bounds for have been obtained since the pioneering work of [62, 63] (see [64, 65] for more discussions). A model selection method is said to be minimax-rate optimal over , if converges at the same rate as the aforementioned minimax risk, where is the least squares estimate of under the variables selected by . In contrast to the notion of asymptotic efficiency, which we focus on here, minimax-rate optimality allows the true data-generating model to vary and thus is a stronger requirement. The asymptotic efficiency is in a pointwise sense, meaning that a fixed but unknown data-generating process already generates the data. It has been proved that AIC is minimax-rate optimal for a range of variable selection tasks, and there exists no model selection method that achieves such optimality as well as selection consistency [52]. Meanwhile, it is possible to simultaneously combine asymptotic efficiency and selection consistency, and that motivated recent research in reconciling AIC-type and BIC-type model selection methods [66, 67, 50, 30].

We propose to use the following penalized model selection procedure, which extends TIC from negative log-likelihood to general loss functions.

Generalized TIC (GTIC) procedure: Given data and a specified model class . We select a model in the following way: 1) for each , find the minimal loss estimator defined in (1), and record the minimum as ; 2) select , where

 Lcn[α]Δ=^Ln[α]+n−1tr{^Vn(^θn[α];α)−1^Jn(^θn[α];α)}. (7)

We note that the two additive terms on the right-hand side of (7) represent the fitting performance and the model complexity, respectively.

The quantity , also referred to as the corrected prediction loss, can be calculated from data. It serves as a surrogate for the out-sample prediction loss , which is usually not computable. The in-sample loss cannot be directly used as an approximation for , because it uses the sample approximation twice: once in the estimation of , and then in the approximation of using

(the law of large numbers). For example, in a nested model class, the largest model always has the least

(i.e., fits data the best). But as we discussed in the introduction, is typically decreasing first and then increasing as the dimension increases.

### Iii-C Asymptotic Analysis of the GTIC Procedure

We need the following assumptions for asymptotic analysis.

###### Assumption 1

Data are independent and identically distributed (i.i.d.).

Assumption 1 is standard for theoretical analysis and some practical applications. In the context of regression analysis, it corresponds to the random design. In our technical proofs, it is possible to extend the assumption of i.i.d. to strong mixing [68], which is more commonly assumed for time series data.

###### Assumption 2

For each model , (as was defined in (3)) is in the interior of the compact parameter space , and for all we have

for some constant that depends only on . Moreover, we have

 supα∈Ansupθ∈Hn[α]∣∣∣Enℓ(⋅,θ;α)−E∗ℓ(⋅,θ;α)∣∣∣→p0,

as , and is twice differentiable in for all , .

Assumption 2 is the counterpart of the separated mode and uniform law of large number conditions that have been commonly required in proving the consistency of maximum likelihood estimator for classical statistical models (see, e.g., [4, Theorem 5.7]). The can be interpreted as the oracle optimum under model , or a “projection” point of the true data generating distribution onto the model .

###### Assumption 3

There exist constants and such that

 supα∈Ansupθ∈Hn[α]∩B(θ∗n[α],δ)nτ∥Enψn(⋅,θ;α)−E∗ψn(⋅,θ;α)∥=Op(1).

Additionally, the map is differentiable at for all and .

Assumption 3

is a weaker statement compared with the central limit theorem and its extension to Donsker classes in a classical (non-high dimensional) setting. In our high dimensional setting, the assumption ensures that each projected model

behaves regularly. It implicitly builds a relation between , the dimension of the largest candidate models, and sample size . As was pointed out by an anonymous reviewer, it is technically possible to replace with a weaker requirement, say for any constant .

###### Assumption 4

There exist constants such that

 liminfn→∞minα∈Aneigmin(Vn(θ∗n;α))≥c1, limsupn→∞maxα∈Aneigmax(Vn(θ∗n;α))≤c2.

Assumption 4 assumes that the second derivative of the out-sample prediction loss has bounded eigenvalues at the optimum . The lower bound indicates that the loss function is strongly convex for all models, and the upper bound requires the loss functions to be reasonably smooth. This assumption is used in our asymptotic analysis to ensure reasonable Taylor expansions up to the second order.

###### Assumption 5

There exist fixed constants , , and measurable functions , for each , such that for all and ,

 ∥ψn(z,θ1;α)−ψn(z,θ2;α)∥≤mn[α](z)∥θ1−θ2∥, (8) E∗mn[α]<∞. (9)

Moreover, we have

 max{d2γn card(An)γ, dn√log{dncard(An)}} ×n−τ∥∥∥supα∈Anmn[α]∥∥∥P∗→0. (10)

Assumption 5 is a Lipschitz-type condition. Similar but simpler forms of this have been used in classical analysis of asymptotic normality [4, Theorem 5.21]. We note that the condition (10) explicitly requires that the largest dimension and the candidate size do not grow too fast as goes to infinity. The condition (10) is used to bound the rate of convergence of the empirical process in the vicinity of . Similar conditions were often used to establish asymptotic results such as the Cramér-Rao bound [69, Theorem 18].

###### Assumption 6

There exists a constant such that

 supα∈Ansupθ∈Hn[α]∩B(θ∗n[α],δ)∥^Jn(θ;α)−Jn(θ;α)∥→p0, (11) supα∈Ansupθ∈Hn[α]∩B(θ∗n[α],δ)∥^Vn(θ;α)−Vn(θ;α)∥→p0, (12) limε→0supα∈Ansupθ∈Hn[α]∩B(θ∗n[α],ε)∥Vn(θ;α)−Vn(θ∗n;α)∥=0. (13)

Assumption 6 requires that the sample analogs of the matrices and are asymptotically close to the truth (in spectral norm) in a neighborhood of . In the classical setting, it is guaranteed by the law of large numbers (applied to each matrix element). The above uniform convergence conditions may be further simplified using finite sample properties of random covariance-type matrices, e.g., a recent result in [70]. Assumption 6 also requires the continuity of in a neighborhood of .

We define

 wn[α]=1√nn∑i=1ψn(zi,θ∗n[α];α).

Clearly, has zero mean and variance matrix , and thus

 E∗∥wn[α]∥2Vn(θ∗n[α];α)−1=tr{Vn(θ∗n[α];α)−1Jn(θ∗n[α];α)}.
###### Assumption 7

Suppose that the following regularity conditions are satisfied.

 infα∈Ann2τRn[α]→∞, (14) supα∈Andn[α]nRn[α]→0. (15)

Moreover, there exists a fixed constant such that

 ∑α∈An(nRn[α])−2m1 E∗{l(⋅,θ∗n[α];α)−E∗l(⋅,θ∗n[α];α)}2m1→0, (16)

there exists a fixed constant such that

 (17)

and there exists a fixed constant such that

 limsupn→∞∑α∈An(nRn[α])−m3{ E∗∥wn[α]∥m3+E∗∥wn[α]∥2m3}<∞. (18)

In Assumption 7, the conditions (14), (15) and (18) indicate that the risks for all are not small so that the model class is virtually mis-specified. The assumptions of (16) and (17

) are central moment constraints that control the regularity of loss functions. Similar conditions were often used to establish the asymptotic performance of model selection, for example

[5, Condition (2.6)] and [71, Condition (A.3)].

Overall, Assumptions 1-7 ensure that the conditions for asymptotic normality in regular parametric models are supplemented with conditions ensuring a sufficient level of uniformity among models.

###### Theorem 1

Suppose that Assumptions 1-7 hold. Then the selected by GTIC procedure is asymptotically efficient (in the sense of Definition 3).

Classical asymptotic analysis for general parametric models with i.i.d. observations typically relies on a type of uniform convergence of empirical process around within a fixed parameter space. Because our functions are vector-valued with dimension depending on the sample size , we cannot directly use state-of-the-art technical tools such as those in [4, Theorem 19.28]. The classical proof by White [57] (in proving asymptotic normality in misspecified class) cannot be directly adapted, either, for parameter spaces that depend on . On the other hand, though asymptotic analysis for criteria such as AIC, , CV, GIC often consider models that depend on (see, e.g., [54, 71, 5, 52], it is often studied in the context of fixed-design regression models, so the technical tools there cannot be directly applied for our purpose.

Some new technical tools are needed in our proof. Here we sketch some technical ideas in the proof. We first prove that is -consistent (instead of the classical -consistency). We then prove the first key result, namely Lemma 6, that states a type of local uniform convergence. Note that its proof is nontrivial as both the empirical process and depend on the same observed data. Our technical tools resemble those for proving a Donsker class, but the major difference is that our model dimensions depend on . We then prove the second key lemma, Lemma 7. It directly leads to the asymptotic normality of maximum likelihood estimators in the classical setting. It is somewhat interesting to see that the proof of Lemma 7 does not require the -consistency of , which usually does not hold in high dimensional settings.

### Iii-D Example

Theorem 1 applies to general parametric model classes, where assumptions can often be simplified. We shall use regression models as an example of applying Theorem 1. Suppose that the response variable is written as , where is a random noise with mean zero and variance , and is a possibly nonlinear function of predictors . In linear models, data analysts assume that is a linear function of in the form of , where may or may not depend on the sample size . We sometimes write as for brevity. For simplicity, we assume that is known, and is a random vector independent with . Also assume that and (). The observed data are independent realizations of . The unknown parameters are . The model class, denoted by , consists of candidate models represented by , i.e. .

In regression, it is common to use the quadratic loss function

 l(z,θ;α)=(y−∑j∈αβjxj)2−σ2

for . The subtraction of allows for better comparison of competing models. Note that the population loss is

 (19)

Suppose that is defined as in (3). We define to be the covariance matrix whose -th element is , to be the column vector whose -th element is , and . We similarly define , , which are the covariance matrix/vectors restricted to model . Simple calculations show that for , and (19) may be rewritten as

 E∗l(z,θ;α) =E∗l(z,θ∗n[α];α)+∥θ−θ∗n[α]∥2Σxx[α] =(Σμμ−Σμx[α]Σxx[α]−1Σxμ[α])+∥θ−θ∗n[α]∥2Σxx[α]. (20)

The decomposition in (20) has a nice interpretation in terms of bias-variance tradeoff. The first term is the -norm of the orthogonal complement of projected to the linear span of covariates, or the minimal possible loss offered by the specified model . Clearly, it is zero if is well-specified, and nonzero otherwise. The second term represents the variance of estimation. Evaluating and in this specific case, we obtain

Note that when is close to the independent noise term, then and the GTIC penalty in (7) is around which approximates the AIC and Mallows’ method. Theorem 1 implies the following corollary. In verifying the previous assumptions such as Assumption 2 for this corollary, we used the fact that , and the least squares estimates fall into with high probability (due to the concentration inequalities for bounded and ). It is possible to relax the conditions by a more sophisticated verification of assumptions.

###### Corollary 1

Assume that and () are bounded by a constant that does not depend on . Suppose the following conditions hold, then the selected by GTIC procedure is asymptotically efficient.
1) are independent with zero mean and unit variance for all ;
2) , where ;
3) , where ;
4) , where .

## Iv Sequential Model Expansion

As explained in the introduction, in terms of predictive power, a model in a misspecified model class could be determined to be unnecessarily large, suitable, or inadequately small, depending on a specific sample size (see Fig. 4). A realistic learning procedure thus requires models of different complexity levels as more data become available.

Throughout this section, we shall use (instead of the previously used ) to denote sample size, and subscript as the data index, in order to emphasize the sequential setting.

### Iv-a Discussion

We have addressed the selection of an efficient model for a given number of observations. In many practical situations, data are sequentially observed. A straightforward model selection is to repeatedly apply the GTIC procedure upon arrival of data. However, in a sequential setting, the following issue naturally arises:

Suppose that we successively select a model and use it to predict at each time step. The path of the historically selected models may fluctuate a lot. Instead, it is more appealing (either statistically or computationally) to force the selected models to evolve gradually.

To address the above challenge, we first propose a concept referred to as the graph-based expert tracking, which extends some classical online learning techniques (Algorithm 1). Motivated by the particular path graph , where index the candidate models, we further propose a model expansion strategy (Algorithm 2), where each candidate model and its corrected prediction loss can be regarded respectively as an expert and loss.

The proposed algorithm can be used for online prediction, which ensures not only statistically reliable results but also simple computation. Specifically, we propose a predictor that has cumulative out-sample prediction loss (over time) close to the following optimum benchmark:

 minsize(i1,…,iT)≤k, i1,…,iT∈{1,…,N}T∑t=1Ln[αit]. (21)

where the size of a sequence is defined as the number of ’s such that . In other words, the minimization is taken over all tuples that have at most switches and that are restricted to the chain . For example, . In the above formulation, and respectively mean the index of the model chosen to predict at time step , and the number of switches within time steps.

### Iv-B Tracking the Best Expert with Graphical Constraints

In this subsection, we propose a novel graph-based expert tracking technique that motivates our algorithm in the following subsection. The discussion may be interesting on its own right, as it includes the state-of-art expert tracking framework as a special case (when the underlying graph is fully-connected/complete).

Suppose there are experts. At each discrete time step , each expert gives its prediction, after which the environment reveals the truth . In this subsection, with a slight abuse of notation, we shall also use to denote loss functions in the context of online learning. The performance of each prediction is measured by a loss function . A smaller loss indicates a better prediction. In light of the model expansion we shall introduce in the next subsection, each represents a model, and is the prediction loss of model which is successively re-estimated using at time step .

In order to aggregate all the predictions that the experts make, we maintain a weight value for each expert and update them upon the arrival of each new data point based on the qualities of the predictions. We denote the weight for expert at time as , and the normalized version as . The goal is to optimally update the weights for a better prediction, which is measured by the cumulative loss minus the best achievable (benchmark) loss. This measure is often called “regret” in the online learning literature [72, 73, 74]. The regret is a relevant criterion of evaluating the predictive performance in sequential settings since the model and model parameters have to be adjusted on a rolling basis as new data arrives, and a selected model at a time step may not be suitable at another time step . If the benchmark in the regret is defined as the minimum cumulative loss achieved by a single expert in hindsight, namely , then it is standard to apply the exponential re-weighting procedure which produces some desirable regret bound [73, Chapter 2]. In many cases the best performing expert can be different from one time segment to another, motivating the benchmark

 minsize(i1,…,iT)≤k,i1,…,iT∈{1,…,N}T∑t=1l(it,zt)

where denotes the maximum number of switches of the best experts in hindsight. In this scenario, the fixed share algorithm [73, Chapter 5] can be a good solution with guaranteed regret bound. We consider the following problem setting that aims to significantly reduce computational costs.

The best performing expert is restricted to switch according to a directed graph, (without self-loops), with denoting the set of nodes (representing experts) and denoting the set of directed edges. At each time point, the best performing expert can either stay the same or jump to another node which is directly connected from the current node. Let

 βij=1∃(i,j)∈E, (22)

which is if there is a directed edge on the graph, and otherwise. Let