I Introduction
One of the main aims in data modeling is good generalization, i.e. the model’s capability to approximate accurately the system output for unseen data. Sparse models can be constructed using the penalized cost function, e.g. the basis pursuit or least absolute shrinkage and selection operator (LASSO) [1, 2, 3]. Based on a fixed single penalized regularization parameter, the LASSO can be configured as a standard quadratic programming optimization problem. By exploiting piecewise linearity of the problem, the least angle regression (LAR) procedure [3] is developed for solving the problem efficiently. Note that the computational efficiency in LASSO is facilitated by a single regularization parameter setting. For more complicated constraints, e.g. multiple regularizers, the cross validation by actually splitting data sets as the means of evaluating model generalization comes with considerably large overall computational overheads.
Alternatively the forward orthogonal least squares (OLS) algorithm efficiently constructs parsimonious models [4, 5]. Fundamental to the evaluation of model generalization capability is the concept of crossvalidation [6], and one commonly used version of crossvalidation is the leaveoneout (LOO) cross validation. For the linearintheparameters models, the LOO mean square error (LOOMSE) can be calculated without actually splitting the training data set and estimating the associated models, by making use of the ShermanMorrisonWoodbury theorem. Using the LOOMSE as the model term selective criterion to seek the model generalization, an efficient orthogonal forward regression (OFR) procedure have been introduced [7]. Furthermore, the norm based regularization techniques [8, 9, 10]
have been incorporated into the OLS algorithm to produce a regularized OLS (ROLS) algorithm that carries out model term selection while reduces the variance of parameter estimate simultaneously
[11]. The optimization of norm regularizer with respect to model generalization analytically is however less studied.In this contribution, we propose a norm penalized OFR (POFR) algorithm to carry out the regularizer optimization as well as model term selection and parameter estimation simultaneously in a forward regression manner. The algorithm is based on a new norm penalized cost function with multiple
regularizers, each of which is associated with an orthogonal basis vector by orthogonal decomposition of the regression matrix of the selected model terms. We derive a closed form of the optimal regularization parameter in terms of minimal LOOMSE. To save computational costs an inactive set is used along the OFR process by predicting whether any model terms will be unselectable in future regression steps.
Ii Preliminaries
Consider the general nonlinear system represented by the nonlinear model [13, 12]:
(1) 
where denotes the dimensional input vector at sample time index and is the system output variable, respectively, while
denotes the system white noise and
is the unknown system mapping.The unknown system (1) is to be identified based on an observation data set using some suitable functional which can approximate with arbitrary accuracy. Without loss of generality, we use
to construct a radial basis function (RBF) network model of the form
(2) 
where is the model prediction output for based on the term RBF model, and is the total number of regressors or model terms, while are the model weights. The regressor is given by
(3) 
in which is known as the center vector of the th RBF unit and is an RBF width parameter. We assume that each RBF unit is placed on a training data, namely, all the RBF center vectors are selected from the training data , and the RBF width has been predetermined, for example, using cross validation.
Let us denote as the term modeling error for the input data . Over the training data set , further denote , , and with , . We have the term model in the matrix form of
(4) 
where . Let an orthogonal decomposition of the regression matrix be
(5) 
where
(6) 
and
(7) 
with columns satisfying , if . The regression model (4) can alternatively be expressed as
(8) 
where the ‘orthogonal’ model’s weight vector satisfies the triangular system , which can be used to determine the original model parameter vector , given and .
Further consider the following weighted norm penalized OLS criterion for the model (8)
(9) 
where , which contains the local regularization parameters , for , and is a predetermined lower bound for the regularization parameters. For a given , the solution for can be obtained by setting the subderivative vector of to zero, i.e. , yielding
(10) 
for , with the usual least squares solution given by , and the operator
(11) 
Unlike the LASSO [1, 2], our objective is constructed on the orthogonal space and the norm parameter constraints are associated with the orthogonal bases , . Since the cost function (9) contains sparsity inducing norm, some parameters will be returned as zeros, producing a sparse model in the orthogonal space spanned by the columns of , which corresponds to a sparse model in the original space spanned by the columns of .
Iii Regularization parameter optimization and model construction with LOOMSE
Each OFR stage involves the joint regularization parameter optimization, model term selection and parameter estimation. The regularization parameters with respect to their associated candidate regressors are optimized using the approximate LOOMSE formula that is derived in Section IIIB, and the regressor with the smallest LOOMSE is selected.
Iiia Model representation and LOOMSE in th stage OFR
Consider the OFR modeling process that has produced the term model. Let us denote the constructed columns of regressors as , with . The model output vector of this term model is given by
(12) 
and the corresponding modeling error vector by . Clearly, the th OFR stage can be represented by
(13) 
The model form (13) illustrates the fact that the th OFR stage is simply to fit a onevariable model using the current model residual produced after the th stage as the desired system output. Since , it is easy to verify that .
The selection of one regressor from the candidate regressors involves initially generating candidate by making each candidate regressor to be orthogonal to the orthogonal basis vectors, for obtained in the previous OFR stages, followed by evaluating their contributions. Consider the case of . Applying (10) to (13), we note that clearly as decreases away from towards , increases its magnitude at a linear rate to , from zero to an upper bound with
(14) 
For any candidate regressor, it is vital that we evaluate its potential model generalization performance using the most suitable value of . The optimization of the LOOMSE with respect to is detailed in Section IIIB, based on the idea of the LOO cross validation outlined below.
Suppose that we sequentially set aside each data point in the estimation set in turn and estimate a model using the remaining data points. The prediction error is calculated on the data point that has not been used in estimation. That is, for , the models are estimated based on , respectively, and the outputs are denoted as . Then, the LOO prediction error based on the th data sample is calculated as
(15) 
The LOOMSE is defined as the average of all these prediction errors, given by . Thus the optimal regularization parameter for the th stage is given by
(16) 
Evaluation of by directly splitting the data set requires extensive computational efforts. Instead, we show in Section IIIB that can be approximately calculated without actually sequentially splitting the estimation data set. Furthermore, we also show that the optimal value can be obtained in a closedform expression.
IiiB Optimal regularization parameter estimate
We notice from (10) that if , and thus a sufficient condition that a given may be excluded from the candidate pool without explicitly determining is , which is the regularizer’s lower bound, a preset value indicating the correlation of the candidate regressor. Hence, in the following we assume that , and we have
(17) 
where , , and . Note that (17) is consistent to (10) for all terms with nonzero . In the OFR procedure, any candidate terms producing zero will not be selected since they will not contribute to any reduction in the LOOMSE.
The model residual is defined by
(18) 
where denotes the transpose of the th row of . If the data sample indexed at is removed from the estimation data set, the LOO parameter estimator obtained by using only the remaining data points is given by
(19) 
in which , and denote the resultant regression matrix and desired output vector, respectively. It follows that we have
(20) 
(21) 
The LOO error evaluated at is given by
(22) 
Applying the matrix inversion lemma to (20) yields
(23) 
and
(24) 
Substituting (21) and (24) into (IIIB) yields
(25) 
Assuming that holds for most data samples and then applying (IIIB) to (IIIB), we have
(26) 
where , and is the th element of . The LOOMSE can then be calculated as
(27) 
We point out that in order for and to be different, each element in needs to be very close to zero, which is unlikely since only the model terms satisfying are considered. Hence we can treat given in (27) as the exact LOOMSE for any preset that is not too small.
We further represent (IIIB) as
(28) 
where is the model residual obtained based on the least square estimate at the th step stage. By setting , we obtain in the form of the weighted least square estimate
(29) 
where and . Finally we calculate
(30) 
in order to satisfy the constraint that . For obtained using (IIIB), we consider the following two cases:

If , then , and this candidate regressor will not be selected.

If , then calculate based on as the LOOMSE for this candidate regressor.
IiiC Moving unselectable regressors to the inactive set
From Section IIIB we noted that a candidate regressor satisfying does not need to be considered at the th stage of selection. To save computational cost, we define the inactive set as the index set of the unselectable regressors removed from the pool of candidates.
In the th OFR stage, all the candidate regressors in the candidate pool are made orthogonal to the previously selected regressors, and the candidate with the smallest LOOMSE value is selected as the th model term . Denote any other candidate regressor as .
Main Results: If , then this candidate regressor will never be selected in further regression stages, and hence it can be moved to .
Proof: At the th OFR stage, consider making the regressor orthogonal to , and define
(31) 
Clearly,
(32) 
The model residual vector after the selection of is
(33) 
where can be written as
(34) 
Thus we have
(35) 
(36) 
and
(37) 
Substituting (IIIC) and (IIIC) into (IIIC) yields
(38) 
due to the fact that . From (IIIC) and (IIIC), it can be concluded that
(39) 
Since is the upper bound of , this means that this regressor will not be selected at the th stage. By induction, it will never be selected in further regression stages, and hence it can be moved to .
Iv The proposed POFR algorithm
The proposed POFR algorithm integrates (i) the model regressor selection based on minimizing the LOOMSE; (ii) regularization parameter optimization also based on minimizing the LOOMSE; and (iii) the mechanism of removing unproductive candidate regressors during the OFR procedure. Define
(40) 
with . If some of the columns in have been interchanged, this will still be referred as for notational simplicity.
For , denote the th element of
as and compute , and .
Step 1): If , ;
Else if , set as a very large
positive number so that it will not be selected in Step 4). Otherwise goto step 2).
Step 2): Calculate

The initial conditions are as follows. Preset as a very small value. Set , for , and as the empty set . The th stage of the selection procedure is listed in Table I. The OFR procedure is automatically terminated at the th stage when the condition
(51) 
is detected, yielding a subset model with significant regressors. It is worth emphasizing that there always exists a model size , and for , the LOOMSE decreases as increases, while the condition (51) holds [7, 14].
Note that the LOOMSE is used not only for deriving the closed form of the optimal regularization parameter estimate but also for selecting the most significant model regressor. Specifically, a regressor is selected as the one that produces the smallest LOOMSE value as well as offering the reduction in the LOOMSE. After the stage when there is no reduction in the LOOMSE criterion for a few consecutive OFR stages, the model construction procedure can be terminated. Thus, the POFR algorithm automatically constructs a sparse term model, where typically .
Also note that it is assumed that should not be too small such that the LOOMSE estimation formula can be considered to be accurate. This means that if is set too low, many insignificant candidate regressors will have inaccurate LOOMSE values for competition. However, we emphasize that these terms with inaccurate LOOMSE values will not be selected as the winner to enter the model. Hence in practice we only need to make sure that is not too large, which would introduce unnecessary bias to the model parameter estimates. Clearly, a relatively larger will save computational costs by 1) resulting in a sparser model, and 2) producing a larger sized inactive set during the OFR process.
Finally, regarding the computational complexity of the POFR algorithm, if the unproductive regressors are not removed to the inactive set during the OFR procedure, it is well known that the computational cost is in the order of for evaluating each candidate regressor [14]. The total computational cost then needs to be scaled by the number of evaluations in forward regression, which is . By removing unproductive regressors to during the OFR procedure, the computational cost can obviously be reduced significantly. It is not possible to exactly assess the computational cost saving due to removing the unproductive regressors, as this is problem dependent.
Algorithm  MSE  MSE  Model  Cost 
training set  test set  size  saving  
LROLSLOO [14]  NA  
SVM ()  NA  
SVM ()  NA  
SVM ()  NA  
SVM ()  NA  
SVM ()  NA  
LASSO () 