    # l1-norm Penalized Orthogonal Forward Regression

A l1-norm penalized orthogonal forward regression (l1-POFR) algorithm is proposed based on the concept of leaveone- out mean square error (LOOMSE). Firstly, a new l1-norm penalized cost function is defined in the constructed orthogonal space, and each orthogonal basis is associated with an individually tunable regularization parameter. Secondly, due to orthogonal computation, the LOOMSE can be analytically computed without actually splitting the data set, and moreover a closed form of the optimal regularization parameter in terms of minimal LOOMSE is derived. Thirdly, a lower bound for regularization parameters is proposed, which can be used for robust LOOMSE estimation by adaptively detecting and removing regressors to an inactive set so that the computational cost of the algorithm is significantly reduced. Illustrative examples are included to demonstrate the effectiveness of this new l1-POFR approach.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

One of the main aims in data modeling is good generalization, i.e. the model’s capability to approximate accurately the system output for unseen data. Sparse models can be constructed using the -penalized cost function, e.g. the basis pursuit or least absolute shrinkage and selection operator (LASSO) [1, 2, 3]. Based on a fixed single -penalized regularization parameter, the LASSO can be configured as a standard quadratic programming optimization problem. By exploiting piecewise linearity of the problem, the least angle regression (LAR) procedure  is developed for solving the problem efficiently. Note that the computational efficiency in LASSO is facilitated by a single regularization parameter setting. For more complicated constraints, e.g. multiple regularizers, the cross validation by actually splitting data sets as the means of evaluating model generalization comes with considerably large overall computational overheads.

Alternatively the forward orthogonal least squares (OLS) algorithm efficiently constructs parsimonious models [4, 5]. Fundamental to the evaluation of model generalization capability is the concept of cross-validation , and one commonly used version of cross-validation is the leave-one-out (LOO) cross validation. For the linear-in-the-parameters models, the LOO mean square error (LOOMSE) can be calculated without actually splitting the training data set and estimating the associated models, by making use of the Sherman-Morrison-Woodbury theorem. Using the LOOMSE as the model term selective criterion to seek the model generalization, an efficient orthogonal forward regression (OFR) procedure have been introduced . Furthermore, the -norm based regularization techniques [8, 9, 10]

have been incorporated into the OLS algorithm to produce a regularized OLS (ROLS) algorithm that carries out model term selection while reduces the variance of parameter estimate simultaneously

. The optimization of -norm regularizer with respect to model generalization analytically is however less studied.

In this contribution, we propose a -norm penalized OFR (-POFR) algorithm to carry out the regularizer optimization as well as model term selection and parameter estimation simultaneously in a forward regression manner. The algorithm is based on a new -norm penalized cost function with multiple

regularizers, each of which is associated with an orthogonal basis vector by orthogonal decomposition of the regression matrix of the selected model terms. We derive a closed form of the optimal regularization parameter in terms of minimal LOOMSE. To save computational costs an inactive set is used along the OFR process by predicting whether any model terms will be unselectable in future regression steps.

## Ii Preliminaries

Consider the general nonlinear system represented by the nonlinear model [13, 12]:

 y(k)=f(x(k))+v(k), (1)

where denotes the -dimensional input vector at sample time index and is the system output variable, respectively, while

denotes the system white noise and

is the unknown system mapping.

The unknown system (1) is to be identified based on an observation data set using some suitable functional which can approximate with arbitrary accuracy. Without loss of generality, we use

to construct a radial basis function (RBF) network model of the form

 ˆy(M)(k)=f(M)(x(k))=M∑i=1θiϕi(x(k)), (2)

where is the model prediction output for based on the -term RBF model, and is the total number of regressors or model terms, while are the model weights. The regressor is given by

 ϕi(x)=exp(−∥x−ci∥22τ2) (3)

in which is known as the center vector of the th RBF unit and is an RBF width parameter. We assume that each RBF unit is placed on a training data, namely, all the RBF center vectors are selected from the training data , and the RBF width has been predetermined, for example, using cross validation.

Let us denote as the -term modeling error for the input data . Over the training data set , further denote , , and with , . We have the -term model in the matrix form of

 y=ΦMθM+e(M), (4)

where . Let an orthogonal decomposition of the regression matrix be

 ΦM=WMAM, (5)

where

 AM=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1a1,2⋯a1,M01⋱⋮⋮⋱⋱aM−1,M0⋯01⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (6)

and

 WM=[w1 w2⋯wM] (7)

with columns satisfying , if . The regression model (4) can alternatively be expressed as

 y=WMgM+e(M), (8)

where the ‘orthogonal’ model’s weight vector satisfies the triangular system , which can be used to determine the original model parameter vector , given and .

Further consider the following weighted -norm penalized OLS criterion for the model (8)

 Le(ΛM,gM)=∥∥y−WMgM∥∥2+M∑i=1λi∣∣gi∣∣, (9)

where , which contains the local regularization parameters , for , and is a predetermined lower bound for the regularization parameters. For a given , the solution for can be obtained by setting the subderivative vector of to zero, i.e. , yielding

 g(olasso)i=(∣∣g(LS)i∣∣−λi/2wTiwi)+sign(g(LS)i) (10)

for , with the usual least squares solution given by , and the operator

 z+={z,if z>0,0,if z≤0. (11)

Unlike the LASSO [1, 2], our objective is constructed on the orthogonal space and the -norm parameter constraints are associated with the orthogonal bases , . Since the cost function (9) contains sparsity inducing norm, some parameters will be returned as zeros, producing a sparse model in the orthogonal space spanned by the columns of , which corresponds to a sparse model in the original space spanned by the columns of .

## Iii Regularization parameter optimization and model construction with LOOMSE

Each OFR stage involves the joint regularization parameter optimization, model term selection and parameter estimation. The regularization parameters with respect to their associated candidate regressors are optimized using the approximate LOOMSE formula that is derived in Section III-B, and the regressor with the smallest LOOMSE is selected.

### Iii-a Model representation and LOOMSE in n-th stage OFR

Consider the OFR modeling process that has produced the -term model. Let us denote the constructed columns of regressors as , with . The model output vector of this -term model is given by

 ˆy(n)=n∑i=1g(olasso)iwi, (12)

and the corresponding modeling error vector by . Clearly, the th OFR stage can be represented by

 e(n−1)=gnwn+e(n). (13)

The model form (13) illustrates the fact that the th OFR stage is simply to fit a one-variable model using the current model residual produced after the th stage as the desired system output. Since , it is easy to verify that .

The selection of one regressor from the candidate regressors involves initially generating candidate by making each candidate regressor to be orthogonal to the orthogonal basis vectors, for obtained in the previous OFR stages, followed by evaluating their contributions. Consider the case of . Applying (10) to (13), we note that clearly as decreases away from towards , increases its magnitude at a linear rate to , from zero to an upper bound with

 g(B)n=(∣∣g(LS)n∣∣−ε2wTnwn)+sign(g(LS)n). (14)

For any candidate regressor, it is vital that we evaluate its potential model generalization performance using the most suitable value of . The optimization of the LOOMSE with respect to is detailed in Section III-B, based on the idea of the LOO cross validation outlined below.

Suppose that we sequentially set aside each data point in the estimation set in turn and estimate a model using the remaining data points. The prediction error is calculated on the data point that has not been used in estimation. That is, for , the models are estimated based on , respectively, and the outputs are denoted as . Then, the LOO prediction error based on the th data sample is calculated as

 e(n,−k)(k,λn)=y(k)−ˆy(n−1,−k)(k,λn). (15)

The LOOMSE is defined as the average of all these prediction errors, given by . Thus the optimal regularization parameter for the th stage is given by

 (16)

Evaluation of by directly splitting the data set requires extensive computational efforts. Instead, we show in Section III-B that can be approximately calculated without actually sequentially splitting the estimation data set. Furthermore, we also show that the optimal value can be obtained in a closed-form expression.

### Iii-B Optimal regularization parameter estimate

We notice from (10) that if , and thus a sufficient condition that a given may be excluded from the candidate pool without explicitly determining is , which is the regularizer’s lower bound, a preset value indicating the correlation of the candidate regressor. Hence, in the following we assume that , and we have

 g(olasso)n=H−1n(WTny−Λnsign(g(LS)n)/2), (17)

where , , and . Note that (17) is consistent to (10) for all terms with nonzero . In the OFR procedure, any candidate terms producing zero will not be selected since they will not contribute to any reduction in the LOOMSE.

The model residual is defined by

 =y(k)−(yTWn−(sign(g(LS)))TΛn/2)H−1nw(k), (18)

where denotes the transpose of the th row of . If the data sample indexed at is removed from the estimation data set, the LOO parameter estimator obtained by using only the remaining data points is given by

 g(olasso,−k)n= (H(−k)n)−1((W(−k)n)Ty(−k)− Λnsign(g(LS,−k))/2) (19)

in which , and denote the resultant regression matrix and desired output vector, respectively. It follows that we have

 H(−k)n=Hn−w(k)wT(k), (20)
 (y(−k))TW(−k)n=yTWn−y(k)wT(k). (21)

The LOO error evaluated at is given by

 e(n,−k)(k,λn) =y(k)−(g(olasso,−k))Tw(k) =y(k)−((y(−k))TW(−k)n− (sign(g(LS,−k)))TΛn/2)(H(−k)n)−1w(k). (22)

Applying the matrix inversion lemma to (20) yields

 (H(−k)n)−1= (Hn−w(k)wT(k))−1 = H−1n+H−1nw(k)wT(k)H−1n1−wT(k)H−1nw(k) (23)

and

 (H(−k)n)−1w(k)=H−1nw(k)1−wT(k)H−1nw(k). (24)

Substituting (21) and (24) into (III-B) yields

 e(n,−k)(k,λn)=y(k)−(yTWn−y(k)wT(k)− (sign(g(LS,−k)))TΛn/2)H−1nw(k)1−wT(k)H−1nw(k) =y(k)−(yTWn−(sign(gLS,−k)))TΛn/2)H−1nw(k)1−wT(k)H−1nw(k). (25)

Assuming that holds for most data samples and then applying (III-B) to (III-B), we have

 e(n,−k)(k,λn)=γn(k)e(n)(k,λn), (26)

where , and is the th element of . The LOOMSE can then be calculated as

 J(λn)= 1NN∑k=1γ2n(k)(e(n)(k,λn))2. (27)

We point out that in order for and to be different, each element in needs to be very close to zero, which is unlikely since only the model terms satisfying are considered. Hence we can treat given in (27) as the exact LOOMSE for any preset that is not too small.

We further represent (III-B) as

 e(n)(k,λn)=η(k)+λn2wTnwnwn(k)sign(g(LS)n), (28)

where is the model residual obtained based on the least square estimate at the th step stage. By setting , we obtain in the form of the weighted least square estimate

 λn=−2sign(g(LS)n)wTnwnwTnΓ(n)η/wTnΓ(n)wn, (29)

where and . Finally we calculate

 λoptn= max{min{2∣∣wTne(n−1)∣∣,−2sign(g(LS)n)wTnwn ×wTnΓ(n)η/wTnΓ(n)wn},ε}, (30)

in order to satisfy the constraint that . For obtained using (III-B), we consider the following two cases:

1. If , then , and this candidate regressor will not be selected.

2. If , then calculate based on as the LOOMSE for this candidate regressor.

### Iii-C Moving unselectable regressors to the inactive set

From Section III-B we noted that a candidate regressor satisfying does not need to be considered at the th stage of selection. To save computational cost, we define the inactive set as the index set of the unselectable regressors removed from the pool of candidates.

In the th OFR stage, all the candidate regressors in the candidate pool are made orthogonal to the previously selected regressors, and the candidate with the smallest LOOMSE value is selected as the th model term . Denote any other candidate regressor as .

Main Results: If , then this candidate regressor will never be selected in further regression stages, and hence it can be moved to .

Proof: At the th OFR stage, consider making the regressor orthogonal to , and define

 w(+)=w(−)−wTnw(−)wTnwnwn. (31)

Clearly,

 ∥∥w(+)∥∥2= (w(−)−wTnw(−)wTnwnwn)T(w(−)−wTnw(−)wTnwnwn) = ∥∥w(−)∥∥2−(wTnw(−))2wTnwn≤∥∥w(−)∥∥2. (32)

The model residual vector after the selection of is

 e(n)=e(n−1)−g(olasso)nwn, (33)

where can be written as

 g(olasso)n=(wTne(n−1)−λn2sign(g(LS)n))/wTnwn. (34)

Thus we have

 ∥∥e(n)∥∥2= ∥∥e(n−1)∥∥2−2g(olasso)nwTne(n−1) +(g(olasso)n)2wTnwn, (35)
 (g(olasso)n)2wTnwn= ((wTne(n−1))2− λnsign(gLS)n)wTne(n−1)+λ2n4)/wTnwn, (36)

and

 2g(olasso)nwTne(n−1)= (2(wTne(n−1))2− λnsign(g(LS)n)wTne(n−1))/wTnwn. (37)

Substituting (III-C) and (III-C) into (III-C) yields

 ∥∥e(n)∥∥2= ∥∥e(n−1)∥∥2−((wTne(n−1))2−λ2n4)/wTnwn < ∥∥e(n−1)∥∥2, (38)

due to the fact that . From (III-C) and (III-C), it can be concluded that

 ∥∥w(+)∥∥⋅∥∥e(n)∥∥<∥∥w(−)∥∥⋅∥∥e(n−1)∥∥<ε2. (39)

Since is the upper bound of , this means that this regressor will not be selected at the th stage. By induction, it will never be selected in further regression stages, and hence it can be moved to .

## Iv The proposed l1-POFR algorithm

The proposed -POFR algorithm integrates (i) the model regressor selection based on minimizing the LOOMSE; (ii) regularization parameter optimization also based on minimizing the LOOMSE; and (iii) the mechanism of removing unproductive candidate regressors during the OFR procedure. Define

 Φ(n−1)=[w1⋯wn−1 ϕ(n−1)n⋯ϕ(n−1)M]∈RN×M, (40)

with . If some of the columns in have been interchanged, this will still be referred as for notational simplicity. Fig. 1: Engine Data: (a) the system input u(k), (b) the system output y(k), and (c) the evolution of the size of S with respect to the chosen ε.

The initial conditions are as follows. Preset as a very small value. Set , for , and as the empty set . The th stage of the selection procedure is listed in Table I. The OFR procedure is automatically terminated at the th stage when the condition

 Jns+1≥Jns (51)

is detected, yielding a subset model with significant regressors. It is worth emphasizing that there always exists a model size , and for , the LOOMSE decreases as increases, while the condition (51) holds [7, 14].

Note that the LOOMSE is used not only for deriving the closed form of the optimal regularization parameter estimate but also for selecting the most significant model regressor. Specifically, a regressor is selected as the one that produces the smallest LOOMSE value as well as offering the reduction in the LOOMSE. After the stage when there is no reduction in the LOOMSE criterion for a few consecutive OFR stages, the model construction procedure can be terminated. Thus, the -POFR algorithm automatically constructs a sparse -term model, where typically .

Also note that it is assumed that should not be too small such that the LOOMSE estimation formula can be considered to be accurate. This means that if is set too low, many insignificant candidate regressors will have inaccurate LOOMSE values for competition. However, we emphasize that these terms with inaccurate LOOMSE values will not be selected as the winner to enter the model. Hence in practice we only need to make sure that is not too large, which would introduce unnecessary bias to the model parameter estimates. Clearly, a relatively larger will save computational costs by 1) resulting in a sparser model, and 2) producing a larger sized inactive set during the OFR process.

Finally, regarding the computational complexity of the -POFR algorithm, if the unproductive regressors are not removed to the inactive set during the OFR procedure, it is well known that the computational cost is in the order of for evaluating each candidate regressor . The total computational cost then needs to be scaled by the number of evaluations in forward regression, which is . By removing unproductive regressors to during the OFR procedure, the computational cost can obviously be reduced significantly. It is not possible to exactly assess the computational cost saving due to removing the unproductive regressors, as this is problem dependent.