1 Introduction
Since the early work of Legendre and Gauss in the late XVIII century, linear or nonlinear regression has employed the space defined by the input data to project the target or desired response and find, in a training set, the optimal set of model parameters through mean square error minimization. This approach has been totally embraced by the adaptive signal processing [1]
, control theory, pattern recognition and machine learning communities
[2], and has become the de facto standard to perform function approximation.The pursuit of this alternative is based on theoretic reasons, i.e. to expand the horizon of function approximation theory, but its impact on current machine learning applications is perhaps even higher. In the conventional modeling approach, when the system that created the inputdesired data pairs is nonlinear, the linear model must be substituted by a nonlinear model (e.g. artificial neural networks), which means that optimization becomes nonlinear in the parameters. This implies that local minima exist in the performance surface, and gradient search techniques become slow, cumbersome and there is no guarantee of finding the optimal solution. This is one of the current bottlenecks of nonlinear modeling and machine learning. All these methods, meanwhile, ignore the error after training the parameter, but the available error information can be better utilized to provide a novel approach to function approximation, as we will demonstrate here.
Our vision is to create universal learning systems that are easy to train and guarantee to converge to the optimum, which we called convex universal learning machines (CULMs) [3]. CULMs are universal mappers with architectures that either do not have a hidden layer or do not need to train the hidden layer weights. One distinctive class of CULMs are Kernel Adaptive Filters (KAFs) which project the input data into a Reproducing Kernel Hilbert Space (RKHS) by using a strictly positive definite kernel function and use linear methods to train parameters [4]. The difficulty is that when employing the representer theorem in RKHS, the filter output is computed from all past data, so the filter computation grows linearly or superlinearly with respect to the sample number , which is unrealistic for real word applications without sparseness procedures [5]. The other class, including reservoir computing, uses stochastic approaches based on random hidden parameters exemplified by the Extreme Learning Machine (ELM), and it suffers from incomplete theoretic understanding and requires many “tricks” to achieve useful and reproducible results [6].
In spite of these shortcomings, ELMs are surprisingly very popular, which means that the need to achieve fast universal processing with generalization capacity in function approximation is still unmet.
Here we propose a new solution to design CULMs based on the conventional Finite Impulse Response (FIR) linear model extended with a table lookup. Instead of using the input to span the projection space, we use the full joint space as the projection space, hence this approach is named Augmented Space Linear Model (ASLM). Augmented with the desired signal, the framework of ASLM expends the data input space, assumed of dimension , to dimensional space. Then the independent bases can span any space, which means the training set error can be as small as the adaptation method can achieve using a linear approach.
There are two difficulties that need to be addressed in the ASLM. The first is that all the weights go to zero after training except the one that is connected to the desired, which approaches one. This means we need regularization during the adaptation process. The second difficulty is we don’t have the desired signal in the test phase! To address these issues here, we use the difference between the outputs in the input space and the desired in the joint space during the training phase (the training error) to augment the input space instead of expanding the input by the desired. Then, we store all the training errors in a table indexed by the input data. Our novel solution takes advantage of the extra information contained in training errors, which are wasted in conventional least squares, to approach nonlinear relationships with a linear model and a table. Since ASLM is an adaptive linear architecture with convex optimization, and the training error is orthogonal to the bases, the adaptation process no longer needs to be regularized. Meanwhile, the computational complexity of training and testing is much lower than nonlinear methods, which is well suited for online learning algorithms. As a matter of fact, ASLM is an intermediate solution in the complexityaccuracy design space between the linear model (easy but not very accurate) and fully nonlinear models (complex but can be much more accurate). Different from the traditional linear and nonlinear models, the augmented space model framework makes full use of training errors, which may improve others models (linear and nonlinear) as well.
2 Augmented Space Linear Model
The simplest implementation of the ASLM is presented below. The left part of Fig.1 shows the conceptual least square solution of finding the best approximation () of the desired response () by projecting in the space spanned by the multidimensional input x. The minimum error is achieved when is the orthogonal projection of in the input subspace. It is sufficient to add the error to the output to obtain exactly the desired response in the training set, because it is by definition perpendicular to the input space. When computing the output of the ASLM, obviously, we are using the joint space to evaluate the desired.
Consider a set of N pairs of training input vectors with desired output
, where denotes discrete time instant. We first compute the weights of the linear model in the input space with all the training data, which can be evaluated by the Least Squares (LS) solution in (1)(1) 
where , and is a small value to prevent rank deficiency. Then we create a table addressed by the input which relates the input with the training error, and store this table. The size of this table will be the training set size if no quantization is introduced. In the test phase, we use the current input to find the closest entry in the table, and then read back the corresponding training set error to approximate the desired response, i.e. equation (2)
(2) 
where is the current output of LS solution, and is the error obtained from the training set, which is corresponding to the closest in index as we show in right part of Fig.1. The norm is used to measure the distance of transformed samples produced by Hadamard product . Considering the inputs of the whole training set, this Hadamard metric can get more reasonable results when finding the closest sample. The error in equation (2) will be a good approximation for the desired under two conditions: (1): is a good approximation for the current test sample from the training error; (2): the error in the test set for a given input remains stationary from training and testing. The second assumption also must be imposed in conventional functional approximation, although in ASLM, the requirement applies to instantaneous errors which is more demanding in practical noisy conditions. In realistic application cases, we can use a quantization approach to cut the noise in the training data while also decreasing the table size.
ASLM is the simplest model in the augmented space, actually, we can also use it to augment the KAF nonlinear model. Although KAFs are universal nonlinear models, it is difficult to achieve a good approximation to the desired by linear combination of Gaussian kernel, because of the rattling and insufficient training data. In order to compute the training error in the augmented space, we first train a nonlinear model as usual, fixed the weights, and compute the training errors all at once. Then we create a table addressed by the input and store this table as mentioned before. In the test phase, we use the current input to find the closest entry in the table, and then read back the corresponding training error to approximate the desired response, i.e. equation (2). Since we don’t have w in the nonlinear models, we measure the distance between input samples by norm directly.
In order to improve the efficiency of finding the nearest neighbor, we use a tree to store the data in the table with searching complexity of [7]. Hence, the testing computational complexity in the augmented space model consists of 2 parts. One is the complexity of the algorithm (linear or nonlinear) to compute the system output, and the other is the complexity of searching for the best error of the augmented space model. The testing computational complexity of ASLM, for example, is . As for the training, ASLM is very fast, since it only needs to create the table after the least squares algorithm, which is much faster than training a nonlinear model. We will compare the performance and computational complexity of the proposed ASLM with several linear or nonlinear models in the next section.
3 Simulation Results
3.1 Prediction Without Noise
In order to evaluate the role of ASLM within the current methodologies for function approximation and system identification, we select three competing models: the Least Squares (LS) as an example of the optimal linear projection, the Nearest Neighbor
(KNN) algorithm
[8] as a memory based approach and the Kernel Least Mean Square (KLMS), using a Gaussian kernel, as a CULM that rivals in performance with the best nonlinear networks for prediction. We also include KLMS with Augmented space model (KLMSAM), as an extra comparison, to show the general capabilities of the Augmented space model. All the hyper parameters were validated to get the best possible results including the kernel size , the step size and the regularization factor . For simplicity, we choose for the KNN algorithm and all the parameters are showed in the last column of table 1. The inclusion of the memory based method is judged important because ASLM also uses a table lookup that is similar in spirit to the memory based approaches for modeling. The problem we selected is the prediction of the of the component of the Lorenz system [4], which has been well studied in the literature (order L=7 according to Takens’ embedding theorem) [9]. The Lorenz data set is generated from the differential equation with the parameters . A first order approximation is used with a step size parameter . Segments of samples are used as the train set and the followingsamples are the testing set. We normalize the time series to zeromean with unit variance. Performance is measured as the power of the error. Results are averaged over
independent trainingtest runs obtained by sliding the window over the generated data by samples each time.Algorithm  Testing MSE  Training MSE  Parameter 

KLSMAM  
KLMS  
ASLM  
KNN  
LS 
We also show the testing MSE and training MSE in table 1. Since KLMS is an online algorithm while the other algorithms are batch based, the testing MSE of KLMS is calculated from the last 100 points of the converged learning curve. In terms of performance, we see that the LS is the worst performer. Even KNN is better, but we notice that ASLM always improves KNN performance for the same storage. KLMS is a little better than ASLM, which can be further improved by the augmented space model as well. However, when we take into consideration accuracy and computation time, ASLM appears as a very good compromise between the performance of the nonlinear and the linear model.
Fig. 2 shows the computation time and storage in the test phase of the compared algorithms. In terms of simplicity of resources, the LS solution is unbeatable both in terms of storage and computation time. Compared with KLMS and KLMSAM, ASLM is much faster with comparable performance, which is much better than the performance of LS algorithm. The bottleneck of ASLM is the search for the best candidate in the table look up, which is very similar to KNN. The general locations of linear and nonlinear model with respect to storage and computational time in this simulation are plotted as an ellipsoid cloud around points. It is obvious that ASLM is a linear model with nonlinear regression capacity, and its location in the Fig. 2 deviates from the diagonal linking the two linear and nonlinear models, which shows its efficiency. In this simulation, the augmented space model also shows surprising potential to improve different models, which can make full use of the training error or desired in the augmented space to increase the performance. Meanwhile, the computational complexity of searching in the table is much smaller than that of KLMS, so the performance improvement won’t bring a big computational burden and explains Fig. 2.
3.2 Prediction With Noise
In this section, the data are the same as the last experiment and the desired signal of training data are corrupted by 20 dB zero mean Gaussian noise. Performance is measured again as the power of the error. Results are averaged over independent trainingtest runs obtained by sliding the window over the generated data by samples each time. The purpose of the experiment is to compare the performance when the training set is not clean. It is obvious that the performance of ASLM will suffer when the noise is added in the table. Therefore, we use a simple (sequential) Vector Quantization (VQ) method to cut the noise, which can build a small size codebook with a small threshold instead of the original input values in the table. Depending on the Euclidean distance, VQ is computationally simple (linear complexity in terms of the codebook size) [10, 5]. The error of one center in the codebook is computed by averaging the training errors whose indexes are quantized into the same center. This method is first used in KLMS to constrain the network size. In ASLM, VQ not only decreases the table size and computational complexity, but also improves the performance by averaging the local training errors. Three extra comparisons are added to show the improvements of algorithms brought by VQ, which is QKLMS, KLMSQAM and QASLM. To be fair, the final size of codebooks are set to 500, and all the hyper parameters are validated to get the best possible results (kernel size , step size , quantization radium and regularization factor ).
Algorithm  Testing MSE 

Parameter  

KLMS  
QKLMS 


KLSMAM 


KLSMQAM 


ASLM  
QASLM  
KNN  
LS 
We show the testing MSE, training MSE and the corresponding parameters in the table 2. It is easy to notice that the linear model shows powerful robustness in this comparison, since the performance of all the algorithms except LS are affected compared with the last result. ASLM can beat KNN and both algorithms get better results than LS algorithm. Without the quantization method, KLMS is the best predictor in this experiment, which is more robust than KLMSAM and ASLM, because the sum of weighted Gaussian kernel can remove the noise to some extent. VQ reduces the storage for KLMS, KLMSAM and ASLM algorithms, while it improves the performance of KLMSAM and ASLM by cutting the noise in the table. However, noise removal by of VQ is not obvious for KLMS algorithm. Hence, KLMSQAM shows the best performance with the help of VQ. QKLMS shows similar result with KLMS, which is still better than QASLM.
4 Conclusion And Discussion
We presented a new solution to the functional approximation problem, by taking advantage of the linear solution, and correcting this estimate with the training error from the input sample that is in the neighborhood of the current test phase input. In essence, we combine the computational efficiency of the linear solution with a memory block encoding the training errors originating from the nonlinearity of the data generation process, which produce a nonlinear response. In conventional nonlinear function approximation, one needs to find appropriate parameters of nonlinear mappers, which is full of difficulties and also computationally expensive. This is the reason ASLM displays an interesting compromise in the space of accuracy and computation complexity between the conventional linear and nonlinear solutions. However, noticing that the ASLM only models the linear error in this case, we have also shown that the same approach can improve upon the modeling of the nonlinear error with the KLMSAM.
ASLM is a member of the CULM family that has not been investigated in the past. Conceptually, we are proposing to augment the input projection space with the desired response, so this opens the door to study many different implementations of the simple ASLM discussed in this paper. The thrust of research should focus on ways to improve the table look up performance which is very rudimentary. In noisy situations, we can think of PCA to obtain a better definition of the input space, and filter the training error by local modeling. In fact, It is very interesting to interpret the training errors as a sensitivity to the unknown desired response that can be exploited for Bayesian modeling. Since we have an implicit model of the input, we can also speed up the search to find the closest neighbor of the current input. These simple modifications will improve ASLM performance and lead to new applications beyond functional approximation. We therefore believe that this will be a vibrant line of research for years to come.
References
 [1] Simon S Haykin, Adaptive filter theory, Pearson Education India, 2008.
 [2] Richard O Duda, Peter E Hart, and David G Stork, Pattern classification, John Wiley & Sons, 2012.
 [3] Jose C Principe and Badong Chen, “Universal approximation with convex optimization: Gimmick or reality?[discussion forum],” IEEE Computational Intelligence Magazine, vol. 10, no. 2, pp. 68–77, 2015.
 [4] Weifeng Liu, Jose C Principe, and Simon Haykin, Kernel adaptive filtering: a comprehensive introduction, vol. 57, John Wiley & Sons, 2011.
 [5] Weifeng Liu, Puskal P Pokharel, and Jose C Principe, “The kernel leastmeansquare algorithm,” IEEE Transactions on Signal Processing, vol. 56, no. 2, pp. 543–554, 2008.
 [6] GuangBin Huang, QinYu Zhu, and CheeKheong Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501, 2006.
 [7] Hanan Samet, The design and analysis of spatial data structures, vol. 199, AddisonWesley Reading, MA, 1990.
 [8] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.
 [9] Edward N Lorenz, “Deterministic nonperiodic flow,” Journal of the atmospheric sciences, vol. 20, no. 2, pp. 130–141, 1963.
 [10] Badong Chen, Songlin Zhao, Pingping Zhu, and José C Príncipe, “Quantized kernel least mean square algorithm,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 1, pp. 22–32, 2012.
Comments
There are no comments yet.