## I Introduction

[lines=2]THE basic framework in learning theory generally considers learning from examples by optimizing (minimizing or maximizing) a certain loss function such that the learned model can discover the structures (or dependencies) in the data generating system under the uncertainty caused by noise or unknown knowledge about the system

[1]. The second order statistical measures such as mean square error (MSE), variance and correlation and have been commonly used as the loss functions in machine learning or adaptive system training due to their simplicity and mathematical tractability. For example, the goal of the

*least squares*

(LS) regression is to learn an unknown mapping (linear or nonlinear) such that MSE between the model output and desired response is minimized. Also, the orthogonal linear transformation in

*principal component analysis*(PCA) is determined such that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components [2]. The

*canonical-correlation analysis*

(CCA) is another example, where the goal is to find the linear combinations of the components in two random vectors which have maximum correlation with each other

[3].The loss functions based on the second order statistical measures, however, are sensitive to outliers in the data, and are not good solution to learning with non-Gaussian data in general

[1]. To handle non-Gaussian data (or noises), various non-second order (or non-quadratic) loss functions are frequently applied to learning systems. Typical examples include Huber’s min-max loss [4, 5], Lorentzian error loss [5], risk-sensitive loss [6] and mean p-power error (MPE) loss [7, 8]. The MPE is the-th absolute moment of the error, which with a proper

value can deal with non-Gaussian data well. In general, MPE is robust to large outliers when [7]. Information theoretic measures, such as entropy, KL divergence and mutual information can also be used as loss functions in machine learning and non-Gaussian signal processing since they can capture higher order statistics (i.e. moments or correlations beyond second order) of the data [1]. Many numerical examples have shown the superior performance of*information theoretic learning*(ITL) [1, 9]. Particularly in recent years, a novel ITL similarity measure, called correntropy, has been successfully applied to robust learning and signal processing [10, 11, 12, 13, 14, 15, 16, 17, 18]

. Correntropy is a generalized correlation in high dimensional kernel space (usually induced by a Gaussian kernel), which is directly related to the probability of how similar two random variables are in a neighborhood (controlled by the kernel bandwidth) of the joint space

[10]. Since correntropy is a local similarity measure, it can increase the robustness with respect to outliers by assigning small weights to data beyond the neighborhood.Essentially, correntropy is a second order statistical measure (i.e. correlation) in kernel space, which corresponds to a non-second order measure in original space. Similarly, one can define other second order statistical measures, such as MSE, in kernel space. The MSE in kernel space is also called the *correntropic loss* (C-Loss) [19, 20]. It can be shown that minimizing the C-Loss is equivalent to maximizing the correntropy. In this paper, we define a non-second order measure in kernel space, called *kernel mean p-power error* (KMPE), which is the MPE in kernel space and, of course, is also a non-second order measure in original space. The KMPE will reduce to the C-Loss as , but with a proper value can outperform the C-Loss when used as a loss function in robust learning. In the present work, we focus mainly on two application examples, *extreme learning machine* (ELM) [21, 22]

and PCA. The ELM is a single-hidden-layer feedforward neural network (SLFN) with randomly generated hidden nodes, which can be used for regression, classification and many other learning tasks

[21, 22]. The proposed KMPE will be used to develop robust ELM and PCA algorithms.The rest of the paper is structured as follows. In section II, we define the KMPE, and give some basic properties. In section III, we apply the KMPE to ELM and PCA, and develop the ELM-KMPE and PCA-KMPE algorithms. In section IV, we present experimental results to demonstrate the desirable performance of the new algorithms. Finally in section V, we give the conclusion.

## Ii Kernel Mean P-Power Error

### Ii-a Definition

Non-second order statistical measures can be defined elegantly as a second order measure in kernel space. For example, the correntropy between two random variables and , is a correlation measure in kernel space, given by [10]

(1) | ||||

where denotes the expectation operator,

stands for the joint distribution function, and

is a nonlinear mapping induced by a Mercer kernel , which transforms from the original space to a functional Hilbert space (or kernel space) equipped with an inner product satisfying . Obviously, we have . In this paper, without mentioned otherwise, the kernel function is a Gaussian kernel, given by(2) |

with being the kernel bandwidth. Similarly, the C-Loss as MSE in kernel space, can be defined by [14]

(3) | ||||

where is inserted to make the expression more convenient. It holds that , hence minimizing the C-Loss will be equivalent to maximizing the correntropy. The *maximum correntropy criterion* (MCC) has drawn more and more attention recently due to its robustness to large outliers [10, 11, 12, 13, 14, 15, 16, 17, 18].

In this work, we define a new statistical measure in kernel space in a non-second order manner. Specifically, we generalize the C-Loss to the case of arbitrary power and define the mean p-power error (MPE) in kernel space, and call the new measure the kernel MPE (KMPE). Given two random variables and , the KMPE is defined by

(4) | ||||

where is the power parameter. Clearly, the KMPE includes the C-Loss as a special case (when ). In addition, given samples , the *empirical KMPE* can be easily obtained as

(5) |

Since is a function of the sample vectors and , one can also denote by if no confusion arises.

### Ii-B Properties

Some basic properties of the proposed KMPE are presented below.

*Property 1*: is symmetric, that is .

*Proof*: Straightforward since .

*Property 2*: is positive and bounded: , and it reaches its minimum if and only if .

*Proof*: Straightforward since , with if and only if .

*Property 3*: As is small enough, it holds that .

*Proof*: The property holds since for small enough.

*Property 4*: As is large enough, it holds that .

*Proof*: Since for small enough, as , we have

(6) | ||||

*Remark*: By Property 4, one can conclude that the KMPE will be, approximately, equivalent to the MPE when kernel bandwidth is large enough.

*Property 5*: Let , where . if , the empirical KMPE as a function of *e* is convex at any point satisfying
.

*Proof*: Since , the Hessian matrix of with respect to *e* is

(7) |

where

(8) | ||||

When , we have if . Thus, for any point *e* with , we have .

*Property 6*: Given any point *e* with , the empirical KMPE will be convex at *e* if is larger than a certain value.

*Proof*: From (8), if and , or if and , we have . So, it holds that if

(9) |

This complete the proof.

*Remark*: According to Property 5 and 6, the empirical KMPE as a function of *e* is convex at any point with . and it can also be convex at a point with if the power parameter is larger than a certain value.

*Property 7*: Let 0 be an -dimensional zero vector. Then as (or ), it holds that

(10) |

where .

*Proof*: As is large enough, we have

(11) | ||||

*Property 8*: Assume that , , where is a small positive number. As , minimizing the empirical KMPE will be, approximately, equivalent to minimizing the -norm of X, that is

(12) |

where denotes a feasible set of X.

*Proof*: Let be the solution obtained by minimizing over and the solution achieved by minimizing . Then , and

(13) | ||||

where denotes the th component of . It follows that

(14) | ||||

Hence

(15) | ||||

Since , , as the right hand side of (15) will approach zero. Thus, if is small enough, it holds that

(16) |

where is a small positive number arbitrarily close to zero. This completes the proof.

*Remark*: From Property 7 and 8, one can see that the empirical KMPE behaves like an norm of X when kernel bandwidth is very large, and like an norm of X when is very small.

## Iii Application Examples

There are many applications in areas of machine learning and signal processing that can employ the KMPE to solve robustly the relevant problems. In this section, we present two examples to investigate the benefits from the KMPE.

### Iii-a Extreme Learning Machine

The first example is about the Extreme Learning Machine (ELM), a single-hidden-layer feedforward neural network (SLFN) with random hidden nodes [21, 22]. With a quadratic loss function, the ELM usually requires no iterative tuning and the global optima can be solved in a batch mode. In the following, we use the KMPE as the loss function for ELM, and develop a robust algorithm to train the model. Since there is no closed-form solution under the KMPE loss, the new algorithm will be a fixed-point iterative algorithm.

Given distinct training samples , with being the input vector and the target response, the output of a standard SLFN with hidden nodes will be

(17) |

where

is an activation function,

and ( ) are the learning parameters of the th hidden node, denotes the inner product of and , and represents the weight parameter of the link connecting the th hidden node to the output node. The above equation can be written in a vector form as(18) |

where , and

(19) |

represents the output matrix of the hidden layer. In general, the output weight vector can be solved by minimizing the regularized MSE (or least squares) loss:

(20) |

where is the error between the th target response and the th actual output, stands for the regularization parameter to prevent overfitting, and is the target response vector. With a pseudo inversion operation, one can easily obtain a unique solution under the loss (20), that is

(21) |

In order to obtain a solution that is robust with respect to large outliers, now we consider the following KMPE based loss function:

(22) | ||||

Note that different from the loss function in (20), the new loss function will be little influenced by large errors since the term is upper bounded by 1.0.

Let . Then we derive

(23) | ||||

where is the th row of H, , and is a diagonal matrix with diagonal elements .

The derived optimal solution is not a closed-form solution since the matrix on the right-hand side depends on the weight vector through . So it is actually a fixed-point equation. The true optimal solution can thus be solved by a fixed-point iterative algorithm, as summarized in Algorithm 1. This algorithm is referred to as the ELM-KMPE in this work.

### Iii-B Principal Component Analysis

The second example is the Principal Component Analysis (PCA), one of the most popular dimensionality reduction methods [2]. Below we use the proposed KMPE as the loss function to derive a robust PCA algorithm.

Consider a set of samples , with being the dimension number and the sample number. The PCA methods try to find a projection matrix to define a new orthogonal coordinate system that can optimally describe the variability in the data set. In L2-PCA, the projection matrix is solved by minimizing the following loss function [2]:

(24) | ||||

where denotes the column-wise-zero-mean version of X, with , is the sample mean of column vectors, and contains the principal components that are projected under the projection matrix W.

In order to prevent the outliers in the edge data from corrupting the results of dimensionality reduction, we minimize the following robust cost function for PCA:

(25) | ||||

where . Indeed, the cost function

belongs to the M-estimation robust cost functions

[23, 24], and minimizing the cost (25) is an M-estimation problem. It is instructive and useful to transform the minimization of (25) into a weighted least squares problem, which can be solved by iteratively reweighted least squares (IRLS). This method is originally proposed in [25] and successfully used in robust statistics [26][27, 28][29, 30] and PCA [31]. Here, the weighting matrix is a diagonal matrix with elements , where . In this way, the cost function (25) will be equivalent to the following weighted least squares cost:(26) | ||||

where

(27) |

Setting , we derive

(28) |

In addition, we can easily obtain the following solution

(29) |

The optimization problem (29) is a weighted PCA that can be computed by solving the corresponding eigenvalue problem. The solution of (25) can thus be obtained by iterating (27), (28) and (29). This algorithm is called in this work the PCA-KMPE, which when

will perform the HQ-PCA [12]. To learn an -dimensional subspace, one can use a trick as in [12] to learn a small dimensional subspace to further eliminate the influence by outliers. The proposed PCA-KMPE is summarized in Algorithm 2.The kernel width is an important parameter in PCA-KMPE. In general, one can employ the Silverman s rule [32], to adjust the kernel width:

(30) |

where

is the standard deviation of

and is the interquartile range.## Iv Experimental Results

This section presents some experimental results to verify the advantages of the ELM-KMPE and PCA-KMPE developed in the previous section.

### Iv-a Function estimation with synthetic data

In this example, the sinc function estimation, a popular illustration example for nonlinear regression problem in the literature, is used to evaluate the performance of the proposed ELM-KMPE and other ELM algorithms, such as ELM [21], RELM [33] and ELM-RCC [34]. The synthetic data are generated by , where ,

(31) |

and is a noise modeled as , where is a binary iid process with probability mass , ( ), denotes the background noise and is another noise process to represent outliers. The noise processes and are mutually independent and both independent of . In this subsection, is set at 0.1 and

is assumed to be a zero-mean Gaussian noise with variance 9.0. Two background noises are considered: a) Uniform distribution over

and b) Sine wave noise , with uniformly distributed over . In addition, the input data are drawn uniformly from . In the simulation, 200 samples are used for training and another 200 noise-free samples are used for testing. The RMSE is employed to measure the performance, calculated by(32) |

where and denote the target values and corresponding estimated values respectively, and is the number of samples. The parameter settings of four algorithms under two distributions of are summarized in Table 1, where , (or ), and denote the number of hidden layer nodes, regularization parameter, kernel width and the power parameter in ELM-KMPE. The estimation results and testing RMSEs are illustrated in Fig.1 and Table.2. It is evident that the ELM-KMPE achieves the best performance among the four algorithms.

ELM | RELM | ELM-RCC | ELM-KMPE | |||||||
---|---|---|---|---|---|---|---|---|---|---|

Uniform | 20 | 90 | 90 | 1.5 | 90 | 0.8 | 4 | |||

Sine wave | 10 | 40 | 25 | 2 | 25 | 1.2 | 3.4 |

ELM | RELM | ELM-RCC | ELM-KMPE | |
---|---|---|---|---|

Uniform | 0.5117 | 0.2234 | 0.1671 | 0.1079 |

Sine wave | 0.3340 | 0.2498 | 0.2335 | 0.1156 |

### Iv-B Regression and classification on benchmark datasets

In this subsection, we compare the aforementioned four algorithms in regression and classification problems with benchmark datasets from UCI machine learning repository [35]. The details of the datasets are shown in Table 3 and 4. For each dataset, the training and testing samples are randomly selected form the set. In particular, the data for regression are normalized to the range . The parameter settings of the four algorithms for regression and classification experiments are presented in Table 5 and 6. For each algorithm, the parameters are experimentally chosen by fivefold cross-validation. The RMSE is used as the performance measure for regression. For classification, the performance is measured by the accuracy (ACC). Let and be the predicted and target labels of the th sample. The ACC is defined by

(33) |

where is an indicator function, if , otherwise , and maps each predicted label to the equivalent target label. The Kuhn-Munkres algorithm [36] is employed to realize such a mapping. The “mean standard deviation” results of the RMSE and ACC during training and testing are shown in Table 7 and 8, where the best testing results are represented in bold for each data set. As one can see, in all the cases the proposed ELM-KMPE can outperform other algorithms.

Datasets | Features | Observations | |
---|---|---|---|

Training | Testing | ||

Servo | 5 | 83 | 83 |

Concrete | 9 | 515 | 515 |

Wine red | 12 | 799 | 799 |

Housing | 14 | 253 | 253 |

Airfoil | 5 | 751 | 751 |

Slump | 10 | 52 | 51 |

Yacht | 6 | 154 | 154 |

Datasets | Classes | Features | Observations | |
---|---|---|---|---|

Training | Testing | |||

Glass | 7 | 11 | 114 | 100 |

Wine | 3 | 13 | 89 | 89 |

Ecoli | 8 | 7 | 180 | 156 |

User-Modeling | 2 | 5 | 138 | 120 |

Wdbc | 2 | 30 | 100 | 496 |

Leaf | 36 | 14 | 180 | 160 |

Vehicle | 4 | 18 | 500 | 346 |

Seed | 3 | 7 | 110 | 100 |

Datasets | ELM | RELM | ELM-RCC | ELM-KMPE | ||||||
---|---|---|---|---|---|---|---|---|---|---|

L | L | L | L | p | ||||||

Servo | 25 | 90 | 0.00001 | 65 | 0.8 | 0.0001 |

Comments

There are no comments yet.