## I Introduction

[lines=2]AS a well-known learning criterion in information theoretic learning (ITL) [1, 2, 3], the minimum error entropy

(MEE) finds successful applications in various learning tasks, including regression, classification, clustering, feature selection and many others

[4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. The basic principle of MEE is to learn a model to discover structure in data by minimizing the entropy of error between model and data generating system [1]. Entropy takes all higher order moments into account and hence, is a global descriptor of the underlying distribution. The MEE can perform much better than the traditional mean square error (MSE) criterion that considers only the second order moment of the error, especially in nonlinear and non-Gaussian (multi-peak, heavy-tailed, etc.) signal processing and machine learning.

In practical applications, an MEE cost can be estimated based on a PDF estimator. The most widely used MEE cost in ITL is the information potential (IP), which is the argument of the logarithm in Renyi’s entropy [1]. The IP can be estimated directly from data and computed by a double summation over all samples. This is much different from traditional learning costs that only involve a single summation. Although IP is simpler than many other entropic costs, it is still computationally very expensive due to the pairwise computation (i.e. double summation). This may pose computational bottlenecks for large-scale datasets. To address this issue, we propose in this paper an efficient approach to decrease the computational complexity of IP from to with . The basic idea is to simplify the inner summation by quantizing the error samples with a simple quantization method. The simplified learning criterion is called the quantized MEE (QMEE). Some properties of the QMEE are presented, and the desirable performance of QMEE is confirmed by several illustrative results.

The remainder of the paper is organized as follows. The MEE criterion is briefly reviewed in section II. The QMEE is proposed in section III. The illustrative examples are provided in section IV and finally, the conclusion is given in section V.

## Ii Brief Review of Mee Criterion

Consider learning from examples ,

, which are drawn independently from an unknown probability distribution

on . Here we assume and. Usually, a loss function

is used to measure the performance of the hypothesis . For regression, one can choose the squared error loss , where is the prediction error. Then the goal of learning is to find a solution in hypothesis space that minimizes the expected cost function , where the expectation is taken over . As the distribution is unknown, in general we use the empirical cost function:(1) |

which involves a summation over all samples. Sometimes, a regularization term is added to the above sum to prevent overfitting. Under MSE criterion, the empirical cost function becomes

(2) |

where is the prediction error for sample . The computational complexity for evaluating the above cost and its gradient with respect to ( ) is .

In the context of information theoretic learning (ITL), one can adopt Renyi’s entropy of order (, ) as the cost function [1]:

(3) |

where denotes the error’s PDF. Under MEE criterion, the optimal hypothesis can thus be solved by minimizing the error entropy . The argument of the logarithm in , called information potential (IP), is

(4) |

Since the logarithm function is a monotonically increasing function, minimizing Renyi’s entropy is equivalent to minimizing (for ) or maximizing (for ) the IP . In ITL, for simplicity the parameter is usually set at . In the rest of the paper, without loss of generality we only consider the case of . In this case, we have

(5) |

According to ITL [1], an empirical version of the quadratic IP can be expressed as

(6) |

where is Parzen’s PDF estimator [18]:

(7) |

with being the Gaussian kernel with bandwidth :

(8) |

The PDF estimator can be viewed as an adaptive loss function that varies with the error samples . This is much different from the conventional loss functions that are typically left unchanged after being set. For example, the loss function of MSE is always . The adaptation of loss function is potentially beneficial because the risk is matched to the error distribution. The superior performance of MEE has been shown theoretically as well as confirmed numerically [1]. However, the price we have to pay is that there is a double summation over all samples, which is obviously time consuming especially for large-scale datasets. The computational complexity for evaluating the cost function (6) is . The goal of this work is to find an efficient way to simplify the computation of the empirical IP.

## Iii Quantized Mee

Comparing with conventional cost functions for machine learning, the MEE cost (or equivalently, the IP) involves an additional summation operation, namely the computation of the PDF estimator. The basic idea of our approach is thus to reduce the computational burden of the PDF estimation (i.e. the inner summation). We aim to estimate the error’s PDF from fewer samples. A natural way is to represent the error samples with a smaller data set by using a simple quantization method. Of course, the quantization will decrease the accuracy of PDF estimation. However, the PDF estimator for an entropic cost function is very different from the ones for traditional density estimation. Indeed, for a cost function for machine learning, ultimately what’s going to matter is the extrema (maxima or minima) of the cost function, not the exact value of the cost. Our experimental results have shown that with quantization the MEE can achieve almost the same (or even better) performance as the original MEE learning.

Let denote a quantization operator (or quantizer) with a codebook containing (in general ) real valued code words, i.e. . Then is a function that can map the error sample into one of the code words in , i.e. . In this work, we assume that each error sample is quantized to the nearest code word. With the quantizer , the empirical IP in (6) can be simplified to

(9) | ||||

where is the number of error samples that are quantized to the code word , and is the PDF estimator based on the quantized error samples. Clearly, we have and .

Remark: The computational complexity of the quantized MEE (QMEE) cost is , which is much simpler than the original cost of (6) especially for large-scale datasets ( ).

Before designing the quantizer , we present below some basic properties of the QMEE cost.

Property 1: When the codebook , we have .

Proof: Straightforward since in this case we have , .

Property 2: The QMEE cost is bounded, i.e. , with equality if and only if , where is an element of .

Proof: Since with equality if and only if , we have

(10) | ||||

with equality if and only if , , which means .

Property 3: It holds that , where , satisfying .

Proof: One can easily derive

(11) | ||||

Remark: By Property 3, the QMEE cost is equal to a weighted average of the Parzen’s PDF estimator evaluated at the code words. Moreover, when there is only one code word in , i.e. , we have . In particular, when , we have , where denotes the empirical correntropy [19, 20, 21, 22, 23], which is a well-known local similarity measure in ITL. In this sense, the correntropy can be viewed as a special case of the QMEE cost. Actually, the correntropy measures the local similarity about the zero, while QMEE cost measures the average similarity about every code word in .

Property 4: When is large enough, we have , where is the second order moment of error about the code word .

Proof: As , we have . It follows easily that

(12) | ||||

Remark: By Property 4, as , the second order moments tend to dominate the QMEE cost . In this case, maximizing the QMEE cost is equivalent to minimizing a weighted average of the second order moments about the code words.

Property 5: If, with being a positive number, then .

Proof: Because the Gaussian function is continuously differentiable over , according to the Mean Value Theorem, , there exists a point such that .

(13) | ||||

where denotes the derivative of with respect to the argument. Then we have

(14) | ||||

where (a) comes from and for any . It follows that

(15) | ||||

Remark: From Property 5, when is very small or is very large, the difference between the values of and will be very small.

Property 6

: For a linear regression model

, withbeing the weight vector to be estimated, the optimal solution under QMEE criterion satisfies

(16) |

where and .

Proof: The derivative of the QMEE cost with respect to is

(17) | ||||

Setting , we get . It completes the proof.

Remark: It is worth noting that the solution is not a closed-form solution as the matrix and the vector on the right side of the equation depend on the weight vector through the error samples (i.e. ). Actually, the equation is a fixed-point equation.

A key problem in QMEE is how to design a simple and efficient quantizer , including how to build the codebook and how to assign the code words to the data. In this work, we will use a method proposed in our recent papers, to quantize the error samples. In [24, 25], we proposed a simple online vector quantization (VQ) to curb the network growth in kernel adaptive filters, such as kernel least mean square (KLMS) and kernel recursive least squares (KRLS). The main advantage of this quantization method lies in its simplicity and online feature. The pseudocode of this online VQ algorithm is presented in Algorithm 1.

Remark: The online VQ method in Algorithm 1 creates the codebook sequentially from the samples, which is computationally very simple, with computational complexity that is linear in the number of samples.

## Iv Illustrative Examples

In the following, we present some illustrative examples to demonstrate the desirable performance of the proposed QMEE criterion.

### Iv-a Linear Regression

In the first example, we use the QMEE criterion to perform the linear regression. According to Property 6, the optimal solution of the linear regression model can easily be solved by the following fixed-point iteration:

(18) |

in which the matrix and vector are

(19) |

where , , and is a diagonal matrix with diagonal elements , with . The detailed procedure of the linear regression under QMEE is summarized in Algorithm 2.

We now consider a simple scenario where the data samples are generated by a two-dimensional linear system , where , and is an additive noise. The input vectors

are assumed to be uniformly distributed over

. In addition, the noise is assumed to be generated by , where is a binary process with probability mass , , with being an occurrence probability. The processes andrepresent the background noises and the outliers respectively, which are mutually independent and both independent of

. In the simulations below, is set at 0.1 andis assumed to be a white Gaussian process with zero-mean and variance 10000. For the distribution of

, we consider four cases: 1) symmetric Gaussian mixture density: , where denotes the Gaussian density with mean and variance ; 2) asymmetric Gaussian mixture density: ; 3) binary distribution with probability mass; 4) Gaussian distribution with zero-mean and unit variance. The root mean squared error (RMSE) is employed to measure the performance, computed by

(20) |

where and denote the estimated and the target weight vectors respectively.

We compare the performance of four learning criteria, namely MSE, MCC [19, 20, 21, 22, 23], MEE and QMEE. For the MSE criterion, there is a closed-form solution, so no iteration is needed. For other three criteria, a fixed-point iteration is used to solve the model (see [22, 26] for the details of the fixed-point algorithms under MCC and MEE). The parameter settings of MCC, MEE and QMEE are given in Table I. The simulations are carried out with MATLAB 2014a running in i5-4590, 3.30 GHZ CPU. The “mean ±deviation” results of the RMSE and the training time over 100 Monte Carlo runs are presented in Table II. In the simulations, the sample number is and the iteration number is . From Table II, we observe: i) the MCC, MEE and QMEE can significantly outperform the traditional MSE criterion although they have no closed-form solution; ii) the MEE and QMEE can achieve much better performance than the MCC criterion, except the case of Gaussian background noise, in which they achieve almost the same performance; iii) the QMEE can achieve almost the same (or even better) performance as the original MEE criterion, but with much less computational cost. Fig. 1 shows the average training time of QMEE and MEE with increasing number of samples.

Further, we show in Fig. 2 the contour plots of the performance surfaces (i.e. the cost surfaces over the parameter space), where the background noise distribution is assumed to be symmetric Gaussian mixture. In Fig. 2, the target weight vector and the optimal solutions of the performance surfaces are denoted by the red crosses and blue circles, respectively. As one can see, the optimal solutions under MEE and QMEE are almost identical to the target value, while the solutions under MSE and MCC (especially the MSE solution) are apart from the target.

MCC | MEE | QMEE | ||
---|---|---|---|---|

Case 1) | 10 | 1.1 | 1.5 | 0.3 |

Case 2) | 15 | 1.1 | 1.5 | 0.3 |

Case 3) | 8 | 0.7 | 1.0 | 0.3 |

Case 4) | 2.8 | 0.6 | 4.0 | 0.1 |

MSE | MCC | MEE | QMEE | ||
---|---|---|---|---|---|

Case 1) | RMSE | ||||

Training Time (sec) | |||||

Case 2) | RMSE | ||||

Training Time (sec) | |||||

Case 3) | RMSE | ||||

Training Time (sec) | |||||

Case 4) | RMSE | ||||

Training Time (sec) |

### Iv-B Extreme Learning Machines

The second example is about the training of Extreme Learning Machine (ELM) [27, 28, 29, 30, 31]

, a single-hidden-layer feedforward neural network (SLFN) with random hidden nodes.

Given distinct training samples , with being the input vector and the target response, the output of a standard SLFN with hidden nodes is

(21) |

where

is an activation function,

and ( ) are the randomly generated parameters of the hidden nodes, and represents the output weight vector. Since the hidden parameters are determined randomly, we only need to solve the output weight vector . To this end, we express (22) in a vector form as(22) |

where , and

(23) |

Usually, the output weight vector can be solved by minimizing the following squared (MSE based) and regularized loss function:

(24) |

where is the th error between the target response and actual output, represents the regularization factor, and . Applying the pseudo inversion operation, one can obtain a unique solution under the loss function (24), that is

(25) |

Here, we propose the following QMEE based loss function:

(26) | ||||

Setting , one can obtain

(27) |

where , , , , and is a diagonal matrix with diagonal elements .

Similar to the linear regression case, the equation (27) is a fixed-point equation since the matrix depends on the weight vector through . Thus, one can solve by using the following fixed-point iteration:

(28) |

where and denote, respectively, the matrix and vector evaluated at . The learning procedure of the ELM under QMEE is described in Algorithm 3. This algorithm is called the ELM-QMEE in this paper.

Comments

There are no comments yet.