One of the most important problems in machine learning is how to approximate a target random variable () knowing another (
). This is a central problem in supervised learning, where we design a model () that receives a random variable and outputs that should approximate
in some sense. The difficulty requires the definition of a loss function (or a similarity measure) to comparewith . The minimum mean square error (MMSE) criterion is widely used where the loss function is , with being the error variable and the expectation operator. The MMSE is generally computationally simple and mathematically tractable, but its learning performance may degrade seriously when non-Gaussian noises are present in the variables .
To improve the learning performance in non-Gaussian noises, a variety of non-MMSE criteria have been proposed in the literature [1, 2, 3, 4, 5, 6, 7, 8]. Particularly in recent years, the maximum correntropy criterion (MCC) have found many successful applications in domains of signal processing and machine learning, which is very useful for the case where the signals are contaminated by heavy-tailed impulsive noises[9, 10, 11, 12, 13, 14, 15]. Under the MCC, an optimal model can be obtained by maximizing the correntropy between the target variable and the output :
where is the optimal model, stands for the model s hypothesis space, and denotes the correntropy between and , with being the Gaussian kernel function:
where is the kernel bandwidth. Since the Gaussian kernel function is a local function of the error variable
, the correntropy can be used as an outlier-robust error measure in signal processing and machine learning. However, the center of the Gaussian kernel in correntropy is always located at zero, which may not be a good choice for many practical situations. In particular, when the error distribution is non-zero-mean, the original correntropy may perform poorly, because in this case the zero-mean Gaussian function usually cannot match well the error distribution. The goal of the present paper is thus to extend the correntropy to the case where the center can be located anywhere, which potentially can significantly improve the learning performance but is still not fully appreciated in the community.
The rest of the paper is organized as follows. In section II, we define the correntropy with variable center and propose the maximum correntropy criterion with variable center (MCC-VC). In section III, we propose an efficient approach to optimize the kernel width and center location in MCC-VC. Simulation results of regression with linear in parameters (LIP) models are then presented in section IV. Finally, conclusion is given in section V.
Ii Maximum Correntropy Criterion with Variable Center
In this work, we define the correntropy with variable center between and as follows:
where is the center location. The above definition will reduce to the original correntropy when .
Similar to the original correntropy , the correntropy with center
also involves all the even moments of the errorabout the center , that is
As increases, the high-order moments about the center will decay faster, hence the second-order moment tends to dominate the value. Particularly, when and , maximizing the correntropy with center
will be equivalent to minimizing the error’s variance.
In addition, when the Gaussian kernel shrinks to zero (), the correntropy with center approaches the value of , where
is the joint probability density function (PDF) of. This can easily be proved as follows
where denotes the Dirac delta function. In this case, we also have
Therefore, when , the correntropy with center will also approach the value of evaluated at , where denotes the error’s PDF.
The optimal model under the maximum correntropy criterion with variable center (MCC-VC) is defined by
To demonstrate how to solve the optimal solution with finite training samples (by optimizing an empirical risk function), we consider the following linear in parameter (LIP) model:
where are the input-output samples, is the
-th nonlinearly mapped input vector (a row vector), withbeing the -th nonlinear mapping function , and is the output weight vector that needs to be learned. Given target samples , the output weight vector can be trained by minimizing the following regularized MMSE cost:
where , , and is the regularization parameter. In this case, the optimal solution can easily be obtained as
where is an dimensional matrix with . Similarly, one can solve by minimizing the following regularized MCC-VC cost:
where is the -th error sample. Setting , one can derive
where , , and is a diagonal matrix with diagonal elements .
The solution (12) is a fixed-point equation since the diagonal matrix on the right-hand side depends on the weight vector via . Therefore, the optimal solution under MCC-VC can be solved by using the following fixed-point iteration:
is the estimated weight vector at the-th iteration.
Iii Optimization of the Free Parameters in MCC-VC
There are two free parameters in MCC-VC, namely the kernel width and the center location , whose values have significant influence on the learning performance. In this section, we propose an efficient approach to optimize the two parameters. First, we divide the correntropy with center into three terms:
Since the first term is independent of the model , we have
where . Then we propose the following optimization:
where and denote the admissible sets of parameters and . Thus, the model , the kernel width and the center location are jointly optimized to maximize the function . To simplify the optimization, we adopt an alternative optimization approach:
i) When the model is fixed(hence the error’s distribution is fixed), the term is independent of and , in this case the two free parameters can simply be optimized by
ii) After the parameters have been determined, the model can then be optimized by maximizing the function (16) or (14) with and .
The above procedure can be repeated until convergence.
From (17), one can see that the parameters and are optimized such that the Gaussian kernel function matches the error’s PDF as closely as possible. This is in principle consistent with our intuition. The idea of PDF matching has been explored with great success in the literature of information theoretic learning (ITL) [1, 16, 17, 18]. Given error samples , we have . It follows that
Remark: There are several approaches to solve the optimization problem in (18). For example, we can use a gradient based method to search the solution. In many practical situations, we often find the optimal solution in a given finite set. To further simplify the computation, one can just set the parameter to the mean or median value of the error samples, and only optimize the kernel width .
Based on the above parameters optimization strategy, a robust regression algorithm with LIP models under MCC-VC can be obtained, which is referred to as the LIP-MCC-VC and is described in Algorithm 1.
Iv Simulation Results
In this section, we present simulation results of regression with LIP models to demonstrate the performance of the proposed method. We consider two LIP models, one is the linear regression model and the another is the extreme learning machine (ELM)[19, 20, 21, 22]
, a kind of single hidden layer feed forward neural network (SLFN), in which the input weights and biases of the hidden layer are randomly generated, and only the weights of the output layer need to be trained.
Iv-a Linear Regression
Consider a simple example in which the data are generated by a two-dimensional linear system , where and is an additive noise. The input samples
are uniformly distributed over. The noise comprises two mutually independent noises, namely the inner noise and the outlier noise . Specifically, is given by , where
is a binary variable with probability mass, , , which is assumed to be independent of both and . In this example, is set at , and the outlier
is drawn from a zero-mean Gaussian distribution with variance. For the inner noise , we consider four zero-mean or non-zero-mean distributions: 1) (0,2), where denotes the Gaussian PDF with mean and variance ; 2)
(3,1); 3) Laplace distribution with zero-mean and variance 1; 4) Chi-square distribution with three degrees of freedom. The root mean squared error (RMSE) is employed to measure the performance, computed by, where and denote the estimated and the target weight vectors respectively.
We compare the performance of three optimization criteria, namely MMSE, MCC and MCC-VC. For MMSE, there is a closed-form solution, so no iteration is needed. For MCC and MCC-VC, a fixed-point iteration is used to solve the model (see  for the fixed-point algorithm under MCC). The mean deviation results of the RMSE and the training time averaged over 100 Monte Carlo runs are presented in Table I. In the simulation, the sample number is = 400, the iteration number is = 100, and the initial weight vector is set to . For each criterion, the parameters are selected by trial-and-error to achieve the best results, except that the kernel bandwidth and center location of MCC-VC are chosen through solving the optimization (18). The finite kernel bandwidth set is equally spaced over with step size 0.2, and the center set is equally spaced over with step size 0.1. From Table I, we observe: i) MCC and MCC-VC can significantly outperform MMSE although both have no closed-form solution; ii) MCC-VC can achieve better performance than MCC especially for non-zero-mean noises because the cost function center can be set at proper value according to the error PDF adaptively; iii) MCC-VC can save much time through solving (18) to find the best values of parameters and , without performing trial-and-error to optimize the two parameters. Under the noise of case 2), the error distribution and corresponding Gaussian kernel function optimized by (18) at the first and second fixed-point iterations of MCC-VC are shown in Fig. 1. As expected, the Gaussian kernel function matches the error distribution very well.
|Training RMSE||Testing RMSE||Training RMSE||Testing RMSE||Training RMSE||Testing RMSE|
Iv-B ELM Based Regression for Benchmark Datasets
In the second example, we utilize seven benchmark data sets from UCI machine learning repository  to confirm the superior regression performance of the MCC-VC based ELM (ELM-MCC-VC) compared with the MCC based ELM (ELM-RCC)  and regularized ELM (RELM)(). The descriptions of the data sets are given in Table II. In the simulation, the training and testing samples from each data set are randomly chosen and the data values are normalized into [0, 1]. The parameters of each algorithm are selected through fivefold cross-validation, except that the kernel bandwidth and center location of MCC-VC are chosen through solving (18). We set the kernel center of MCC-VC to the median value of the error samples, only optimize the kernel width by solving (18). The finite kernel bandwidth set is equally spaced over with step size 0.1. The training and testing RMSEs over 100 runs are presented in Table III. Evidently, The ELM-MCC-VC outperforms the ELM-RCC and RELM for all the data sets. Especially on the Yacht data set, MCC-VC can significantly outperform other methods.
The kernel function in Correntropy is in general a Gaussian function and the kernel center is always located at zero. In this paper, we extended the correntropy to the case where the center can locate at any position. On this basis, the maximum correntropy criterion with variable center (MCC-VC) was proposed. In addition, we proposed an efficient method to optimize the kernel width and center location in MCC-VC. Regression results with linear in parameters (LIP) models have shown the desirable performance of the new method.
-  J. C. Principe, Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media, 2010.
-  S.-C. Pei and C.-C. Tseng, “Least mean p-power error criterion for adaptive fir filter,” IEEE Journal on Selected Areas in Communications, vol. 12, no. 9, pp. 1540–1547, 1994.
-  D. Erdogmus and J. C. Principe, “Generalized information potential criterion for adaptive system training,” IEEE Transactions on Neural Networks, vol. 13, no. 5, pp. 1035–1044, 2002.
-  W. Liu, P. P. Pokharel, and J. C. Príncipe, “Correntropy: Properties and applications in non-gaussian signal processing,” IEEE Transactions on Signal Processing, vol. 55, no. 11, pp. 5286–5298, 2007.
-  B. Chen, P. Zhu, and J. C. Principe, “Survival information potential: a new criterion for adaptive system training,” IEEE Transactions on Signal Processing, vol. 60, no. 3, pp. 1184–1194, 2012.
-  M. O. Sayin, N. D. Vanli, and S. S. Kozat, “A novel family of adaptive filtering algorithms based on the logarithmic cost.” IEEE Trans. Signal Processing, vol. 62, no. 17, pp. 4411–4424, 2014.
-  B. Chen, L. Xing, H. Zhao, N. Zheng, J. C. Prı et al., “Generalized correntropy for robust adaptive filtering,” IEEE Transactions on Signal Processing, vol. 64, no. 13, pp. 3376–3387, 2016.
-  B. Chen, L. Xing, B. Xu, H. Zhao, N. Zheng, and J. C. Principe, “Kernel risk-sensitive loss: Definition, properties and application to robust adaptive filtering,” IEEE Transactions on Signal Processing, vol. 65, no. 11, pp. 2888–2901, 2017.
R. He, B.-G. Hu, W.-S. Zheng, and X.-W. Kong, “Robust principal component analysis based on maximum correntropy criterion,”IEEE Transactions on Image Processing, vol. 20, no. 6, pp. 1485–1494, 2011.
-  A. Singh, R. Pokharel, and J. Principe, “The c-loss function for pattern classification,” Pattern Recognition, vol. 47, no. 1, pp. 441–453, 2014.
-  E. Hasanbelliu, L. S. Giraldo, and J. C. Principe, “Information theoretic shape matching,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 12, pp. 2436–2451, 2014.
-  W. Ma, H. Qu, G. Gui, L. Xu, J. Zhao, and B. Chen, “Maximum correntropy criterion based sparse adaptive filtering algorithms for robust channel estimation under non-gaussian environments,” Journal of the Franklin Institute, vol. 352, no. 7, pp. 2708–2727, 2015.
-  Y. Feng, X. Huang, L. Shi, Y. Yang, and J. A. Suykens, “Learning with the maximum correntropy criterion induced losses for regression.” Journal of Machine Learning Research, vol. 16, pp. 993–1034, 2015.
B. Chen, X. Liu, H. Zhao, and J. C. Principe, “Maximum correntropy kalman filter,”Automatica, vol. 76, pp. 70–77, 2017.
-  B. Chen, X. Wang, N. Lu, S. Wang, J. Cao, and J. Qin, “Mixture correntropy for robust learning,” Pattern Recognition, vol. 79, pp. 318–327, 2018.
-  D. Erdogmus and J. C. Principe, “An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems,” IEEE Transactions on Signal Processing, vol. 50, no. 7, pp. 1780–1786, 2002.
-  I. Santamarıa, C. Pantaleón, L. Vielva, and J. C. Principe, “Adaptive blind equalization through quadratic pdf matching,” in Proceedings of the European Signal Processing Conference, vol. 2, 2002, pp. 289–292.
-  A. R. Heravi and G. A. Hodtani, “A new information theoretic relation between minimum error entropy and maximum correntropy,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 921–925, 2018.
-  G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1-3, pp. 489–501, 2006.
-  Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, and A. Lendasse, “Op-elm: optimally pruned extreme learning machine,” IEEE transactions on neural networks, vol. 21, no. 1, pp. 158–162, 2010.
-  G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 2, pp. 513–529, 2012.
-  H.-J. Xing and X.-M. Wang, “Training extreme learning machine via regularized correntropy criterion,” Neural Computing and Applications, vol. 23, no. 7-8, pp. 1977–1986, 2013.
-  B. Chen, J. Wang, H. Zhao, N. Zheng, and J. C. Principe, “Convergence of a fixed-point algorithm under maximum correntropy criterion,” IEEE Signal Processing Letters, vol. 22, no. 10, pp. 1723–1727, 2015.
-  A. Frank, “Uci machine learning repository,” http://archive. ics. uci. edu/ml, 2010.