I Introduction
One of the most important problems in machine learning is how to approximate a target random variable (
) knowing another (). This is a central problem in supervised learning, where we design a model (
) that receives a random variable and outputs that should approximatein some sense. The difficulty requires the definition of a loss function (or a similarity measure) to compare
with . The minimum mean square error (MMSE) criterion is widely used where the loss function is , with being the error variable and the expectation operator. The MMSE is generally computationally simple and mathematically tractable, but its learning performance may degrade seriously when nonGaussian noises are present in the variables [1].To improve the learning performance in nonGaussian noises, a variety of nonMMSE criteria have been proposed in the literature [1, 2, 3, 4, 5, 6, 7, 8]. Particularly in recent years, the maximum correntropy criterion (MCC) have found many successful applications in domains of signal processing and machine learning, which is very useful for the case where the signals are contaminated by heavytailed impulsive noises[9, 10, 11, 12, 13, 14, 15]. Under the MCC, an optimal model can be obtained by maximizing the correntropy between the target variable and the output [4]:
(1) 
where is the optimal model, stands for the model s hypothesis space, and denotes the correntropy between and , with being the Gaussian kernel function:
(2) 
where is the kernel bandwidth. Since the Gaussian kernel function is a local function of the error variable
, the correntropy can be used as an outlierrobust error measure in signal processing and machine learning
[1]. However, the center of the Gaussian kernel in correntropy is always located at zero, which may not be a good choice for many practical situations. In particular, when the error distribution is nonzeromean, the original correntropy may perform poorly, because in this case the zeromean Gaussian function usually cannot match well the error distribution. The goal of the present paper is thus to extend the correntropy to the case where the center can be located anywhere, which potentially can significantly improve the learning performance but is still not fully appreciated in the community.The rest of the paper is organized as follows. In section II, we define the correntropy with variable center and propose the maximum correntropy criterion with variable center (MCCVC). In section III, we propose an efficient approach to optimize the kernel width and center location in MCCVC. Simulation results of regression with linear in parameters (LIP) models are then presented in section IV. Finally, conclusion is given in section V.
Ii Maximum Correntropy Criterion with Variable Center
In this work, we define the correntropy with variable center between and as follows:
(3) 
where is the center location. The above definition will reduce to the original correntropy when .
Similar to the original correntropy [4], the correntropy with center
also involves all the even moments of the error
about the center , that is(4) 
As increases, the highorder moments about the center will decay faster, hence the secondorder moment tends to dominate the value. Particularly, when and , maximizing the correntropy with center
will be equivalent to minimizing the error’s variance.
In addition, when the Gaussian kernel shrinks to zero (), the correntropy with center approaches the value of , where
is the joint probability density function (PDF) of
. This can easily be proved as follows(5)  
where denotes the Dirac delta function. In this case, we also have
(6)  
Therefore, when , the correntropy with center will also approach the value of evaluated at , where denotes the error’s PDF.
The optimal model under the maximum correntropy criterion with variable center (MCCVC) is defined by
(7) 
To demonstrate how to solve the optimal solution with finite training samples (by optimizing an empirical risk function), we consider the following linear in parameter (LIP) model:
(8)  
where are the inputoutput samples, is the
th nonlinearly mapped input vector (a row vector), with
being the th nonlinear mapping function , and is the output weight vector that needs to be learned. Given target samples , the output weight vector can be trained by minimizing the following regularized MMSE cost:(9) 
where , , and is the regularization parameter. In this case, the optimal solution can easily be obtained as
(10) 
where is an dimensional matrix with . Similarly, one can solve by minimizing the following regularized MCCVC cost:
(11) 
where is the th error sample. Setting , one can derive
(12) 
where , , and is a diagonal matrix with diagonal elements .
The solution (12) is a fixedpoint equation since the diagonal matrix on the righthand side depends on the weight vector via . Therefore, the optimal solution under MCCVC can be solved by using the following fixedpoint iteration:
(13) 
where
is the estimated weight vector at the
th iteration.Iii Optimization of the Free Parameters in MCCVC
There are two free parameters in MCCVC, namely the kernel width and the center location , whose values have significant influence on the learning performance. In this section, we propose an efficient approach to optimize the two parameters. First, we divide the correntropy with center into three terms:
(14)  
Since the first term is independent of the model , we have
(15) 
where . Then we propose the following optimization:
(16) 
where and denote the admissible sets of parameters and . Thus, the model , the kernel width and the center location are jointly optimized to maximize the function . To simplify the optimization, we adopt an alternative optimization approach:
i) When the model is fixed(hence the error’s distribution is fixed), the term is independent of and , in this case the two free parameters can simply be optimized by
(17)  
ii) After the parameters have been determined, the model can then be optimized by maximizing the function (16) or (14) with and .
The above procedure can be repeated until convergence.
From (17), one can see that the parameters and are optimized such that the Gaussian kernel function matches the error’s PDF as closely as possible. This is in principle consistent with our intuition. The idea of PDF matching has been explored with great success in the literature of information theoretic learning (ITL) [1, 16, 17, 18]. Given error samples , we have . It follows that
(18) 
Remark: There are several approaches to solve the optimization problem in (18). For example, we can use a gradient based method to search the solution. In many practical situations, we often find the optimal solution in a given finite set. To further simplify the computation, one can just set the parameter to the mean or median value of the error samples, and only optimize the kernel width .
Based on the above parameters optimization strategy, a robust regression algorithm with LIP models under MCCVC can be obtained, which is referred to as the LIPMCCVC and is described in Algorithm 1.
Iv Simulation Results
In this section, we present simulation results of regression with LIP models to demonstrate the performance of the proposed method. We consider two LIP models, one is the linear regression model and the another is the extreme learning machine (ELM)
[19, 20, 21, 22], a kind of single hidden layer feed forward neural network (SLFN), in which the input weights and biases of the hidden layer are randomly generated, and only the weights of the output layer need to be trained.
Iva Linear Regression
Consider a simple example in which the data are generated by a twodimensional linear system , where and is an additive noise. The input samples
are uniformly distributed over
. The noise comprises two mutually independent noises, namely the inner noise and the outlier noise . Specifically, is given by , whereis a binary variable with probability mass
, , , which is assumed to be independent of both and . In this example, is set at , and the outlieris drawn from a zeromean Gaussian distribution with variance
. For the inner noise , we consider four zeromean or nonzeromean distributions: 1) (0,2), where denotes the Gaussian PDF with mean and variance ; 2)(3,1); 3) Laplace distribution with zeromean and variance 1; 4) Chisquare distribution with three degrees of freedom. The root mean squared error (RMSE) is employed to measure the performance, computed by
, where and denote the estimated and the target weight vectors respectively.We compare the performance of three optimization criteria, namely MMSE, MCC and MCCVC. For MMSE, there is a closedform solution, so no iteration is needed. For MCC and MCCVC, a fixedpoint iteration is used to solve the model (see [23] for the fixedpoint algorithm under MCC). The mean deviation results of the RMSE and the training time averaged over 100 Monte Carlo runs are presented in Table I. In the simulation, the sample number is = 400, the iteration number is = 100, and the initial weight vector is set to . For each criterion, the parameters are selected by trialanderror to achieve the best results, except that the kernel bandwidth and center location of MCCVC are chosen through solving the optimization (18). The finite kernel bandwidth set is equally spaced over with step size 0.2, and the center set is equally spaced over with step size 0.1. From Table I, we observe: i) MCC and MCCVC can significantly outperform MMSE although both have no closedform solution; ii) MCCVC can achieve better performance than MCC especially for nonzeromean noises because the cost function center can be set at proper value according to the error PDF adaptively; iii) MCCVC can save much time through solving (18) to find the best values of parameters and , without performing trialanderror to optimize the two parameters. Under the noise of case 2), the error distribution and corresponding Gaussian kernel function optimized by (18) at the first and second fixedpoint iterations of MCCVC are shown in Fig. 1. As expected, the Gaussian kernel function matches the error distribution very well.
MMSE  MCC  MCCVC  

case 1)  RMSE  0.0765  
TIME(sec)  N/A  
case 2)  RMSE  
TIME(sec)  N/A  
case 3)  RMSE  
TIME(sec)  N/A  
case 4)  RMSE  
TIME(sec)  N/A 
Datasets  RELM  ELMRCC  ELMMCCVC  

Training RMSE  Testing RMSE  Training RMSE  Testing RMSE  Training RMSE  Testing RMSE  
Servo  
Airfoil  
Concrete  
Housing  
Yacht  
Winered  
Slump 
Datasets  Features  Observations  

Training  Testing  
Servo  5  83  83 
Airfoil  5  751  751 
Concrete  9  515  515 
Housing  14  253  253 
Yacht  6  154  154 
Winered  12  799  799 
Slump  10  52  51 
IvB ELM Based Regression for Benchmark Datasets
In the second example, we utilize seven benchmark data sets from UCI machine learning repository [24] to confirm the superior regression performance of the MCCVC based ELM (ELMMCCVC) compared with the MCC based ELM (ELMRCC) [22] and regularized ELM (RELM)([21]). The descriptions of the data sets are given in Table II. In the simulation, the training and testing samples from each data set are randomly chosen and the data values are normalized into [0, 1]. The parameters of each algorithm are selected through fivefold crossvalidation, except that the kernel bandwidth and center location of MCCVC are chosen through solving (18). We set the kernel center of MCCVC to the median value of the error samples, only optimize the kernel width by solving (18). The finite kernel bandwidth set is equally spaced over with step size 0.1. The training and testing RMSEs over 100 runs are presented in Table III. Evidently, The ELMMCCVC outperforms the ELMRCC and RELM for all the data sets. Especially on the Yacht data set, MCCVC can significantly outperform other methods.
V Conclusion
The kernel function in Correntropy is in general a Gaussian function and the kernel center is always located at zero. In this paper, we extended the correntropy to the case where the center can locate at any position. On this basis, the maximum correntropy criterion with variable center (MCCVC) was proposed. In addition, we proposed an efficient method to optimize the kernel width and center location in MCCVC. Regression results with linear in parameters (LIP) models have shown the desirable performance of the new method.
References
 [1] J. C. Principe, Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media, 2010.
 [2] S.C. Pei and C.C. Tseng, “Least mean ppower error criterion for adaptive fir filter,” IEEE Journal on Selected Areas in Communications, vol. 12, no. 9, pp. 1540–1547, 1994.
 [3] D. Erdogmus and J. C. Principe, “Generalized information potential criterion for adaptive system training,” IEEE Transactions on Neural Networks, vol. 13, no. 5, pp. 1035–1044, 2002.
 [4] W. Liu, P. P. Pokharel, and J. C. Príncipe, “Correntropy: Properties and applications in nongaussian signal processing,” IEEE Transactions on Signal Processing, vol. 55, no. 11, pp. 5286–5298, 2007.
 [5] B. Chen, P. Zhu, and J. C. Principe, “Survival information potential: a new criterion for adaptive system training,” IEEE Transactions on Signal Processing, vol. 60, no. 3, pp. 1184–1194, 2012.
 [6] M. O. Sayin, N. D. Vanli, and S. S. Kozat, “A novel family of adaptive filtering algorithms based on the logarithmic cost.” IEEE Trans. Signal Processing, vol. 62, no. 17, pp. 4411–4424, 2014.
 [7] B. Chen, L. Xing, H. Zhao, N. Zheng, J. C. Prı et al., “Generalized correntropy for robust adaptive filtering,” IEEE Transactions on Signal Processing, vol. 64, no. 13, pp. 3376–3387, 2016.
 [8] B. Chen, L. Xing, B. Xu, H. Zhao, N. Zheng, and J. C. Principe, “Kernel risksensitive loss: Definition, properties and application to robust adaptive filtering,” IEEE Transactions on Signal Processing, vol. 65, no. 11, pp. 2888–2901, 2017.

[9]
R. He, B.G. Hu, W.S. Zheng, and X.W. Kong, “Robust principal component analysis based on maximum correntropy criterion,”
IEEE Transactions on Image Processing, vol. 20, no. 6, pp. 1485–1494, 2011.  [10] A. Singh, R. Pokharel, and J. Principe, “The closs function for pattern classification,” Pattern Recognition, vol. 47, no. 1, pp. 441–453, 2014.
 [11] E. Hasanbelliu, L. S. Giraldo, and J. C. Principe, “Information theoretic shape matching,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 12, pp. 2436–2451, 2014.
 [12] W. Ma, H. Qu, G. Gui, L. Xu, J. Zhao, and B. Chen, “Maximum correntropy criterion based sparse adaptive filtering algorithms for robust channel estimation under nongaussian environments,” Journal of the Franklin Institute, vol. 352, no. 7, pp. 2708–2727, 2015.
 [13] Y. Feng, X. Huang, L. Shi, Y. Yang, and J. A. Suykens, “Learning with the maximum correntropy criterion induced losses for regression.” Journal of Machine Learning Research, vol. 16, pp. 993–1034, 2015.

[14]
B. Chen, X. Liu, H. Zhao, and J. C. Principe, “Maximum correntropy kalman filter,”
Automatica, vol. 76, pp. 70–77, 2017.  [15] B. Chen, X. Wang, N. Lu, S. Wang, J. Cao, and J. Qin, “Mixture correntropy for robust learning,” Pattern Recognition, vol. 79, pp. 318–327, 2018.
 [16] D. Erdogmus and J. C. Principe, “An errorentropy minimization algorithm for supervised training of nonlinear adaptive systems,” IEEE Transactions on Signal Processing, vol. 50, no. 7, pp. 1780–1786, 2002.
 [17] I. Santamarıa, C. Pantaleón, L. Vielva, and J. C. Principe, “Adaptive blind equalization through quadratic pdf matching,” in Proceedings of the European Signal Processing Conference, vol. 2, 2002, pp. 289–292.
 [18] A. R. Heravi and G. A. Hodtani, “A new information theoretic relation between minimum error entropy and maximum correntropy,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 921–925, 2018.
 [19] G.B. Huang, Q.Y. Zhu, and C.K. Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 13, pp. 489–501, 2006.
 [20] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, and A. Lendasse, “Opelm: optimally pruned extreme learning machine,” IEEE transactions on neural networks, vol. 21, no. 1, pp. 158–162, 2010.
 [21] G.B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 2, pp. 513–529, 2012.
 [22] H.J. Xing and X.M. Wang, “Training extreme learning machine via regularized correntropy criterion,” Neural Computing and Applications, vol. 23, no. 78, pp. 1977–1986, 2013.
 [23] B. Chen, J. Wang, H. Zhao, N. Zheng, and J. C. Principe, “Convergence of a fixedpoint algorithm under maximum correntropy criterion,” IEEE Signal Processing Letters, vol. 22, no. 10, pp. 1723–1727, 2015.
 [24] A. Frank, “Uci machine learning repository,” http://archive. ics. uci. edu/ml, 2010.