# Kernel Risk-Sensitive Loss: Definition, Properties and Application to Robust Adaptive Filtering

Nonlinear similarity measures defined in kernel space, such as correntropy, can extract higher-order statistics of data and offer potentially significant performance improvement over their linear counterparts especially in non-Gaussian signal processing and machine learning. In this work, we propose a new similarity measure in kernel space, called the kernel risk-sensitive loss (KRSL), and provide some important properties. We apply the KRSL to adaptive filtering and investigate the robustness, and then develop the MKRSL algorithm and analyze the mean square convergence performance. Compared with correntropy, the KRSL can offer a more efficient performance surface, thereby enabling a gradient based method to achieve faster convergence speed and higher accuracy while still maintaining the robustness to outliers. Theoretical analysis results and superior performance of the new algorithm are confirmed by simulation.

Comments

There are no comments yet.

## Authors

• 30 publications
• 17 publications
• 8 publications
• 6 publications
• 43 publications
• 36 publications
• ### Generalized Correntropy for Robust Adaptive Filtering

As a robust nonlinear similarity measure in kernel space, correntropy ha...
04/12/2015 ∙ by Badong Chen, et al. ∙ 0

read it

• ### Kernel Least Mean Square with Adaptive Kernel Size

Kernel adaptive filters (KAF) are a class of powerful nonlinear filters ...
01/23/2014 ∙ by Badong Chen, et al. ∙ 0

read it

• ### A Generalized Kernel Risk Sensitive Loss for Robust Two-Dimensional Singular Value Decomposition

Two-dimensional singular decomposition (2DSVD) has been widely used for ...
05/10/2020 ∙ by Miaohua Zhang, et al. ∙ 0

read it

• ### Robust Learning with Kernel Mean p-Power Error Loss

Correntropy is a second order statistical measure in kernel space, which...
12/21/2016 ∙ by Badong Chen, et al. ∙ 0

read it

• ### Asymmetric Correntropy for Robust Adaptive Filtering

In recent years, correntropy has been seccessfully applied to robust ada...
11/21/2019 ∙ by Badong Chen, et al. ∙ 0

read it

• ### The Generalized Complex Kernel Least-Mean-Square Algorithm

We propose a novel adaptive kernel based regression method for complex-v...
02/22/2019 ∙ by Rafael Boloix-Tortosa, et al. ∙ 0

read it

• ### Maximum Correntropy Criterion with Variable Center

Correntropy is a local similarity measure defined in kernel space and th...
04/13/2019 ∙ by Badong Chen, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

[lines=2]STATISTICAL similarity measures play significant roles in signal processing and machine learning communities. In particular, the cost function for a learning machine is usually a certain similarity measure between the learned model and the data generating system. Due to their simplicity, mathematical tractability and optimality under Gaussian and linear assumptions, second-order similarity measures defined in terms of inner products, such as correlation and mean square error (MSE), have been widely used to quantify how similar two random variables (or random processes) are. However, second-order statistics cannot quantify similarity fully if the random variables are non-Gaussian distributed

[1, 2]

. To address the problem of modeling with non-Gaussian data (which are very common in many real-world applications), similarity measures must go beyond second-order statistics. Higher-order statistics, such as kurtosis, skewness, higher-order moments or cumulants, can be applicable for dealing with non-Gaussian data. Besides, as an alternative to the MSE, the risk-sensitive loss

[3, 4, 5, 6, 7], which quantifies the similarity by emphasizing the larger errors in an exponential form (”risk-sensitive” means ”average-of-exponential”), has been proven to be robust to the case where the actual probabilistic model deviates from the assumed Gaussian model[6]

. The problem of existence and uniqueness of the risk-sensitive estimation has been studied in

[7]. Nevertheless, the risk-sensitive loss is not robust to impulsive noises (or outliers) when utilizing a gradient-based learning algorithm, because its performance surface (as a function over parameter space) can be super-convex and the gradient may grow exponentially fast with error increasing across iterations.

Recent advances in information theoretic learning (ITL) suggest that similarity measures can also be defined in a reproducing kernel Hilbert space (RKHS)[1].The ITL costs (i.e. entropy and divergence) for adaptive system training can be directly estimated from data via a Parzen kernel estimator, which can capture higher-order statistics of data and achieve better solutions than MSE particularly in non-Gaussian and nonlinear signal processing [8, 9, 10, 11, 12, 13, 14]

. As a local similarity measure in ITL, the correntropy is defined as a generalized correlation in kernel space, directly related to the probability of how similar two random variables are in a neighborhood of the joint space

[15, 16]. The kernel function in correntropy is usually a Gaussian kernel, but it can be extended to generalized Gaussian functions [17]. Since correntropy measures the similarity in an observation window (controlled by the kernel bandwidth), it provides an effective way to eliminate the adverse effects of outliers [1]. So far correntropy has been successfully applied to develop various robust learning or adaptive algorithms [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]. However, the performance surface of the correntropic loss (C-Loss) is highly non-convex, which can be very sharp around the optimal solution while extremely flat at a region far away from the optimal solution, and this may result in poor convergence performance in adaptation [29].

In this paper, we define a new similarity measure in kernel space, called the kernel risk-sensitive loss (KRSL), which inherits the original form of risk-sensitive loss but is defined in RKHS by means of kernel trick. The performance surface of KRSL is bounded but can be more ”convex” than that of C-Loss, leading to a faster convergence speed and higher solution accuracy while maintaining the robustness to outliers. Besides the kernel bandwidth, an extra free parameter, namely the risk-sensitive parameter, is introduced to control the shape of the performance surface. Further, we apply the KRSL to develop a new robust adaptive filtering algorithm, referred to in this work as the minimum kernel risk-sensitive loss (MKRSL) algorithm, which can outperform existing methods, including the recently proposed generalized maximum correntropy criterion (GMCC) algorithm [17]. A brief version of this work was presented at 2015 IEEE International Conference on Digital Signal Processing [30].

The rest of the paper is organized as follows. In section II, after briefly reviewing some background on similarity measures in kernel space, we define the KRSL, and present some important properties. In section III, we apply the proposed KRSL to adaptive filtering and analyze the robustness, and develop the MKRSL algorithm and present the mean square convergence performance. In section IV, we carry out Monte Carlo simulations to confirm the theoretical results and demonstrate the superior performance of the new algorithm. Finally, we give the conclusion in section V.

## Ii Kernel Risk-Sensitive Loss

### Ii-a Background on Similarity Measures in Kernel Space

The kernel methods have been widely applied in domains of machine learning and signal processing [31, 32, 33].Let be a continuous, symmetric and positive definite Mercer kernel defined over .Then the nonlinear mapping transforms the data from the input space to a functional Hilbert space, namely a reproducing kernel Hilbert space (RKHS)H, satisfying , , where denotes the inner product in H. In particular, we have

 (1)

There is a close relationship between kernel methods and information theoretic learning (ITL) [1]. Most ITL cost functions, when estimated using the Parzen kernel estimator, can be expressed in terms of inner products in kernel space H. For example, the Parzen estimator of the quadratic information potential (QIP) from samples can be obtained as [1]

 QIP(X)=12N2√πσN∑i=1N∑j=1κ√2σ(x(i)−x(j)) (2)

where denotes the translation-invariant Gaussian kernel with bandwidth , given by

 κσ(x−y)=exp(−(x−y)22σ2) (3)

Then we have

 QIP(X) =12√πσ∥∥ ∥∥1NN∑i=1Φ(x(i))∥∥ ∥∥2H (4) =12√πσ⟨1NN∑i=1Φ(x(i)),1NN∑j=1Φ(x(j))⟩H

where stands for the norm in H. Thus, the estimated QIP represents the squared mean of the transformed data in kernel space. In this work, we only consider the Gaussian kernel of (3), but most of the results can be readily extended to other Mercer kernels.

The intrinsic link between ITL and kernel methods enlightens researchers to define new similarity measures in kernel space. As a local similarity measure in ITL, the correntropy between two random variables and , is defined by [16, 34]

 V(X,Y)=E[κσ(X−Y)]=∫κσ(x−y)dFXY(x,y) (5)

where denotes the expectation operator, and

is the joint distribution function of

. Of course the correntropy can be expressed in terms of inner product as

 V(X,Y)=E[⟨Φ(X),Φ(Y)⟩%H] (6)

which is a correlation measure in kernel space. It has been shown that correntropy is directly related to the probability of how similar two random variables are in an “observation window” controlled by the kernel bandwidth [16]. In a similar way, one can define other similarity measures in terms of inner products in kernel space, such as the centered correntropy, correntropy coefficient and correntropic loss (C-Loss) [1]. Three similarity measures in kernel space and their linear counterparts in input space are presented in Table 1. Similarity measures in kernel space are able to extract higher-order statistics of data and offer potentially significant performance improvement over their linear counterparts especially in non-Gaussian signal processing and machine learning [1].

### Ii-B Kernel Risk-Sensitive Loss

Correntropy is a local similarity measure that is little influenced by large outliers. This desirable feature makes it possible for researchers to develop robust learning algorithms using correntropy as the cost function. For example, the supervised learning problem can be solved by maximizing the correntropy (or equivalently, minimizing the C-Loss) between the model output and the desired response. This learning principle is referred to in the literature as the

maximum correntropy criterion(MCC) [1, 2].However, the C-Loss performance surface can be highly non-convex, with steep slopes around the optimal solution while extremely flat areas away from the solution, leading to slow convergence as well as poor accuracy. This situation can be improved by choosing a larger kernel bandwidth, but with the kernel bandwidth increasing the robustness will decrease significantly when outliers occur. To achieve a better performance surface, we define in this work a new similarity measure in kernel space, called the kernel risk-sensitive loss (KRSL). The superiority of the performance surface of KRSL will be demonstrated in the next section.

Given two random variables and , the KRSL is defined by

 Lλ(X,Y) =1λE[exp(λ(1−κσ(X−Y)))] (7) =1λ∫exp(λ(1−κσ(x−y)))dFXY(x,y)

with being the risk-sensitive parameter. The above KRSL can also be expressed as

 Lλ(X,Y)=1λE[exp(λ(12∥Φ(X)−Φ(Y)∥2H))] (8)

which takes the same form as that of the traditional risk-sensitive loss [6, 7], but defined in different spaces.

In most practical situations, the joint distribution of and is unknown, but only a finite number of samples are available. In these cases, however, one can compute an approximation, called empirical KRSL, by approximating the expectation by an average over samples:

 ^Lλ(X,Y)=1NλN∑i=1exp(λ(1−κσ(x(i)−y(i)))) (9)

The empirical KRSL also defines a “distance” between the vectors

and . In this work, we also denote by when no confusion arises.

### Ii-C Properties

In the following, we present some important properties of the proposed KRSL.

Property 1: is symmetric, that is .

Proof: Straightforward since .

Property 2: is positive and bounded: , and it reaches its minimum if and only if .

Proof: Straightforward since , and if and only if .

Property 3: As is small enough, it holds that .

Proof: For small enough, we have , and it follows that

 Lλ(X,Y) =1λE[exp(λ(1−κσ(X−Y)))] (10) ≈1λE[1+λ(1−κσ(X−Y))] =1λ+E[1−κσ(X−Y)] =1λ+Closs(X,Y)

Property 4: As is large enough, it holds that .

Proof: Due to for small enough, as is large enough, we have

 Lλ(X,Y) =1λE[exp(λ(1−exp(−(X−Y)22σ2)))] (11) ≈1λE[exp(λ((X−Y)22σ2))] ≈1λE[1+λ((X−Y)22σ2)] =1λ+12σ2E[(X−Y)2]

Remark 1: According to Property 3 and 4, the KRSL will be, approximately, equivalent to the C-Loss as is small enough, and equivalent to the MSE when is large enough. Thus the C-Loss and MSE can be viewed as two extreme cases of the KRSL.

Property 5: Let , where . Then the empirical KRSL as a function of e is convex at any point satisfying .

Proof: Since , the Hessian matrix of with respect to e can be derived as

 H^Lλ(X,Y)(\emphe)=[∂2^Lλ(X,Y)∂e(i)∂e(j)]=diag[γ1,γ2,⋯,γN] (12)

where

 γi=ξi(λσ2exp(−e2(i)2σ2)e2(i)+1−1σ2e2(i)) (13)

with . Thus we have if . This completes the proof.

Property 6: Given any point e with , the empirical KRSL will be convex at e if the risk-sensitive parameter is larger than a certain value..

Proof: From (13), we have if one of the following conditions is satisfied: i); ii) and . Therefore, we have if

 (14)

This complete the proof.

Remark 2: According to Property 5 and 6, the empirical KRSL as a function of e is convex at any point satisfying . For the case , the empirical KRSL can still be convex at e if the risk-sensitive parameter is larger than a certain value. In fact, the parameter controls the convex range, and a larger results in a larger convex range in general.

Property 7: As (or ), it holds that

 ^Lλ(X,0)≈12σ2∥∥X∥∥22+1λ (15)

where 0 denotes an -dimensional zero vector.

Proof: Since as , as is large enough, we have

 ^Lλ(X,0) =1NλN∑i=1exp(λ(1−κσ(x(i)))) (16) =1NλN∑i=1exp(λx2(i)2σ2) ≈1NλN∑i=1[1+λx2(i)2σ2] =1λ+12σ21NN∑i=1x2(i) =12σ2∥∥X∥∥22+1λ

Property 8: Assume that , , where is a small positive number. As , minimizing the empirical KRSL will be, approximately, equivalent to minimizing the -norm of X, that is

 minX∈Ω^Lλ(X,0)∼minX∈Ω∥∥X∥∥0,asσ→0+ (17)

where denotes a feasible set of X.

Proof: Let be the solution obtained by minimizing over and the solution achieved by minimizing . Then , and

 N∑i=1[exp(λ(1−κσ((XL)i)))−exp(λ)] (18) ≤N∑i=1[exp(λ(1−κσ((X0)i)))−exp(λ)]

where denotes the th component of . It follows that

 (1−exp(λ))(N−∥∥XL∥∥0) (19) +N∑i=1,(XL)i≠0[exp(λ(1−κσ((XL)i)))−exp(λ)] ≤(1−exp(λ))(N−∥∥X0∥∥0) +N∑i=1,(X0)i≠0[exp(λ(1−κσ((X0)i)))−exp(λ)]

Hence

 ∥∥XL∥∥0−∥∥X0∥∥0 ≤N∑i=1,(X0)i≠0[exp(λ(1−κσ((X0)i)))−exp(λ)]exp(λ)−1 (20) −N∑i=1,(XL)i≠0[exp(λ(1−κσ((XL)i)))−exp(λ)]exp(λ)−1

Since , , as the right hand side of (20) will approach zero. Thus, if is small enough, it holds that

 ∥∥X0∥∥0≤∥∥XL∥∥0≤∥∥X0∥∥0+ε (21)

where is a small positive number arbitrarily close to zero. This completes the proof.

Remark 3: According to Property 7 and 8, the empirical KRSL behaves like a squared norm of X when kernel bandwidth is very large, and like an norm of X when is very small. Similar properties also hold for the empirical C-Loss (or correntropy induced metric, CIM) [16].

## Iii Application to Adaptive Filtering

### Iii-a Performance Surface

Consider the identification of an FIR system:

 d(i)=WT0X(i)+v(i) (22)

where denotes an observed response at time , is an unknown weight vector to be estimated, is the input vector (known value), and stands for an additive noise (usually independent of the input). Let be the estimated value of the weight vector. Then the KRSL cost (as a function of is also referred to as the performance surface) is

 JKRSL(W) =1NλN∑i=1exp(λ(1−κσ(e(i)))) (23) =1NλN∑i=1exp(λ(1−κσ(d(i)−WTX(i))))

with being the error at time and the number of samples. The optimal solution can be solved by minimizing the cost function . This optimization principle is called in this paper the minimum kernel risk-sensitive loss (MKRSL) criterion. The following theorem holds.

Theorem 1 (Optimal Solution): The optimal solution under the MKRSL criterion satisfies

 WMKRSL=[N∑i=1h(e(i))X(i)X(i)T]−1[N∑i=1h(e(i))d(i)X(i)] (24)

where
, provided that the matrix is invertible.

Proof: It is easy to derive

 ∂∂WJKRSL(W)=0 (25) ⇒N∑i=1exp(λ(1−κσ(d(i)−WTX(i))))× κσ(d(i)−WTX(i))(d(i)−WTX(i))X(i)=0 ⇒[N∑i=1h(e(i))X(i)X(i)T]W=[N∑i=1h(e(i))d(i)X(i)] ⇒WMKRSL=[N∑i=1h(e(i))X(i)X(i)T]−1× [N∑i=1h(e(i))d(i)X(i)]

Remark 4: We have as . In this case, the optimal solution will be equal to the well-known Wiener solution. In addition, it is worth noting that the equation (24) does not provide a closed-form solution because the right hand side of (24) depends on through the error .

Now we compare the performance surfaces of the proposed KRSL and C-Loss. For the case (for visualization purpose), the contours and gradients (with respect to ) of the performance surfaces are plotted in Fig.1, where , , , , and the input and noise

are both zero-mean white Gaussian processes with unit variance. From Fig.1, one can see that the performance surface of the C-Loss is very flat (where the gradients are very small) when the estimated value is far away from the optimal solution (i.e.

), whereas it becomes very sharp near the optimal solution. For a gradient-based search algorithm, such a performance surface may lead to slow convergence speed especially when the initial estimate is far away from the optimal solution and possibly low accuracy at final stage due to misadjustments caused by large gradients near the optimal solution. By contrast, the performance surface of the KRSL has three regions: i) when the estimated value is close to the optimum, the gradients will become small to reduce the misadjustments; ii) when the estimated value is away from the optimum, the gradients will become large to speed up the convergence; iii) when the estimated value is further away from the optimum, the gradients will decrease gradually to zero to avoid big fluctuations possibly caused by large outliers. Therefore, compared with the C-Loss, the KRSL can offer potentially a more efficient solution, enabling simultaneously faster convergence and higher accuracy while maintaining the robustness to outliers.

### Iii-B Robustness Analysis

Similar to the MCC criterion, the MKRSL criterion is also robust to impulsive noises (or large outliers). In the following, we present some theoretical results on the robustness of the MKRSL criterion. For mathematical tractability, we consider only the scalar FIR identification case ( ). In this case, the weight and input are both scalars.

First, we give some notations. Let be a positive number, be the sample index set, and be a subset of satisfying , . In addition, the following two assumptions are made:

Assumption 1: , where denotes the cardinality of the set ;

Assumption 2: such that , .

Remark 5: The Assumption 1 means that there are ( more than ) samples in which the amplitudes of the additive noises satisfy , and (at least one) samples that may contain large outliers with (possibly ). The Assumption 2 is reasonable since for a finite number of samples, the minimum amplitude is non-zero in general.

With the above notations and assumptions, the following theorem holds:

Theorem 2: if , then the optimal solution under the MKRSL criterion satisfies , where the expression of is shown at the the bottom of the page.

Proof: See Appendix.

The following corollary is a direct consequence of Theorem 2:

Corollary 1: If
, then the optimal solution under MKRSL satisfies , where the expression of the constant is shown at the the bottom of the page, with .

Remark 5: According to Corollary 1, if the kernel bandwidth is larger than a certain value, the absolute value of the estimation error will be upper bounded by . If is very small, the upper bound will also be very small, which implies that the MKRSL solution can be very close to the true value () even in presence of outliers (whose values can be arbitrarily large), provided that there are () samples disturbed by small noises (bounded by ).

For the vector case (), it is very difficult to derive an upper bound on the norm of the estimation error. However, we believe that the above results for scalar case explain clearly why and how the MKRSL criterion will be robust to large outliers.

### Iii-C Stochastic Gradient Adaptive Algorithm

Stochastic gradient based adaptive algorithms have been widely used in many practical applications, especially those involving online adaptation. Under the MKRSL criterion, the instantaneous cost function at time is

 ^JKRSL=1λexp(λ(1−κσ(e(i)))) (28)

Then a stochastic gradient based adaptive filtering algorithm can be easily derived as

 W(i+1)=W(i)−μ∂∂W(i)^JKRSL (29) =W(i)+μσ2exp(λ(1−κσ(e(i))))κσ(e(i))e(i)X(i) =W(i)+ηexp(λ(1−κσ(e(i))))κσ(e(i))e(i)X(i)

where denotes the estimated weight vector at time , and is the step-size parameter. We call the above algorithm the MKRSL algorithm. In this work, we use the same abbreviation for an optimization criterion and the corresponding algorithm when no confusion can arise from the context.

The MKRSL algorithm (29) can also be expressed as

 W(i+1)=W(i)+η(i)e(i)X(i) (30)

which is a least mean square (LMS) algorithm with a variable step-size (VSS) .

We have the following observations:

1) As , we have . In this case, the MKRSL algorithm becomes the MCC algorithm [21, 22]:

 W(i+1)=W(i)+ηκσ(e(i))e(i)X(i) (31)

2) As , we have . In this case, the MKRSL algorithm will reduce to the original LMS algorithm (with a fixed step-size):

 W(i)=W(i−1)+ηe(i)X(i) (32)

Fig. 2 shows the curves of as a function of for different values of (where ). As one can see, when (corresponding to the MCC algorithm), the step-size will reach the maximum at the origin ().When , however, the step-size may reach the maximum at a location away from the origin, potentially leading to faster convergence speed and better accuracy. For any , the step-size will approach zero as , which implies that the MKRSL algorithm will be insensitive (or robust) to large errors.

Note that the computational complexity of the MKRSL algorithm is almost the same as the MCC algorithm. The only extra computational demand is to calculate the term .

### Iii-D Mean Square Convergence Performance

The mean square convergence behavior is very important for an adaptive filtering algorithm. There have been extensive studies on the mean square convergence of various adaptive filtering algorithms in the literature [35]. The proposed MKRSL algorithm belongs to a general class of adaptive filtering algorithms [36, 37]:

 W(i+1)=W(i)+ηf(e(i))X(i) (33)

where is a nonlinear function of , which, for the MKRSL algorithm, is

 (34)

For the case , the following relation holds [36]:

 E[∥∥~W(i+1)∥∥2]= E[∥∥~W(i)∥∥2]−2ηE[ea(i)f(e(i))] (35) +η2E[∥X(i)∥2f2(e(i))]

where is the weigh error vector at iteration , and is the a priori error. The relation (35) is a direct consequence of the energy conservation relation[36, 37].
1) Transient Behavior

Based on (35) and under some assumptions, one can derive a dynamic equation to characterize the transient behavior of the weight error power . Specifically, the following theorem holds [37].

Theorem 3: Consider the adaptive filtering algorithm (33), where , and . Assume that the noise process is i.i.d. and independent of the zero-mean input and that the filter is long enough so that is zero-mean Gaussian and that and are uncorrelated. Then it holds that

 E[∥∥~W(i+1)∥∥2Σ]=E[∥∥~W(i)∥∥2Σ] (36) +η2E[∥X(i)∥2Σ]hU(E[∥∥~W(i)∥∥2X(i)XT(i)])

where , and the functions and are defined by

 hG(E[e2a(i)])=E[ea(i)f(e(i))]E[e2a(i)] (37) hU(E[e2a(i)])=E[f2(e(i))]

Proof: A detailed derivation can be found in [37].

For the MKRSL algorithm, ,the functions and can be expressed as

 hG(x)= 1√2πx3∫∞−∞∫∞−∞yexp(λ(1−κσ(y+v)))× (38) κσ(y+v)(y+v)exp(−y22x)pv(v)dydv hU(x)= 1√2πx∫∞−∞∫∞−∞exp(2λ(1−κσ(y+v)))× κσ/σ√2√2(y+v)(y+v)2exp(−y22x)pv(v)dydv

where denotes the PDF of the noise . In general, there are no closed-form expressions for and . But the two functions can still be calculated by numerical integration.

Remark 6: Using (36), one can construct the convergence curves of the weight error power. For example, if, in addition, the input sequence is i.i.d., with covariance matrix , where I

denotes the identity matrix, we have

 E[∥∥~W(i+1)∥∥2]=E[∥∥~W(i)∥∥2] (39) −2ησ2xhG(σ2xE[∥∥~W(i)∥∥2])×E[∥∥~W(i)∥∥2] +η2σ2xmhU(σ2x