# Broad Learning System Based on Maximum Correntropy Criterion

As an effective and efficient discriminative learning method, Broad Learning System (BLS) has received increasing attention due to its outstanding performance in various regression and classification problems. However, the standard BLS is derived under the minimum mean square error (MMSE) criterion, which is, of course, not always a good choice due to its sensitivity to outliers. To enhance the robustness of BLS, we propose in this work to adopt the maximum correntropy criterion (MCC) to train the output weights, obtaining a correntropy based broad learning system (C-BLS). Thanks to the inherent superiorities of MCC, the proposed C-BLS is expected to achieve excellent robustness to outliers while maintaining the original performance of the standard BLS in Gaussian or noise-free environment. In addition, three alternative incremental learning algorithms, derived from a weighted regularized least-squares solution rather than pseudoinverse formula, for C-BLS are developed.With the incremental learning algorithms, the system can be updated quickly without the entire retraining process from the beginning, when some new samples arrive or the network deems to be expanded. Experiments on various regression and classification datasets are reported to demonstrate the desirable performance of the new methods.

There are no comments yet.

## Authors

• 1 publication
• 30 publications
• 15 publications
• 3 publications
• 1 publication
• 11 publications
• ### Diffusion Maximum Correntropy Criterion Algorithms for Robust Distributed Estimation

Robust diffusion adaptive estimation algorithms based on the maximum cor...
08/08/2015 ∙ by Wentao Ma, et al. ∙ 0

• ### Generalized Correntropy for Robust Adaptive Filtering

As a robust nonlinear similarity measure in kernel space, correntropy ha...
04/12/2015 ∙ by Badong Chen, et al. ∙ 0

• ### Robustness of Maximum Correntropy Estimation Against Large Outliers

The maximum correntropy criterion (MCC) has recently been successfully a...
03/23/2017 ∙ by Badong Chen, et al. ∙ 0

• ### Quick sensitivity analysis for incremental data modification and its application to leave-one-out CV in linear classification problems

We introduce a novel sensitivity analysis framework for large scale clas...
04/11/2015 ∙ by Shota Okumura, et al. ∙ 0

• ### Numerical analysis of least squares and perceptron learning for classification problems

This work presents study on regularized and non-regularized versions of ...
04/02/2020 ∙ by L. Beilina, et al. ∙ 0

• ### Multi-Kernel Correntropy for Robust Learning

As a novel similarity measure that is defined as the expectation of a ke...
05/24/2019 ∙ by Badong Chen, et al. ∙ 0

• ### Bias-Compensated Normalized Maximum Correntropy Criterion Algorithm for System Identification with Noisy Input

This paper proposed a bias-compensated normalized maximum correntropy cr...
11/23/2017 ∙ by Wentao Ma, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Broad Learning System (BLS) [10]

is an emerging discriminative learning method, which has been shown with the potential to outperform some deep neural network based learning methods, such as multilayer perceptron-based methods (MLP)

[4]

, deep belief networks (DBN)

[21], and stacked auto encoders (SAE) [49]

. To design a BLS, there are several necessary steps, including: 1) transforming the input data into general mapped features by some feature mappings; 2) the generated mapping features are connected by nonlinear activation functions to form the so called the “enhancement nodes”; 3) the mapped features and the “enhancement nodes” are sent together into the output layer, and the corresponding output weights are obtained by the means of pseudo-inverse. Since all weights and biases of the hidden layer units in BLS can be randomly generated and remain unchanged after that, we only need to train the weights between the hidden layer and the output layer, which brings great convenience to the training process. In addition, if some new samples arrive or the network deems to be expanded, several practical incremental learning algorithms were developed to guarantee that the system can be remodeled quickly without the entire retraining process from the beginning

[10]. Thanks to these attractive features, BLS has received increasing attention [9, 18, 25, 33, 32, 54, 35, 56, 57, 59, 22, 24, 55]

and been successfully applied in image recognition, face recognition, time series prediction, etc.

The standard BLS, however, takes the minimum mean square error (MMSE) criterion as a default choice of the optimization criterion in training the network output weights. Although MMSE criterion is computationally efficient and can provide good performance in Gaussian or noise-free environments, it will degrade the performance of BLS in complicated noise environments especially when data are contaminated by some outliers. To address this issue, several alternative optimization criteria which combine -norm with different regularization terms were proposed to train the output weights of BLS, generating a class of robust BLS (RBLS) variants [32]. Since -norm is less sensitive to outliers, the robustness of BLS has been significantly improved. Along the same line, Chu et al. [14] put forward the weighted BLS (WBLS). With the well-designed weighted penalty factor, WBLS has shown good robustness in nonlinear industrial process. Another representative work to improve the robustness of BLS is the robust manifold BLS (RM-BLS) [19]. By introducing the manifold embedding and random perturbation approximation, the robust mapping features can be expected in some special application scenarios, like the noisy chaotic time series prediction. Therefore, RM-BLS also has the ability to improve the robustness of BLS.

Although the aforementioned robust BLS variants can be good candidates when some training data are disturbed by outliers, they suffer from some drawbacks. For example, due to computational complexity, the incremental learning algorithms have not been provided under -norm based optimization criteria, even though they are one of the most important features of the standard BLS. For WBLS, its performance depends on the weighted penalty factor which needs to be specified in advance. In addition, the abandonment of the connections between the input layer and the feature layer may loss some interesting proprieties [10, 9, 18, 25, 54], and even makes WBLS fall into some common pitfalls discussed in [39]. As for RM-BLS, the random perturbation matrix is of great importance to promote the robustness of the algorithm, but how to design such random perturbation matrix is lack of guidance at present. Therefore, to develop a more general BLS which is expected to remain the advantages of the standard BLS as possible while having the ability to suppress the adverse effects of outliers still needs more efforts.

During the past few years, an effcient Information Theoretic Learning (ITL) [45] criterion called the maximum correntropy criterion (MCC) has been successfully applied to adaptive filters [52, 50, 7], randomized learning machines [38, 23, 42, 12, 53]

[27], auto-encoder [46, 13], common spatial patterns (CSP) [16], and many others. These successful applications demonstrate that MCC performs very well with outliers. In addition, according to Property 3 provided in [40], correntropy has the potential to capture both the second-order and higher-order statistical characteristics of errors when the Gaussian kernel is used. With an appropriate setting of kernel size, the second-order statistical characteristics of errors can be dominant, which makes correntropy based optimization criterion also become a suitable choice for Gaussian noise or noise free environment. Inspired by the successful applications and attractive features of correntropy, we adopt it to train the output weights of BLS. Our main contributions are summarized as follows:

• By using an MCC based fixed-point iteration algorithm to train the output weights of BLS, we propose a correntropy based BLS (C-BLS). The new method is robust to outliers, and has the potential to achieve comparable performance to the standard BLS in Gaussian noise or noise free environment.

• Three alternative incremental learning algorithms that are derived from a weighted regularized least-squares solution rather than pseudoinverse formula are provided. These algorithms ensure that the system can be remodeled quickly without the entire retraining process from the beginning when some new samples arrive or the network deems to be expanded.

• To test the effectiveness of the proposed methods comprehensively, various regression and classification applications are provided for performance evaluation.

The remainder of the paper is organized as follows. In Section II, we give a brief review of BLS. In Section III, the correntropy is introduced, and based on correntropy, we propose the C-BLS and its incremental learning algorithms. Section IV presents experiment results on various regression and classification applications to demonstrate the performance of the proposed methods. At last, the conclusion is made in section V.

## Ii Broad Learning System

The basic idea of BLS comes from random vector functional-link neural networks (RVFLNN)

[44, 11], but the direct connections between the input layer and the output layer of RVFLNN are replaced by a set of general mapped features, and the system can be flatted in the wide sense by the enhancement nodes. Such deformation leads to some interesting properties and even makes BLS outperform several deep structure based learning methods [10, 9].

### Ii-a Basic Structure and Training Algorithm

Fig. 1 shows the basic architecture of BLS [10]. Herein, and are respectively the input and the output matrices, where denotes the number of samples, represents the transpose operator, is the dimension of each input vector, and denotes the dimension of each output.

Based on X, groups of mapped features denoted as are firstly obtained by

 Zi=ϕi(XWei+βei)∈RN×q,i=1,2,⋯,k (1)

where

is usually a linear transformation;

corresponds to the number of feature nodes in each group; and are randomly generated weights and biases, respectively. In order to obtain the sparse representations of input data, they can be slightly fine-tuned by a sparse auto-encoder [10]. Concatenating all mapped features together, we have

 Zk=[Z1,Z2,⋯,Zk]∈RN×kq. (2)

Based on , groups of “enhancement nodes” denoted as are further obtained, that is

 Hj=ξj(ZkWhj+βhj)∈RN×r,j=1,2,⋯,m (3)

where is an activation function, such as ; corresponds to the number of enhancement nodes in each group; and are also randomly generated weights and biases, respectively. These “enhancement nodes” can also be cascaded into one in the form of

 Hm=[H1,H2,⋯,Hm]∈RN×mr. (4)

By concatenating and , we obtain

 U=[Zk,Hm]∈RN×L, (5)

where . Clearly, U is a new representation of the original input matrix X, and termed as the state matrix in [54]. Since all and

are randomly generated and remain unchanged after that, the learning task reduces to estimate the output weights

W. This optimization problem can be modeled as to find the regularized least-squares solution of , that is

 arg minW(∥%UW−Y∥22+λ∥W∥22). (6)

Therefore, we have

 W=(UTU+λI)−1UTY, (7)

in which I denotes an identify matrix with proper dimensions, and is a nonnegative constant for regularization. One should note that when , the solution in (7) is equivalent to

 W=U†Y, (8)

where denotes the pseudoinverse of U. Equation (8) has been chosen as the main strategy in [10] for finding the output weights W.

### Ii-B Incremental Learning Algorithms for BLS

We now give a brief introduction to the incremental learning algorithms of BLS. For simplicity, the subscripts of the feature mapping and the activation function will be omitted in the following, but one should note that can be selected differently in practice as well as . In addition, we denote and as the current input matrix and the current output matrix, respectively. According to (8), the current output weights can be obtained by

 W(t)=U(t)†Y(t). (9)

where is the state matrix calculated according to (1)-(5). Obviously, to derive the incremental learning algorithms of BLS, we need to determine the new forms of and .

#### Ii-B1 Increment of New Samples

When new samples arrive, the increased input matrix and output matrix can be expressed by and , respectively. The new state matrix and the output matrix are therefore obtained by

 U(t+1)=[U(t)Uα], Y(t+1)=[Y(t)Yα], (10)

where , and

 Zkα=[ϕ(XαWe1+βe1),⋯,ϕ(XαWek+βek)], (11)
 Hmα=[ξ(Zkα% Wh1+βh1),⋯,ξ(ZkαWhm+βhm)], (12)

According to [10, 11], the pseudoinverse of in (10) can be calculated by

 U(t+1)†=[U(t)†−BD, B], (13)

with

 D=UαU(t)†B={C†,ifC≠0(I+DDT)−1U(t)†DT,ifC=0C=Uα−DU(t). (14)

Correspondingly, the update equation for the output weights has the following form

 W(t+1) =U(t+1)†Y(t+1) =W(t)+B(Yα−UαW(t)). (15)

#### Ii-B2 Increment of Enhancement Mapping Nodes

When new enhancement nodes are inserted, the state matrix changes to

 U(t+1)=[U(t),ξ(ZkWhm+1+βhm+1)], (16)

where and are randomly generated weights and biases, respectively. The pseudoinverse of in (16) can be calculated in the following way [10]

 U(t+1)†=[U(t+1)†−DBTBT], (17)

with

 D=U(t)†ξ(Z% kWhm+1+βhm+1)BT={C†,ifC≠0(I+DTD)−1DTU(t)†,ifC=0C=ξ(ZkWhm+1+βhm+1)−U(t)D. (18)

Since , the output weights in this case are therefore updated by

 W(t+1)=U(t+1)†Y(t+1)=[W(t)−DBTY(t)BTY(t)]. (19)

#### Ii-B3 Increment of Feature Mapping Nodes

When the th group of feature nodes are inserted, we have

 U(t+1)=[U(t),Zk+1,Hexm], (20)

with

 Zk+1=ϕ(XWek+1+βek+1), (21)
 Hexm=[ξ(Zk+1Wex1+βex1),⋯,ξ(Zk+1Wexm+β%exm)], (22)

where are also randomly generated weights and biases that connect new feature nodes to the enhancement nodes. With a similar procedure used from (17) to (19), the new output weights here can be calculated by

 W(t+1)=[W(t)−DBTY(t)BTY(t)], (23)

with

 D=U(t)†[%Zk+1,Hexm]BT={C†,ifC≠0(I+DTD)−1DTU(t)†,ifC=0C=[Zk+1,Hexm]−U(t)D. (24)

Remark 1: When some new samples arrive or some new nodes are involved, the above three incremental learning algorithms can update the output weights of BLS without needing to run a complete training cycle, which ensures that the system can be remodeled quickly. However, these incremental learning algorithms require the regularization factor to tend to zero, so that the regularized least squares solution can well approximate the pseudo-inverse. This is, of course, not always the good choice, since the regularization factor plays an important role to improve the model’s generalization ability in many practical applications. In the next section, several more general incremental learning algorithms under the BLS architecture will be provided.

## Iii Correntropy Based Broad Learning System

Although BLS has so many attractive features, its dependence on the second order statistical characteristics of errors makes it not a suitable choice in complicated noise environments, especially when data are disturbed by some outliers [32]. To offer a robust version of BLS, we introduce in this section the concept of correntropy, and based on correntropy, the C-BLS and its incremental learning algorithms are developed.

### Iii-a Correntropy

Correntropy [40]

is a local similarity measure between two arbitrary random variables

and , defined by

 Vσ(X,Y) =E[κσ(X,Y)], (25)

where denotes the expectation operator, and is a Mercer kernel [2] controlled by the kernel size . Without loss of generality, the Gaussian kernel defined as will be a default choice in this paper. By using the Taylor series expansion to (25), we have

 Vσ(X,Y) =1√2πσ∞∑n=1(−1)n2nn!E[(X−Y)2nσ2n]. (26)

Clearly, correntropy can be viewed as a weighted sum of all even moments of

, and the weights of the second and higher-order moments are controlled by the kernel size . As increases, the high-order moments decay faster. Hence, the second-order moment has the chance to be dominant with a large . In practice, the data distribution is usually unknown and only a finite number of samples are available, resulting in the sample estimator of correntropy to be

 ^Vσ(X,Y)=1NN∑i=1κσ(xi,yi). (27)

In signal processing and machine learning fields, it is usually to estimate an unknown parameter

(such as the weight vector of an adaptive filter) by maximizing the correntropy between the desired signal and its estimated value , i. e.,

 arg maxω^Vσ(Y,^Y), (28)

This optimization criterion is called MCC. Unlike the well known MMSE criterion which is sensitive to outliers, MCC has been proven to be very robust for parameters estimation in complicated noise environments [6, 3, 58, 28, 52, 50, 20].

### Iii-B Basic Training Algorithm for C-BLS

Similar to the standard BLS, the state matrix U in the proposed method can be constructed though a series of feature mappings and enhancement transformations that have been described in (1)-(5). However, more powerful feature mapping strategies, such as convolution-pooling operation [9], neuro-fuzzy model [18], and structured manifold learning technology [25], are also feasible. Thus, the optimization model that combines BLS and MCC can be formulated by

 arg maxW⎛⎝N∑i=1exp(−∥∥uiW−yi∥∥222σ2)−λ2∥W∥22⎞⎠, (29)

where denotes the th row of U. For implicity, we denote , and then (29) can be rewritten as

 arg maxWJ(W). (30)

Taking the gradient of with respect to W, we have

 \resizebox413.3844pt$∂J(W)∂W=−1σ2N∑i=1uTiexp(−∥∥uiW−yi∥∥222σ2)(ui% W−yi)−λW$. (31)

The matrix form of (31) can be expressed by

 ∂J(W)∂W=−1σ2UTΛw(UW−Y)−λW, (32)

with

 Λw=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣exp(−∥∥u1W−y1∥∥222σ2)⋱exp(−∥∥uNW−yN∥∥222σ2)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦. (33)

By setting to zero, the solution of W can be written in the following form

 W =(UTΛwU+γI)−1UTΛwY, (34)

where . Obviously, is the function of W. Hence, (34) is actually a fixed-point equation which can be described by

 W=f(W), (35)

with

 f(W) =(UTΛwU+γI)−1UTΛwY. (36)

Referring to the widely used fixed-point iteration method [1, 8, 29, 5], we can solve W by the following iteration way

 W(t+1)=f(W(t)), (37)

where denotes the solution at iteration . Let denote termination tolerance, the stopping criterion can be set as . According to the work done in [8], the convergence of the fixed-point iteration method under the MCC can be guaranteed if the kernel size is appropriately chosen. In the following experiments, the grid search method will be adopted to determine and other parameters of C-BLS, so as to ensure its convergence and also make it approach its optimal performance as possible.

Finally, the proposed C-BLS is summarized in Algorithm 1.

Remark 2: Compared with (7) for the original BLS, (34) has an additive weighted diagonal matrix , whose the th diagonal element is controlled by the kernel size as well as the difference between and its estimation . It can be verified that when , (34) will reduce to the solution of the standard BLS. This makes C-BLS can, at least, achieve the comparable performance to the standard BLS. In addition, by appropriately setting the values of , C-BLS has the potential to weaken the negative effects of outliers. For example, when the th sample is polluted by outlier, there will be in general a large difference between and , denoted as . With appropriately setting of , such as , we have , which makes the outlier not have a big impact on the training process.

Remark 3: The update equations of the proposed C-BLS is somewhat similar to the W-BLS proposed in [14]. However, there are several dissimilarities between them, including: 1) C-BLS is proposed from the Information Theoretic Learning ITL) [45]

perspective while W-BLS is proposed from the application of industrial process; 2) the weighted operator in C-BLS is a successive result derived from MCC while the weighted operator in W-BLS is an additional hyperparameter needed to be specified in advance; 3) C-BLS remains the connections between the input layer and the feature layer and hence can be easily combined with some existing feature mapping technologies

[10, 9, 18, 25] while W-BLS abandons such connections.

### Iii-C Incremental Learning Algorithms for C-BLS

To derive the incremental learning algorithms of C-BLS, we also use and to denote the current input matrix and the current output matrix, respectively. According to (34), we therefore get that

 W(t) =[U(t)TΛw(t)U(t)+γI]−1U(t)TΛw(t)Y% (t) =(Uw(t)TUw(t)+γI)−1Uw(t)TYw(t), (38)

where is the weighted state matrix, corresponds to the weighted output matrix, and is calculated by

 Λw(t)=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣exp(−∥∥u1W(t)−y1∥∥222σ2)⋱exp(−∥∥uNW(t)−yN∥∥222σ2)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦. (39)

For ease of representation, we define and . Hence, (III-C) can be written as

 W(t) =R−1w(t)Pw(t). (40)

#### Iii-C1 Increment of new samples

Assume that new samples are available. We first denote and . Then, the weighted state matrix and output matrix can be obtained by

 Uw(t+1)≈[Uw(t)Uαw(t)], Yw(t+1)≈[Yw(t)Yαw(t)], (41)

where and with

 Λαw(t)=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣exp(−∥∥uN+1W(t)−yN+1∥∥222σ2)⋱exp(−∥∥uN+N0W(t)−yN+N0∥∥222σ2)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦, (42)
 (43)
 (44)

Substituting and into (40), we have

 W(t+1) =R−1w(t+1)Pw(t+1), (45)

where

 Rw(t+1) =UTw(t+1)Uw(t+1)+γI ≈[Uw(t)Uαw(t)]T[Uw(t)Uαw(t)]+γI =Rw(t)+Uαw(t)TUαw(t), (46)

and

 Pw(t+1) =UTw(t+1)Yw(t+1) ≈[Uw(t)Uαw(t)]T[Yw(t)Yαw(t)] =Pw(t)+Uαw(t)TYαw(t). (47)

By using the matrix inverse lemma

 (A+BCD)−1=A−1−A−1B(C−1+DA−1B)−1DA−1 (48)

with the definitions of , , and , we get

 R−1w(t+1)=R−1w(t)−R−1w(t)%Uαw(t)TSαw(t)Uαw(t)R−1w(t), (49)

where . Substituting (III-C1) and (49) into (45), yields

 W(t+1)=R−1w(t+1)Pw(t+1)           =W(t)+R−1w(t)U% αw(t)TSαw(t)(Yαw(t)−Uαw(t)W(t)). (50)

Let . We obtain the final equations for updating as follows

 Sαw(t)=[I+Uαw(t)Cw(t)Uαw(t)T]−1 W(t+1)=W(t)+Cw(t)Uαw(t)TSαw(t)(Yαw(t)−Uαw(t)W(t)) Cw(t+1)=Cw(t)−Cw(t)Uαw(t)TSαw(t)Uαw(t)Cw(t). (51)

The main computational effort in (III-C1) focuses on the calculation of . Since only the latest samples as well as the previous are involved to compute , the corresponding computational cost is in general not burdensome.

#### Iii-C2 Increment of Enhancement Nodes

Assume that enhancement nodes are inserted. The weighted state matrix can be expressed by

 Uw(t+1)≈[Uw(t),ξw(t)], (52)

where ; and are randomly generated weights and biases, respectively. With the approximation of , the output weights in this case are obtained by

 W(t+1) =R−1w(t+1)Pw(t+1), (53)

with

 Rw(t+1) =UTw(t+1)Uw(t+1)+γI ≈⎡⎣UTw(t)ξTw(t)⎤⎦[Uw(t),ξw(t)]+γI =⎡⎣Rw(t)UTw(t)ξw(t)ξTw(t)Uw(t)γI+ξTw(t)ξw(t)⎤⎦, (54)

and

 Pw(t+1) =UTw(t+1)Yw(t+1) ≈⎡⎣UTw(t)ξTw(t)⎤⎦Yw(t) =[Pw(t)ξTw(t)Yw(t)], (55)

The inverse matrix of in (53) can be calculated by using the block matrix inversion lemma [41], which has the form of

 \resizebox413.3844pt$[ABCD]−1=[(A−BD−1C)−1−A−1B(D−CA−1B)−1−(D−CA−1B)−1CA−1(D−CA−1B)−1]$, (56)

where A and D are arbitrary reversible matrix blocks. Let , , , and , then we get

 \resizebox413.3844pt$R−1w(t+1)=⎡⎣R−1w(t)+Zw(t)Qw(t)ZTw(t)−Zw(t)Qw(t)−Qw(t)ZTw(t)Qw(t)⎤⎦$, (57)

where

 Zw(t)=R−1w(t)UTw(t)ξw(t),Qw(t)=(γI+ξTw(t)ξw(t)−ξTw(t)Uw(t)Zw(t))−1. (58)

Substituting (56) and (57) into (53), we have

 W(t+1)=R−1w(t+1)Pw(t+1)=[W(t)−Zw(t)Qw(t)ξTw(t)(Y