I Introduction
Broad Learning System (BLS) [10]
is an emerging discriminative learning method, which has been shown with the potential to outperform some deep neural network based learning methods, such as multilayer perceptronbased methods (MLP)
[4], deep belief networks (DBN)
[21], and stacked auto encoders (SAE) [49]. To design a BLS, there are several necessary steps, including: 1) transforming the input data into general mapped features by some feature mappings; 2) the generated mapping features are connected by nonlinear activation functions to form the so called the “enhancement nodes”; 3) the mapped features and the “enhancement nodes” are sent together into the output layer, and the corresponding output weights are obtained by the means of pseudoinverse. Since all weights and biases of the hidden layer units in BLS can be randomly generated and remain unchanged after that, we only need to train the weights between the hidden layer and the output layer, which brings great convenience to the training process. In addition, if some new samples arrive or the network deems to be expanded, several practical incremental learning algorithms were developed to guarantee that the system can be remodeled quickly without the entire retraining process from the beginning
[10]. Thanks to these attractive features, BLS has received increasing attention [9, 18, 25, 33, 32, 54, 35, 56, 57, 59, 22, 24, 55]and been successfully applied in image recognition, face recognition, time series prediction, etc.
The standard BLS, however, takes the minimum mean square error (MMSE) criterion as a default choice of the optimization criterion in training the network output weights. Although MMSE criterion is computationally efficient and can provide good performance in Gaussian or noisefree environments, it will degrade the performance of BLS in complicated noise environments especially when data are contaminated by some outliers. To address this issue, several alternative optimization criteria which combine norm with different regularization terms were proposed to train the output weights of BLS, generating a class of robust BLS (RBLS) variants [32]. Since norm is less sensitive to outliers, the robustness of BLS has been significantly improved. Along the same line, Chu et al. [14] put forward the weighted BLS (WBLS). With the welldesigned weighted penalty factor, WBLS has shown good robustness in nonlinear industrial process. Another representative work to improve the robustness of BLS is the robust manifold BLS (RMBLS) [19]. By introducing the manifold embedding and random perturbation approximation, the robust mapping features can be expected in some special application scenarios, like the noisy chaotic time series prediction. Therefore, RMBLS also has the ability to improve the robustness of BLS.
Although the aforementioned robust BLS variants can be good candidates when some training data are disturbed by outliers, they suffer from some drawbacks. For example, due to computational complexity, the incremental learning algorithms have not been provided under norm based optimization criteria, even though they are one of the most important features of the standard BLS. For WBLS, its performance depends on the weighted penalty factor which needs to be specified in advance. In addition, the abandonment of the connections between the input layer and the feature layer may loss some interesting proprieties [10, 9, 18, 25, 54], and even makes WBLS fall into some common pitfalls discussed in [39]. As for RMBLS, the random perturbation matrix is of great importance to promote the robustness of the algorithm, but how to design such random perturbation matrix is lack of guidance at present. Therefore, to develop a more general BLS which is expected to remain the advantages of the standard BLS as possible while having the ability to suppress the adverse effects of outliers still needs more efforts.
During the past few years, an effcient Information Theoretic Learning (ITL) [45] criterion called the maximum correntropy criterion (MCC) has been successfully applied to adaptive filters [52, 50, 7], randomized learning machines [38, 23, 42, 12, 53]
, principal component analysis (PCA)
[27], autoencoder [46, 13], common spatial patterns (CSP) [16], and many others. These successful applications demonstrate that MCC performs very well with outliers. In addition, according to Property 3 provided in [40], correntropy has the potential to capture both the secondorder and higherorder statistical characteristics of errors when the Gaussian kernel is used. With an appropriate setting of kernel size, the secondorder statistical characteristics of errors can be dominant, which makes correntropy based optimization criterion also become a suitable choice for Gaussian noise or noise free environment. Inspired by the successful applications and attractive features of correntropy, we adopt it to train the output weights of BLS. Our main contributions are summarized as follows:
By using an MCC based fixedpoint iteration algorithm to train the output weights of BLS, we propose a correntropy based BLS (CBLS). The new method is robust to outliers, and has the potential to achieve comparable performance to the standard BLS in Gaussian noise or noise free environment.

Three alternative incremental learning algorithms that are derived from a weighted regularized leastsquares solution rather than pseudoinverse formula are provided. These algorithms ensure that the system can be remodeled quickly without the entire retraining process from the beginning when some new samples arrive or the network deems to be expanded.

To test the effectiveness of the proposed methods comprehensively, various regression and classification applications are provided for performance evaluation.
The remainder of the paper is organized as follows. In Section II, we give a brief review of BLS. In Section III, the correntropy is introduced, and based on correntropy, we propose the CBLS and its incremental learning algorithms. Section IV presents experiment results on various regression and classification applications to demonstrate the performance of the proposed methods. At last, the conclusion is made in section V.
Ii Broad Learning System
The basic idea of BLS comes from random vector functionallink neural networks (RVFLNN)
[44, 11], but the direct connections between the input layer and the output layer of RVFLNN are replaced by a set of general mapped features, and the system can be flatted in the wide sense by the enhancement nodes. Such deformation leads to some interesting properties and even makes BLS outperform several deep structure based learning methods [10, 9].Iia Basic Structure and Training Algorithm
Fig. 1 shows the basic architecture of BLS [10]. Herein, and are respectively the input and the output matrices, where denotes the number of samples, represents the transpose operator, is the dimension of each input vector, and denotes the dimension of each output.
Based on X, groups of mapped features denoted as are firstly obtained by
(1) 
where
is usually a linear transformation;
corresponds to the number of feature nodes in each group; and are randomly generated weights and biases, respectively. In order to obtain the sparse representations of input data, they can be slightly finetuned by a sparse autoencoder [10]. Concatenating all mapped features together, we have(2) 
Based on , groups of “enhancement nodes” denoted as are further obtained, that is
(3) 
where is an activation function, such as ; corresponds to the number of enhancement nodes in each group; and are also randomly generated weights and biases, respectively. These “enhancement nodes” can also be cascaded into one in the form of
(4) 
By concatenating and , we obtain
(5) 
where . Clearly, U is a new representation of the original input matrix X, and termed as the state matrix in [54]. Since all and
are randomly generated and remain unchanged after that, the learning task reduces to estimate the output weights
W. This optimization problem can be modeled as to find the regularized leastsquares solution of , that is(6) 
Therefore, we have
(7) 
in which I denotes an identify matrix with proper dimensions, and is a nonnegative constant for regularization. One should note that when , the solution in (7) is equivalent to
(8) 
where denotes the pseudoinverse of U. Equation (8) has been chosen as the main strategy in [10] for finding the output weights W.
IiB Incremental Learning Algorithms for BLS
We now give a brief introduction to the incremental learning algorithms of BLS. For simplicity, the subscripts of the feature mapping and the activation function will be omitted in the following, but one should note that can be selected differently in practice as well as . In addition, we denote and as the current input matrix and the current output matrix, respectively. According to (8), the current output weights can be obtained by
(9) 
where is the state matrix calculated according to (1)(5). Obviously, to derive the incremental learning algorithms of BLS, we need to determine the new forms of and .
IiB1 Increment of New Samples
When new samples arrive, the increased input matrix and output matrix can be expressed by and , respectively. The new state matrix and the output matrix are therefore obtained by
(10) 
where , and
(11) 
(12) 
According to [10, 11], the pseudoinverse of in (10) can be calculated by
(13) 
with
(14) 
Correspondingly, the update equation for the output weights has the following form
(15) 
IiB2 Increment of Enhancement Mapping Nodes
IiB3 Increment of Feature Mapping Nodes
When the th group of feature nodes are inserted, we have
(20) 
with
(21) 
(22) 
where are also randomly generated weights and biases that connect new feature nodes to the enhancement nodes. With a similar procedure used from (17) to (19), the new output weights here can be calculated by
(23) 
with
(24) 
Remark 1: When some new samples arrive or some new nodes are involved, the above three incremental learning algorithms can update the output weights of BLS without needing to run a complete training cycle, which ensures that the system can be remodeled quickly. However, these incremental learning algorithms require the regularization factor to tend to zero, so that the regularized least squares solution can well approximate the pseudoinverse. This is, of course, not always the good choice, since the regularization factor plays an important role to improve the model’s generalization ability in many practical applications. In the next section, several more general incremental learning algorithms under the BLS architecture will be provided.
Iii Correntropy Based Broad Learning System
Although BLS has so many attractive features, its dependence on the second order statistical characteristics of errors makes it not a suitable choice in complicated noise environments, especially when data are disturbed by some outliers [32]. To offer a robust version of BLS, we introduce in this section the concept of correntropy, and based on correntropy, the CBLS and its incremental learning algorithms are developed.
Iiia Correntropy
Correntropy [40]
is a local similarity measure between two arbitrary random variables
and , defined by(25) 
where denotes the expectation operator, and is a Mercer kernel [2] controlled by the kernel size . Without loss of generality, the Gaussian kernel defined as will be a default choice in this paper. By using the Taylor series expansion to (25), we have
(26) 
Clearly, correntropy can be viewed as a weighted sum of all even moments of
, and the weights of the second and higherorder moments are controlled by the kernel size . As increases, the highorder moments decay faster. Hence, the secondorder moment has the chance to be dominant with a large . In practice, the data distribution is usually unknown and only a finite number of samples are available, resulting in the sample estimator of correntropy to be(27) 
In signal processing and machine learning fields, it is usually to estimate an unknown parameter
(such as the weight vector of an adaptive filter) by maximizing the correntropy between the desired signal and its estimated value , i. e.,(28) 
This optimization criterion is called MCC. Unlike the well known MMSE criterion which is sensitive to outliers, MCC has been proven to be very robust for parameters estimation in complicated noise environments [6, 3, 58, 28, 52, 50, 20].
IiiB Basic Training Algorithm for CBLS
Similar to the standard BLS, the state matrix U in the proposed method can be constructed though a series of feature mappings and enhancement transformations that have been described in (1)(5). However, more powerful feature mapping strategies, such as convolutionpooling operation [9], neurofuzzy model [18], and structured manifold learning technology [25], are also feasible. Thus, the optimization model that combines BLS and MCC can be formulated by
(29) 
where denotes the th row of U. For implicity, we denote , and then (29) can be rewritten as
(30) 
Taking the gradient of with respect to W, we have
(31) 
The matrix form of (31) can be expressed by
(32) 
with
(33) 
By setting to zero, the solution of W can be written in the following form
W  (34) 
where . Obviously, is the function of W. Hence, (34) is actually a fixedpoint equation which can be described by
(35) 
with
(36) 
Referring to the widely used fixedpoint iteration method [1, 8, 29, 5], we can solve W by the following iteration way
(37) 
where denotes the solution at iteration . Let denote termination tolerance, the stopping criterion can be set as . According to the work done in [8], the convergence of the fixedpoint iteration method under the MCC can be guaranteed if the kernel size is appropriately chosen. In the following experiments, the grid search method will be adopted to determine and other parameters of CBLS, so as to ensure its convergence and also make it approach its optimal performance as possible.
Finally, the proposed CBLS is summarized in Algorithm 1.
Remark 2: Compared with (7) for the original BLS, (34) has an additive weighted diagonal matrix , whose the th diagonal element is controlled by the kernel size as well as the difference between and its estimation . It can be verified that when , (34) will reduce to the solution of the standard BLS. This makes CBLS can, at least, achieve the comparable performance to the standard BLS. In addition, by appropriately setting the values of , CBLS has the potential to weaken the negative effects of outliers. For example, when the th sample is polluted by outlier, there will be in general a large difference between and , denoted as . With appropriately setting of , such as , we have , which makes the outlier not have a big impact on the training process.
Remark 3: The update equations of the proposed CBLS is somewhat similar to the WBLS proposed in [14]. However, there are several dissimilarities between them, including: 1) CBLS is proposed from the Information Theoretic Learning ITL) [45]
perspective while WBLS is proposed from the application of industrial process; 2) the weighted operator in CBLS is a successive result derived from MCC while the weighted operator in WBLS is an additional hyperparameter needed to be specified in advance; 3) CBLS remains the connections between the input layer and the feature layer and hence can be easily combined with some existing feature mapping technologies
[10, 9, 18, 25] while WBLS abandons such connections.IiiC Incremental Learning Algorithms for CBLS
To derive the incremental learning algorithms of CBLS, we also use and to denote the current input matrix and the current output matrix, respectively. According to (34), we therefore get that
(38) 
where is the weighted state matrix, corresponds to the weighted output matrix, and is calculated by
(39) 
For ease of representation, we define and . Hence, (IIIC) can be written as
(40) 
IiiC1 Increment of new samples
Assume that new samples are available. We first denote and . Then, the weighted state matrix and output matrix can be obtained by
(41) 
where and with
(42) 
(43) 
(44) 
Substituting and into (40), we have
(45) 
where
(46) 
and
(47) 
By using the matrix inverse lemma
(48) 
with the definitions of , , and , we get
(49) 
where . Substituting (IIIC1) and (49) into (45), yields
(50) 
Let . We obtain the final equations for updating as follows
(51) 
The main computational effort in (IIIC1) focuses on the calculation of . Since only the latest samples as well as the previous are involved to compute , the corresponding computational cost is in general not burdensome.
IiiC2 Increment of Enhancement Nodes
Assume that enhancement nodes are inserted. The weighted state matrix can be expressed by
(52) 
where ; and are randomly generated weights and biases, respectively. With the approximation of , the output weights in this case are obtained by
(53) 
with
(54) 
and
(55) 
The inverse matrix of in (53) can be calculated by using the block matrix inversion lemma [41], which has the form of
(56) 
where A and D are arbitrary reversible matrix blocks. Let , , , and , then we get
(57) 
where
(58) 
Substituting (56) and (57) into (53), we have
Comments
There are no comments yet.