Broad Learning System (BLS) 4]
, deep belief networks (DBN), and stacked auto encoders (SAE) 
. To design a BLS, there are several necessary steps, including: 1) transforming the input data into general mapped features by some feature mappings; 2) the generated mapping features are connected by nonlinear activation functions to form the so called the “enhancement nodes”; 3) the mapped features and the “enhancement nodes” are sent together into the output layer, and the corresponding output weights are obtained by the means of pseudo-inverse. Since all weights and biases of the hidden layer units in BLS can be randomly generated and remain unchanged after that, we only need to train the weights between the hidden layer and the output layer, which brings great convenience to the training process. In addition, if some new samples arrive or the network deems to be expanded, several practical incremental learning algorithms were developed to guarantee that the system can be remodeled quickly without the entire retraining process from the beginning. Thanks to these attractive features, BLS has received increasing attention [9, 18, 25, 33, 32, 54, 35, 56, 57, 59, 22, 24, 55]
and been successfully applied in image recognition, face recognition, time series prediction, etc.
The standard BLS, however, takes the minimum mean square error (MMSE) criterion as a default choice of the optimization criterion in training the network output weights. Although MMSE criterion is computationally efficient and can provide good performance in Gaussian or noise-free environments, it will degrade the performance of BLS in complicated noise environments especially when data are contaminated by some outliers. To address this issue, several alternative optimization criteria which combine -norm with different regularization terms were proposed to train the output weights of BLS, generating a class of robust BLS (RBLS) variants . Since -norm is less sensitive to outliers, the robustness of BLS has been significantly improved. Along the same line, Chu et al.  put forward the weighted BLS (WBLS). With the well-designed weighted penalty factor, WBLS has shown good robustness in nonlinear industrial process. Another representative work to improve the robustness of BLS is the robust manifold BLS (RM-BLS) . By introducing the manifold embedding and random perturbation approximation, the robust mapping features can be expected in some special application scenarios, like the noisy chaotic time series prediction. Therefore, RM-BLS also has the ability to improve the robustness of BLS.
Although the aforementioned robust BLS variants can be good candidates when some training data are disturbed by outliers, they suffer from some drawbacks. For example, due to computational complexity, the incremental learning algorithms have not been provided under -norm based optimization criteria, even though they are one of the most important features of the standard BLS. For WBLS, its performance depends on the weighted penalty factor which needs to be specified in advance. In addition, the abandonment of the connections between the input layer and the feature layer may loss some interesting proprieties [10, 9, 18, 25, 54], and even makes WBLS fall into some common pitfalls discussed in . As for RM-BLS, the random perturbation matrix is of great importance to promote the robustness of the algorithm, but how to design such random perturbation matrix is lack of guidance at present. Therefore, to develop a more general BLS which is expected to remain the advantages of the standard BLS as possible while having the ability to suppress the adverse effects of outliers still needs more efforts.
During the past few years, an effcient Information Theoretic Learning (ITL)  criterion called the maximum correntropy criterion (MCC) has been successfully applied to adaptive filters [52, 50, 7], randomized learning machines [38, 23, 42, 12, 53]
, principal component analysis (PCA), auto-encoder [46, 13], common spatial patterns (CSP) , and many others. These successful applications demonstrate that MCC performs very well with outliers. In addition, according to Property 3 provided in , correntropy has the potential to capture both the second-order and higher-order statistical characteristics of errors when the Gaussian kernel is used. With an appropriate setting of kernel size, the second-order statistical characteristics of errors can be dominant, which makes correntropy based optimization criterion also become a suitable choice for Gaussian noise or noise free environment. Inspired by the successful applications and attractive features of correntropy, we adopt it to train the output weights of BLS. Our main contributions are summarized as follows:
By using an MCC based fixed-point iteration algorithm to train the output weights of BLS, we propose a correntropy based BLS (C-BLS). The new method is robust to outliers, and has the potential to achieve comparable performance to the standard BLS in Gaussian noise or noise free environment.
Three alternative incremental learning algorithms that are derived from a weighted regularized least-squares solution rather than pseudoinverse formula are provided. These algorithms ensure that the system can be remodeled quickly without the entire retraining process from the beginning when some new samples arrive or the network deems to be expanded.
To test the effectiveness of the proposed methods comprehensively, various regression and classification applications are provided for performance evaluation.
The remainder of the paper is organized as follows. In Section II, we give a brief review of BLS. In Section III, the correntropy is introduced, and based on correntropy, we propose the C-BLS and its incremental learning algorithms. Section IV presents experiment results on various regression and classification applications to demonstrate the performance of the proposed methods. At last, the conclusion is made in section V.
Ii Broad Learning System
The basic idea of BLS comes from random vector functional-link neural networks (RVFLNN)[44, 11], but the direct connections between the input layer and the output layer of RVFLNN are replaced by a set of general mapped features, and the system can be flatted in the wide sense by the enhancement nodes. Such deformation leads to some interesting properties and even makes BLS outperform several deep structure based learning methods [10, 9].
Ii-a Basic Structure and Training Algorithm
Fig. 1 shows the basic architecture of BLS . Herein, and are respectively the input and the output matrices, where denotes the number of samples, represents the transpose operator, is the dimension of each input vector, and denotes the dimension of each output.
Based on X, groups of mapped features denoted as are firstly obtained by
is usually a linear transformation;corresponds to the number of feature nodes in each group; and are randomly generated weights and biases, respectively. In order to obtain the sparse representations of input data, they can be slightly fine-tuned by a sparse auto-encoder . Concatenating all mapped features together, we have
Based on , groups of “enhancement nodes” denoted as are further obtained, that is
where is an activation function, such as ; corresponds to the number of enhancement nodes in each group; and are also randomly generated weights and biases, respectively. These “enhancement nodes” can also be cascaded into one in the form of
By concatenating and , we obtain
where . Clearly, U is a new representation of the original input matrix X, and termed as the state matrix in . Since all and
are randomly generated and remain unchanged after that, the learning task reduces to estimate the output weightsW. This optimization problem can be modeled as to find the regularized least-squares solution of , that is
Therefore, we have
in which I denotes an identify matrix with proper dimensions, and is a nonnegative constant for regularization. One should note that when , the solution in (7) is equivalent to
Ii-B Incremental Learning Algorithms for BLS
We now give a brief introduction to the incremental learning algorithms of BLS. For simplicity, the subscripts of the feature mapping and the activation function will be omitted in the following, but one should note that can be selected differently in practice as well as . In addition, we denote and as the current input matrix and the current output matrix, respectively. According to (8), the current output weights can be obtained by
Ii-B1 Increment of New Samples
When new samples arrive, the increased input matrix and output matrix can be expressed by and , respectively. The new state matrix and the output matrix are therefore obtained by
where , and
Correspondingly, the update equation for the output weights has the following form
Ii-B2 Increment of Enhancement Mapping Nodes
Ii-B3 Increment of Feature Mapping Nodes
When the th group of feature nodes are inserted, we have
where are also randomly generated weights and biases that connect new feature nodes to the enhancement nodes. With a similar procedure used from (17) to (19), the new output weights here can be calculated by
Remark 1: When some new samples arrive or some new nodes are involved, the above three incremental learning algorithms can update the output weights of BLS without needing to run a complete training cycle, which ensures that the system can be remodeled quickly. However, these incremental learning algorithms require the regularization factor to tend to zero, so that the regularized least squares solution can well approximate the pseudo-inverse. This is, of course, not always the good choice, since the regularization factor plays an important role to improve the model’s generalization ability in many practical applications. In the next section, several more general incremental learning algorithms under the BLS architecture will be provided.
Iii Correntropy Based Broad Learning System
Although BLS has so many attractive features, its dependence on the second order statistical characteristics of errors makes it not a suitable choice in complicated noise environments, especially when data are disturbed by some outliers . To offer a robust version of BLS, we introduce in this section the concept of correntropy, and based on correntropy, the C-BLS and its incremental learning algorithms are developed.
is a local similarity measure between two arbitrary random variablesand , defined by
where denotes the expectation operator, and is a Mercer kernel  controlled by the kernel size . Without loss of generality, the Gaussian kernel defined as will be a default choice in this paper. By using the Taylor series expansion to (25), we have
Clearly, correntropy can be viewed as a weighted sum of all even moments of, and the weights of the second and higher-order moments are controlled by the kernel size . As increases, the high-order moments decay faster. Hence, the second-order moment has the chance to be dominant with a large . In practice, the data distribution is usually unknown and only a finite number of samples are available, resulting in the sample estimator of correntropy to be
In signal processing and machine learning fields, it is usually to estimate an unknown parameter(such as the weight vector of an adaptive filter) by maximizing the correntropy between the desired signal and its estimated value , i. e.,
This optimization criterion is called MCC. Unlike the well known MMSE criterion which is sensitive to outliers, MCC has been proven to be very robust for parameters estimation in complicated noise environments [6, 3, 58, 28, 52, 50, 20].
Iii-B Basic Training Algorithm for C-BLS
Similar to the standard BLS, the state matrix U in the proposed method can be constructed though a series of feature mappings and enhancement transformations that have been described in (1)-(5). However, more powerful feature mapping strategies, such as convolution-pooling operation , neuro-fuzzy model , and structured manifold learning technology , are also feasible. Thus, the optimization model that combines BLS and MCC can be formulated by
where denotes the th row of U. For implicity, we denote , and then (29) can be rewritten as
Taking the gradient of with respect to W, we have
The matrix form of (31) can be expressed by
By setting to zero, the solution of W can be written in the following form
where . Obviously, is the function of W. Hence, (34) is actually a fixed-point equation which can be described by
where denotes the solution at iteration . Let denote termination tolerance, the stopping criterion can be set as . According to the work done in , the convergence of the fixed-point iteration method under the MCC can be guaranteed if the kernel size is appropriately chosen. In the following experiments, the grid search method will be adopted to determine and other parameters of C-BLS, so as to ensure its convergence and also make it approach its optimal performance as possible.
Finally, the proposed C-BLS is summarized in Algorithm 1.
Remark 2: Compared with (7) for the original BLS, (34) has an additive weighted diagonal matrix , whose the th diagonal element is controlled by the kernel size as well as the difference between and its estimation . It can be verified that when , (34) will reduce to the solution of the standard BLS. This makes C-BLS can, at least, achieve the comparable performance to the standard BLS. In addition, by appropriately setting the values of , C-BLS has the potential to weaken the negative effects of outliers. For example, when the th sample is polluted by outlier, there will be in general a large difference between and , denoted as . With appropriately setting of , such as , we have , which makes the outlier not have a big impact on the training process.
Remark 3: The update equations of the proposed C-BLS is somewhat similar to the W-BLS proposed in . However, there are several dissimilarities between them, including: 1) C-BLS is proposed from the Information Theoretic Learning ITL) 
perspective while W-BLS is proposed from the application of industrial process; 2) the weighted operator in C-BLS is a successive result derived from MCC while the weighted operator in W-BLS is an additional hyperparameter needed to be specified in advance; 3) C-BLS remains the connections between the input layer and the feature layer and hence can be easily combined with some existing feature mapping technologies[10, 9, 18, 25] while W-BLS abandons such connections.
Iii-C Incremental Learning Algorithms for C-BLS
To derive the incremental learning algorithms of C-BLS, we also use and to denote the current input matrix and the current output matrix, respectively. According to (34), we therefore get that
where is the weighted state matrix, corresponds to the weighted output matrix, and is calculated by
For ease of representation, we define and . Hence, (III-C) can be written as
Iii-C1 Increment of new samples
Assume that new samples are available. We first denote and . Then, the weighted state matrix and output matrix can be obtained by
where and with
Substituting and into (40), we have
By using the matrix inverse lemma
with the definitions of , , and , we get
Let . We obtain the final equations for updating as follows
The main computational effort in (III-C1) focuses on the calculation of . Since only the latest samples as well as the previous are involved to compute , the corresponding computational cost is in general not burdensome.
Iii-C2 Increment of Enhancement Nodes
Assume that enhancement nodes are inserted. The weighted state matrix can be expressed by
where ; and are randomly generated weights and biases, respectively. With the approximation of , the output weights in this case are obtained by
where A and D are arbitrary reversible matrix blocks. Let , , , and , then we get