I Introduction
Along with the rising wave of deep learning, neural networks, by means of their universal approximation capability and welldeveloped learning techniques, have achieved great success in data analytics
[1]. Usually, the input layer of a fully connected neural network (FCNN) is fed with vector inputs, rather than twodimensional matrices such as images or higher dimensional tensors like videos or light fields
[2, 3, 4]. Technically, the vectorization operation makes the dot product (between the inputs and hidden weights) computationally feasible but inevitably induces two drawbacks: (i) the dimensionality curse issue when the number of training samples is limited; (ii) the loss of spatial information of the original mutidimensional input. Although convolutional neural networks (CNNs) have brought about some breakthroughs in image data modelling, by means of their good potential in abstract feature extraction, power in local connectivity and parameter sharing, etc.
[5], the development of FCNNs with matrix inputs (or multidimensional inputs in general)is of great importance in terms of both theoretical and algorithmic viewpoints. In [6], Gao et al. first nominated the term ‘Matrix Neural Networks (MatNet)’ and extended the conventional backprorogation (BP) algorithm [7] to a general version that is capable for dealing with 2D inputs. Empirical results in [6] and their parallel work [8] demonstrate some advantages of MatNet for image data modelling. Obviously, MatNet may still suffer some intrinsic drawbacks of gradient descentbased approaches such as local minimum and low convergence rate. That indicates a urgent demand for developing fast learning techniques to build FCNNs with 2D inputs, as an immediate motivation of this work.
Randomized learning techniques have demonstrated their great potential in fast building neural network models and algorithms with less computational cost [9]. In particular, Random Vector FunctionalLink (RVFL) networks developed in the early 90s [10, 11] and Stochastic Configuration Networks (SCNs) proposed recently [12] are two representatives of the randomized learner models. Technically, RVFL networks randomly assign the input weights and biases from a fixed distribution (range) that is totally dataindependent, and optimize merely the output weights by solving a linear least squares problem. This trivial idea sounds computationally efficient, however, the obtained learner models may not have universal approximation capability when an inappropriate distribution (range) is used for the random assignment. This drawback indeed makes RVFL networks less practical in data modelling problems because more human intervention and/or empirical knowledge is required for problemsolving. Fortunately, SCNs, as stateofthe art randomized leaner models, were developed with rigorous theoretical fundamentals and advanced algorithm implementations [12]. The success of SCNs and their extensions [13, 14] in fast building universal approximators with randomness has been extensively demonstrated on data analytics. Generally, the very heart of SCNs framework lies in the supervisory (datadependent) mechanism used to stochastically (and incrementally) configure the input weights and biases from an appropriate ‘support range’. In the presence of multidimensional especially 2D inputs (e.g. images), both RVFL networks and SCNs require a regular vectorization operation before feeding the given input signal into the neural network model. Authors in [15] made a first attempt on two dimensional randomized learner models, via developing RVFL networks with matrix inputs (termed 2DRVFL) with applications in image data modelling. Although some advantages of the 2D model are experimentally demonstrated, the concerned methodology/framework still suffers from the drawbacks of RVFL networks as highlighted above.
This paper develops two dimensional SCNs on the basis of our previous SCN framework [12]
, aiming to fast build 2D randomized learner models that are capable for resolving data anlytics with matrix inputs. We first provide a detailed algorithmic implementation for 2DSCN, followed by a convergence analysis with special reference to the universal approximation theorem of SCNs. Then, some technical differences between 2DSCN and SCN are presented with highlights in various aspects, such as the support range for random parameters, the complexity for parameter space, data structure preservation. Among that and interestingly, our work is the first to think about a potential relationship between 2DSCN and CNN in problemsolving, that is, computations involved in 2DSCN in some sense can be viewed as equivalent to the ‘convolution’ and ‘pooling’ tricks performed in CNN structure. Later, some technical issues around why randomized learner models produced by 2DSCN algorithm are more prone to have a better generalization ability are investigated indepth. In particular, some solid results from statistical learning theory are revisited with our special interpretation, for the purpose of qualitative analysis on learner models’ generalization power and useful insights on certain very influential factors. Besides, we provide an intuitive sense that 2DSCN may exhibit similar philosophy as concerned in DropConnect
[16]for effectively alleviating overfitting. Importantly, to make a reasonable and practical judgement on the generalization ability of an obtained randomized learner model, we make efforts towards developing a nearly sharp estimation about the model’s test error upper bound, thereby one can effectively predict the generalization performance. Extensive experimental study on both regression and classification problems (with matrix inputs setup) have demonstrated remarkable advantages of 2DSCN on image data modelling, compared to some existing randomized learning techniques. Also, our theoretical analysis has been successfully verified by the statistical simulation results. Overall, our main contributions can be summarized in threefold:

From the algorithmic perspective, we extend our original SCN framework to 2D version, and the proposed 2DSCN algorithm can effectively deal with data modelling tasks with matrix inputs, compared with some existing randomized learning techniques;

Theoretically, the universal approximation property of 2DSCNbased learner models is verified and some technical differences between 1D and 2D randomized learner models are investigated in terms of various perspectives. Importantly, we provide an upper bound for the test error of a given randomized learner model, and demonstrate in theory how the hidden layer output matrix (computationally associated with the training inputs, the hidden input weights and biases, the number of hidden nodes, etc.) and the output weights can affect the randomized learner model’s generalization power.

For practical applications, the merits of the developed 2DSCN algorithm on image data analytics have been illustrated on various benchmark tasks, such as the rotation angles predication for handwritten digits, handwritten digits classification, and human face recognition. The extensive experimental study conducted in this paper can lend some empirical support to endusers who would like to employ FCNNs rather than CNNs in image data modeling.
The remainder of this paper is organized as follows. Section II provides some related work including the 2D random vector functional link (RVFL) networks and our original SCN framework. Section III details the proposed 2D stochastic configuration networks (2DSCNs) with algorithmic description, technical highlights, and some theoretical explanation, aiming to distinguish 2DSCN from the other randomized learning techniques. Section IV presents experimental study in terms of both imagebased regression and classification problems, and Section V concludes this paper with further remarks and expectation.
Ii Related Work
This section reviews two types of randomized learner models with highlights in their technical discrepancy. First, 2DRVFL networks as an extension of RVFL networks can deal with matrix inputs but could not guarantee universal approximation in data modelling, which causes some infeasibility as well as uncertainty for problemsolving; SCNs as an advanced universal approximator have demonstrated their effectiveness and efficiency in data analytics with vector inputs as usually done. Basically, the brief examination of these two methodology motivates us to think about the formulation of 2DSCN and also its potential advantages in image data modelling problems, as to be detailed in the following section.
Iia 2DRVFL Networks
2DRVFL networks with matrix inputs has been empirically studied in [15]. Technically, it can be viewed as a trivial extension of the original RVFL networks in computation via employing two sets of input weights acting as matrix transformation over the left and right sides of inputs. Here we start directly with the problem formulation for 2DRVFL, rather than revisit the basics of RVFL networks. Given training instances sampled from an unknown function, with inputs , outputs , training a 2DRVFL learner model with hidden nodes is equivalent to solving a linear least squares (LS) problem ( the output weights), i.e,
where , , are randomly assigned from , , , respectively and remain fixed.
is the activation function.
The above LS problem can be represented by a matrix form, i.e.,
(1) 
where
is the hidden layer output matrix, , . A closed form solution can be obtained by using the pseudoinverse method, i.e., .
Remark 1. Although RVFL networks (with either vector or matrix inputs) allows fast building a model by randomly assigning input weights and biases, some key technical issues are still unresolved. Theoretically, approximation error for this kind of randomized learner model are bounded in the statistical sense, which means preferable approximation performance is not guaranteed for every random assignment of the hidden parameters [11]. Besides, it has been proved that in the absence of such additional conditions, one may observe exponential growth of the number of terms needed to approximate a nonlinear map, and/or the resulting learner model will be extreme sensitivity to the parameters [17]. From the algorithmic perspective, all these theoretical predictions do not address the learning algorithm or implementation issues for the randomized learner. Practical usage of this kind of randomized model encounter one key technical difficulty, that is, how to find an appropriate range for randomly assigning hidden parameters with considerable confidence to ensure the universal approximation property. So far, the most accurate (and trivial) way for implementing RVFL networks should employ trialanderror/ruleofthumb for parameter setting, that is to say, one needs perform various setting of before getting an acceptable learner model. This trick sounds practical but still has potential drawbacks due to uncertainty causes by the randomness, as theoretically and empirically studied in [18]. We also note that one can try out different random selection range for and in 2DRVFL, such as and , but may need more gridsearching in algorithm implementation to find out the ’best’ collection .
IiB SCN framework
Our recent work [12] is the first to touch the foundation of building universal approximator with random basis functions. More precisely, a new type of randomized learner model, termed stochastic configuration networks (SCNs), is developed by implanting a ‘datadependent’ supervisory mechanism to the random assignment of input weights and biases. Readers who are interested in a complete roadmap of this novel work can refer to [12]. Here we just briefly revisit the essence and highlight some technical points.
Let represent a set of realvalued functions, and span stands for the associated function space spanned by . denote the space of all Lebesgue measurable functions defined on , with the norm defined as
(2) 
Given another vectorvalued function , the inner product of and is defined as
(3) 
Note that this definition becomes the trivial case when , corresponding to a realvalued function defined on a compact set.
Before revising the universal approximation theory behind SCNs, we recall the problem formulation as follows. For a target function , suppose that we have already built a neural network learner model with only one hidden layer and hidden nodes, i.e, (, ), with , and residual error far away from an acceptable accuracy level, our SCN framework can successfully offer a fast solution to incrementally add , ( and ) leading to until the residual error falls into an expected tolerance . The following Theorem 1 restates the universal approximation property of SCNs, corresponding to Theorem 7 in [12].
Theorem 1. Suppose that span() is dense in space and , for some . Given and a nonnegative real number sequence with , , for , denoted by
(4) 
if the random basis function is generated to satisfy the following inequalities:
(5) 
and the output weights are evaluated by
(6) 
it holds that where , .
Basically, the algorithmic procedures for building SCNs can be summarized as repeating the following sessions with until the given training error tolerance is reached:
Remark 2. We would like to highlight that SCNs outperforms some existing randomized learning techniques (e.g. RVFL networks) that employ a totally dataindependent randomization in training process, and demonstrate considerable advantages in building fast learner models with sound learning and generalization ability. It implies a good potential for dealing with online stream and/or big data analytics. Recently, some extensions of SCNs are proposed towards various viewpoints. In [19], an ensemble version of SCNs with heterogeneous features was developed with applications in largescale data analytics. In [14], we have generalized our SCNs to a deep version, termed as DeepSCNs, with both theoretical analysis and algorithm implementation. It has been empirically illustrated that DeepSCNs can be constructed efficiently (much faster than other deep neural networks) and share many great features, such as learning representation and consistency property between learning and generalization. Besides in [13], we built robust SCNs for the purpose of uncertain data modelling. This series of work to some extent exhibits the effectiveness of SCN framework and have displayed an advisable and useful way on studying/implantng randomness in neural networks.
Iii 2D Stochastic Configuration Networks
This section details our proposal for two dimensional stochastic configuration networks (2DSCN). First, based on our original SCN framework, we can straightforwardly present the algorithm description for 2DSCN, followed by theoretically verifying the convergence property. Then, comparison around some technical points between these two methods are discussed. Afterwards, a theoretical analysis why randomized learner models with 2D inputs have a good potential for inducing better generalization is provided.
Iiia Algorithm Implementation
On the basis of SCN framework, the problem of building 2DSCN can be formulated as follows. Given a target function , suppose that a 2DSCN with hidden nodes has already been constructed, that is, (, ), where represents the activation function, , stand for the collection of input weights (to be stochastically configured with certain constrains), are the output weights. With the current residual error denoted by , which as supposed does not reach a predefined tolerance level, our objective is to fast generate a new hidden node (in lieu of , , and ) so that the resulted model has an improved residual error after evaluating all the output weights based on a linear least squares problem.
Suppose we have a training dataset with inputs , and its corresponding outputs , where , , sampled based on a target function . Denoted by as the corresponding residual error vector before adding the th new hidden node, where with . With two dimensional inputs , the th hidden node activation can be expressed as
(7) 
where and are input weights, is the bias.
Denote a set of temporal variables as follows:
(8) 
Based on Theorem 1, it is natural to think about the inequality constrain for building 2DSCN by letting .
After successfully adding the th hidden node (), the current hidden layer output matrix can be expressed as . Then, the output weights are evaluated by solving a least squares problem, i.e.,
(9) 
where is the MoorePenrose generalized inverse [20] and represents the Frobenius norm.
IiiB Convergence Analysis
The key to verify the convergence of Algorithm 1 is to analyze the universal approximation property of 2DSCN. Recall the proof of Theorem 1 (Theorem 7 in [12]), one can observe that it is the inequality constrains that dominant the whole deduction, rather than the form of input weights (either vector or matrix). In fact, it still holds that is monotonically decreasing and convergent, for a given . Therefore, .
We remark that value is varying during the whole incremental process and the same approach intuitively applies to verify the convergence. Also, it sounds logical to set as a sequence with monotonically increasing values, because it will become more difficult to meet the inequality condition after considerable amount of hidden nodes are successfully configured. To some extent, this userdetermined (and problemdependent) parameter affects the algorithm convergence speed. In particular, one can set sequence (monotonically increasing) with initial value quite close to one, which can ease the configuration phase when adding one hidden node as the inequality condition can be easily satisfied. Alternatively, user can start with a relatively small value (but cannot be too small), which however requires more configuration trials at one single step to find suitable input weights and biases that fit the inequality condition. It can lead to huge computational burden or even more unnecessary fails during the configuration phase. Since the convergence property is guaranteed theoretically, one can think about some practical guideline for setting sequence with reference to their practical task. Based on our experience, the first trick, i.e., initializing with value close to one and then monotonically increase (progressively approaching one), offers more feasibility in algorithm implementation. Later in Section IV, we will recall this note in our experimental setup.
IiiC Comparison with SCNs
IiiC1 Support Range for Random Parameters
Technically, 2DSCN still inherits the essence of our original SCN framework, that is, stochastically configuring basis functions in light of a supervisory mechanism (see Theorem 1). This kind of datadependent randomization way can effectively and efficiently locate the ‘support range’, where one can randomly generate hidden nodes with insurance for building universal approximators. Despite this common character, differences between support ranges induced by these two methods should be highlighted. Computationally, it holds that
where Tr means the matrix trace, , , stands for vectorization of a given 2D array.
We observe that although can be viewed as a regular dot product (between the hidden weight vector and input) computation performed in SCN, the resulted ()dimensional vector may exhibit different distribution, in contrast to a random ()dimensional vector from SCNinduced support range.
We should also note that there is no special requirements for the initial distribution of and performed in the algorithm implementation. For instance, one can set two different range parameter sets and respectively in their experimental setup. If so, in algorithm design, one more loop is need for searching appropriate from when is chosen and fixed, or vice versa. Since universal approximation capability is always guaranteed, this complex manipulation sounds not computationally efficient in practical implementation . For simplicity, we just use the same random range setting for and , i.e., merely , as noticed in step 3 of the above Algorithm 1.
In practice, or , which can be viewed as row/columndirection hidden weight, has its own support range, which relies on their initially employed distribution ( or ) and the inequality constrain for hidden node configuration. Regarding discrepancies between 2DSCN and SCN in random parameter distribution, we will elaborate more details at the end of this section.
IiiC2 Parameter Space
Despite that neural networks can universally approximate complex multivariate functions, they still suffers from difficulties on highdimensional problems where the number of input features is much larger than the number of observations. Generally, a huge number of training observations is required for training/building an acceptable approximator, as normally performed in deep learning community. Empirically, problems with very limited number of training samples but of very high dimension usually need further technical concerns in algorithm development, like feature selection or learning representation with sparsity (e.g, Lasso). To avoid highdimensional inputs and seek useful input features for the alleviation of overfitting are important and essential for the majority of machine learning techniques.
It is clear that one 2DSCN model with hidden nodes has dimensional weights and dimensional input weights, biases (scalar), dimensional output weights, that is, parameters in total; whist SCN model with the same structure has dimensional input weights and the same amount of biases and output weights, i.e., parameters altogether. Technically, in SCN, it can impose a high dimensional parameter space that may cause potential difficulties to meet the stochastic configuration inequality (5), especially when the number of training samples is far lower than the dimensionality of the input weights. Besides, for a relatively large , huge memory is needed for saving parameters in computation. On the contrary, 2DSCN can effectively ease the highdimensional issue and to some extent economize physical memory in practice.
IiiC3 Data Structure Preservation
It sounds logical that 2DSCN has some advantages in preserving the spatial details of the given input images, due to that it cares about the 2Dneighborhood information (the order in which pixels appear) of the input rather than a simple vectorization operation performed in SCN. This argument has been raised and commonly accepted in literature, however, there is no sufficient scientific evidence verifying why and how the vectorization trick affect the structural information of the 2D inputs. In this part, we aims at examining the resemblance between 2DSCN methodology and convolutional neural networks (CNNs) in terms of computational perspective. A schematic diagram is plotted in Fig. 2 and corresponding explanations are as follows.
Recall Eq. (IIIC1), the left lowdimensional vector is acting as a ‘filter’ used in extracting some random features from the 2D input . In other words, each column of is now considered as a block, i.e., image is supposed to be represented by blockpixels, then can be viewed as a ‘convolution’ operation between the ‘filter’ (of size ) and the input along the vertical direction, leading to a feature map . Then, a ‘pooling’ operation, conducted by calculating a weighted sum of the obtained feature map, is used to aggregate feature information.
As a conjecture, 2DSCN might have some technical merits in common with CNNs for image data analytics. More theoretical and/or empirical research on this judgment are left for our future study.
IiiD Superiority in Generalization
In this part, we will investigate indepth why 2DSCN (and 2DRVFL) potentially leads to a better generalization performance than SCN (and RVFL). Four supportive theories (ST1 to ST4) are presented to explain our intuition prediction, that is, the stochastically configured input weights and biases of 2DSCN to a great extent are more prone to result in lower generalization error. Later some statistical verification are demonstrated to further justify our theoretical interpretation.
ST1: Learning LessOverlapping Representations. Typically, elements of a weight vector have onetoone correspondence with observed features and a weight vector is oftentimes interpreted by examining the top observedfeatures that correspond to the largest weights in this vector. In [21], the authors proposed a nonoverlapness promoting regularization learning framework to improve interpretability and help alleviate overfitting. It imposes a structural constraint over the weight vectors, thus can effectively shrink the complexity of the function class induced by the neural network models and improve the generalization performance on unseen data. Assume that a model is parameterized by vectors , [21] proposed a hybrid regularizer consisting of a orthogonalitypromoting term and sparsitypromoting term, denoted by
where is the Gram matrix associated with , i.e., , is a tradeoff parameter between these two regularizers.
Theoretically, the first term controls the level of nearorthogonal over the weight vectors from , while the second term encourage to have more elements close to zero. It is empirically verified that this hybrid form of regularizer can contribute to learner models with better generalization performance [21].
Back to our thesis, we can roughly explain why 2D models (2DSCN and 2DRVFL) can outperform 1D models (SCN and RVFL), and simultaneously, why SCNbased models are better than RVFLbased ones: (i) Generally, there is no big difference between SCN and 2DSCN on the nearorthogonal level of , however, random weights in 2DSCN can have higher level of sparsity than that in SCN, hence leading to a smaller . This deduction can also be used to differ 2DRVFL from RVFL as well; (ii) Given similar level of sparsity in , SCNbased models are more prone to have a lower nearorthogonal level than RVFLbased ones, therefore, indicating a smaller . For further justifications on these intuitive arguments, we present analogous theories regarding the nearorthogonality of weight vectors in the following part (see ST2 below) and demonstrate some statistical results at the end of this section.
ST2: Weight Vector Angular Constraints for Diversity Promoting. Authors in [22, 23] have shown empirical effectiveness and explained in theory when and why a low generalization error can be achieved via adjusting the diversity of hidden nodes in neural networks. Theoretically, increasing the diversity of hidden nodes in a neural network model would reduce estimation error but increase approximation error, which implies that a low generalization error can be achieved when the diversity level is set appropriately. Specifically, nearorthogonality of the weight vectors (e.g, input weights and biases) can be used to characterize the diversity of hidden nodes in a neural network model, and a regularizer with weight vector angular constraints can be used to alleviate overfitting. To highlight the impact of nearorthogonality (of the weight vectors) on the generalization error, we will reformulate two main theoretical results addressed in [23]. Before that, some notations and preliminaries on statistical learning theory are revisited.
Consider the hypothesis set
where stands for the output weight is the sigmoid activation function. Given training samples generated independently from an unknown distribution . The generalization error of is defined as . As is not available, one can only consider minimizing the empirical risk in lieu of . Let be the true risk minimizer and be the empirical risk minimizer. Then, the generalization error (of the empirical risk minimizer ) can be estimated by bounding the estimation error and the approximation error , respectively. The following Theorem 2 and Theorem 3 show these two estimations in relation to the factor .
Theorem 2 [23] (Estimation Error).
With probability at least
, the estimation upper bound of estimation error decreases as becomes smaller, i.e.,where .
Suppose the target function satisfy certain smoothness condition given by , where
represents the Fourier transformation of
. Then, the approximation error, which reflects the power of the hypothesis set for approximating , is expressed as follows.Theorem 3 [23] (Approximation Error). A smaller contributes to a larger upper bound of approximation error, that is, let and , where , then there exists such that
Based on Theorem 2 and Theorem 3, we can come to a conclusion, that is, a larger upper bound of generalization error can be caused by the case when the weight vectors are highly nearorthogonal with each other ( is extremely small) or the situation that is close or equal to 1 (e.g., there exist two weight vectors that are linearly dependent). Therefore, given two obtained (randomized) learner models with roughly the same training performance, the one equipped with hidden weight vectors of high nearorthogonality is likely to result in worse generalization. On the other hand, our previous work [18]
reveals a key pitfall of RVFL networks that all highdimensional dataindependent random features are nearly orthogonal to each other with probability one. Fortunately, the supervisory mechanism used in SCN framework imposes an implicit relationship between each weight vector and can effectively reduce the probability of nearorthogonality. With all these clues, we can roughly explain why the leaner models produced by SCN and 2DSCN are more prone to result in a better generalization performance than RVFL and 2DRVFL. It would be interesting to organize rigourous theoretical analysis and extensive empirical study on differing the SCN framework from RVFL networks from this point of view. Also, it is meaningful to think about weight vector angular constraints in the development of SCNs, for the purpose of the enhancement of generalization. To avoid losing keynote for this work, we leave these useful explorations to our future research.
ST3: Vague Relationship between 2DSCN and DropConnect framework. To effectively alleviate overfitting and improve the generalization performance, Dropout has been proposed for regularizing fully connected layers within neural networks by randomly setting a subset of activations to zero during training [24, 25]. DropConnect proposed by Wan et al. [16] is the extension of Dropout in which each connection, instead of each output unit, can be dropped with certain probability. Technically, DropConnect can be viewed as similar to Dropout because they both perform dynamic sparsity within the learner model during the training phase, however, differs in that the sparsitybased concerns are imposed on the hidden input weights, rather than on the output vectors of a layer. That means the fully connected layer with DropConnect becomes a sparsely connected layer in which the connections are chosen at random during the training stage. Importantly, as noted in [16], the mechanism employed in DropConnect is not equivalent to randomly assigning a sparse hidden input weights matrix (and remain fixed) during the training process, which indirectly invalidates the effectiveness of RVFL and 2DRVFL method even when they use sparse weights in the hidden layer.
Intuitively, our proposed 2DSCN could be thought as related to DropConnect, in terms of the following points:

supervisory mechanism used in 2DSCN aims at incrementally configuring the weight vectors until convergence to a universal approximator, which is equivalent to the training objective of DropConnect;

once random weight vectors in 2DSCN have many small elements close to zero, their functionality is similar to the sparsity mechanism imposed in DropConnect on the hidden weights;

On the basis of the above two clues, the incremental process performed in 2DSCN can be viewed as similar to proceeding dynamic sparsity within the learner model during the training phase as used in DropConnect.
We would like to highlight that the original SCN does not have this kind of vague relationship with DropConnect, unless certain weights sparsity regularizer is concerned in the training process. In contrast, 2DSCN involves more weight vectors with small values, which indeed can be viewed as considerable degree of sparsity, have a good potential to inherit some merits of DropConnect and its parallel methodology. Fig. 3 highlights the characteristics of DropConnect, and provides a vivid demonstration of our logic why 2DSCN differs from SCN in exhibiting sparsity among the hidden input weights.
ST4: Novel Estimation of Test Error. Various statistical convergence rates for neural networks have been established when some constrains on the weights are concerned [26, 27, 28, 29]
. Empirically, small weights together with small training error can lead to significant improvements in generalization. All these investigations lend scientific supports to the heuristic techniques like weight decay and early stopping. The reason behind is that producing overfitted mappings requires high curvature and hence large weights, while keeping the weights small during training can contribute to smooth mappings. Technically, the regularization learning framework, introducing various types of weight penalty such as L2 weight decay
[30, 31], Lasso [32], KullbackLeibler (KL)divergence penalty [33], etc., shares a similar philosophy to help prevent overfitting.A comprehensive overview of existing theories/techniques concerning learner models’s generalization capability is out of our focus in this paper. Instead, we revisit the theoretical result presented in [34], and illustrate mathematically how the output weights magnitudes affect randomized learner models’ generalization power. For a better understanding and consistent problem formulation, we restate their main result with reference to our previous notations used in ST2, that is,
Theorem 4 [34]. Consider the hypothesis set with certain distribution and function satisfying , and given a training data set with inputoutput pairs drawn iid from some distribution , a randomized learner model can be obtained by randomly assigning from the distribution and solving the empirical risk minimization problem ^{1}^{1}1
*In [34], a general form of cost function is concerned. Here we specify a quadratic loss function and its associated Lipschitz constant has no impacts on the final estimation.
subject to . Then, with probability at least , the upper bound for the generalization error of can be estimated byTheoretically, the upper bound in Theorem 3 implies that randomized learner models with good training result and small output weights can probably lead to preferable generalization performance, in terms of probability perspective. However, this cannot be used directly to bound the practical test error for evaluating the randomized learner models’s generalization performance. More numerical estimation for the test error (resulted from algorithm realization in practice) is required to better characterize the generalization capability as well as the associated impacting factors.
As one of our main contributions in this work, a novel upper bound estimation for the test error is presented in terms of computational perspective. To facilitate our theoretical investigation, we view the hidden layer matrix as a matrixvalued function of matrix variable, i.e., , denoted by (see [35] for basic fundamentals on matrix calculus)
(10) 
with the argument represented by
(11) 
Suppose that is differentiable and has continuous firstorder gradient , defined by a quartix belongs to , i.e.,
where for ,
Then, the first directional derivative in a given direction can be represented by
It is logical to think that the test sample matrix can be represented by imposing sufficiently small random noises into the training sample matrix , i.e., , where
is a random matrix,
is sufficiently small. Then, we can take the firstorder Taylor series expansion about ([35]), i.e.,Therefore, the test error can be estimated by
(12)  
where stands for the Frobenius norm, represents the training error.
Basically, this rough estimation implies two points that should be highlighted:
(i) The upper bound for the test error can be viewed as an increasing function of , which means that learner models with smaller output weight values are more prone to generalize preferably on unseen data. This is consistent with the philosophy behind the regularization learning framework, that is, imposing a penalty term to control the output weights magnitudes during the training process.
(ii) We can further investigate how the input weights and biases affect the value of
. In particular, we use sigmoid function in the following deduction, i.e.,
and . Mathematically, the th element (, ) inside can be expressed asThen, a rough upper bound for can be obtained, that is,
where we use abbreviation for ), CauchySchwarz inequality in the first inequality. stands for the th row vector of the matrix . is defined in (10), is a matrix of ones (every element is equal to one), and is formulated by copying times of the row vector , ‘’ stands for the Hadamard (entrywise) product among the matrixes.
So far, we can summarize the above theoretical result in the following Theorem 5. Readers can refer to some notations aforementioned in the context.
Theorem 5. Given training input and output , suppose a randomized neural network model with hidden nodes is build, corresponding to the hidden layer output matrix (on the training data) , the output weight matrix , and the training error . Let be the test (unseen) input data matrix, where is a random matrix, is sufficiently small, stand for the associated hidden layer output matrix, then, the test error can be bounded by
(13)  
Remark 3. We would like to highlight a trick concerned in the previous deduction for Theorem 5. Indeed, we have considered to preserve the bundle of computational units ‘’ rather than to roughly estimate the whole term by ‘’, which consequently can result in a very blunt bound for . Unfortunately, upper bound sounds meaningless because it does not consider the saturation property of sigmoid function, and may cause some misleading that ‘larger input weights can destroy the generalization capability’. In contrast, our proposed upper bound (IIID) is nearly sharp and can provide valuable information to identify the role of input weights (and biases) and training samples on the learner models’s generalization power. It is the bundle of computational units rather than merely the that acts as a suitable indicator for predicting the generalization performance. Besides, it should be noted that, input weights (and biases) with small values but enforcing the or (corresponding to the saturation range of sigmoid function), are more likely to result in a small value of and consequently bring a small generalization error bound.
On the other hand, the right side of Eq. (13) has a strong resemblance to the regularized learning target by viewing as the regularization factor , that is, , considered as a whole to effectively alleviate overfitting.
Why 2D randomized models are equipped with more small weights? Since small weights to some extent can probably have certain positive influence on enhancing a learner model’s generalization ability, one major issue still left unclarified is that whether or not 2D randomized learner models possess this advantage. For that purpose and before ending this section, we would like to provide a statistical verification on the frequency when sufficiently small weights occur in 1D and 2D randomized models, aiming to further support the superiority of 2D randomized models.
Given distribution (either uniform or gaussian), we investigate the statistical differences among the following three strategies for randomly assign parameters:

M1: Randomly assign from ;

M2: Randomly assign , from , then calculate their Hadamard (entrywise) product ;

M3: Randomly assign , , with , then calculate and let .
A simple and vivid demonstration for the distribution of the random weights induced by M1 and M3 is provided in Fig. 4, in which it can be clearly seen that have more small values (near zero) than . Based on our empirical experience, similar plotting (display) between M2 and M3 looks visually indistinguishable. More statistical results are helpful for making a reasonable distinction among M1M3.
Comments
There are no comments yet.