Abstract
We present a numerical scheme for the computation of Artificial Neural Networks’ weights, without a laborious iterative procedure. The proposed algorithm adheres to the underlying theory, is highly fast, and results in remarkably low errors when applied for regression and classification of complex datasets, such as the Griewank function of multiple variables with random noise addition, and MNIST database for handwritten digits recognition, with images. Interestingly, the same mathematical formulation found capable of approximating highly nonlinear functions in multiple dimensions, with low errors (e.g. ) for the test set of the unknown functions, their higherorder partial derivatives, as well as numerically solving Partial Differential Equations. The method is based on the calculation of the weights of each neuron, in small neighborhoods of data, such that the corresponding local approximation matrix is invertible. Accordingly, the hyperparameters optimization is not necessary, as the neurons’ number stems directly from the dimensions of the data, further improving the algorithmic speed. The overfitting is inherently eliminated, and the results are interpretable and reproducible. The complexity of the proposed algorithm is of class P with computing time, that is linear for the observations and cubic for the features, in contrast with the NPComplete class of standard algorithms for training ANNs. The performance of the method is high, for small as well as big datasets, and the testset errors are similar or smaller than the train errors indicating the generalization efficiency. The supplementary computer code in Julia Language, may reproduce the validation examples, and run for other datasets.
1 Introduction
Artificial Intelligence (AI) has been broadening its numerical methods as well as fields of applications, however, empirical rigor is not following such advancements [sculley2018winner's], and researchers have been questioning the accuracy of iterative algorithms [Hutson2018]. Furthermore, the obtained results from a certain problem are not reproducible if we rerun the algorithm for the same problem[Hutson725, belthangady2019applications]. In theory, Artificial Neural Networks (ANNs) are capable of approximating any continuous function [Hassoun2005], however, apart from existence, the theory does not offer a rigorous procedure to calculate the approximation parameters. Hence, iterative algorithms [ruder2016overview] are applied for the calculation of the weights , which attain the minimum possible errors of the approximation. However, because this optimization problem might have more than one local minimum concerning the weights
, and the required iterations might be computationally costly, enhanced methods such as stochastic gradient descent
[bottou2010large, johnson2013accelerating], have been developing. The overfitting issue, that is to approximate the given data with high accuracy, but failing to generalize the predictions is a major flaw, and methods such as dropout [srivastava2014dropout], try to confront it. Over and above, the number of computational Neurons is often selected arbitrarily, while the socalled hyperparameter optimization [bergstra2011algorithms, bergstra2012random, Feurer2019], which actually regards the solution of an optimization problem which depends on another optimization problem, that is the laborious weights’ calculation, result in extended computational time.The purpose of this work was to develop a numerical scheme for the calculation of the optimal weights , Neurons
, and other parameters of approximation with ANNs, by adhering to the underlying theory, and at the same time being fast and precise. It was attained, by utilizing a novel numerical scheme, using classical ANN representation, division of the studied dataset into small neighborhoods, and matrix manipulations for the calculation of the sought weights. The results are remarkably accurate by achieving i.e. stateoftheart errors in the Testset of known datasets such as MNIST for computer vision, and complex nonlinear functions for regression, while the computational time is kept low. Interestingly, the same Algorithmic scheme may be applied for the solution of Partial Differential Equations (PDEs), appearing in Physics, Engineering, Financial Sciences, etc. The paper is organized as follows. In section
2, we present the formulation the method. The basic formulation of the method, is concisely presented in sections 2.1.1, and 2.1.2. In section 2.2we extend the implementation utilizing radial basis functions, in
2.3 we present the application of the method for the derivatives approximation, and in section 2.4 for the solution of PDEs. In section 2.5 we transform the basic scheme to Deep Networks, and in section 2.6 to Ensembles. In section 3, we present the results obtained from the numerical experiments, followed by a Discussion in section 4. The implementation of the method is available in opensource computer code in GitHub (Appendix I), written in Julia [bezanson2017julia] programming Language.2 The ANNBN Method
Let be some given data of input variables in observations of responces. The Universal Approximation Theorem [Tadeusiewicz1995, Cybenko1989], ensures the existence of an integer , such that
with approximation errors among the given response and the corresponding simulated , arbitrarily low; where is the number of Neurons, and the approximation weights and bias terms for neuron , and the approximation weights and bias for the linear summation upon the neurons, and the given input. The presented method is based on the segmentation of the given dataset into small clusters of data, which approximate in each neighborhood the local , and later use this weights for the calculation of the total approximation. Thus, we call the proposed method Artificial Neural Networks By Neighborhoods (ANNBN). In order to create such Neighborhoods, we utilize a variety of custering algorithms, such as of the means++, a fast variation for big datasets and another for equally sized clusters, so as to obtain inevitable matrices , as per the following Equation 1. We should underline that by the providing the clustering representation or just the initial seeds (for the means), the results are precisely reproducible for the clustering, as well as for the entire ANNBN representation, as we will show in the following.
2.1 Shallow Networks
In Figure 1, the calculation process for the ANNBN weights is presented. The data are usually considered as a whole in ANN, while in ANNBN, we firstly consider the in the cluster and later all the as demonstrated in Figure 1 and the following mathematical formulation.
2.1.1 Calculation of and in the cluster
Let be the observations found in the cluster, with ,
the sigmoid function, which may be selected among the variety of sigmoids, such as
, and the inverted sigmoid, . Within the cluster, we may writeand by utilizing the inverse sigmoid function , and writing the left part of the Equation in matrix form, we deduce that
(1) 
where , with , and the observations found in the cluster. Considering the equality , we deduce that for distinct observations, with , the matrix is of full row rank, and if we construct clusters with , the matrix is invertible, and the system in Equation 1 has a unique solution. Hence, because the dimensions of are small , we may rapidly calculate the approximation weights (Figure 1 left) in the cluster (corresponding to the neuron) by
(2) 
If the clusters are not equally sized, we may solve numerically Equation 1, by utilizing Gaussian elimination, decomposition, or where is the pseudoinverse, and
the vector of free parameters.
2.1.2 Calculation of and exploiting all the given observations
Following the computation of the weights , for each neuron in the hidden layer, we may write for all the neurons connected with the external layer that
(3) 
where , with length , and is the matrix containing the entire sample; in contrast with the previous step utilized containing the observations in cluster . By solving the system of Equations (3), we may compute the weights by
(4) 
and obtain the entire representation of the ANNBN.
2.2 ANNBN with Radial Basis Functions as Kernels
The method was further expanded by using Radial Basis Functions (RBFs) for the approximation, , depending on the distances among the observations , instead of their raw values (Figure 2), again in the clusters of data, instead of the entire sample. A variety of studies exist on the approximation proficiency of RBFs [Yiotis2015, Babouskos2015, wikiRadial], however they regards noiseless data, and the entire sample, instead of neighborhoods. We should also distinguish this approach of RBFs implemented as ANNBN, with the Radial Basis Function Newtorks [schwenker2001three, park1991universal, wikiRadialN], with , where the centers are the clusters’ means  instead of collocation points, is the number of neurons, and are calculated by training, instead of matrix handling. In the proposed formulation, the representation regards the distances (Figure 2) among all the observation in cluster with dimension (features) , and , and another observation in the same cluster , with . Accordingly, we may approximate the responses in the cluster , by
(5) 
and compute , by
The elements , of matrix represents the application of function to the Euclidean Distances (or norms) of the observations in the cluster. Note that vector has length for each cluster , instead of for the sigmoid approach. Afterwards, similarly with the sigmoid functions, we obtain the entire representation for all clusters, similarly with Equations 3,4, for the weights of the output layer , by solving
(6) 
similarly with Equation 3, while the rows of the matrices contain the observations of the entire sample and the columns the collocation points found in each cluster. For the calculation of the weights we use , and after the computation of and
, in order to interpolate for any new
(outofsample), we may usewhere are the RBF collocation points for the approximation, same as in Equation 5. Hence, we may predict for out of sample observations by using
(7) 
It is important to note that the kernel is applied to each element of matrices (Equation 5), instead of the total row, as per Equation 1. Hence we don’t need the inverted (corresponding to in Equation 1), while the apply directly by multiplication. This results in convenient formulation for the approximation of the derivatives, as well as the solution of PDEs.
Due to Mairhuber–Curtis theorem[10.2307/2033359], the matrix might be singular, and we should select an appropriate kernel for the studied data. Some examples of radial basis kernels are the Gaussian , Multiquadric , etc., where , and the shape parameter , controls the width of the function. may take a specific value or get optimized, to attain higher accuracy for the particular dataset studied. Accordingly with sigmoid functions, after the computation of , we utilize Equation 6, to compute , and obtain the entire representation.
2.3 ANNBN for the Derivatives Approximation
Equation 7, offers an approximation the sought solution, by using multiplications and additions on the particular , where
(8) 
which is a differentiable function with respect of any outofsample , considering the dimensional collocation points as constants.
Accordingly, we may compute any higherorder derivative of the approximated function, by utilizing Equation 7, and simply differentiating the kernel , and multiplying by the computed weights , for all . In particular, we may approximate the derivative respect to the dimension, in the observation by
(9) 
where
and
(10) 
with the collocation points of cluster and the points where we compute . Because the vector applies by multiplication and summation for all the clusters (Equation 7), by differentiating , applying all , and finally the vector , to the particular , we obtain the entire approximation of each partial derivative. The weights remain the same for the function and its derivatives. We should underline that the differentiation in Equation 10, holds for any dimension of , hence due to Equation 10, with the same formulation, we derive the partial derivatives with respect to any variable, in a concise manner.
For example, if we want to approximate a function , and later compute its partial derivatives with respect to , by utilizing the collocation points , we may write
and if we use as a kernel
we obtain
and hence
The variable may take values from the collocation points or any other intermediate point, after the weights’ calculation, in order to predict for outofsample observations. In empirical practice, we may select among the available in literature RBFs, try some new, or optimise its shape parameter . In Appendix I, we also provide a simple computer code for the symbolic differentiation of any selected RBF, using SymPy [meurer2017sympy] package.
Particular interest exhibit the Integrated RBFs (IRBFs) [Bakas2019, Babouskos2015, Yiotis2015], which are formulated from the indefinite integration of the kernel, such that its derivative is the RBF . Accordingly, we may integrate for more than one time the kernel, to approximate the higherorder derivatives. For example, by utilizing , and the two times integrated Gaussian RBF for at collocation points ,
we deduce that
and hence
which is the Gaussian RBF, approximating the second derivative , instead of .
2.4 ANNBN for the solution of Partial Differential Equations
Correspondingly with the numerical differentiation, we may easily apply the proposed scheme to approximate numerically the solution of Partial Differential Equations (PDEs). We consider a generic Differential operator
depending on the partial derivatives of the sought solution , for some coefficient functions , which satisfys the equality
where might be any function in the form of . We may approximate by
(11) 
By utilizing Equation 9, we constitute a system of linear equations. Hence, the weights may be calculated by solving the resulting system, as per Equation 5.
For example, if we consider a Laplace’s equation in the generic form of
The weights in Equation 11 are constant, hence the differentiation regards only the function . Thus, by writing for all found in cluster , we obtain
(12) 
Because the weights are the same for the approximated function and its derivatives, we may apply some boundary conditions for the function or its derivatives , at some boundary points ,
by using
(13) 
hence, we may compute , by solving the resulting sytem of Equations
(14) 
2.5 Deep Networks
A method for the transformation of shallow ANNBNs to Deep Networks is also presented. Although Shallow Networks exhibited vastly high accuracy even for unstructured and complex datasets, Deep ANNBNs may be utilized for research purposes, for example in the intersection of neuroscience and artificial intelligence. After the calculation of the weights for the first Layer , we use them to create a second layer (Figure 1), where each node corresponds to the given . We then use the same procedure for each neuron of Layer , by solving
(15) 
with respect to , where implies the ellementwise application of to . This procedure is iterated for all neurons of layer . Matrix corresponds to the weights of the layer . Finally we calculate the output layer, linear weights , as per Equation 3. This procedure results in a good initialization of the weights, close to the optimal solution, and if we normalize in a range close to the linear part of the sigmoid function (say ), we rapidly obtain a deep network with approximately equal errors with the shallow. Afterwards any optimization method may supplementary applied to compute the final weights, however the accuracy is already vastly high.
Alternatively, we may utilize the obtained layer for the shallow implementation of ANNBN, (Equation 3), as an input for another layer, then for a third, and sequentially up to any desired number of layers.
2.6 Ensembles
By randomly subsampling of a percentage of the observations, running the ANNBN algorithm for multiple times , and averaging the results with respect to the errors of each iteration with error for all flods
we may constitute an Ensemble of ANNBN (Figure (b)b). Ensembles of ANNBNs exhibited increased accuracy and generalization properties for noisy data, as per the following Numerical Experiments.
2.7 Time Complexity of the ANNBN algorithm
The training of an ANN with two layers and three nodes only, is proved to be NPComplete in [blum1989training], if the nodes compute linear threshold functions of their inputs. Even simple cases, and approximating hypothesis, results in NPcomplete problems [engel2001complexity]. Apart from the theoretical point of view, the slow speed of learning algorithms is a major flaw of ANNs. To the contrary, ANNBNs are fast, because the main part of the approximation regards operations with smallsized square matrices , with the number of features. We provide here a theoretical investigation of ANNBNs’ time complexity, which may empirically validated by the running of the supplementary code. More specifically, the computational effort of ANNBNs regards the following steps.
Definition 1.
Definition 2.
Let be number of observations, the number of features and we use equally sized clusters. The number of clusters , which is equal to number of neurons, will be
(16) 
where the addition of , corresponds to the unit column in Equation 1. Note that the number of clusters , is equal to number of neurons as well (Equation 1, and Figure 1). This is the maximum number of clusters, otherwise the matrices are not invertible and Equation 1 has more than one solutions. Hence we investigate the worst case in terms of computational time, while in practice may be smaller. We assume the number of iterations needed until convergence of clustering, which in practical applications is small and the clustering fast.
Lemma 1.
Time complexity of step (a) is .
Proof.
The running time of Lloyd’s algorithm (and most variants) is ,[hartigan1979algorithm, Manning:2008:IIR:1394399]. We should note that the Lyod’s algorithm, in the worst case is a superpolynomial [blomer2016theoretical, Arthur:2006:SKM:1137856.1137880], with and hence the ANNBN algorithm. However, in practice, the algorithm converges for small .
In particular, the running time of the means algorithm, is [arthur2007k]. From Equation 16, we derive that the running time of step (a) is of . ∎
Lemma 2.
Time complexity of step (b) is
Proof.
Time complexity of step (b) regards the inversion of matrices with size (Equation 1). His is repeated for times, hence the complexity is ∎
Lemma 3.
Time complexity of step (c) is
Proof.
Step c regards the solution of an system of Equations (Eq. 3). We may solve with respect to , by . Hence the complexity regards a multiplication of with , its inversion with complexity , as well multiplication of with , with complexity , and , with , with complexity . Thus, the total complexity is . ∎
Theorem 1.
(ANNBN Complexity) The running time of ANNBN algorithm is
3 Validation Results
3.1 1D Function approximation & geometric point of view
We consider a simple one dimensional function , with , to present the basic functionality of ANNBNs. Because is unstable for , and , we normalize the responses in the domain . In Figure 5, the approximation of is depicted, demonstrating the approximation by varying the number of neurons utilized in the ANNBN. We may see that by increasing the number of neurons from 2 to 4 and 8, the approximating ANNBN exhibits more curvature alterations. This complies with the Universal Approximation theorem, and offers a geometric point of view. Interestingly, the results are nor affected by adding some random noise, , as the Mean Absolute Error (MAE) in this noisy dataset was for the train set and the test set was even smaller , further indicating the capability of ANNBN to approximate the hidden signal and not the noise. We should note that for noiseless data of 100 observations, and 50 neurons, the M.A.E. in the train set was and in the test set . The approximation of the same function with Gaussian RBF, and shape parameter , results in M.A.E. for the train set and for the test set.
3.2 Regression in
We consider the function of five variables,
We create a train set of the five variables , compute , add some random noise and normalize . Then we create a test set with an equal number of observations with the train set , and compute , without adding random noise. Thus, we may check the capability of ANNBN to approximate the signal and not the noise. The results are presented in Table 1
, indicating great accuracy achieved with ANNBNs. The comparison with other methods regards Random Forests as implemented in
[DecisionTree], XGBoost
[XGBoost], and AdaBoost from ScikitLearn [ScikitLearn].Table 1 presents the similar results in terms of approximation errors obtained for input , for observations, and addition of some random noise to the highly nonlinear Griewank function [griewank1981generalized],
With RBF ANNBNs, we may use a higher number of clusters, and hence neurons, , as the matrices of Equation 5, are always square. Accordingly, we may approximate this nonlinear, noisy function with a few observations with respect to features, , with vastly low errors as demonstrated in Figure 5, and Table 1.
Mean Absolute Errors  Random Forests  AdaBoost  XGBoost  ANNBN 

Griewank. 
3.3 Classification for Computer Vision
As highlighted in the introduction, the reproducibility of AI Research is a major issue. We utilize ANNBN for the MNIST database [lecun1998gradient, mnistweb], obtained from [MLDatasets], consisting of handwritten integers , for train and for test. The investigation regards a variety of ANNBN formulations, and the comparison with other methods. In particular, the , and
were utilized as activation functions, and the corresponding
, and for the Equation 1. We constructed a ANNBNs with one and multiple layers, varying the number of neurons and normalization of , in the domain . The results regard separate training for each digit. All results in Table 2are obtained without any clustering. We consider as accuracy metric, the percentage of the Correct Classified (CC) digits, divided by the number of observations
The aim of this investigation was to compare ANNBN with standard ANN algorithms such as Flux [Flux], as well as Random Forests as implemented in [DecisionTree], and XGBoost [XGBoost]. Table 2 present the results in terms of accuracy and computational time. The models are trained on the raw dataset, without any spatial information exploitation. The results in Table 2 are exactly reproducible in terms of accuracy, as no clustering was utilized and the indexes are taken into account in ascending order. For example, the running time to train 5000 neurons is 29.5 seconds on average for each digit, which is fast, considering that the training regards weights, for instances and features. Also, the Deep ANNBNs with 10 layers with 1000 neurons each, are trained in the vastly short time of 91 seconds per digit on average (Table 2). Correspondingly, In Table 2, we compare the Accuracy and Running Time, with Random Forests (with Trees), and XGBoost (200 rounds). Future steps might include data preprocessing and augmentation, as well as exploitation of spatial information like in CNNs. Furthermore, we may achieve higher accuracy by utilizing clustering for the Neighborhoods training, Ensembles, and other combinations of ANNBNs. Also by exploiting data prepossessing and augmentation, spatial information, and further training of the initial ANNBN with an optimizer such as stochastic gradient descent. No GPU or parallel programming was utilized, which might also be a topic for future research. For example the RBF implementation of ANNBN with clustering and neurons exhibit a test set accuracy of 99.7 for digit . The accuracy results regard the out of sample test set with digits. The running time was measured in an Intel i76700 CPU @3.40GHz with 32GB memory and SSD hard disk. A computer code to feed the calculated weights into Flux [Flux] is provided.
Correct Classified (%)  Digit Label  

0  1  2  3  4  5  6  7  8  9  
Random Forests [DecisionTree]  
XGBoost [XGBoost]  
Flux ANN [Flux]  
ANNBN  
ANNBN  
ANNBN  
ANNBN  
Deep ANN  
Running Time (sec)  Digit Label  
0  1  2  3  4  5  6  7  8  9  
Random Forests [DecisionTree]  
XGBoost [XGBoost]  
Flux ANN [Flux]  
ANNBN  
ANNBN  
ANNBN  
ANNBN  
Deep ANN 

1 hidden layer with Neurons, Activation Function (AF) , and .

1 hidden layer with Neurons, AF , and .

1 hidden layer with Neurons, AF , and .

1 hidden layer with Neurons, AF , and .

10 hidden layes with Neurons each, AF , and .

fastest design; highest accuracy.

The ANNBN accuracy results, are exactly reproducible with the supplementary Computer Code. All training examples utilize the raw MNIST database, without any preprocessing or data augmentation. The accuracy (%) regards the outofsample test set of MNIST with handwritten digits.
3.4 Solution of Partial Differential Equations
We consider the Laplace’s Equation [Brady2005]
in a rectangle with dimensions (a,b), and boundary conditions for , , for , , for , and ), for . In Figure (a)a, the numerical solution as well as the exact
are presented. The MAE among the closedform solution and the numerical with ANNBN, was found . Interestingly, if we add some random noise in the zero source, hat is to solve for
(17) 
the MAE remains small, and in particular , for a=b=1, rectangle grid of points with . It is important to underline that numerical methods for the solution of partial differential Equations are highly sensitive to noise [mai2003approximation, Bakas2019], as it vanishes the derivatives. However, by utilizing the ANNBN solution the results change slightly, as described in the above errors. This is further highlighted if we utilize the calculated weights of the ANNBN approximation and compute the partial derivatives of the solution of Equation 17, , and , the corresponding MAE for the second order partial derivatives is (Figure (b)b), which is about two orders less than the added noise , implying that ANNBN approximates the signal and not the noise even in PDEs, and even with stochastic source.
4 Discussion
As described in the formulation of the proposed method, we may use a variety of ANNBNs, such as Sigmoid or Radial Basis Functions scheme, Ensembles of ANNBNs, Deep ANNBNs etc. The method adheres to the theory of function approximation with ANNs, as per Visual representations of ANNs’ capability to approximate continuous functions [Nielsen2015, rojas2013neural]. We explained the implementation of the method in the presented illustrative examples, which may be reproduced with the provided computer code. In general, Sigmoid functions are faster, RBFs more accurate and Ensembles of either sigmoid of RBFs handle better the noisy datasets. RBFs, may use smaller than sized matrices, and hence approximate datasets with limited observations and a lot of features. The overall results are stimulating in terms of speed and accuracy, compared with the Literature and stateoftheart methods.
The approximation of the partial derivatives and solution of PDEs, with or without noisy source, in a fast and accurate manner, offer a solid step towards the unification of Artificial Intelligence Algorithms with Numerical Methods and Scientific Computing. Future research may consider the implementation of ANNNs to specific AI applications such as Face Recognition, Reinforcement Learning, Text Mining, etc., as well as Regression Analyses, Predictions, and solution of other types of PDEs. Furthermore, the investigation of other sigmoid functions than the logistic, such as
, etc., as well as other RBFs, such as multiquadrics, integrated, etc., and the selection of an optimal shape parameter for even higher accuracy. Finally, although the weights’ computation is whopping fast, the algorithm may easily be converted to parallel, as the weights’ computation for each neuron regards the times inversion of matrices .Interpretable AI is a modern demand in Science, while ANNBN are inherently suitable for such scope, as by checking the approximation errors of the neurons in each cluster, we may retrieve information for the local accuracy, as well as local and global nonlinearities in the data properties. Furthermore, as demonstrated in the examples, the method is proficient for small datasets, without overfitting, by approximating the signal and not the noise, which is a common problem of ANNs.
5 Conclusions
From Newton’s law of motion, Leibniz’s vis viva, and Quantum Mechanics, to modern Artificial Intelligence algorithms which mimic human intelligence, a core problem is confronted; to construct a mathematical model, which approximates the measured data of a physical, engineering, social, etc., system. The evolution of scientific research enhances such models, aiming to attain the best possible performance in terms of prediction accuracy and computational time. Scientists and Engineers, often utilize a variety of numerical models to attain the sought scope, depending on the characteristics of the given problem and data. In this work, we presented a unified approach for problems involving mathematical modeling of an unknown system, from function approximation and solution of partial differential equations to computer vision, regression and classification tasks. The method adheres precisely on the theory of approximating functions with bounded activation functions. The validation results demonstrate the proficiency of the proposed method regarding accuracy and speed.
6 Appendix I: Computer Code
The presented method is implemented in Julia [bezanson2017julia] Language. The corresponding computer code, is available on https://github.com/nbakas/ANNBN.jl
weights of first layer for neuron, in cluster number of observations number of variables (features) given data number of clusters, equal to number of neurons approximation weights and bias for the external layer local observations in cluster number of observations found in cluster local matrix of observations (in cluster) observations in the cluster iterators for observations, features, and clusters outofsample point in Correct Classified Observations number of folds for Ensembles bias for neuron number of Layers
[1cm]
Comments
There are no comments yet.