Training-Free Artificial Neural Networks

09/30/2019
by   Nikolaos P. Bakas, et al.
Neapolis University Pafos
0

We present a numerical scheme for the computation of Artificial Neural Networks' weights, without a laborious iterative procedure. The proposed algorithm adheres to the underlying theory, is highly fast, and results in remarkably low errors when applied for regression and classification of complex data-sets, such as the Griewank function of multiple variables x∈R^100 with random noise addition, and MNIST database for handwritten digits recognition, with 7×10^4 images. Interestingly, the same mathematical formulation found capable of approximating highly nonlinear functions in multiple dimensions, with low errors (e.g. 10^-10) for the test set of the unknown functions, their higher-order partial derivatives, as well as numerically solving Partial Differential Equations. The method is based on the calculation of the weights of each neuron, in small neighborhoods of data, such that the corresponding local approximation matrix is invertible. Accordingly, the hyperparameters optimization is not necessary, as the neurons' number stems directly from the dimensions of the data, further improving the algorithmic speed. The overfitting is inherently eliminated, and the results are interpretable and reproducible. The complexity of the proposed algorithm is of class P with O(mn^3) computing time, that is linear for the observations and cubic for the features, in contrast with the NP-Complete class of standard algorithms for training ANNs. The performance of the method is high, for small as well as big datasets, and the test-set errors are similar or smaller than the train errors indicating the generalization efficiency. The supplementary computer code in Julia Language, may reproduce the validation examples, and run for other data-sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/02/2021

Accelerating Algebraic Multigrid Methods via Artificial Neural Networks

We present a novel Deep Learning-based algorithm to accelerate - through...
12/12/2020

Solving for high dimensional committor functions using neural network with online approximation to derivatives

This paper proposes a new method based on neural networks for computing ...
12/14/2017

Neural networks catching up with finite differences in solving partial differential equations in higher dimensions

Fully connected multilayer perceptrons are used for obtaining numerical ...
07/06/2020

Weak SINDy For Partial Differential Equations

We extend the WSINDy (Weak SINDy) method of sparse recovery introduced p...
05/18/2020

Computing Lyapunov functions using deep neural networks

We propose a deep neural network architecture and a training algorithm f...
01/24/2020

PairNets: Novel Fast Shallow Artificial Neural Networks on Partitioned Subspaces

Traditionally, an artificial neural network (ANN) is trained slowly by a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

We present a numerical scheme for the computation of Artificial Neural Networks’ weights, without a laborious iterative procedure. The proposed algorithm adheres to the underlying theory, is highly fast, and results in remarkably low errors when applied for regression and classification of complex data-sets, such as the Griewank function of multiple variables with random noise addition, and MNIST database for handwritten digits recognition, with images. Interestingly, the same mathematical formulation found capable of approximating highly nonlinear functions in multiple dimensions, with low errors (e.g. ) for the test set of the unknown functions, their higher-order partial derivatives, as well as numerically solving Partial Differential Equations. The method is based on the calculation of the weights of each neuron, in small neighborhoods of data, such that the corresponding local approximation matrix is invertible. Accordingly, the hyperparameters optimization is not necessary, as the neurons’ number stems directly from the dimensions of the data, further improving the algorithmic speed. The overfitting is inherently eliminated, and the results are interpretable and reproducible. The complexity of the proposed algorithm is of class P with computing time, that is linear for the observations and cubic for the features, in contrast with the NP-Complete class of standard algorithms for training ANNs. The performance of the method is high, for small as well as big datasets, and the test-set errors are similar or smaller than the train errors indicating the generalization efficiency. The supplementary computer code in Julia Language, may reproduce the validation examples, and run for other data-sets.

1 Introduction

Artificial Intelligence (AI) has been broadening its numerical methods as well as fields of applications, however, empirical rigor is not following such advancements [sculley2018winner's], and researchers have been questioning the accuracy of iterative algorithms [Hutson2018]. Furthermore, the obtained results from a certain problem are not reproducible if we re-run the algorithm for the same problem[Hutson725, belthangady2019applications]. In theory, Artificial Neural Networks (ANNs) are capable of approximating any continuous function [Hassoun2005], however, apart from existence, the theory does not offer a rigorous procedure to calculate the approximation parameters. Hence, iterative algorithms [ruder2016overview] are applied for the calculation of the weights , which attain the minimum possible errors of the approximation. However, because this optimization problem might have more than one local minimum concerning the weights

, and the required iterations might be computationally costly, enhanced methods such as stochastic gradient descent

[bottou2010large, johnson2013accelerating], have been developing. The overfitting issue, that is to approximate the given data with high accuracy, but failing to generalize the predictions is a major flaw, and methods such as dropout [srivastava2014dropout], try to confront it. Over and above, the number of computational Neurons is often selected arbitrarily, while the so-called hyper-parameter optimization [bergstra2011algorithms, bergstra2012random, Feurer2019], which actually regards the solution of an optimization problem which depends on another optimization problem, that is the laborious weights’ calculation, result in extended computational time.

The purpose of this work was to develop a numerical scheme for the calculation of the optimal weights , Neurons

, and other parameters of approximation with ANNs, by adhering to the underlying theory, and at the same time being fast and precise. It was attained, by utilizing a novel numerical scheme, using classical ANN representation, division of the studied data-set into small neighborhoods, and matrix manipulations for the calculation of the sought weights. The results are remarkably accurate by achieving i.e. state-of-the-art errors in the Test-set of known data-sets such as MNIST for computer vision, and complex nonlinear functions for regression, while the computational time is kept low. Interestingly, the same Algorithmic scheme may be applied for the solution of Partial Differential Equations (PDEs), appearing in Physics, Engineering, Financial Sciences, etc. The paper is organized as follows. In section

2, we present the formulation the method. The basic formulation of the method, is concisely presented in sections 2.1.1, and 2.1.2. In section 2.2

we extend the implementation utilizing radial basis functions, in

2.3 we present the application of the method for the derivatives approximation, and in section 2.4 for the solution of PDEs. In section 2.5 we transform the basic scheme to Deep Networks, and in section 2.6 to Ensembles. In section 3, we present the results obtained from the numerical experiments, followed by a Discussion in section 4. The implementation of the method is available in open-source computer code in GitHub (Appendix I), written in Julia [bezanson2017julia] programming Language.

2 The ANNBN Method

Let be some given data of input variables in observations of responces. The Universal Approximation Theorem [Tadeusiewicz1995, Cybenko1989], ensures the existence of an integer , such that

with approximation errors among the given response and the corresponding simulated , arbitrarily low; where is the number of Neurons, and the approximation weights and bias terms for neuron , and the approximation weights and bias for the linear summation upon the neurons, and the given input. The presented method is based on the segmentation of the given data-set into small clusters of data, which approximate in each neighborhood the local , and later use this weights for the calculation of the total approximation. Thus, we call the proposed method Artificial Neural Networks By Neighborhoods (ANNBN). In order to create such Neighborhoods, we utilize a variety of custering algorithms, such as of the -means++, a fast variation for big datasets and another for equally sized clusters, so as to obtain inevitable matrices , as per the following Equation 1. We should underline that by the providing the clustering representation or just the initial seeds (for the -means), the results are precisely reproducible for the clustering, as well as for the entire ANNBN representation, as we will show in the following.

2.1 Shallow Networks

In Figure 1, the calculation process for the ANNBN weights is presented. The data are usually considered as a whole in ANN, while in ANNBN, we firstly consider the in the cluster and later all the as demonstrated in Figure 1 and the following mathematical formulation.

Figure 1: Numerical Scheme for the calculation steps of ANNBN weights; Initially for each neuron with , and afterwards for the entire network with

2.1.1 Calculation of and in the cluster

Let be the observations found in the cluster, with ,

the sigmoid function, which may be selected among the variety of sigmoids, such as

, and the inverted sigmoid, . Within the cluster, we may write

and by utilizing the inverse sigmoid function , and writing the left part of the Equation in matrix form, we deduce that

(1)

where , with , and the observations found in the cluster. Considering the equality , we deduce that for distinct observations, with , the matrix is of full row rank, and if we construct clusters with , the matrix is invertible, and the system in Equation 1 has a unique solution. Hence, because the dimensions of are small , we may rapidly calculate the approximation weights (Figure 1 left) in the cluster (corresponding to the neuron) by

(2)

If the clusters are not equally sized, we may solve numerically Equation 1, by utilizing Gaussian elimination, decomposition, or where is the pseudo-inverse, and

the vector of free parameters.

2.1.2 Calculation of and exploiting all the given observations

Following the computation of the weights , for each neuron in the hidden layer, we may write for all the neurons connected with the external layer that

(3)

where , with length , and is the matrix containing the entire sample; in contrast with the previous step utilized containing the observations in cluster . By solving the system of Equations (3), we may compute the weights by

(4)

and obtain the entire representation of the ANNBN.

2.2 ANNBN with Radial Basis Functions as Kernels

The method was further expanded by using Radial Basis Functions (RBFs) for the approximation, , depending on the distances among the observations , instead of their raw values (Figure 2), again in the clusters of data, instead of the entire sample. A variety of studies exist on the approximation proficiency of RBFs [Yiotis2015, Babouskos2015, wikiRadial], however they regards noiseless data, and the entire sample, instead of neighborhoods. We should also distinguish this approach of RBFs implemented as ANNBN, with the Radial Basis Function Newtorks [schwenker2001three, park1991universal, wikiRadialN], with , where the centers are the clusters’ means - instead of collocation points, is the number of neurons, and are calculated by training, instead of matrix handling. In the proposed formulation, the representation regards the distances (Figure 2) among all the observation in cluster with dimension (features) , and , and another observation in the same cluster , with . Accordingly, we may approximate the responses in the cluster , by

(5)

and compute , by

The elements , of matrix represents the application of function to the Euclidean Distances (or norms) of the observations in the cluster. Note that vector has length for each cluster , instead of for the sigmoid approach. Afterwards, similarly with the sigmoid functions, we obtain the entire representation for all clusters, similarly with Equations 3,4, for the weights of the output layer , by solving

(6)

similarly with Equation 3, while the rows of the matrices contain the observations of the entire sample and the columns the collocation points found in each cluster. For the calculation of the weights we use , and after the computation of and

, in order to interpolate for any new

(out-of-sample), we may use

where are the RBF collocation points for the approximation, same as in Equation 5. Hence, we may predict for out of sample observations by using

(7)

It is important to note that the kernel is applied to each element of matrices (Equation 5), instead of the total row, as per Equation 1. Hence we don’t need the inverted (corresponding to in Equation 1), while the apply directly by multiplication. This results in convenient formulation for the approximation of the derivatives, as well as the solution of PDEs.

Due to Mairhuber–Curtis theorem[10.2307/2033359], the matrix might be singular, and we should select an appropriate kernel for the studied data. Some examples of radial basis kernels are the Gaussian , Multiquadric , etc., where , and the shape parameter , controls the width of the function. may take a specific value or get optimized, to attain higher accuracy for the particular data-set studied. Accordingly with sigmoid functions, after the computation of , we utilize Equation 6, to compute , and obtain the entire representation.

Figure 2: cluster of Radial ANNBN

2.3 ANNBN for the Derivatives Approximation

Equation 7, offers an approximation the sought solution, by using multiplications and additions on the particular , where

(8)

which is a differentiable function with respect of any out-of-sample , considering the -dimensional collocation points as constants.

Accordingly, we may compute any higher-order derivative of the approximated function, by utilizing Equation 7, and simply differentiating the kernel , and multiplying by the computed weights , for all . In particular, we may approximate the derivative respect to the dimension, in the observation by

(9)

where

and

(10)

with the collocation points of cluster and the points where we compute . Because the vector applies by multiplication and summation for all the clusters (Equation 7), by differentiating , applying all , and finally the vector , to the particular , we obtain the entire approximation of each partial derivative. The weights remain the same for the function and its derivatives. We should underline that the differentiation in Equation 10, holds for any dimension of , hence due to Equation 10, with the same formulation, we derive the partial derivatives with respect to any variable, in a concise manner.

For example, if we want to approximate a function , and later compute its partial derivatives with respect to , by utilizing the collocation points , we may write

and if we use as a kernel

we obtain

and hence

The variable may take values from the collocation points or any other intermediate point, after the weights’ calculation, in order to predict for out-of-sample observations. In empirical practice, we may select among the available in literature RBFs, try some new, or optimise its shape parameter . In Appendix I, we also provide a simple computer code for the symbolic differentiation of any selected RBF, using SymPy [meurer2017sympy] package.

Particular interest exhibit the Integrated RBFs (IRBFs) [Bakas2019, Babouskos2015, Yiotis2015], which are formulated from the indefinite integration of the kernel, such that its derivative is the RBF . Accordingly, we may integrate for more than one time the kernel, to approximate the higher-order derivatives. For example, by utilizing , and the two times integrated Gaussian RBF for at collocation points ,

we deduce that

and hence

which is the Gaussian RBF, approximating the second derivative , instead of .

2.4 ANNBN for the solution of Partial Differential Equations

Correspondingly with the numerical differentiation, we may easily apply the proposed scheme to approximate numerically the solution of Partial Differential Equations (PDEs). We consider a generic Differential operator

depending on the partial derivatives of the sought solution , for some coefficient functions , which satisfys the equality

where might be any function in the form of . We may approximate by

(11)

By utilizing Equation 9, we constitute a system of linear equations. Hence, the weights may be calculated by solving the resulting system, as per Equation 5.

For example, if we consider a Laplace’s equation in the generic form of

The weights in Equation 11 are constant, hence the differentiation regards only the function . Thus, by writing for all found in cluster , we obtain

(12)

Because the weights are the same for the approximated function and its derivatives, we may apply some boundary conditions for the function or its derivatives , at some boundary points ,

by using

(13)

hence, we may compute , by solving the resulting sytem of Equations

(14)

similarly with Equation 5 for cluster . Afterwards, we may obtain the entire representation for all clusters, by using Equation 6 for the computation of . Hence, we obtain the sought solution by applying the computed weights in Equation 7, for any new .

2.5 Deep Networks

A method for the transformation of shallow ANNBNs to Deep Networks is also presented. Although Shallow Networks exhibited vastly high accuracy even for unstructured and complex data-sets, Deep ANNBNs may be utilized for research purposes, for example in the intersection of neuroscience and artificial intelligence. After the calculation of the weights for the first Layer , we use them to create a second layer (Figure 1), where each node corresponds to the given . We then use the same procedure for each neuron of Layer , by solving

(15)

with respect to , where implies the ellementwise application of to . This procedure is iterated for all neurons of layer . Matrix corresponds to the weights of the layer . Finally we calculate the output layer, linear weights , as per Equation 3. This procedure results in a good initialization of the weights, close to the optimal solution, and if we normalize in a range close to the linear part of the sigmoid function (say ), we rapidly obtain a deep network with approximately equal errors with the shallow. Afterwards any optimization method may supplementary applied to compute the final weights, however the accuracy is already vastly high.

Alternatively, we may utilize the obtained layer for the shallow implementation of ANNBN, (Equation 3), as an input for another layer, then for a third, and sequentially up to any desired number of layers.

(a) Deep ANNBNs with layers of neurons
(b) Ensembles of ANNBNs
Figure 3: Transformation of the basic Numerical Scheme

2.6 Ensembles

By randomly sub-sampling of a percentage of the observations, running the ANNBN algorithm for multiple times , and averaging the results with respect to the errors of each iteration with error for all -flods

we may constitute an Ensemble of ANNBN (Figure (b)b). Ensembles of ANNBNs exhibited increased accuracy and generalization properties for noisy data, as per the following Numerical Experiments.

2.7 Time Complexity of the ANNBN algorithm

The training of an ANN with two layers and three nodes only, is proved to be NP-Complete in [blum1989training], if the nodes compute linear threshold functions of their inputs. Even simple cases, and approximating hypothesis, results in NP-complete problems [engel2001complexity]. Apart from the theoretical point of view, the slow speed of learning algorithms is a major flaw of ANNs. To the contrary, ANNBNs are fast, because the main part of the approximation regards operations with small-sized square matrices , with the number of features. We provide here a theoretical investigation of ANNBNs’ time complexity, which may empirically validated by the running of the supplementary code. More specifically, the computational effort of ANNBNs regards the following steps.

Definition 1.

ANNBN Training is obtained in three distinct steps: a) Clustering, b) Inversion of small-sized matrices (Equation 1) for the calculation of weights, and c) Calculation of weights (Equation 4).

Definition 2.

Let be number of observations, the number of features and we use equally sized clusters. The number of clusters , which is equal to number of neurons, will be

(16)

where the addition of , corresponds to the unit column in Equation 1. Note that the number of clusters , is equal to number of neurons as well (Equation 1, and Figure 1). This is the maximum number of clusters, otherwise the matrices are not invertible and Equation 1 has more than one solutions. Hence we investigate the worst case in terms of computational time, while in practice may be smaller. We assume the number of iterations needed until convergence of clustering, which in practical applications is small and the clustering fast.

Lemma 1.

Time complexity of step (a) is .

Proof.

The running time of Lloyd’s algorithm (and most variants) is ,[hartigan1979algorithm, Manning:2008:IIR:1394399]. We should note that the Lyod’s algorithm, in the worst case is a super-polynomial [blomer2016theoretical, Arthur:2006:SKM:1137856.1137880], with and hence the ANNBN algorithm. However, in practice, the algorithm converges for small .

In particular, the running time of the means algorithm, is [arthur2007k]. From Equation 16, we derive that the running time of step (a) is of . ∎

Lemma 2.

Time complexity of step (b) is

Proof.

Time complexity of step (b) regards the inversion of matrices with size (Equation 1). His is repeated for times, hence the complexity is

Lemma 3.

Time complexity of step (c) is

Proof.

Step c regards the solution of an system of Equations (Eq. 3). We may solve with respect to , by . Hence the complexity regards a multiplication of with , its inversion with complexity , as well multiplication of with , with complexity , and , with , with complexity . Thus, the total complexity is . ∎

Theorem 1.

(ANNBN Complexity) The running time of ANNBN algorithm is

Proof.

By considering the Time Complexity of each step, (Lemma 1,2,3), we deduce that the total coomplexity is . ∎

3 Validation Results

3.1 1D Function approximation & geometric point of view

We consider a simple one dimensional function , with , to present the basic functionality of ANNBNs. Because is unstable for , and , we normalize the responses in the domain . In Figure 5, the approximation of is depicted, demonstrating the approximation by varying the number of neurons utilized in the ANNBN. We may see that by increasing the number of neurons from 2 to 4 and 8, the approximating ANNBN exhibits more curvature alterations. This complies with the Universal Approximation theorem, and offers a geometric point of view. Interestingly, the results are nor affected by adding some random noise, , as the Mean Absolute Error (MAE) in this noisy data-set was for the train set and the test set was even smaller , further indicating the capability of ANNBN to approximate the hidden signal and not the noise. We should note that for noiseless data of 100 observations, and 50 neurons, the M.A.E. in the train set was and in the test set . The approximation of the same function with Gaussian RBF, and shape parameter , results in M.A.E. for the train set and for the test set.

Figure 4: ANNBN with neurons, for the approximation of .
Figure 5: Regression Errors for the Griewank Function with input

3.2 Regression in

We consider the function of five variables,

We create a train set of the five variables , compute , add some random noise and normalize . Then we create a test set with an equal number of observations with the train set , and compute , without adding random noise. Thus, we may check the capability of ANNBN to approximate the signal and not the noise. The results are presented in Table 1

, indicating great accuracy achieved with ANNBNs. The comparison with other methods regards Random Forests as implemented in

[DecisionTree]

, XGBoost

[XGBoost], and AdaBoost from ScikitLearn [ScikitLearn].

Table 1 presents the similar results -in terms of approximation errors- obtained for input , for observations, and addition of some random noise to the highly nonlinear Griewank function [griewank1981generalized],

With RBF ANNBNs, we may use a higher number of clusters, and hence neurons, , as the matrices of Equation 5, are always square. Accordingly, we may approximate this nonlinear, noisy function with a few observations with respect to features, , with vastly low errors as demonstrated in Figure 5, and Table 1.

Mean Absolute Errors Random Forests AdaBoost XGBoost ANNBN
Griewank.
Table 1: Regression Results

3.3 Classification for Computer Vision

As highlighted in the introduction, the reproducibility of AI Research is a major issue. We utilize ANNBN for the MNIST database [lecun1998gradient, mnistweb], obtained from [MLDatasets], consisting of handwritten integers , for train and for test. The investigation regards a variety of ANNBN formulations, and the comparison with other methods. In particular, the , and

were utilized as activation functions, and the corresponding

, and for the Equation 1. We constructed a ANNBNs with one and multiple layers, varying the number of neurons and normalization of , in the domain . The results regard separate training for each digit. All results in Table 2

are obtained without any clustering. We consider as accuracy metric, the percentage of the Correct Classified (CC) digits, divided by the number of observations

The aim of this investigation was to compare ANNBN with standard ANN algorithms such as Flux [Flux], as well as Random Forests as implemented in [DecisionTree], and XGBoost [XGBoost]. Table 2 present the results in terms of accuracy and computational time. The models are trained on the raw dataset, without any spatial information exploitation. The results in Table 2 are exactly reproducible in terms of accuracy, as no clustering was utilized and the indexes are taken into account in ascending order. For example, the running time to train 5000 neurons is 29.5 seconds on average for each digit, which is fast, considering that the training regards weights, for instances and features. Also, the Deep ANNBNs with 10 layers with 1000 neurons each, are trained in the vastly short time of 91 seconds per digit on average (Table 2). Correspondingly, In Table 2, we compare the Accuracy and Running Time, with Random Forests (with Trees), and XGBoost (200 rounds). Future steps might include data preprocessing and augmentation, as well as exploitation of spatial information like in CNNs. Furthermore, we may achieve higher accuracy by utilizing clustering for the Neighborhoods training, Ensembles, and other combinations of ANNBNs. Also by exploiting data prepossessing and augmentation, spatial information, and further training of the initial ANNBN with an optimizer such as stochastic gradient descent. No GPU or parallel programming was utilized, which might also be a topic for future research. For example the RBF implementation of ANNBN with clustering and neurons exhibit a test set accuracy of 99.7 for digit . The accuracy results regard the out of sample test set with digits. The running time was measured in an Intel i7-6700 CPU @3.40GHz with 32GB memory and SSD hard disk. A computer code to feed the calculated weights into Flux [Flux] is provided.

Correct Classified (%) Digit Label
0 1 2 3 4 5 6 7 8 9
Random Forests [DecisionTree]
XGBoost [XGBoost]
Flux ANN [Flux]
ANNBN
ANNBN
ANNBN
ANNBN
Deep ANN
Running Time (sec) Digit Label
0 1 2 3 4 5 6 7 8 9
Random Forests [DecisionTree]
XGBoost [XGBoost]
Flux ANN [Flux]
ANNBN
ANNBN
ANNBN
ANNBN
Deep ANN
  • 1 hidden layer with Neurons, Activation Function (AF) , and .

  • 1 hidden layer with Neurons, AF , and .

  • 1 hidden layer with Neurons, AF , and .

  • 1 hidden layer with Neurons, AF , and .

  • 10 hidden layes with Neurons each, AF , and .

  • fastest design; highest accuracy.

  • The ANNBN accuracy results, are exactly reproducible with the supplementary Computer Code. All training examples utilize the raw MNIST database, without any preprocessing or data augmentation. The accuracy (%) regards the out-of-sample test set of MNIST with handwritten digits.

Table 2: Computer Vision (MNIST)

3.4 Solution of Partial Differential Equations

We consider the Laplace’s Equation [Brady2005]

in a rectangle with dimensions (a,b), and boundary conditions for , , for , , for , and ), for . In Figure (a)a, the numerical solution as well as the exact

are presented. The MAE among the closed-form solution and the numerical with ANNBN, was found . Interestingly, if we add some random noise in the zero source, hat is to solve for

(17)

the MAE remains small, and in particular , for a=b=1, rectangle grid of points with . It is important to underline that numerical methods for the solution of partial differential Equations are highly sensitive to noise [mai2003approximation, Bakas2019], as it vanishes the derivatives. However, by utilizing the ANNBN solution the results change slightly, as described in the above errors. This is further highlighted if we utilize the calculated weights of the ANNBN approximation and compute the partial derivatives of the solution of Equation 17, , and , the corresponding MAE for the second order partial derivatives is (Figure (b)b), which is about two orders less than the added noise , implying that ANNBN approximates the signal and not the noise even in PDEs, and even with stochastic source.

(a) Solution
(b) Partial Derivatives
Figure 6: ANNBN solution of Laplace’s Equation with stochastic source.

4 Discussion

As described in the formulation of the proposed method, we may use a variety of ANNBNs, such as Sigmoid or Radial Basis Functions scheme, Ensembles of ANNBNs, Deep ANNBNs etc. The method adheres to the theory of function approximation with ANNs, as per Visual representations of ANNs’ capability to approximate continuous functions [Nielsen2015, rojas2013neural]. We explained the implementation of the method in the presented illustrative examples, which may be reproduced with the provided computer code. In general, Sigmoid functions are faster, RBFs more accurate and Ensembles of either sigmoid of RBFs handle better the noisy datasets. RBFs, may use smaller than sized matrices, and hence approximate datasets with limited observations and a lot of features. The overall results are stimulating in terms of speed and accuracy, compared with the Literature and state-of-the-art methods.

The approximation of the partial derivatives and solution of PDEs, with or without noisy source, in a fast and accurate manner, offer a solid step towards the unification of Artificial Intelligence Algorithms with Numerical Methods and Scientific Computing. Future research may consider the implementation of ANNNs to specific AI applications such as Face Recognition, Reinforcement Learning, Text Mining, etc., as well as Regression Analyses, Predictions, and solution of other types of PDEs. Furthermore, the investigation of other sigmoid functions than the logistic, such as

, etc., as well as other RBFs, such as multiquadrics, integrated, etc., and the selection of an optimal shape parameter for even higher accuracy. Finally, although the weights’ computation is whopping fast, the algorithm may easily be converted to parallel, as the weights’ computation for each neuron regards the times inversion of matrices .

Interpretable AI is a modern demand in Science, while ANNBN are inherently suitable for such scope, as by checking the approximation errors of the neurons in each cluster, we may retrieve information for the local accuracy, as well as local and global non-linearities in the data properties. Furthermore, as demonstrated in the examples, the method is proficient for small datasets, without over-fitting, by approximating the signal and not the noise, which is a common problem of ANNs.

5 Conclusions

From Newton’s law of motion, Leibniz’s vis viva, and Quantum Mechanics, to modern Artificial Intelligence algorithms which mimic human intelligence, a core problem is confronted; to construct a mathematical model, which approximates the measured data of a physical, engineering, social, etc., system. The evolution of scientific research enhances such models, aiming to attain the best possible performance in terms of prediction accuracy and computational time. Scientists and Engineers, often utilize a variety of numerical models to attain the sought scope, depending on the characteristics of the given problem and data. In this work, we presented a unified approach for problems involving mathematical modeling of an unknown system, from function approximation and solution of partial differential equations to computer vision, regression and classification tasks. The method adheres precisely on the theory of approximating functions with bounded activation functions. The validation results demonstrate the proficiency of the proposed method regarding accuracy and speed.

6 Appendix I: Computer Code

The presented method is implemented in Julia [bezanson2017julia] Language. The corresponding computer code, is available on https://github.com/nbakas/ANNBN.jl

weights of first layer for neuron, in cluster number of observations number of variables (features) given data number of clusters, equal to number of neurons approximation weights and bias for the external layer local observations in cluster number of observations found in cluster local matrix of observations (in cluster) observations in the cluster iterators for observations, features, and clusters out-of-sample point in Correct Classified Observations number of folds for Ensembles bias for neuron number of Layers

[1cm]

References