1 Introduction
The radial basis function network (RBFN) was introduced by Broomhead and Lowe in 1988 Broomhead_init_rbf
. It is a simple yet flexible regression model that can be interpreted as a feedforward neural network with a single hidden layer. While it often cannot compete in predictive accuracy against stateoftheart black box models such as deep neural networks or ensemble methods such as boosting, it is nevertheless an universal approximator
univ_approx , which means that its predictive accuracy can be improved to match any other predictor by increasing the amount of training data. Especially with large data sets, computational efficiency becomes an essential concern.In this article, we describe a modern gradientbased RBFN implementation based on the same computational machinery that is used in modern deep learning. While gradientbased methods for training RBFN’s have been criticized
OLS because of local optima, tailored training methods have been used in the past with good results sel_bp. We show that suitable optimizers, regularization techniques, and learning rate schedules enable us to train large RBF networks without overfitting and achieve predictive performance on par with gradientboosted decision trees.
Classical algorithms for training RBFN’s usually follow a twostep approach. In the first step, the parameter values describing centroid positions are determined, for instance, by taking a random subset of input data points or applying a suitable clustering algorithm. In the second step, the rest of the model parameters are computed using closedform analytic expressions intro_rbf ; survey_old ; survey_new . Another approach for finding RBFN parameters, the Orthogonal Least Squares (OLS) algorithm OLS
, constructs the network sequentially by adding new centroids that maximize the reduced variance of the error vector.
While these methods from late 1990’s indeed produce reasonably wellperforming RBFN’s, it is clear that the solutions thus found are suboptimal because of the biases introduced by the multistep algorithms. More recent research new_1 ; new_2 has mainly concentrated on developing algorithms that aim to reach good predictive performance while keeping the number of parameters in the network as low as possible survey_new . With our method, it becomes straightforward to train an RBF network with a large number of prototypes.
The parameters of an RBFN have easytounderstand interpretations. For a network with tens of centroids, this makes the RBFN as a whole quite interpretable; however, a network with hundreds or thousands of centroids is in practice no more interpretable than any large machine learning model. Therefore, we also propose a gradientbased RBFN pruning method that produces a smaller RBFN that globally approximates the larger one. Our approach simultaneously optimizes all parameters of the smaller RBFN and, crucially, minimizes the expected discrepancy over the input data distribution, not the input data points themselves. Hence a pruned RBF network thus obtained is optimized for an objective function that maintains an explicit connection to the larger RBF network and is therefore different from what one would use for directly training a small RBFN of the same size.
Interpretability is a great asset especially in those machine learning applications where the learned patterns in the input data can give new valuable insight into the mechanisms underlying the correlations between input and output. Data sets where such patterns, sometimes too nuanced for human intuition to discover, can be revealed by interpretable machine learning models are encountered e.g. in natural sciences. In this paper, we apply RBFN’s to a materials physics data set that describes a subset of parameters for a kinetic Monte Carlo (KMC) model for surface diffusion in copper jyri2 ; jyri3 . In the KMC model, diffusion is interpreted as a series of atomic migration events that have rates defined by their energy barriers :
(1) 
where is the Boltzmann constant and
is temperature. The barriers in turn are defined by the configuration of atoms around the migrating atom. There are methods for computing the barriers that correspond to different local atomic environments of the event, but problems arise from the vast number of the environments. Computing the barriers accurately is computationally expensive, so either the accuracy or the number of different barriers has to be compromised for parametrizing the KMC model. Machine learning offers a way to interpolate and extrapolate barrier values based on the incomplete data set of calculated barriers.
The input data assumes a perfect crystalline structure, where variation only occurs in the occupation of fixed lattice sites—either an atom sits at the site indexed , or not. Hence, the input data can be losslessly converted to a binary sequence of occupation numbers at lattice positions. The local atomic environment that is assumed to affect the migration energy barrier is extended up to the second nearest neighbors of the migrating atom. In the facecentered cubic copper lattice, this comprises 26 lattice positions.
The data set does not span the entire 26dimensional input space, that in principle has million possible values; even setting aside to computational cost of calculating such an immense amount of migration energy barriers, there are other challenges related to finding all of these barriers in crystalline surface systems jyri1
. In any case, it is necessary to interpolate or otherwise estimate the barriers for the missing input values, and machine learning is a promising approach to accomplish this. Were there a need to introduce another element into the parameterization in addition to Cu, the input space would suddenly grow to have
possible values; likewise, expanding the local atomic environment to include third nearest neighbors would grow the dimensionality itself from 26 to 58. For input spaces like this, the only option is to use a method that is capable of generalizing from a small subset of all possible inputs.The large RBFN’s trained to the migration barrier data are then pruned, and the inputassociated weights are revealed to contain patterns that correspond to physically meaningful threedimensional symmetries, even though the networks only ever saw the “flat” binary representations of the atomic environments.
We emphasize that our motivation for pruning is to achieve interpretability, not to minimize the number of centroids used for making predictions. Indeed, we will demonstrate with a materials physics data set that one can train RBF networks with thousands of centroids without overfitting, so there is no inherent need to use pruning as a form of regularization—and that these large networks can be made interpretable with our proposed pruning method. This approach is also fundamentally different from simply training a small RBF network: limiting the number of centroids for the sake of interpretability would compromise predictive accuracy, and in general the pruned networks we obtain are different from those one would get by directly training networks of the same size because the objective functions are different.
Somewhat similar ideas on the the input data distribution and our pruning criterion’s functional form appear in the growingandpruning (GAP) algorithm GAP ; GGAP for constructing RBF networks. However, GAP is designed for constructing RBFN’s in sequential access cases and does not consider pruning as a separate task. Also related are two existing RBFN algorithms that incorporate pruning by initially assigning each training data point its own centroid: The early twostage algorithm by Musavi et al. pruning_clust prunes the initial network by combining similar centroids using an unsupervised clustering approach, then keeps the centroid locations fixed for the rest of the training process; as discussed above, this is unlikely to converge to an optimal solution. The more sophisticated algorithm of Leonardis et al. pruning_mdl
alternates between gradientbased optimization steps (performed on the full data set) and centroid removal steps based on the Minimum Description Length (MDL) principle; unfortunately their approach does not seem to scale well to large data sets because of the need to repeatedly solve a combinatorial optimization problem with a search space exponential in the number of centroids.
The rest of this article is structured as follows. In Section 2, we describe our gradientbased approach for training and pruning RBF networks. In Section 3, we report experimental results both on toy data and on a materials physics dataset. Finally, we summarize our results and discuss future directions in Section 4.
2 The RBFN Framework
2.1 Basic problem setting
Suppose that we are given a data set , , where the are i.i.d. samples from some distribution . Our task is to reconstruct an unknown target function with , where the are i.i.d. noise with some fixed variance.
We model with an RBF network , defined as follows:
Definition 1
A radial basis function network (RBFN) is a function of the form
(2) 
where , , and are parameters, is a kernel function, and denotes the Euclidean norm.
We restrict ourselves to the Gaussian kernel with the parameter . The rows of in Definition 1 are called centroids or prototypes and can be interpreted as pseudo–data points. An RBFN computes an input point’s Euclidean distance to each centroid, feeds these distances into the kernel, and uses a weighted linear combination of the resulting values to provide a prediction. Taken together, the weights and centroids define areas of the input spaces with smaller or larger output values and hence have natural interpretations.
We propose to fit an RBF network to the observations by solving the optimization problem
(3) 
i.e., by minimizing the mean squared error (MSE) between the and the . This is equivalent to finding a maximum likelihood solution for the probabilistic setting described above.
2.2 Training
To solve (3), we use the same machinery that is used in modern deep learning. More specifically, we provide an open source implementation^{1}^{1}1Our implementation is available at XXX under an open source license. based on the PyTorch framework paszke2017automatic and the gradientbased Adam optimizer adam . We use regularization for all realvalued parameters (logtransformed in the case of
) and train the RBFN over multiple epochs, starting with random initialization and using minibatches. We use early stopping with a validation set and use the validation set loss to guide our learning rate schedule. Our approach is implemented as a standard PyTorch module and is vectorized for efficiency.
2.3 Pruning
Assume that we have trained a large RBF network with centroids. Suppose then that we would like to find another RBFN with centroids that gives a good approximation to original RBFN. Our principal motivation here is that a smaller network can be easier to interpret. There are many ways in which one might specify what is the best approximation to
, but we propose to find an RBFN that minimizes expected squared difference between the two RBFN’s predictions in the probability space of the datagenerating process.
Denote the parameters of the smaller RBFN by and the (fixed) parameters of the original RBFN by . Then the pruning task is to solve
(4) 
where is the data distribution.
By expanding the squares and using the linearity of expectation, we can rewrite the expectation in (4) as
(5) 
From the decomposition (5) we see that the optimization problem (4) is reduced to the computation of expectations of the form . The following theorem gives closedform expressions for these expectations under three practically relevant scenarios.
Theorem 2
Let and .

If follows the distribution given by the mixture density
where is the Gaussian density with mean and variance , , and for all , then

If follows the distribution given by
where is the uniform density on the interval , then

If follows the distribution given by
where is the Bernoulli point probability function for the outcomes , then
The proof is postponed to A
. Gaussian mixtures and the uniform distribution provide good approximations to the majority of practical situations where the input data is free of outliers. The binary case is relevant for more specific situations; it arises, for instance, in modeling atomic environments in physical simulations
jyri1 ; jyri2 .Given the closedform expressions for the expectations from Theorem 2, we can directly compute the pruning objective (5) for a given set of RBFN parameters. Each individual expectation has a computational complexity of , which implies that (5) can be computed with operations, or if we omit constant terms, . This is particularly remarkable in the Bernoulli case, where a naïve computation of the expectation would involve summation over terms.
In our PyTorchbased implementation, we provide efficient vectorized implementations of the pruning objectives corresponding to the settings where each input dimension has the distribution , , or . As in RBFN training, we use the Adam optimizer adam to minimize the pruning objective. We initialize the smaller RBFN’s centroids by sampling without replacement from the larger RBFN, and as the objective is not convex, we use multiple restarts to improve the quality of the final solution.
3 Results and discussion
In all the experiments described in this section, we use the same fixed hyperparameters when training RBF networks. Namely, we use a batch size of 64 and weight decay (
regularization parameter) . We use validation set–based early stopping with the following learning rate schedule: We start with the learning rate , and multiply it by when ten successive epochs have produced no improvement for the validation set MSE. After each learning rate reduction, we allow ten epochs before resuming the monitoring, and we stop training when the learning rate goes below .When pruning RBF networks, we use a similar learning rate schedule, but we start from , and we stop when the objective has not improved for ten iterations with the learning rate .
3.1 Toy example with normal and uniform pruning
First, we illustrate training and pruning with a simple toy data set. We sample , , uniformly at random from the interval and compute . We then train an RBFN with 100 centroids, using 20% of the data points for early stopping. Afterwards, we prune the resulting RBFN down to three centroids. For pruning, we use both the and distributions, and for each choice, we select the best pruned RBFN out of ten random initializations.
The results are shown in Figure 1. The 100centroid RBFN has learned a very good approximation of the target function within the data set and degenerates to a constant value elsewhere as implied by Definition 1. Roughly 95% of the probability mass of lies within the interval , and within this interval the corresponding pruned RBFN provides a good fit, but outside that interval the RBFN degenerates to a constant with a suboptimal value; this is to be expected, as pruning with this distribution should give little weight to values outside the interval. The uniformpruned RBFN also matches the central peak well and converges to a more reasonable constant in the tails; this focus on the tail area appears to be the cause for the slightly worse approximation at around .
3.2 Migration energy barrier data
3.2.1 Predictive performance
We evaluate the performance of gradientbased RBFN training on a data set of migration energy barriers for copper surfaces jyri3 . The data set is further divided into three subsets corresponding to the 100, 110, and 111 surfaces with , , and data points, respectively. Each data point consists of 26 binary features and a realvalued response. We always use data points for training, data points for early stopping, and leave the rest for the test set.
Splitting of the data in three subsets arises from the method in which the barriers were calculated. See Figure 2
for an illustration of the surface orientations in Cu crystal. For the calculation of the migration energy barriers, each 26dimensional input vector was embedded in one of the three lowest index surfaces, according to a selected criterion of stability. This results in different relaxation and forces being present in the system, depending on the selected surface orientation. Furthermore, in an earlier study using multilayer perceptrons for the same data set, accuracy was gained by taking advantage of this physical split in the data
jyri1 .We use the bracketed threenumber Miller indexing for noting crystal orientations kittel1996introduction . Briefly, the three numbers , and are vector coordinates in the basis of cubic lattice vectors : [hkl] is shorthand for . Overline signifies a negative number. Square brackets are used for vectors, angular brackets for sets of equivalent vectors, round brackets for surfaces, and curly brackets for sets of equivalent surfaces. The surface (hkl) can be defined as perpendicular to the vector [hkl].
We compare the performance of our gradientbased RBF networks to two baselines: gradientboosted decision trees gb
, as implemented in the popular XGBoost
xgboost package, and deep neural networks implemented with the PyTorch framework. For both baselines, we use early stopping after ten successive iterations of no improvement, and for XGBoost we impose an upper limit of trees. For each surface and both baseline algorithms, we first do 100 rounds of hyperparameter tuning with random search randomhp .For XGBoost, we sample the learning rate uniformly from , the instance and column subsampling ratios from , the maximum tree depth from , the minimum number of instances in a node from , and the and regularization weights and the minimum loss reduction required for a new partition from .
For deep neural networks, we sample the learning rate from , the batch size from , the regularization weight from , the number of hidden layers from , and the number of nodes per hidden layer from
. For each network, we randomly choose the activation function to be either the widely used ReLU or the more recently proposed ELU
elu .For each sampled set of hyperparameters, we perform ten random train–validation–test splits and use the average of the test set MSE to rank the hyperparameter values.
We then evaluate the performance of the hyperparametertuned XGBoost and DNN models and RBF networks with 128, 256, 512, 1024, 2048, and 4096 centroids, each with ten random train–validation–test splits for each three surfaces. The resulting root mean squared errors (RMSE) are shown in Table 1 and also shown in Figure 3. (We also tried 8192centroid RBF networks, but the improvements were insignificant.) Overall, deep neural networks outperform the other models; for all three surfaces, the bestperforming DNN’s use 256 nodes per layer, but the other optimized hyperparameters found vary by surface (with, e.g., the number of layers ranging from 6 to 9). For the 100 and 110 surfaces, the 4096centroid RBFN is better than XGBoost, and the predictive performance is acceptable also for the 111 surface. The best XGBoost predictors used 7324, 3419, and 1993 decision trees for the 100, 110, and 111 surfaces, respectively. The best hyperparameter values found for XGBoost lie either in the interiors of the predefined ranges or at zero, which suggests that the hyperparameter search had sufficient coverage.
Surface  Model  RMSE (± one s.d.) 

100  RBFN (128)  0.0626 ± 0.0006 
RBFN (256)  0.0541 ± 0.0004  
RBFN (512)  0.0481 ± 0.0004  
RBFN (1024)  0.0444 ± 0.0003  
RBFN (2048)  0.0425 ± 0.0002  
RBFN (4096)  0.0417 ± 0.0001  
XGBoost  0.0437 ± 0.0002  
DNN  0.0366 ± 0.0009  
110  RBFN (128)  0.0610 ± 0.0006 
RBFN (256)  0.0534 ± 0.0004  
RBFN (512)  0.0488 ± 0.0004  
RBFN (1024)  0.0454 ± 0.0003  
RBFN (2048)  0.0436 ± 0.0002  
RBFN (4096)  0.0430 ± 0.0001  
XGBoost  0.0447 ± 0.0002  
DNN  0.0392 ± 0.0007  
111  RBFN (128)  0.1017 ± 0.0002 
RBFN (256)  0.0974 ± 0.0003  
RBFN (512)  0.0950 ± 0.0004  
RBFN (1024)  0.0941 ± 0.0003  
RBFN (2048)  0.0937 ± 0.0002  
RBFN (4096)  0.0938 ± 0.0002  
XGBoost  0.0924 ± 0.0002  
DNN  0.0700 ± 0.0010 
Predictive performance of RBF networks (as a function of the number of centroids), XGBoost (shown by the solid horizontal line), and DNN (shown by the dashed line) on the three surfaces of the migration energy barrier data set. Error bars mostly not visible for RBFN’s; dotted lines show one standard deviation for XGBoost and DNN. Note the different vertical axes.
3.2.2 Visualization and interpretability by pruning
To demonstrate the usefulness of our pruning approach, we first take the bestperforming RBF networks from Section 3.2.1 for each surface and prune them down to sixteen centroids. Note that this is different from directly training a sixteencentroid RBFN: our goal is to produce an interpretable approximation of the large model, and this approximation is not meant to be used for producing predictions. The predictive accuracy of a directly trained sixteencentroid model would be better than that of the pruned model but much worse than that of a large RBFN with thousands of centroids; its centroids would be optimized for prediction, not for approximating a larger RBF network and aiding in its interpretation.
In this case, we use the Bernoulli distribution with
for each of the 26 features. Hence, denoting the large RBFN by and the pruned one by , we are essentially solving the optimization problemwhich seemingly involves terms but can be optimized efficiently because of Theorem 2. As before, we use ten random initializations for the pruning and select the RBFN that gives the best result. The square roots of the resulting pruning loss values (5) are 0.0650, 0.0653, and 0.0908 for the surfaces 100, 110, and 111, respectively. While these values naturally show that the heavy pruning incurs a reduction in accuracy, the values are nevertheless reasonably good when one considers that the response values in the data sets have the ranges , , and ( shows one standard deviation).
Interestingly, we can visualize the centroids of the resulting RBF networks and gain insight into the structure of the prediction task and the workings of the RBF network. All visualizations for the centroids of the pruned RBFN’s are shown in Figures 6–11 in B. The 26 elements of corresponding to each centroid in the pruned RBFN are shown by the color and the opacity of the atomic positions: red encodes a positive value, blue a negative value, and the opacity is proportional to the absolute value of the element (fully opaque means a value ). The same prototype is shown from four different viewpoints: [110], [001], [110], and [031].
Several patterns related to the surfaces’ orientations can be observed from the distributions of opacities. Namely, in the 100 prototypes, more than in the other groups, a pattern of high opacity aligning with the 100 surface can be seen. Likewise, many of the 111 set prototypes have structures aligning with the 111 surface. Illustrative examples of prototypes from the 100 and the 111 networks are depicted in Figure 4.
Relevant physical features contained in the RBFN’s can be also inspected from the mean absolute opacities of each lattice site. This information is plotted in Figure 5. Color coding is the same as in Figure 2: by the atomic layer and the distance to the migrating atom.
The first clear observation is that the first nearest neighbor (1nn) sites on average have higher opacities as the pruned RBFN’s are small and incorporate only the most important patterns. Another notable feature is that 110 prototypes have lower opacities on average on the last four lattice positions. This, again, corresponds to the lattice structure, as on the 110 surface, these four positions are located in the same layer as the migrating atom, but quite far away, at second nearest neighbor (2nn) distance, on the neighboring ridges. See the secondrightmost panel of Figure 2 for an illustration: the last four sites are the light yellow positions at the top and the bottom of that figure. Sites 19–22 on the 100 surface have a similar role, and these sites indeed have the lowest mean opacity. It should be expected that 2nn sites in the same layer with the migrating atom have the least contribution to the migration barrier, since they are far from the migrating atom, and their absence or presence does not even impose much stress on the system, unlike that of the sites in the lower layers. On the 111 surface, there are no 2nn sites in the same layer with the migrating atom—this is reflected in the more even opacity of all lattice site groups.
The distributions of lattice position opacities on different surfaces suggest that the RBFN’s were able to infer physically meaningful information only from the functional dependency between binary input and the migration energy barriers, without ever having access to threedimensional representations of the lattice sites.
The input patterns that the RBFN algorithm found to be meaningful for the regression model may prove useful in developing more sophisticated input descriptors for future machine learning solutions of similar problems. The currently used integer vector descriptor, developed by Djurabekova et al. djurabekova2007stability ; djurabekova2007artificial , has certain limitations: while it is invariant in threedimensional translations of the lattice (arising from the total omission of the threedimensional coordinates), it is not invariant in reflection about the (001) or the (110) planes (see the second panel of Figure 2 for the vectors normal to these planes) and 180 ° rotation around the [110] axis. The migration energy barrier response used in KMC simulations has to be invariant in these reflections and rotations to give similar diffusion properties in physically equivalent local atomic environments on differently oriented surfaces.
The lack of important invariances in the descriptor can be circumvented e.g. by systematically choosing a representative input from each family of physically equivalent cases, as was done in ref. jyri2 , or by averaging over all the symmetric cases when producing the response for an input. The former method will waste some of the available training data, and prevent the regressor from learning the symmetries itself, while the latter method will spend nearly four times as much resources at each function call. A properly invariant descriptor would render these workarounds unnecessary and could potentially even reduce the dimensionality of the input space.
We are aware of some descriptors popularly used in representing atomistic input data, such as the smooth overlap of atomic positions (SOAP) descriptor bartok2013representing , that are invariant in translations, rotations, and reflections. These descriptors have been developed for a somewhat different task than ours—for mapping atomistic input to total energy of the system, and often also the forces present in it, as opposed to just a single migration energy barrier of a transition process. The different problem setting motivates different properties for the descriptor. The SOAP descriptor in particular was developed to give a continuous similarity kernel for comparing systems where atomic positions are not restricted to certain lattice positions. The migration energy problem can certainly be formulated in terms of total energies: is the difference between the energy of the system at its saddle position (roughly, halfway through the jump) and its initial position. For the purpose of KMC simulations only, the total energies can be expensive extra information, as the only parameters of interest are the barriers. Nevertheless, at least Messina et al. have taken this approach using the bispectrum descriptor messina2018smart .
While the SOAP descriptor specifically may be unnecessarily complicated for a rigidlattice system, where its continuity properties are not needed, the applications of these kinds of descriptors in the direct migration barrier regression could be explored in future work. At the same time, developing new descriptors precisely suited for this task might prove fruitful. Relevant input patterns, such as those discovered in this work, can guide in the design of descriptors that have the desired invariance properties. One could imagine a set of templates, created either manually or as a part of the training process, that are convoluted over the threedimensional representation of the local atomic environment, to produce the input values fed to a machine learning regressor.
4 Conclusions
The radial basis function network (RBFN) is a classic model for supervised machine learning and has good interpretability properties when the number of centroids is small. For training RBFN’s, previous research has mainly concentrated on heuristic twostep methods, apparently because gradient descent–based optimization for small RBFN’s has faced local minima problems.
In this article, we introduced a new PyTorchbased RBFN implementation that can be combined with modern optimization techniques that are currently used in deep learning. With a large number of centroids and full gradientbased optimization, we showed that RBFN’s can achieve predictive performance on par with hyperparameteroptimized gradientboosted decision trees on a materials physics data set that describes migration energy barriers for atom configurations on three different copper surfaces.
To make the trained RBF networks interpretable, we derived a novel pruning method based on finding a smaller RBFN that globally approximates the larger one given a suitable assumption on the distribution of the data manifold. We provided closedform pruning objective functions for the cases where the input features are assumed to follow either a Bernoulli, a continuous uniform, or a Gaussian mixture distribution. Using the Bernoulli objective, we pruned RBFN’s with thousands of centroids trained on the aforementioned materials physics data set down to sixteen centroids. Visualizations of the obtained centroids show that the large RBFN’s have learned multiple patterns that match the properties of the physical lattice without the models having had access to any a priori knowledge on the nature of the problem.
While our methods produce good results on a realworld data set, many possibilities remain for further refinement. We have only considered the common but quite restrictive exponential kernel, and one can also generalize the functional form of RBF networks in various ways. For instance, having a separate
parameter for each centroid would increase the flexibility of the model, though at the expense of making at least the pruning objective derivations more complicated. For pruning, one could consider automatically approximating the distribution of the training data set with Gaussian mixture–based kernel density estimators, though this would entail the risk of overfitting and hence producing pruned RBFN’s that are unable to generalize from the training data.
Acknowledgments
This work was supported by the Academy of Finland (projects 313857 and 313867). Computational resources were provided by the Finnish Grid and Cloud Infrastructure (urn:nbn:fi:researchinfras2016072533).
References
 (1) D. S. Broomhead, D. Lowe, Multivariable functional interpolation and adaptive networks, Complex Systems 2 (3) (1988) 321–355.
 (2) J. Park, I. W. Sandberg, Universal approximation using radialbasisfunction networks, Neural Computation 3 (2) (1991) 246–257. doi:10.1162/neco.1991.3.2.246.
 (3) S. Chen, C. F. N. Cowan, P. M. Grant, Orthogonal least squares learning algorithm for radial basis function networks, IEEE Transactions on Neural Networks 2 (2) (1991) 302–309. doi:10.1109/72.80341.

(4)
M.T. VakilBaghmisheh, N. Pavešić, Training RBF networks with selective backpropagation, Neurocomputing 62 (2004) 39–64.
doi:10.1016/j.neucom.2003.11.011.  (5) M. J. L. Orr, Introduction to radial basis function networks, Tech. rep., Centre for Cognitive Science, University of Edinburgh (Apr. 1996).
 (6) Y. Wu, H. Wang, B. Zhang, K.L. Du, Using radial basis function networks for function approximation and classification, ISRN Applied Mathematics 2012, article ID 324194. doi:10.5402/2012/324194.
 (7) C. S. K. Dash, A. K. Behera, S. Dehuri, S.B. Cho, Radial basis function neural networks: a topical stateoftheart survey, Open Computer Science 6 (1). doi:10.1515/comp20160005.

(8)
A. Alexandridis, E. Chondrodima, H. Sarimveis, Cooperative learning for radial basis function networks using particle swarm optimization, Applied Soft Computing 49 (2016) 485–497.
doi:10.1016/j.asoc.2016.08.032.  (9) J. Lu, H. Hu, Y. Bai, Generalized radial basis function neural network based on an improved dynamic particle swarm optimization and AdaBoost algorithm, Neurocomputing 152 (2015) 305–315. doi:10.1016/j.neucom.2014.10.065.
 (10) J. Kimari, V. Jansson, S. Vigonski, E. Baibuz, R. Domingos, V. Zadin, F. Djurabekova, Application of artificial neural networks for rigid lattice kinetic monte carlo studies of cu surface diffusion (Submitted for publication).
 (11) J. Kimari, V. Jansson, S. Vigonski, E. Baibuz, R. Domingos, V. Zadin, F. Djurabekova, Data sets and trained neural networks for cu migration barriers, (Submitted for publication).
 (12) E. Baibuz, S. Vigonski, J. Lahtinen, J. Zhao, V. Jansson, V. Zadin, F. Djurabekova, Migration barriers for surface diffusion on a rigid lattice: Challenges and solutions, Computational Materials Science 146 (2018) 287–302. doi:10.1016/j.commatsci.2017.12.054.
 (13) G.B. Huang, P. Saratchandran, N. Sundararajan, An efficient sequential learning algorithm for growing and pruning RBF (GAPRBF) networks, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 34 (6) (2004) 2284–2292. doi:10.1109/TSMCB.2004.834428.
 (14) G.B. Huang, P. Saratchandran, N. Sundararajan, A generalized growing and pruning RBF (GGAPRBF) neural network for function approximation, IEEE Transactions on Neural Networks 16 (1) (2005) 57–67. doi:10.1109/TNN.2004.836241.

(15)
M. T. Musavi, W. Ahmed, K. H. Chan, K. B. Faris, D. M. Hummels, On the training of radial basis function classifiers, Neural Networks 5 (4) (1992) 595–603.
doi:10.1016/S08936080(05)800383.  (16) A. Leonardis, H. Bischof, An efficient MDLbased construction of RBF networks, Neural Networks 11 (5) (1998) 963–973. doi:10.1016/S08936080(98)000513.
 (17) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in PyTorch, in: NIPS Autodiff Workshop, 2017.
 (18) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015.
 (19) C. Kittel, P. McEuen, P. McEuen, Introduction to solid state physics, Vol. 8, Wiley New York, 1996.
 (20) J. H. Friedman, Greedy function approximation: A gradient boosting machine, The Annals of Statistics 29 (5) (2001) 1189–1232.
 (21) T. Chen, C. Guestrin, XGBoost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, 2016, pp. 785–794. doi:10.1145/2939672.2939785.
 (22) J. Bergstra, Y. Bengio, Random search for hyperparameter optimization, Journal of Machine Learning Research 13 (Feb) (2012) 281–305.
 (23) D. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by exponential linear units (ELUs), in: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, 2016.
 (24) F. Djurabekova, L. Malerba, C. Domain, C. Becquart, Stability and mobility of small vacancy and coppervacancy clusters in bccfe: An atomistic kinetic monte carlo study, Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms 255 (1) (2007) 47–51.

(25)
F. Djurabekova, R. Domingos, G. Cerchiara, N. Castin, E. Vincent, L. Malerba, Artificial intelligence applied to atomistic kinetic monte carlo simulations in fe–cu alloys, Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms 255 (1) (2007) 8–12.
 (26) A. P. Bartók, R. Kondor, G. Csányi, On representing chemical environments, Physical Review B 87 (18) (2013) 184115.
 (27) L. Messina, A. Quaglino, A. Goryaeva, M.C. Marinica, C. Domain, N. Castin, G. Bonny, R. Krause, Smart energy models for atomistic simulations using a dftdriven multifidelity approach, arXiv preprint arXiv:1808.06935.
Appendix A Proof of Theorem 2
Let and . Assume first that has the mixture density
Then
and by straightforward algebraic manipulation, we find that the righthandside expectation equals
By recognizing that the final integral equals one, we obtain the desired form.
Consider then the uniform density
As previously, we decompose
and by algebraic manipulation we arrive at a form involving the Gaussian CDF which we express in terms of the error function:
This concludes the uniform case.
Finally, consider the binary case
where is the Bernoulli point probability function for the outcomes . We denote the first elements of by (and similarly for other vectors) and give a proof by induction on the number of dimensions. The case is straightforward:
Assume then that the claim holds for some and consider the case with dimensions. We have
which completes the proof.