Introduction
The quality of a model is typically measured by its ability to generalize from a training set to previously unseen data from the same distribution. In regression tasks generalization essentially boils down to interpolation if the training data is sufficiently dense. As long as models are selected correctly, i. e. in a way to not overfit the data, the regression problem is well understood and can – at least conceptually – be considered solved. However, when working with data from realworld devices, e. g. controlling a robotic arm, interpolation might not be sufficient. It could happen that future data lies outside of the training domain, e. g. when the arm is temporarily operated outside of its specifications. For the sake of robustness and safety it is desirable in such a case to have a regression model that continues to make good predictions, or at least does not fail catastrophically. This setting, which we call
extrapolation generalization, is the topic of the present paper.We are particularly interested in regression tasks for systems that can be described by realvalued analytic expression, e. g. mechanical systems such as a pendulum or a robotic arm. These are typically governed by a highly nonlinear function but it is nevertheless possible, in principle, to infer their behavior on an extrapolation domain from their behavior elsewhere. We make two main contributions: 1) a new type of network that can learn analytical expressions and is able to extrapolate to unseen domains and 2) a model selection strategy tailored to the extrapolation setting.
The following section describes the setting of regression and extrapolation. Afterwards we introduce our method and discuss the architecture, its training, and its relation to prior art. We present our results in the Section Experimental evaluation and close with conclusions.
Regression and extrapolation
We consider a multivariate regression problem with a training set with , . Because our main interest lies on extrapolation in the context of learning the dynamics of physical systems we assume the data originates from an unknown analytical function (or system of functions), with additive zeromean noise, , i. e. and . The function
may, for instance, reflect a system of ordinary differential equations that govern the movements of a robot arm or the like. The general task is to learn a function
that approximates the true functional relation as well as possible in the squared loss sense, i. e. achieves minimal expected error . In practice, we only have particular examples of the function values available and measure the quality of predicting in terms of the empirical error on training or test data ,(1) 
If training and test data are sampled from the same distribution then we speak about an interpolation problem. In the extrapolation setting the training data is assumed to cover only a limited range of the data domain. In the example of the robot arm, for instance, the training may be restricted to a certain joint angle range or maximal velocity. For testing we want to make predictions about the unseen domains, e. g. for higher velocities. As usual, we split the data that is available at training time into a part for model training and a part for validation or model selection.
Learning a network for function extrapolation
The main model we propose is a multilayered feedforward network with computational units specifically designed for the extrapolation regression tasks. For an layer network, there are
hidden layers, each consisting of a linear mapping followed by nonlinear transformations. For simplicity of notation, we explain the network as if each hidden layer had the same structure (
inputs, outputs). In practice, each layer can be designed independently of the others, of course, as long as input/output dimensions match.The linear mapping at level maps the dimensional input to the dimensional intermediate representation given by
(2) 
where is the output of the previous layer, with the convention . The weight matrix
and the bias vector
are free parameters that are learned during training. The nonlinear transformation contains unary units, , for , and binary units, for . Their outputs are concatenated to form the layer output(3) 
In total, the nonlinear stage has outputs and inputs. The unary units, receive the respective component, as inputs, and each unit may be one of the following base functions as specified in a fixed type parameter
(4) 
where
is the standard sigmoid function. The binary units,
receive the remaining component, , as input in pairs of two. They are multiplication units that compute the product of their two input values:(5) 
Finally, the th and last layer computes the regression values by a linear readout
(6) 
The architecture is depicted in Fig. 1. We call the new architecture Equation Learner (EQL) and denote the function it defines by .
Discussion of the architecture
The proposed network architecture differs in two main aspects from typical feedforward networks: the existence of multiplication units and the possibility of sine and cosine as nonlinearities for the unary units. Both design choices are motivated by our objective of learning a system of equations that govern a physical system and can extrapolate to new parts of the input space.
Sigmoid nonlinearities are the canonical choice of activation function for
artificial neural networks
(ANN) and proved to be successful. In fact, we include sigmoids in our architecture, making it a super class of ANNs. However, they were typically disabled by the training procedure corresponding to their absence in the considered physical equations. Other, predominantly local nonlinearities, in particular radial basis functions [Broomhead and Lowe, 1988] we do not include, since one cannot expect them to extrapolate at all. Further nonlinearities, such as (square) roots and logarithms, could in principle be useful for learning physical equations, but they pose problems because their domains of definition is restricted to positive inputs. We leave the task of incorporating them in a principled way to future work.The ability to multiply two values is a second crucial component of our network architecture. Again, it is inspired by the typical form of physical equations, where multiplication of components is arguably second common basic operation after addition (which the linear layers can perform). Multiplication was introduced into neural networks long ago as productunits [Durbin and Rumelhart, 1989] and PiSigmaunit [Shin and Ghosh, 1991]. The productunits have large fanin that compute products over all their inputs, potentiated by the respective weights. The result is typically the behavior of a high order polynomial, which are powerful function approximators, but rarely occur in physical equations. Polynomials are also known to require careful finetuning in order not to overfit, which makes them a risky choice for the purpose of extrapolation. The PiSigma units are multiplication units with a fixed number of factors and our multiplication units are a special for 2 factors. We find that multiplying just two values at a time is well adjusted to the task we aim at, as it allows to control the maximal degree of the learned polynomial by the depth of the network.
Finally, each layer of the network contains unary units that act as identity maps, which in particular gives the network the option to learn functions with smaller number of nonlinearities than the total network depths.
Network training
The EQL is fully differentiable in its free parameters , which allows us to train it in an endtoend fashion using backpropagation. We adopt a Lassolike objective [Tibshirani, 1996],
(7) 
that is, a linear combination of loss and
regularization, and apply a stochastic gradient descent algorithm with minibatches and Adam
[Kingma and Ba, 2015] for calculating the updates:(8) 
where denotes the current minibatch and is the stepsize parameter. The choice of Adam is not critical and standard stochastic gradient descent also works. In all numerical experiments we use and a minibatch size of 20.
The role of the
regularization is to encourage networks with sparse connections, matching the intuition that a typical formula describing a physical system contains only a small number of terms, each operating only on a few variables. However, in a nonconvex setting where local minima are likely to occur, this type of regularization can have an undesirable sideeffect: during the course of the optimization the weights hardly ever change their sign. The reason is that the regularization leads to a constant rate of weight decay whereas the counteracting derivative with respect to the square loss is proportional to the backpropagated error signal and the input to the unit. The latter contributions are often smaller along paths with small weights, such that many weights go to zero and stay there. Additionally, any nonzero regularization term causes the learned weights to reflect a tradeoff between minimizing the loss and the regularizer. Although, this can lead to improved generalization, it also results in a systematic underestimation of the function values.
Therefore, we follow a hybrid regularization strategy: at the beginning of the training procedure () we use no regularization (), such that parameters can vary freely and reach reasonable starting points. Afterwards, we switch on the regularization by setting to a nonzero value, which has the effect that a sparse network structure emerges. Finally, for the last steps of the training () we disable regularization () but enforce the same norm of the weights. This is achieved by keeping all weights that are close to 0 at 0, i. e. if then
during the remaining epochs. This ensures that the learned model finds not only a function of the right parametric form, but also fits the observed values as closely as possible. We observed that the exact choice of breakpoints
and is not critical. In practice, we use and , where is total number of update steps. was selected large enough to ensure convergence. Note, that convergence to a sparse structure is important here, so early stopping will be disadvantageous.Model selection for extrapolation
EQL networks have a number of hyperparameters, e. g. the number of layers, the number of units and the regularization constant. Unfortunately, standard techniques for model selection, such as evaluation on a holdout set or crossvalidation, will not be optimal for our purpose, since they rely on interpolation quality. In order to extrapolate the network has to find the “right” formula. But how can we tell? Using Occams razor principle: the simplest formula is most likely the right one. Intuitively, if we have the choice between and its truncated power series approximation , the first one is preferred. We use the number of active hidden units in the network as a proxy for the complexity of the formula, see Appendix A1 for details. One could also think of differentiating between the unit types. In any case, this argumentation is only correct if the model explains the data well, i. e. it has a low validation error. So we have a dual objective to minimize, which we solve by ranking the instances w. r. t. validation error and sparsity and select the one with the smallest norm (in rankspace), see Eq. (15).
Furthermore, the optimization process may only find a local optimum of the training objective, which depends on the initialization of the parameters. We use independent runs to quantify expected performance deviations.
Related work
In the field of machine learning, regression is often treated as a black box process of identifying a suitable realvalued function from a hypothesis set, e. g. a reproducing kernel Hilbert space for Gaussian Processes Regression (GPR) [Williams and Rasmussen, 2006] or Support Vector Regression (SVR) [Smola and Schölkopf, 2004], or a multilayer network of suitable expressive power [Specht, 1991]. The goal is to find a prediction function that leads to a small expected error on future data, not necessarily to gain insight into the mechanism of how the output values derive from the inputs. The goal of finding an interpretable function is rather common in the natural sciences, such as biology, where high noise levels and strong intersystem variability often make it important to rely on external prior knowledge, and finding a “biologically plausible” model is often preferable over finding one that makes the highest prediction accuracy. As a consequence, model classes are often highly constrained, e. g. allowing only for sparse linear models.
The task of learning a true, nonlinear, functional dependence from observing a physical system, has received little attention in the machine learning literature so far, but forms the basis of the field of system identification
. There, typically the functional form of the system is known and only the parameters have to be identified. Another approach is to model the time evolution with autoregressive models or higher order convolution integrals (Volterra series) but learning analytic formulas is not common.
Causal learning is an area of recent research that aims at identifying a causal relation between multiple observables, which are typically the result of a physical process. Classically, this tasks reduces to finding a minimal graphical model based only on tests of conditional independence [Pearl, 2000]. Although very successful in some fields, this classical approach only provides a factorization of the problem, separating causes and effects, but it leaves the exact functional dependency unexplained. Recent extensions of causal learning can take a functional view, but typically do not constrain the regression functions to physically plausible ones, but rather constrain the noise distributions [Peters et al., 2014]. The topic of learning a regression function with emphasis on extrapolation performance has not been studied much in the literature so far. Existing work on time series prediction deals with extrapolation in the temporal domain, i. e. predict the next value(s) [Wiener, 1949]. By our nomenclature, this is typically rather an interpolation task, when the prediction is based on the behaviour of the series at earlier time steps but with similar value distribution [Müller et al., 1997; Györfi et al., 2013]. Extrapolating in the data domain implies that the data distribution at prediction time will differ from the data distribution at training time. This is traditionally called the domain adaptation setting. In particular, since we assume a common labeling function, our setting would fall under the covariate shift setting [QuioneroCandela et al., 2009]. Unfortunately, this connection is not particularly useful for our problem. As domain adaptation typically does not make additional assumptions about how the data distribution may change, existing methods need access to some unlabeled data from the test distribution already at training time [BenDavid et al., 2010]. In our setting this is not possible to obtain.
On the technical level, EQL networks are an instance of general feedforward networks for function approximation [Bishop, 1995]. In contrast to recent trends towards deep learning [Bengio, 2009; Bengio et al., 2013], our goal is not to learn any data representation, but to learn a function which compactly represents the inputoutput relation and generalizes between different regions of the data space, like a physical formula. Structurally, EQL networks resemble sumproduct networks (SPNs) [Poon and Domingos, 2012] and PiSigma networks (PSNs) [Shin and Ghosh, 1991]
, in the sense that both are based on directed acyclic graphs with computational units that allows for summation and multiplication. Otherwise, SPNs are different as they act as efficient alternative to probabilistic graphical models for representing probability distributions, whereas EQL networks are meant for the classical task of function approximation. In PSNs each output needs to be passed through multiplicative units, whereas in EQL multiplication is optional.
Finding equations for observations is also known as symbolic regression where a search is performed in a certain function space, typically done with evolutionary computation. With these techniques it is possible to discover physical laws such as invariants and conserved quantities
[Schmidt and Lipson, 2009]. Unfortunately, the computational complexity/search time explodes for larger expressions and highdimensional problems. We attempt to circumvent this by modeling it as a gradient based optimization problem. Related to symbolic regression is finding mathematical identities for instance to find computationally more efficient expressions. In [Zaremba et al., 2014] this was done using machine learning to overcome the potentially exponential search space.Experimental evaluation
We demonstrate the ability of EQL to learn physically inspired models with good extrapolation quality by experiments on synthetic and real data. For this, we implemented the network training and evaluation procedure in python based on the theano framework [Theano Development Team, 2016]. We will make the code for training and evaluation public after acceptance of the manuscript.
Pendulum.
We first present the results of learning the equations of motion for a very simple physical system: a pendulum. The state space of a pendulum is where the first value is the angle of the pole in radians and the second value is the angular velocity. In the physics literature, these are usually denoted as , but for our purposes, we call them in order to keep the notation consistent between experiments. The pendulum’s dynamic behavior is governed by the following two ordinary differential equations:
(9) 
where is the gravitation constant.
We divide each equation by in order to balance the output scales and form a regression problem with two output values, and .
As training data, we sample 1000 points uniformly in the hypercube for . Note that this domain contains more than half of a sine period, so it should be sufficient to identify the analytic expression. The target values are disturbed by Gaussian noise with standard derivation . We also define three test sets, each with 1000 points. The interpolation test set is sampled from the same data distribution as the training set. The extrapolation (near) test set contains data sampled uniformly from the data domain , which is relatively near the training region and the extrapolation (far) test set extends the region to further outside: . We train a 2layer EQL and perform model selection among the hyperparameters: the regularization strength and the number of nodes
. All weights are randomly initialized from a normal distribution with
. The unit selection is set such that all unit types are equally often. To ensure convergence we choseepochs. We compare our algorithm to a standard multilayer perceptron (MLP) with
activation functions and possible hyperparameters:
as for EQL, number of layers , and number of neurons . A second baseline is given by epsilon support vector regression (SVR) [Basak et al., 2007] with two hyperparameters and using radial basis function kernel with width .interpolation  extrapol. (near)  extrapol. (far)  

EQL  
MLP  
SVR 
dataset. Reported are the mean and standard deviation of the root mean squares error (RMS) (
, Eq. (1)) on different test sets for 10 random initializations.(a)  (b) 
Numeric results are reported in Tab. 1. As expected all models are able to interpolate well with a test error on the order of the noise level (). For extrapolation however, the performance differ between the approaches. For MLP the prediction quality decreases quickly when leaving the training domain. SVR remains a bit better in the near extrapolation but also fails catastrophically on the far extrapolation data. EQL, on the other hand, extrapolates well, both near and far away from the training domain. The reasons can be seen in Figure 2: while the MLP and SVR simply learns a function that interpolates the training values, EQL finds the correct functional expression and therefore predicts the correct values for any input data.
Double pendulum kinematics.
The second system we consider real double pendulum where the forward kinematics should be learned. For that we use recorded trajectories of a real double pendulum [Schmidt and Lipson, 2009]. The task here is to learn the position of the tips of the double pendulum segments from the given joint angles (). These positions where not measured such that we supply them by the following formula: where and correspond to xycoordinates of the first and second endpoint respectively. The dataset contains two short trajectories. The first covers only part of the domain (input as well as output) and consists of 819 samples where 10% was used as validation set (randomly sampled), see Fig. 3(a). The second trajectory corresponds to a behavior with several spins of both pendulum segments such that a much larger domain is covered. Nevertheless the angle values are confined to . We use this trajectory as extrapolation test set. The trajectory and the outputs of our method are shown in Fig. 3(b). The prediction for unseen domains is perfect, which is also illustrated in a systematic sweep, see Fig. 3(c). The performance of MLP is off already near the training domain. SVR is a bit better, but still does not give usable predictions for the test data, see also the root means square error in Fig. 3(d).
(a)  (b)  (c) 
(d) EQL MLP SVR extrapolation error
Model selection is performed to determine as above, , (MLP: ) and layer number .
Robotic arms.
A more complicated task is to learn the forward kinematics of multisegment robotic arms. We consider planar arms with 3, 4, and 5 joints, where each segment is 0.5 units long. For training the arm is controlled by sinusoidal joint target angles with amplitude in , each joint with a different frequency. The number of data points are: 3000, 6000, and 18000 for the 3, 4, and 5 segment arms respectively, with added noise as above. For testing extrapolation performance the amplitude was used. Note that the extrapolation space is much larger than the training space. The task is to predict the coordinates of the endeffector of the arms (kin3end, kin4end) and the coordinates of all segment positions kin5all. The numerical results, see Tab. 2, shows that our method is able to extrapolate in these cases. Model selection as above with , (MLP: ) and layer number . To illustrate the dependence on the amount of noise and the number of available training points we provide a quantification in Appendix A2. In short, increasing noise can be compensated by increasing amount of data to keep the performance.
kin3end  kin4end  kin5all  

EQL  
MLP  
SVR 
Learning complex formula.
In order to find out whether EQL can also learn more complicated formulas, we consider three examples with fourdimensional input and onedimensional output:
F1  (10)  
F2  (11)  
F3  (12) 
The first equation requires only one hidden layer to be represented. The second equation and third equation should requires two hidden layers. In particular, F2 contains a product of and and F3 contains a product of three terms, and we use it to test if our restriction to only pairwise product units causes problems for more complex target functions. We follow the same procedure as in the pendulum case for building training and test sets, though with as input data range. We use 10000 points for training set and validation set (90%10% split) and 5000 points for each of the test sets. Model selection for EQL is performed as above using the number of layers . The number of units is set to . For the MLP, we select and from the same set as above as well as .
Table 3 shows the numerical results. Again, all methods are able to interpolate, but only EQL achieves good extrapolation results, except for equation F3. There it settles in 9 out of 10 cases into a local minimum and finds only an approximating equation that deviates outside the training domain. Interestingly, if we restrict the base functions to not contain cosine, the algorithm finds the right formula. Note, the sparsity of the correct formula is lower than those of the approximation, so it should be selected if found. Figure Fig. 4 illustrates the performance and the learned networks visually. It shows one of the modelselected instances for each case. For F1 the correct formula was identified, so correct predictions can be made even far outside the training region (much further than illustrated). For F2 the network provided us with a surprise, because it yields good extrapolation performance with only one hidden layer! How can it implement ? Apparently it uses which is a good approximation for . The sparsity of this solution is whereas the true solution needs at least , which explains its selection. For F3 the suboptimal local minima uses some strange way of approximating using , which deviates fast, however the true solution would be sparser but was not found. Only if we remove cosine from the base functions we get always the correct formula, see Fig. 4(c).
(a) F1  
learned formula:  
(b) F2  
learned formula:  
(c) F3  
EQL EQL (no cos)  
learned formula (EQL):  
learned formula (EQL (no cos)): 
dataset  method  interpolation  extrapol. (near)  extrapol. (far) 

F1  EQL  
MLP  
SVR  
F2  EQL  
MLP  
SVR  
F3  EQL  
EQL (no cos)  
MLP  
SVR 
XRay transition energies.
As a further example we consider data measured in atomic physics. When shooting electron beams onto atoms one can excite them and they consequently emit xray radiation with characteristic peak energies. For each element/isotope these energies are different as they correspond to the potential difference between the electron shells, such that one can identify elements in a probe this way. The data is taken from [Deslattes et al., 2003], where we consider one specific transition, called the line, because it was measured for all elements. The true relationship between atomic number and transition energies is complicated, as it involves many body interactions and no closedform solution exists. Nevertheless we can find out which relationships our system proposes. It is known that the main relationship is according to Moseley’s law. Further correction terms for elements with larger are potentially of higher order. We have data for elements with , which is split into training/validation sets in the range (70/10 data points) and extrapolation test set in the interval (14 data points because of isotops). Since we have so little data we evaluate the performance for 10 independent training/validation splits. The data is scaled to lie in , i. e. and . Model selection is here based on validation error only. The selection for sparsity and validation error only yields the relationship. Minibatch size is 2 here and was used. Figure 5 presents the data, the predictions, the learned formulas and the numerical results. EQL and SVR achieve similar performance and MLP is significantly worse. However, EQL also yields interpretable formulas, see Fig. 5(e) that can be used to gain insights into the potential relationship.
(a)  (b)  (c)  

Poor extrapolation out of model class — cartpendulum system
Let us now go beyond our assumptions and consider cases where the true target function is not an element of the hypothesis set.
Consider a pendulum attached to a cart that can move horizontally along a rail but that is attached to a spring damper system, see Fig. 6(a). The system is parametrized by 4 unknowns: the position of the cart, the velocity of the cart, the angle of the pendulum and the angular velocity of the pendulum. We combine these into a fourdimensional vector .
We set up a regression problem with four outputs from the corresponding system of ordinary differential equations where , and
(13)  
The formulas contain divisions which are not included in our architecture due to their singularities. To incorporate them in a principled manner is left for future work. Thus, the cartpendulum dynamics is outside the hypothesis class. In this case we cannot expect great extrapolation performance and this is confirmed by the experiments. In Fig. 6(b,c) the extrapolation performance is illustrated by slicing through the input space. The near extrapolation performance is still acceptable for both EQL and MLP, but as soon as the training region is left further even the best instances differ considerably from the true values, see also the numeric results in Tab. 4. The SVR is performing poorly also for near extrapolation range. Inspecting the learned expressions we find that the sigmoid functions are rarely used.
(a)  (b)  (c) 

interpolation  extrapol. (near)  extrapol. (far)  

EQL  
MLP  
SVR 
Conclusions
We presented a new network architecture called EQL that can learn analytic expressions that typically occur in equations governing physical, in particular mechanical, systems. The network is fully differentiable, which allows endtoend training using backpropagation. By sequencing regularization and fixing
norm we achieve sparse representations with unbiased estimation of factors within the learned equations. We also introduce a model selection procedure specifically designed to select for good extrapolation quality by a multiobjective criterion based on validation error and sparsity. The proposed method is able to learn functional relations and extrapolate them to unseen parts of the data space, as we demonstrate by experiments on synthetic as well as real data. The approach learns concise functional forms that may provide insights into the relationships within the data, as we show on physical measurements of xray transition energies.
The optimization problem is nontrivial and has many local minima. We have shown cases where the algorithm is not reliably finding the right equation but instead finds an approximation only, in which case extrapolation may be poor.
If the origin of the data is not in the hypothesis class, i. e. the underlying expression cannot be represented by the network, good extrapolation performance cannot be achieved. Thus it is important to increase the model class by incorporating more base functions which we will address in future work alongside the application to larger examples. We expect good scaling capabilities to larger systems due to the gradient based optimization. Apart from the extrapolation we also expect improved interpolation results in highdimensional spaces, where data is less dense.
Acknowledgments
GM received funding from the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme (FP7/20072013) under REA grant agreement no. [291734].
References
 Basak et al. [2007] D. Basak, S. Pal, and D. C. Patranabis. Support vector regression. Neural Information ProcessingLetters and Reviews, 11(10):203–224, 2007.
 BenDavid et al. [2010] S. BenDavid, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine Learning, 79(12):151–175, 2010.
 Bengio [2009] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
 Bengio et al. [2013] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(8):1798–1828, 2013.

Bishop [1995]
C. M. Bishop.
Neural networks for pattern recognition
. Oxford University Press, 1995.  Broomhead and Lowe [1988] D. S. Broomhead and D. Lowe. Radial basis functions, multivariable functional interpolation and adaptive networks. Technical report, DTIC Document, 1988.
 Deslattes et al. [2003] R. D. Deslattes, E. G. Kessler Jr, P. Indelicato, L. De Billy, E. Lindroth, and J. Anton. Xray transition energies: new approach to a comprehensive evaluation. Reviews of Modern Physics, 75(1):35, 2003.
 Durbin and Rumelhart [1989] R. Durbin and D. E. Rumelhart. Product units: A computationally powerful and biologically plausible extension to backpropagation networks. Neural Computation, 1(1):133–142, Mar. 1989. ISSN 08997667. doi: 10.1162/neco.1989.1.1.133. URL http://dx.doi.org/10.1162/neco.1989.1.1.133.
 Györfi et al. [2013] L. Györfi, W. Härdle, P. Sarda, and P. Vieu. Nonparametric curve estimation from time series, volume 60. Springer, 2013.
 Kingma and Ba [2015] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In in Proceedings of ICLR, 2015.

Müller et al. [1997]
K.R. Müller, A. J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen,
and V. Vapnik.
Predicting time series with support vector machines.
In Artificial Neural Networks (ICANN), pages 999–1004. Springer, 1997.  Pearl [2000] J. Pearl. Causality. Cambridge University Press, 2000.
 Peters et al. [2014] J. Peters, J. Mooij, D. Janzing, and B. Schölkopf. Causal discovery with continuous additive noise models. Journal of Machine Learning Research (JMLR), 15:2009–2053, 2014.
 Poon and Domingos [2012] H. Poon and P. M. Domingos. Sumproduct networks: A new deep architecture, 2012.
 QuioneroCandela et al. [2009] J. QuioneroCandela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. The MIT Press, 2009.
 Schmidt and Lipson [2009] M. Schmidt and H. Lipson. Distilling freeform natural laws from experimental data. Science, 324(5923):81–85, 2009. ISSN 00368075. doi: 10.1126/science.1165893. URL http://science.sciencemag.org/content/324/5923/81.
 Shin and Ghosh [1991] Y. Shin and J. Ghosh. The pisigma network : An efficient higherorder neural network for pattern classification and function approximation. In in Proceedings of the International Joint Conference on Neural Networks, pages 13–18, 1991.
 Smola and Schölkopf [2004] A. J. Smola and B. Schölkopf. A tutorial on support vector regression. Statistics and computing, 14(3):199–222, 2004.
 Specht [1991] D. F. Specht. A general regression neural network. IEEE Transactions on Neural Networks (TNN), 2(6):568–576, 1991.
 Theano Development Team [2016] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints, abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688.
 Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 Wiener [1949] N. Wiener. Extrapolation, interpolation, and smoothing of stationary time series, volume 2. The MIT Press, 1949.
 Williams and Rasmussen [2006] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for machine learning. The MIT Press, 2006.
 Zaremba et al. [2014] W. Zaremba, K. Kurach, and R. Fergus. Learning to discover efficient mathematical identities. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1278–1286. Curran Associates, Inc., 2014.
References
Appendix A Appendix
Appendix B A1: Model selection details
Quantifying sparsity
We actually want a measure of complexity of the formula, however, since it is not clear what is the right choice of a measure, we use the sparsity instead, by counting the number of active/used hidden units denoted by . For a given network we get
(14) 
where is the heavyside function and 0.01 is an arbitrary threshold. For the multiplication units the norm of the incoming weights for both inputs are added (omitted to avoid clutter in the formula).
Selection criteria
As stated in the main text, we strive to choose the model that is both simple and has good performance in terms of the validation set. Since both quantities have different scales, we proposed to choose them based on their ranking. Let and be the ranks of the network w. r. t. the validation error and sparsity respectively, then the network with minimal squared rank norm is selected:
(15) 
In Fig. 7 the extrapolation performance of all considered networks for the kin2D3 dataset is visualized in dependence of validation error and the sparsity. It becomes evident that the best performing networks are both sparse and have a low validation error.
Appendix C A2: Dependence on noise and number of data points
In order to understand how the method depends on the amount of noise and the number of datapoints we scan through the two parameters and present the empirical results in Fig. 8. In general the method is robust to noise and as expected, more noise can be compensated by more data.
(a)  (b) 

Comments
There are no comments yet.