Inferring mathematical laws from experimental data is a central problem in natural science; having observed a variable at points , it implies finding a function such that for all . Two types of approaches exist to solve this problem. In parametric statistics (PS), the function
is defined by a small number of parameters that can directly be estimated from the data. On the other hand,machine learning
(ML) techniques such as decision trees and neural networks selectfrom large families of non-linear functions by minimizing a loss over the data. The latter relax the assumptions about the underlying law, but their solutions are more difficult to interpret, and tend to overfit small experimental data sets, yielding poor extrapolation performance.
Symbolic regression (SR) stands as a middle ground between PS and ML approaches: is selected from a large family of functions, but is required to be defined by an interpretable analytical expression. It has already proved extremely useful in a variety of tasks such as inferring physical laws [udrescu2020ai, Cranmer2020DiscoveringSM].
SR is usually performed in two steps. First, predicting a “skeleton”, a parametric function using a pre-defined list of operators – typically, the basic operations () and functions (, etc.). It determines the general shape of the law up to a choice of constants, e.g. . Then, the constants in the skeleton () are estimated using optimization techniques, typically the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS).
The leading algorithms for SR rely on genetic programming (GP). At each generation, a population of candidates is predicted, and the fittest ones are selected based on the data, and mutated to build the next generation. The algorithm iterates this procedure until a satisfactory level of accuracy is achieved.
While GP algorithms achieve good prediction accuracy, they are notably slow (see the Pareto plot of Fig. 1
). Indeed, the manually predefined function space to search is generally vast, and each generation involves a costly call to the BFGS routine. Also, GP does not leverage past experience: every new problem is learned from scratch. This makes GP techniques inapplicable to situations where fast computation is needed, for instance in reinforcement learning and physics environments[garnelo2016towards, landajuela2021discovering].
Pre-training neural networks built for language modelling on large datasets of synthetic examples has recently been proposed for SR [valipour2021symbolicgpt, biggio2021neural]. These references follow the two-step procedure (predicting the skeleton then fitting the constants) inherited from GP. Once the model is pre-trained, the skeleton is predicted via a simple forward pass, and a single call to BFGS is needed, thus resulting in a significant speed-up compared to GP. However, these methods are not as accurate as state-of-the-art GP, and have so far been limited to low-dimensional functions (). We argue that two reasons underlie their shortcomings.
First, skeleton prediction is an ill-posed problem that does not provide sufficient supervision: different instances of the same skeleton can have very different shapes, and instances of very different skeletons can be very close. Second, the loss function minimized by BFGS can be highly non-nonconvex: even when the skeleton is perfectly predicted, the correct constants are not guaranteed to be found. For these reasons, we believe, and will show, that doing away with skeleton estimation as a intermediary step can greatly facilitate the task of SR for language models.
In this paper, we train Transformers over synthetic datasets to perform end-to-end (E2E) symbolic regression: solutions are predicted directly, without resorting to skeletons. To this effect, we leverage a hybrid symbolic-numeric vocabulary, that uses both symbolic tokens for the operators and variables and numeric tokens for the constants. One can then perform a refinement of the predicted constants by feeding them as informed guess to BFGS, mitigating non-linear optimization issues. Finally, we introduce generation and inference techniques that allow our models to scale to larger problems: up to input features against in concurrent works.
Evaluated over the SRBench benchmark [la2021contemporary], our model significantly narrows the accuracy gap with state-of-the-art GP techniques, while providing several orders of magnitude of inference time speedup (see Fig. 1). We also demonstrate strong robustness to noise and extrapolation capabilities.
Finally, we will provide an online demonstration of our model at https://bit.ly/3niE5FS
and will open-source our implementation as a Scikit-learn compatible regressor at the following address:https://github.com/facebookresearch/symbolicregression.
SR is a challenging task that traces back from a few decades ago, with a large number of open-source and commercial softwares, and has already been used to accelerate scientific discoveries [Archiga2021AcceleratingUO, Udrescu2021SymbolicPD, Butter2021BackTT]. Most popular frameworks for symbolic regression use GP [schmidt2011age, schmidt2009distilling, la2018learning, mcconaghy2011ffx, virgolin2021improving, de2021interaction, arnaldo2014multiple, virgolin2019linear, kommenda2020genetic] (see [la2021contemporary] for a recent review), but SR has also seen growing interest from the Deep Learning (DL) community, motivated by the fact that neural networks are good at identifying qualitative patterns.
Neural networks have been combined with GP algorithms, e.g. to simplify the original dataset [udrescu2020ai], or to propose a good starting distribution over mathematical expressions[petersen2019deep]. [martius2016extrapolation, sahoo2018learning]
propose modifications to feed-forward networks to include interpretable components, i.e. replacing usual activation functions by operators such as, however these are hard to optimize and prone to numerical issues.
Language models, and especially Transformers [vaswani2017attention], have been trained over synthetic datasets to solve various mathematical problems: integration [lample2019deep], dynamical systems [charton2020learning], linear algebra [charton2021linear], formal logic [hahn2020teaching] and theorem proving [polu2020generative]. A few papers apply these techniques to symbolic regression: the aforementioned references [biggio2021neural, valipour2021symbolicgpt] train Transformers to predict function skeletons, while [d2022deep] infers one-dimensional recurrence relations in sequences of numbers.
The recently introduced SRBench [la2021contemporary] provides a benchmark for rigorous evaluation of SR methods, in addition to SR methods and ML baselines which we will compare to in this work.
1 Data generation
Our approach consists in pre-training language models on vast synthetic datasets. Each training example is a pair: a set of points as the input, and a function such that as the target111We only consider functions from into ; the general case can be handled as independent subproblems. Examples are generated by first sampling a random function , then a set of input values in , and computing .
1.1 Generating functions
To sample functions , we follow the seminal approach of Lample and Charton [lample2019deep], and generate random trees with mathematical operators as internal nodes and variables or constants as leaves. The procedure is detailed below (see Table 3 in the Appendix for the values of parameters):
Sample the desired input dimension of the function from .
Sample the number of binary operators from then sample operators from 222Note that although the division operation is technically a binary operator, it appears much less frequently than additions and multiplications in typical expressions [guimera2020bayesian], hence we replace it by the unary operator ..
Build a binary tree with those nodes, using the sampling procedure of [lample2019deep].
For each leaf in the tree, sample one of the variables , .
Sample the number of unary operators from then sample operators from the list in Table 3, and insert them at random positions in the tree.
For each variable and unary operator , apply a random affine transformation, i.e. replace by , and by , with sampled from .
Note that since we require independent control on the number of unary operators (which is independent of ) and binary operators (which depends on ), we cannot directly sample a unary-binary tree as in [lample2019deep]. Note also that the first variables are sampled in ascending order to obtain the desired input dimension, which means functions with missing variables such as are never encountered; this is not an issue as our model can always set the prefactor of to zero. As discussed quantitatively in App. C, the number of possible skeletons as well as the random sampling of numerical constants guarantees that our model almost never sees the same function twice, and cannot simply perform memorization.
1.2 Generating inputs
For each function , we sample input values from the distribution described below, and compute the corresponding output values . If any is outside the domain of definition of or if any is larger , the process is aborted, and we start again by generating a new function. Note that rejecting and resampling out-of-domain values of , the obvious and cheaper alternative, would provide the model with additional information about , by allowing it to learn its domain of definition.
To maximize the diversity of input distributions seen at training time, we sample our inputs from a mixture of distributions (uniform or gaussian), centered around random centroids333For , such a mixture could in principe approximate any input distribution., see App. A for some illustrations at . Input samples are generated as follows:
Sample a number of clusters and weights , which are then normalized so that .
For each cluster , sample input points from then apply a random rotation sampled from the Haar distribution.
Finally, concatenate all the points obtained and whiten
them by substracting the mean and dividing by the standard deviation along each dimension.
Following [charton2021linear], we represent numbers in base floating-point notation, round them to four significant digits, and encode them as sequences of tokens: their sign, mantissa (between 0 and 9999), and exponent (from E-100 to E100).
To represent mathematical functions as sequences, we enumerate the trees in prefix order, i.e. direct Polish notation, as in [lample2019deep]: operators and variables and integers are represented as single autonomous tokens, and constants are encoded as explained above.
For example, the expression is encoded as [cos,mul,+,2424,E-3,x]. Note that the vocabulary of the decoder contains a mix of symbolic tokens (operators and variables) and numeric tokens, whereas that of the encoder contains only numeric tokens444The embeddings of numeric tokens are not shared between the encoder and decoder..
Below we describe our approach for end-to-end symbolic regression; please refer to Fig. 2 for an illustration.
Our model is provided input points , each of which is represented as tokens of dimension . As and become large, this results in long input sequences (e.g. tokens for and ), which challenge the quadratic complexity of Transformers. To mitigate this, we introduce an embedder to map each input point to a single embedding.
The embedder pads the empty input dimensions to, then feeds the
-dimensional vector into a 2-layer fully-connected feedforward network (FFN) with ReLU activations, which projects down to dimension555We explored various architectures for the embedder, but did not obtain any improvement; this does not appear to be a critical part of the model. The resulting embeddings of dimension are then fed to the Transformer.
We use a sequence to sequence Transformer architecture [vaswani2017attention] with 16 attention heads and an embedding dimension of 512, containing a total of 86M parameters. Like [charton2021linear], we observe that the best architecture for this problem is asymmetric, with a deeper decoder: we use 4 layers in the encoder and 16 in the decoder. A notable property of this task is the permutation invariance of the input points. To account for this invariance, we remove positional embeddings from the encoder.
We optimize a cross-entropy loss with the Adam optimizer, warming up the learning rate from to over the first 10,000 steps, then decaying it as the inverse square root of the number of steps, following [vaswani2017attention]. We hold out a validation set of
examples from the same generator, and train our models until the accuracy on the validation set saturates (around 50 epochs of 3M examples).
Input sequence lengths vary significantly with the number of points ; to avoid wasteful padding, we batch together examples of similar lengths, ensuring that a full batch contains a minimum of 10,000 tokens. On 32 GPU with 32GB memory each, one epoch is processed in about half an hour.
2.2 Inference tricks
In this section, we describe three tricks to improve the performance of our model at inference.
|Skeleton + BFGS|
|E2E no BFGS|
|E2E + BFGS random init|
|E2E + BFGS model init|
Previous language models for SR, such as [biggio2021neural], follow a skeleton approach: they first predict equation skeletons, then fit the constants with a non-linear optimisation solver such as BFGS. In this paper, we follow an end-to-end (E2E) approach: predicting simultaneously the function and the values of the constants. However, we improve our results by adding a refinement step: fine-tuning the constants a posteriori with BFGS, initialized with our model predictions666To avoid BFGS having to approximate gradients via finite differences, we provide the analytical expression of the gradient using sympytorch [sympytorch] and functorch [functorch2021]..
This results in a large improvement over the skeleton approach, as we show by training a Transformer to predict skeletons in the same experimental setting. The improvement comes from two reasons: first, prediction of the full formula provides better supervision, and helps the model predict the skeleton; second, the BFGS routine strongly benefits from the informed initial guess, which helps the model predict the constants. This is illustrated qualitatively in Table 1, and quantitatively in Table 2.
As described in Section 1.2, all input points presented to the model during training are whitened: their distribution is centered around the origin and has unit variance. To allow accurate prediction for input points with a different mean and variance, we introduce a scaling procedure at inference time. Let the function to be inferred, be the input points, and . As illustrated in Fig. 2 we pre-process the input data by replacing by . The model then predicts , and we can recover an approximation of by unscaling the variables in .
This gives our model the desirable property to be insensitive to the scale of the input points: DL-based approaches to SR are known to fail when the inputs are outside the range of values seen during training [d2022deep, charton2021linear]. Note that here, the scale of the inputs translates to the scale of the constants in the function ; although these coefficients are sampled in during training, coefficients outside can be expressed by multiplication of constants in .
Bagging and decoding
Since our model was trained on input points, it does not perform satisfactorily at inference when presented with more than input points. To take advantage of large datasets while accommodating memory constraints, we perform bagging: whenever is larger than at inference, we randomly split the dataset into bags of input points777Smarter splits, e.g. diversity-preserving, could be envisioned, but were not considered here..
For each bag, we apply a forward pass and generate function candidates via random sampling or beam search using the next token distribution. As shown in App. E (Fig. 16), the more commonly used beam search [https://doi.org/10.48550/arxiv.1606.02960] strategy leads to much less good results than sampling due to the lack of diversity induced by constant prediction (typical beams will look like ). This provides us with a set of candidate solutions.
Our model inference speed has two sources: the forward passes described above on one hand (which can be parallelized up to memory limits of the GPU), and the refinements of candidate functions on the other (which are CPU-based and could also be parallelized, although we did not consider this option here).
Since can become large, we rank candidate functions (according to their error on all input points), get rid of redundant skeleton functions and keep the best candidates for the refinement step888Though these candidates are the best functions without refinement, there are no guarantees that these would be the best after refinement, especially as optimization is particularly prone to spurious local optimas.. To speed up the refinement, we use a subset of at most input points for the optimization. The parameters , and can be used as cursors in the speed-accuracy tradeoff: in the experiments presented in Fig. 1, we selected , , .
In this section, we present the results of our model. We begin by studying in-domain accuracy, then present results on out-of-domain datasets.
3.1 In-domain performance
We report the in-domain performance of our models by evaluating them on a fixed validation set of 100,000 examples, generated as per Section 1. Validation functions are uniformly spread out over three difficulty factors: number of unary operators, binary operators, and input dimension. For each function, we evaluate the performance of the model when presented input points , and prediction accuracy is evaluated on points sampled from a fresh instance of the multimodal distribution described in Section 1.2.
We assess the performance of our model using two popular metrics: -score [la2021contemporary] and accuracy to tolerance [biggio2021neural, d2022deep]:
where is the indicator function.
is classically used in statistics, but it is unbounded, hence a single bad prediction can cause the average over a set of examples to be extremely bad. To circumvent this, we set upon pathological examples as in [la2021contemporary](such examples occur in less that 1% of cases)999Note that predicting the constant function naturally yields an score of 0.. The accuracy metric provides a better idea of the precision of the predicted expression as it depends on a desired tolerance threshold. However, due to the presence of the
operator, it is sensitive to outliers, and hence to the number of points considered at test time (more points entails a higher risk of outlier). To circumvent this, we discard the 5% worst predictions, following[biggio2021neural].
End-to-end outperforms skeleton
|Skeleton + BFGS||0.43||0.40||0.27||0.17|
|E2E no BFGS||0.62||0.51||0.27||0.09|
|E2E + BFGS random init||0.44||0.44||0.30||0.19|
|E2E + BFGS model init||0.68||0.61||0.44||0.29|
In Table 2, we report the average in-domain results of our models. Without refinement, our E2E model outperforms the skeleton model trained under the same protocol in terms of low precision prediction ( and metrics), but small errors in the prediction of the constants lead to lower performance at high precision ( metric). The refinement procedure alleviates this issue significantly, inducing a three-fold increase in while also boosting other metrics.
Initializing BFGS with the constants estimated in the E2E phase plays a crucial role: with random initialization, the BFGS step actually degrades E2E performance. However, refinement with random initialization still achieves better results than the skeleton model: this suggests that the E2E model predicts skeletons better that the skeleton model.
Fig. 4A,B,C presents an ablation over three indicators of formula difficulty (from left to right): number of unary operators, number of binary operators and input dimension. In all cases, increasing the factor of difficulty degrades performance, as one could expect. This may give the impression that our model does not scale well with the input dimension, but we show that our model scales in fact very well on out-of-domain datasets compared to concurrent methods (see Fig. 15 of the Appendix).
Fig. 4D shows how performance depends on the number of input points fed to the model, . In all cases, performance increases, but much more signicantly for the E2E models than for the skeleton model, demonstrating the importance of having a lot of data to accurately predict the constants in the expression.
Extrapolation and robustness
In Fig. 4
E, we examine the ability of our models to interpolate/extrapolate by varying the scale of the test points: instead of normalizing the test points to unit variance, we normalize them to a scale. As expected, performance degrades as we increase , however the extrapolation performance remains decent even very far away from the inputs ().
Finally, in Fig. 4F, we examine the effect of corrupting the targets with a multiplicative noise of variance : . The results reveal something interesting: without refinement, the E2E model is not robust to noise, and actually performs worse than the skeleton model at high noise. This shows how sensitive the Transformer is to the inputs when predicting constants. Refinement improves robustness significantly, but the initialization of constants to estimated values has less impact, since the prediction of constants is corrupted by the noise.
3.2 Out-of-domain generalization
We evaluate our method on the recently released benchmark SRBench[la2021contemporary]. Its repository contains a set of regression datasets from the Penn Machine Learning Benchmark (PMLB)[friedman2001greedy] in addition to open-source SR and ML baselines. The datasets consist in "ground-truth" problems where the true underlying function is known, as well as "black-box" problems which are more general regression datasets without an underlying ground truth.
We filter out problems from SRBench to only keep regression problems with with continuous features; this results in regression datasets, splitted into black-box problems (combination of real-world and noisy, synthetic datasets), SR datasets from the Feynman [udrescu2020ai] and SR datasets from the ODE-Strogatz [strogatz:2000] databases. Each dataset is split into training data and test data, on which performance is evaluated.
The overall performance of our models is illustrated in the Pareto plot of Fig. 1, where we see that on both types of problems, our model achieves performance close to state-of-the-art GP models such as Operon with a fraction of the inference time101010Inference uses a single GPU for the forward pass of the Transformer.
We provide more detailed results on Feynman problems in Fig. 5, where we additionally plot the formula complexity, i.e. the number of nodes in the mathematical tree (see App. E for similar results on black-box and Strogatz problems). Varying the noise applied to the targets noise, we see that our model displays similar robustness to state-of-the-art GP models.
While the average accuracy or our model is only ranked fourth, it outputs formulas with lower complexity than the top 2 models (Operon and SBP-GP), which is an important criteria for SR problems: see App. 11 for complexity-accuracy Pareto plots. To the best of our knowledge, our model is the first non-GP approach to achieve such competitive results for SR.
In this work, we introduced a competitive deep learning model for SR by using a novel numeric-symbolic approach. Through rigorous ablations, we showed that predicting the constants in an expression not only improves performance compared to predicting a skeleton, but can also serve as an informed initial condition for a solver to refine the value of the constants.
Our model outperforms previous deep learning approaches by a margin on SR benchmarks, and scales to larger dimensions. Yet, the dimensions considered here remain moderate (): adapting to the truly high-dimensional setup is an interesting future direction, and will likely require qualitative changes in the data generation protocol. While our model narrows the gap between GP and DL based SR, closing the gap also remains a challenge for future work.
This work opens up a whole new range of applications for SR in fields which require real-time inference. We hope that the methods presented here may also serve as a toolbox for many future applications of Transformers for symbolic tasks.
Appendix A Details on the training data
In Tab. 3
we provide the detailed set of parameters used in our data generator. The probabilities of the unary operators were selected to match the natural frequencies appearing in the Feynman dataset.
In Fig. 6, we show the statistics of the data generation.The number of expressions diminishes with the input dimension and number of unary operators because of the higher likelihood of generating out-of-domain inputs. One could easily make the distribution uniform by enforcing to retry as long as a valid example is not found, however we find empirically that having more easy examples than hard ones eases learning and provides better out-of-domain generalization, which is our ultimate goal.
In Fig. 7, we show some examples of the input distributions generated by our multimodal approach. Notice the diversity of shapes obtained by this procedure.
|Max input dim|
|Distrib of (,)||
|Max binary ops|
|Binary operators||add:1, sub:1, mul:1|
|Max unary ops|
|Min number of points|
|Max number of points|
|Max num clusters||10|
Appendix B Attention maps
A natural question is whether self-attention based architectures are optimally suited for symbolic regression tasks. In Fig. 8, we show the attention maps produced by the encoder of our Transformer model, which contains 4 layers avec 16 attention heads (we only keep the first 8 for the sake of space). In order to make the maps readable, we consider one-dimensional inputs and sort them in ascending order.
The attention plots demonstrate the complementarity of the attention heads. Some focus on specific regions of the input, whereas others are more spread out. Some are concentrated along the diagonal (focusing on neighboring points), whereas others are concentrated along the anti-diagonal (focusing on far-away points.
Most strikingly, the particular features of the functions studied clearly stand out in the attention plots. Focus, for example, on the 7th head of layer 2. For the exponential function, it focuses on the extreme points (near -1 and 1); for the inverse function, it focuses on the singularity around the origin; for the sine function, it reflects the periodicity, with evenly spaces vertical lines. The same phenomenology can be acrossed is several other heads.
Appendix C Does memorization occur?
It is natural to ask the following question: due to the large amount of data seen during training, is our model simply memorizing the training set ? Answering this question involves computing the number of possible functions which can be generated. To estimate this number, calculating the number of possible skeleton is insufficient, since a given skeleton can give rise to very different functions according to the sampling of the constants, and even for a given choice of the constants, the input points can be sampled in many different ways.
Nonetheless, we provide the lower bound as a function of the number of nodes in Fig. 9, using the equations provided in [lample2019deep]. For small expressions (up to four operators), the number of possible expressions is lower or similar to than the number of expressions encountered during training, hence one cannot exclude the possibility that some expressions were seen several times during training, but with different realizations due to the initial conditions. However, for larger expressions, the number of possibilities is much larger, and one can safely assume that the expressions encountered at test time have not been seen during training.
Appendix D Additional in-domain results
Appendix E Additional out-of-domain results
In Fig. 11, we display a Pareto plot comparing accuracy and formula complexity on SRBench datasets.
In Fig. 12, we show the predictions of our model on the functions provided in [jin2020bayesian]. Our model gets all of them correct except for one.
In Fig. 13, we display the results of our model on the black-box problems of SRBench.
Each of the 14 datasets from the ODE-Strogatz benchmark is the trajectory of a 2-state system following a first-order ordinary differential equation (ODE). Therefore, the input data has a very particular, time-ordered distribution, which differs significantly from that seen at train time. Unsurprisingly, Fig.14 shows that our model performs somewhat less well to this kind of data in comparison with GP-based methods.
Ablation on input dimension
In Fig. 15, we show how the performance of our model depends on the dimensionality of the inputs on Feynamn and black-box datasets.
Ablation on decoding strategy
In Fig. 16, we display the difference in performance using two decoding strategies.