End-to-end symbolic regression with transformers

by   Pierre-Alexandre Kamienny, et al.

Symbolic regression, the task of predicting the mathematical expression of a function from the observation of its values, is a difficult task which usually involves a two-step procedure: predicting the "skeleton" of the expression up to the choice of numerical constants, then fitting the constants by optimizing a non-convex loss function. The dominant approach is genetic programming, which evolves candidates by iterating this subroutine a large number of times. Neural networks have recently been tasked to predict the correct skeleton in a single try, but remain much less powerful. In this paper, we challenge this two-step procedure, and task a Transformer to directly predict the full mathematical expression, constants included. One can subsequently refine the predicted constants by feeding them to the non-convex optimizer as an informed initialization. We present ablations to show that this end-to-end approach yields better results, sometimes even without the refinement step. We evaluate our model on problems from the SRBench benchmark and show that our model approaches the performance of state-of-the-art genetic programming with several orders of magnitude faster inference.



There are no comments yet.


page 5

page 16

page 18

page 20


Hash-Based Tree Similarity and Simplification in Genetic Programming for Symbolic Regression

We introduce in this paper a runtime-efficient tree hashing algorithm fo...

SymbolicGPT: A Generative Transformer Model for Symbolic Regression

Symbolic regression is the task of identifying a mathematical expression...

Predicting Friction System Performance with Symbolic Regression and Genetic Programming with Factor Variables

Friction systems are mechanical systems wherein friction is used for for...

Solving the Exponential Growth of Symbolic Regression Trees in Geometric Semantic Genetic Programming

Advances in Geometric Semantic Genetic Programming (GSGP) have shown tha...

Transformation-Interaction-Rational Representation for Symbolic Regression

Symbolic Regression searches for a function form that approximates a dat...

Deep Symbolic Regression for Recurrent Sequences

Symbolic regression, i.e. predicting a function from the observation of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.



Inferring mathematical laws from experimental data is a central problem in natural science; having observed a variable at points , it implies finding a function such that for all . Two types of approaches exist to solve this problem. In parametric statistics (PS), the function

is defined by a small number of parameters that can directly be estimated from the data. On the other hand,

machine learning

(ML) techniques such as decision trees and neural networks select

from large families of non-linear functions by minimizing a loss over the data. The latter relax the assumptions about the underlying law, but their solutions are more difficult to interpret, and tend to overfit small experimental data sets, yielding poor extrapolation performance.

Symbolic regression (SR) stands as a middle ground between PS and ML approaches: is selected from a large family of functions, but is required to be defined by an interpretable analytical expression. It has already proved extremely useful in a variety of tasks such as inferring physical laws [udrescu2020ai, Cranmer2020DiscoveringSM].

SR is usually performed in two steps. First, predicting a “skeleton”, a parametric function using a pre-defined list of operators – typically, the basic operations () and functions (, etc.). It determines the general shape of the law up to a choice of constants, e.g. . Then, the constants in the skeleton () are estimated using optimization techniques, typically the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS).

The leading algorithms for SR rely on genetic programming (GP). At each generation, a population of candidates is predicted, and the fittest ones are selected based on the data, and mutated to build the next generation. The algorithm iterates this procedure until a satisfactory level of accuracy is achieved.

While GP algorithms achieve good prediction accuracy, they are notably slow (see the Pareto plot of Fig. 1

). Indeed, the manually predefined function space to search is generally vast, and each generation involves a costly call to the BFGS routine. Also, GP does not leverage past experience: every new problem is learned from scratch. This makes GP techniques inapplicable to situations where fast computation is needed, for instance in reinforcement learning and physics environments 

[garnelo2016towards, landajuela2021discovering].

Pre-training neural networks built for language modelling on large datasets of synthetic examples has recently been proposed for SR [valipour2021symbolicgpt, biggio2021neural]. These references follow the two-step procedure (predicting the skeleton then fitting the constants) inherited from GP. Once the model is pre-trained, the skeleton is predicted via a simple forward pass, and a single call to BFGS is needed, thus resulting in a significant speed-up compared to GP. However, these methods are not as accurate as state-of-the-art GP, and have so far been limited to low-dimensional functions (). We argue that two reasons underlie their shortcomings.

First, skeleton prediction is an ill-posed problem that does not provide sufficient supervision: different instances of the same skeleton can have very different shapes, and instances of very different skeletons can be very close. Second, the loss function minimized by BFGS can be highly non-nonconvex: even when the skeleton is perfectly predicted, the correct constants are not guaranteed to be found. For these reasons, we believe, and will show, that doing away with skeleton estimation as a intermediary step can greatly facilitate the task of SR for language models.

Figure 1: Our model outperforms previous DL-based methods and offers at least an order of magnitude inference speedup compared to SOTA GP-based methods. Pareto plot comparing the average test performance and inference time of our models with baselines provided by the SRbench benchmark [la2021contemporary], both on Feynman SR problems [udrescu2020ai] and black-box regression problems. We use colors to distinguish three families of models: deep-learning based SR, genetic programming-based SR and classic machine learning methods (which do not provide an interpretable solutions). A similar Pareto plot against formula complexity is provided in Fig. 11.


In this paper, we train Transformers over synthetic datasets to perform end-to-end (E2E) symbolic regression: solutions are predicted directly, without resorting to skeletons. To this effect, we leverage a hybrid symbolic-numeric vocabulary, that uses both symbolic tokens for the operators and variables and numeric tokens for the constants. One can then perform a refinement of the predicted constants by feeding them as informed guess to BFGS, mitigating non-linear optimization issues. Finally, we introduce generation and inference techniques that allow our models to scale to larger problems: up to input features against in concurrent works.

Evaluated over the SRBench benchmark [la2021contemporary], our model significantly narrows the accuracy gap with state-of-the-art GP techniques, while providing several orders of magnitude of inference time speedup (see Fig. 1). We also demonstrate strong robustness to noise and extrapolation capabilities.

Finally, we will provide an online demonstration of our model at https://bit.ly/3niE5FS

and will open-source our implementation as a Scikit-learn compatible regressor at the following address:


Related work

SR is a challenging task that traces back from a few decades ago, with a large number of open-source and commercial softwares, and has already been used to accelerate scientific discoveries [Archiga2021AcceleratingUO, Udrescu2021SymbolicPD, Butter2021BackTT]. Most popular frameworks for symbolic regression use GP [schmidt2011age, schmidt2009distilling, la2018learning, mcconaghy2011ffx, virgolin2021improving, de2021interaction, arnaldo2014multiple, virgolin2019linear, kommenda2020genetic] (see [la2021contemporary] for a recent review), but SR has also seen growing interest from the Deep Learning (DL) community, motivated by the fact that neural networks are good at identifying qualitative patterns.

Neural networks have been combined with GP algorithms, e.g. to simplify the original dataset [udrescu2020ai], or to propose a good starting distribution over mathematical expressions[petersen2019deep]. [martius2016extrapolation, sahoo2018learning]

propose modifications to feed-forward networks to include interpretable components, i.e. replacing usual activation functions by operators such as

, however these are hard to optimize and prone to numerical issues.

Language models, and especially Transformers [vaswani2017attention], have been trained over synthetic datasets to solve various mathematical problems: integration [lample2019deep], dynamical systems [charton2020learning], linear algebra [charton2021linear], formal logic [hahn2020teaching] and theorem proving [polu2020generative]. A few papers apply these techniques to symbolic regression: the aforementioned references [biggio2021neural, valipour2021symbolicgpt] train Transformers to predict function skeletons, while [d2022deep] infers one-dimensional recurrence relations in sequences of numbers.

The recently introduced SRBench [la2021contemporary] provides a benchmark for rigorous evaluation of SR methods, in addition to SR methods and ML baselines which we will compare to in this work.

1 Data generation

Our approach consists in pre-training language models on vast synthetic datasets. Each training example is a pair: a set of points as the input, and a function such that as the target111We only consider functions from into ; the general case can be handled as independent subproblems. Examples are generated by first sampling a random function , then a set of input values in , and computing .

1.1 Generating functions

To sample functions , we follow the seminal approach of Lample and Charton [lample2019deep], and generate random trees with mathematical operators as internal nodes and variables or constants as leaves. The procedure is detailed below (see Table 3 in the Appendix for the values of parameters):

  1. [leftmargin=1cm,noitemsep]

  2. Sample the desired input dimension of the function from .

  3. Sample the number of binary operators from then sample operators from 222Note that although the division operation is technically a binary operator, it appears much less frequently than additions and multiplications in typical expressions [guimera2020bayesian], hence we replace it by the unary operator ..

  4. Build a binary tree with those nodes, using the sampling procedure of [lample2019deep].

  5. For each leaf in the tree, sample one of the variables , .

  6. Sample the number of unary operators from then sample operators from the list in Table 3, and insert them at random positions in the tree.

  7. For each variable and unary operator , apply a random affine transformation, i.e. replace by , and by , with sampled from .

Note that since we require independent control on the number of unary operators (which is independent of ) and binary operators (which depends on ), we cannot directly sample a unary-binary tree as in [lample2019deep]. Note also that the first variables are sampled in ascending order to obtain the desired input dimension, which means functions with missing variables such as are never encountered; this is not an issue as our model can always set the prefactor of to zero. As discussed quantitatively in App. C, the number of possible skeletons as well as the random sampling of numerical constants guarantees that our model almost never sees the same function twice, and cannot simply perform memorization.

Figure 2: Sketch of our model. During training, the inputs are all whitened. At inference, we whiten them as a pre-processing step; the predicted function must then be unscaled to account for the whitening.

1.2 Generating inputs

For each function , we sample input values from the distribution described below, and compute the corresponding output values . If any is outside the domain of definition of or if any is larger , the process is aborted, and we start again by generating a new function. Note that rejecting and resampling out-of-domain values of , the obvious and cheaper alternative, would provide the model with additional information about , by allowing it to learn its domain of definition.

To maximize the diversity of input distributions seen at training time, we sample our inputs from a mixture of distributions (uniform or gaussian), centered around random centroids333For , such a mixture could in principe approximate any input distribution., see App. A for some illustrations at . Input samples are generated as follows:

  1. [leftmargin=1cm, noitemsep]

  2. Sample a number of clusters and weights , which are then normalized so that .

  3. For each cluster , sample a centroid

    , a vector of

    variances and a distribution shape (gaussian or uniform) .

  4. For each cluster , sample input points from then apply a random rotation sampled from the Haar distribution.

  5. Finally, concatenate all the points obtained and whiten

    them by substracting the mean and dividing by the standard deviation along each dimension.

1.3 Tokenization

Following [charton2021linear], we represent numbers in base floating-point notation, round them to four significant digits, and encode them as sequences of tokens: their sign, mantissa (between 0 and 9999), and exponent (from E-100 to E100).

To represent mathematical functions as sequences, we enumerate the trees in prefix order, i.e. direct Polish notation, as in [lample2019deep]: operators and variables and integers are represented as single autonomous tokens, and constants are encoded as explained above.

For example, the expression is encoded as [cos,mul,+,2424,E-3,x]. Note that the vocabulary of the decoder contains a mix of symbolic tokens (operators and variables) and numeric tokens, whereas that of the encoder contains only numeric tokens444The embeddings of numeric tokens are not shared between the encoder and decoder..

2 Methods

Below we describe our approach for end-to-end symbolic regression; please refer to Fig. 2 for an illustration.

2.1 Model


Our model is provided input points , each of which is represented as tokens of dimension . As and become large, this results in long input sequences (e.g. tokens for and ), which challenge the quadratic complexity of Transformers. To mitigate this, we introduce an embedder to map each input point to a single embedding.

The embedder pads the empty input dimensions to

, then feeds the

-dimensional vector into a 2-layer fully-connected feedforward network (FFN) with ReLU activations, which projects down to dimension

555We explored various architectures for the embedder, but did not obtain any improvement; this does not appear to be a critical part of the model. The resulting embeddings of dimension are then fed to the Transformer.


We use a sequence to sequence Transformer architecture [vaswani2017attention] with 16 attention heads and an embedding dimension of 512, containing a total of 86M parameters. Like [charton2021linear], we observe that the best architecture for this problem is asymmetric, with a deeper decoder: we use 4 layers in the encoder and 16 in the decoder. A notable property of this task is the permutation invariance of the input points. To account for this invariance, we remove positional embeddings from the encoder.

As shown in Fig. 3 and detailed in App. B, the encoder captures the most distinctive features of the functions considered, such as critical points and periodicity, and blends a mix of short-ranged heads focusing on local details with long-ranged heads which capture the global shape of the function.

Figure 3: Attention heads reveal intricate mathematical analysis. We considered the expression , with input points sampled between and (red dots; the y-axis is arbitrary). We plotted the attention maps of a few heads of the encoder, which are matrices where the element represents the attention between point and point . Notice that heads 2, 3 and 4 of the second layer analyze the periodicity of the function in a Fourier-like manner.


We optimize a cross-entropy loss with the Adam optimizer, warming up the learning rate from to over the first 10,000 steps, then decaying it as the inverse square root of the number of steps, following [vaswani2017attention]. We hold out a validation set of

examples from the same generator, and train our models until the accuracy on the validation set saturates (around 50 epochs of 3M examples).

Input sequence lengths vary significantly with the number of points ; to avoid wasteful padding, we batch together examples of similar lengths, ensuring that a full batch contains a minimum of 10,000 tokens. On 32 GPU with 32GB memory each, one epoch is processed in about half an hour.

2.2 Inference tricks

In this section, we describe three tricks to improve the performance of our model at inference.


Model Function
Skeleton + BFGS
E2E + BFGS random init
E2E + BFGS model init
Table 1: The importance of an end-to-end model with refinement. The skeleton approach recovers an incorrect skeleton. The E2E approach predicts the right skeleton. Refinement worsens original prediction when randomly initialized, and yields the correct result when initialized with predicted constants.

Previous language models for SR, such as [biggio2021neural], follow a skeleton approach: they first predict equation skeletons, then fit the constants with a non-linear optimisation solver such as BFGS. In this paper, we follow an end-to-end (E2E) approach: predicting simultaneously the function and the values of the constants. However, we improve our results by adding a refinement step: fine-tuning the constants a posteriori with BFGS, initialized with our model predictions666To avoid BFGS having to approximate gradients via finite differences, we provide the analytical expression of the gradient using sympytorch [sympytorch] and functorch [functorch2021]..

This results in a large improvement over the skeleton approach, as we show by training a Transformer to predict skeletons in the same experimental setting. The improvement comes from two reasons: first, prediction of the full formula provides better supervision, and helps the model predict the skeleton; second, the BFGS routine strongly benefits from the informed initial guess, which helps the model predict the constants. This is illustrated qualitatively in Table 1, and quantitatively in Table 2.


As described in Section 1.2, all input points presented to the model during training are whitened: their distribution is centered around the origin and has unit variance. To allow accurate prediction for input points with a different mean and variance, we introduce a scaling procedure at inference time. Let the function to be inferred, be the input points, and . As illustrated in Fig. 2 we pre-process the input data by replacing by . The model then predicts , and we can recover an approximation of by unscaling the variables in .

This gives our model the desirable property to be insensitive to the scale of the input points: DL-based approaches to SR are known to fail when the inputs are outside the range of values seen during training [d2022deep, charton2021linear]. Note that here, the scale of the inputs translates to the scale of the constants in the function ; although these coefficients are sampled in during training, coefficients outside can be expressed by multiplication of constants in .

Bagging and decoding

Since our model was trained on input points, it does not perform satisfactorily at inference when presented with more than input points. To take advantage of large datasets while accommodating memory constraints, we perform bagging: whenever is larger than at inference, we randomly split the dataset into bags of input points777Smarter splits, e.g. diversity-preserving, could be envisioned, but were not considered here..

For each bag, we apply a forward pass and generate function candidates via random sampling or beam search using the next token distribution. As shown in App. E (Fig. 16), the more commonly used beam search [https://doi.org/10.48550/arxiv.1606.02960] strategy leads to much less good results than sampling due to the lack of diversity induced by constant prediction (typical beams will look like ). This provides us with a set of candidate solutions.

Inference time

Our model inference speed has two sources: the forward passes described above on one hand (which can be parallelized up to memory limits of the GPU), and the refinements of candidate functions on the other (which are CPU-based and could also be parallelized, although we did not consider this option here).

Since can become large, we rank candidate functions (according to their error on all input points), get rid of redundant skeleton functions and keep the best candidates for the refinement step888Though these candidates are the best functions without refinement, there are no guarantees that these would be the best after refinement, especially as optimization is particularly prone to spurious local optimas.. To speed up the refinement, we use a subset of at most input points for the optimization. The parameters , and can be used as cursors in the speed-accuracy tradeoff: in the experiments presented in Fig. 1, we selected , , .

3 Results

In this section, we present the results of our model. We begin by studying in-domain accuracy, then present results on out-of-domain datasets.

3.1 In-domain performance

We report the in-domain performance of our models by evaluating them on a fixed validation set of 100,000 examples, generated as per Section 1. Validation functions are uniformly spread out over three difficulty factors: number of unary operators, binary operators, and input dimension. For each function, we evaluate the performance of the model when presented input points , and prediction accuracy is evaluated on points sampled from a fresh instance of the multimodal distribution described in Section 1.2.

We assess the performance of our model using two popular metrics: -score [la2021contemporary] and accuracy to tolerance  [biggio2021neural, d2022deep]:


where is the indicator function.

is classically used in statistics, but it is unbounded, hence a single bad prediction can cause the average over a set of examples to be extremely bad. To circumvent this, we set upon pathological examples as in [la2021contemporary](such examples occur in less that 1% of cases)999Note that predicting the constant function naturally yields an score of 0.. The accuracy metric provides a better idea of the precision of the predicted expression as it depends on a desired tolerance threshold. However, due to the presence of the

operator, it is sensitive to outliers, and hence to the number of points considered at test time (more points entails a higher risk of outlier). To circumvent this, we discard the 5% worst predictions, following 


End-to-end outperforms skeleton

Skeleton + BFGS 0.43 0.40 0.27 0.17
E2E no BFGS 0.62 0.51 0.27 0.09
E2E + BFGS random init 0.44 0.44 0.30 0.19
E2E + BFGS model init 0.68 0.61 0.44 0.29
Table 2: Our approach outperforms the skeleton approach. Metrics are computed over the examples of the evaluation set.

In Table 2, we report the average in-domain results of our models. Without refinement, our E2E model outperforms the skeleton model trained under the same protocol in terms of low precision prediction ( and metrics), but small errors in the prediction of the constants lead to lower performance at high precision ( metric). The refinement procedure alleviates this issue significantly, inducing a three-fold increase in while also boosting other metrics.

Initializing BFGS with the constants estimated in the E2E phase plays a crucial role: with random initialization, the BFGS step actually degrades E2E performance. However, refinement with random initialization still achieves better results than the skeleton model: this suggests that the E2E model predicts skeletons better that the skeleton model.


Fig. 4A,B,C presents an ablation over three indicators of formula difficulty (from left to right): number of unary operators, number of binary operators and input dimension. In all cases, increasing the factor of difficulty degrades performance, as one could expect. This may give the impression that our model does not scale well with the input dimension, but we show that our model scales in fact very well on out-of-domain datasets compared to concurrent methods (see Fig. 15 of the Appendix).

Fig. 4D shows how performance depends on the number of input points fed to the model, . In all cases, performance increases, but much more signicantly for the E2E models than for the skeleton model, demonstrating the importance of having a lot of data to accurately predict the constants in the expression.

Figure 4: Ablation over the function difficulty (top row) and input difficulty (bottom row). We plot the accuracy at (Eq. 1), see App. D for the score. We distinguish four models: skeleton, E2E without refinement, E2E with refinement from random guess and E2E with refinement. A: number of unary operators. B: number of binary operators. C: input dimension. D: Low-resource performance, evaluated by varying the number of input points. E: Extrapolation performance, evaluated by varying the variance of the inputs. F: Robustness to noise, evaluated by varying the multiplicative noise added to the labels.

Extrapolation and robustness

In Fig. 4

E, we examine the ability of our models to interpolate/extrapolate by varying the scale of the test points: instead of normalizing the test points to unit variance, we normalize them to a scale

. As expected, performance degrades as we increase , however the extrapolation performance remains decent even very far away from the inputs ().

Finally, in Fig. 4F, we examine the effect of corrupting the targets with a multiplicative noise of variance : . The results reveal something interesting: without refinement, the E2E model is not robust to noise, and actually performs worse than the skeleton model at high noise. This shows how sensitive the Transformer is to the inputs when predicting constants. Refinement improves robustness significantly, but the initialization of constants to estimated values has less impact, since the prediction of constants is corrupted by the noise.

3.2 Out-of-domain generalization

We evaluate our method on the recently released benchmark SRBench[la2021contemporary]. Its repository contains a set of regression datasets from the Penn Machine Learning Benchmark (PMLB)[friedman2001greedy] in addition to open-source SR and ML baselines. The datasets consist in "ground-truth" problems where the true underlying function is known, as well as "black-box" problems which are more general regression datasets without an underlying ground truth.

We filter out problems from SRBench to only keep regression problems with with continuous features; this results in regression datasets, splitted into black-box problems (combination of real-world and noisy, synthetic datasets), SR datasets from the Feynman [udrescu2020ai] and SR datasets from the ODE-Strogatz [strogatz:2000] databases. Each dataset is split into training data and test data, on which performance is evaluated.

The overall performance of our models is illustrated in the Pareto plot of Fig. 1, where we see that on both types of problems, our model achieves performance close to state-of-the-art GP models such as Operon with a fraction of the inference time101010Inference uses a single GPU for the forward pass of the Transformer.

. Impressively, our model outperforms all classic ML methods (e.g. XGBoost and Random Forests) on real-world problems with a lower inference time, and while outputting an interpretable formula.

We provide more detailed results on Feynman problems in Fig. 5, where we additionally plot the formula complexity, i.e. the number of nodes in the mathematical tree (see App. E for similar results on black-box and Strogatz problems). Varying the noise applied to the targets noise, we see that our model displays similar robustness to state-of-the-art GP models.

While the average accuracy or our model is only ranked fourth, it outputs formulas with lower complexity than the top 2 models (Operon and SBP-GP), which is an important criteria for SR problems: see App. 11 for complexity-accuracy Pareto plots. To the best of our knowledge, our model is the first non-GP approach to achieve such competitive results for SR.

Figure 5: Our model presents strong accuracy-speed-complexity tradeoffs, even in presence of noise. Results are averaged over all 119 Feynman problems, for random seeds and three target noises each as shown in the legend. The accuracy is computed as the fraction of problems for which the score on test examples is above 0.99. Models are ranked according to the accuracy averaged over all target noise.


In this work, we introduced a competitive deep learning model for SR by using a novel numeric-symbolic approach. Through rigorous ablations, we showed that predicting the constants in an expression not only improves performance compared to predicting a skeleton, but can also serve as an informed initial condition for a solver to refine the value of the constants.

Our model outperforms previous deep learning approaches by a margin on SR benchmarks, and scales to larger dimensions. Yet, the dimensions considered here remain moderate (): adapting to the truly high-dimensional setup is an interesting future direction, and will likely require qualitative changes in the data generation protocol. While our model narrows the gap between GP and DL based SR, closing the gap also remains a challenge for future work.

This work opens up a whole new range of applications for SR in fields which require real-time inference. We hope that the methods presented here may also serve as a toolbox for many future applications of Transformers for symbolic tasks.

Appendix A Details on the training data

In Tab. 3

we provide the detailed set of parameters used in our data generator. The probabilities of the unary operators were selected to match the natural frequencies appearing in the Feynman dataset.

In Fig. 6, we show the statistics of the data generation.The number of expressions diminishes with the input dimension and number of unary operators because of the higher likelihood of generating out-of-domain inputs. One could easily make the distribution uniform by enforcing to retry as long as a valid example is not found, however we find empirically that having more easy examples than hard ones eases learning and provides better out-of-domain generalization, which is our ultimate goal.

In Fig. 7, we show some examples of the input distributions generated by our multimodal approach. Notice the diversity of shapes obtained by this procedure.

Parameter Description Value
Max input dim
Distrib of (,)
sign ,
mantissa ,
Max binary ops
Binary operators add:1, sub:1, mul:1
Max unary ops
Unary operators
inv:5, abs:1, sqr:3, sqrt:3,
sin:1, cos:1, tan:0.2, atan:0.2,
log:0.2, exp:1
Min number of points
Max number of points
Max num clusters 10
Table 3: Parameters of our generator.
Figure 6: Statistics of the synthetic data. We calculated the latter on generated examples.
Figure 7: Diversity of the input distributions generated by the multimodal approach. Here we show distributions obtained for .

Appendix B Attention maps

A natural question is whether self-attention based architectures are optimally suited for symbolic regression tasks. In Fig. 8, we show the attention maps produced by the encoder of our Transformer model, which contains 4 layers avec 16 attention heads (we only keep the first 8 for the sake of space). In order to make the maps readable, we consider one-dimensional inputs and sort them in ascending order.

The attention plots demonstrate the complementarity of the attention heads. Some focus on specific regions of the input, whereas others are more spread out. Some are concentrated along the diagonal (focusing on neighboring points), whereas others are concentrated along the anti-diagonal (focusing on far-away points.

Most strikingly, the particular features of the functions studied clearly stand out in the attention plots. Focus, for example, on the 7th head of layer 2. For the exponential function, it focuses on the extreme points (near -1 and 1); for the inverse function, it focuses on the singularity around the origin; for the sine function, it reflects the periodicity, with evenly spaces vertical lines. The same phenomenology can be acrossed is several other heads.

Figure 8: Attention maps reveal distinctive features of the functions considered. We presented the model 1-dimensional functions with 100 input points sorted in ascending order, in order to better visualize the attention. We plotted the self-attention maps of the first 8 (out of 16) heads of the Transformer encoder, across all four layers. We see very distinctive patterns appears: exploding areas for the exponential, the singularity at zero for the inverse function, and the periodicity of the sine function.

Appendix C Does memorization occur?

It is natural to ask the following question: due to the large amount of data seen during training, is our model simply memorizing the training set ? Answering this question involves computing the number of possible functions which can be generated. To estimate this number, calculating the number of possible skeleton is insufficient, since a given skeleton can give rise to very different functions according to the sampling of the constants, and even for a given choice of the constants, the input points can be sampled in many different ways.

Nonetheless, we provide the lower bound as a function of the number of nodes in Fig. 9, using the equations provided in [lample2019deep]. For small expressions (up to four operators), the number of possible expressions is lower or similar to than the number of expressions encountered during training, hence one cannot exclude the possibility that some expressions were seen several times during training, but with different realizations due to the initial conditions. However, for larger expressions, the number of possibilities is much larger, and one can safely assume that the expressions encountered at test time have not been seen during training.

Figure 9: Our models only see a small fraction of the possible expressions during training. We report the number of possible skeletons for each number of operators. Even after a hundred epochs, our models have only seen a fraction of the possible expressions with more than 4 operators.

Appendix D Additional in-domain results

Fig. 10, we present a similar ablation as Fig. 4 of the main text but using the score as metric rather than accuracy.

Figure 10: Ablation over the function difficulty (top row) and input difficulty (bottom row). We plot the score (Eq. 1). A: number of unary operators. B: number of binary operators. C: input dimension. D: Low-resource performance, evaluated by varying the number of input points. E: Extrapolation performance, evaluated by varying the variance of the inputs. F: Robustness to noise, evaluated by varying the multiplicative noise added to the labels.

Appendix E Additional out-of-domain results


In Fig. 11, we display a Pareto plot comparing accuracy and formula complexity on SRBench datasets.

Figure 11: Complexity-accuracy pareto plot. Pareto plot comparing the average test performance and formula complexity of our models with baselines provided by the SRbench benchmark [la2021contemporary], both on Feynman SR problems [udrescu2020ai] and black-box regression problems. We use colors to distinguish three families of models: deep-learning based SR, genetic programming-based SR and classic machine learning methods (which do not provide an interpretable solution).

Jin benchmark

In Fig. 12, we show the predictions of our model on the functions provided in [jin2020bayesian]. Our model gets all of them correct except for one.

(a) Jin-1
(b) Jin-2
(c) Jin-3
(d) Jin-4
(e) Jin-5
(f) Jin-6
Figure 12: Illustration of our model on a few benchmark datasets from the litterature. We show the prediction of our model on six 2-dimensional datasets presented in [jin2020bayesian] and used as a comparison point in a few recent works [mundhenk2021symbolic]. The input points are marked as black crosses. Our model retrieves the correct expression in all but one of the cases: in Jin5, the prediction matches the input points correctly, but extrapolates badly.

Black-box datasets

In Fig. 13, we display the results of our model on the black-box problems of SRBench.

Figure 13: Performance metrics on black-box datasets.

Strogatz datasets

Each of the 14 datasets from the ODE-Strogatz benchmark is the trajectory of a 2-state system following a first-order ordinary differential equation (ODE). Therefore, the input data has a very particular, time-ordered distribution, which differs significantly from that seen at train time. Unsurprisingly, Fig. 

14 shows that our model performs somewhat less well to this kind of data in comparison with GP-based methods.

Figure 14: Performance metrics on Strogatz datasets.

Ablation on input dimension

In Fig. 15, we show how the performance of our model depends on the dimensionality of the inputs on Feynamn and black-box datasets.

(a) Black-box
(b) Feynman
Figure 15: Performance metrics on SRBench, separated by input dimension.

Ablation on decoding strategy

In Fig. 16, we display the difference in performance using two decoding strategies.

Figure 16: Median of our method without refinement on black-box datasets when , varying the number of decoded function samples. The beam search [https://doi.org/10.48550/arxiv.1606.02960] used in [biggio2021neural] leads to low-diversity candidates in our setup due to expressions differing only by small modifications of the coefficients.