1 Introduction
Mathematicians solve problems by relying on rules, correct derivations and proven methods of computation that are guaranteed to lead to a correct solution. Over time, they have developed a rich set of computational techniques that can be applied to many problems, and were said to be “unreasonably effective” (Wigner (1960)). Most of those computational techniques are not intuitive, they have to be derived from theory and applied by trained scientists or built into software libraries. They are the building blocks of every advanced computation.
Many recent studies showed that deep learning models can learn complex rules from large datasets, only from examples. In natural language processing, models learn to output grammatically correct sentences without prior knowledge of grammar and syntax
Radford et al. (2019), or to automatically map one language into another Bahdanau et al. (2014); Sutskever et al. (2014). In mathematics, deep learning models have been trained to perform logical inference Evans et al. (2018), SAT solving Selsam et al. (2018), basic arithmetic Kaiser and Sutskever (2015) and symbolic integration Lample and Charton (2020).In this paper, we investigate whether deep learning models can be trained to perform complex computations and to deduce the qualitative behavior of mathematical objects, without builtin mathematical knowledge. We consider three questions of higher mathematics: the local stability and controllability of differential systems, and the existence and behavior at infinity of solutions of partial differential equations. All three problems have been widely researched and have many applications outside of pure mathematics. They have known solutions that rely on advanced symbolic and computational techniques, from formal differentiation, Fourier transform, geometrical fullrank conditions, to function evaluation, matrix inversion, and computation of complex eigenvalues. Surprisingly, we find that neural networks can solve these problems with a very high accuracy, by simply looking at instances of problems and their solutions, while being totally unaware of the underlying theory. These results are unintuitive given the advanced numerical techniques required by the theory and the difficulty of neural networks to perform simple arithmetic tasks
Saxton et al. (2019); Trask et al. (2018); Zaremba and Sutskever (2014), suggesting that the model might be using a different approach than the known theory to correctly predict the output.After reviewing prior applications of deep learning to differential equations and symbolic computation, we introduce the three problems we consider, describe how we generate datasets, and detail how we train our models. Finally, we present our experiments and discuss their results.
2 Related work
Research on the application of neural networks to differential equations has mainly focused on two aspects. A first line of research investigates the relation between residual neural networks He et al. (2016)
and the different discretization schemes of Ordinary Differential Equations (ODEs). In particular, an ODE of the form
can be approximated by the Euler method with
which can be seen as a particular instantiation of a residual network: the chain of layers in a residual neural network represents an approximation of the ODE solution with the Euler method. Chen et al. (2018) used this observation to build a memory efficient architecture where the network depth can be adapted by simply modifying the discretization scheme (with a potentially infinite number of layers). Subsequent work by Lu et al. (2017) analyzed the connection between neural architectures and numerical differential equations. In particular, they showed that other categories of neural networks such as PolyNets Zhang et al. (2017) or FractalNets Larsson et al. (2016) could also be interpreted as different discretization schemes of ODEs, although these networks were not designed with differential equations in mind.
Although the majority of Partial Differential Equations (PDEs) do not have any explicit solution, their solutions can be approximated using numerical methods. These methods are particularly efficient for lowdimension problems, but they are based on the discretization of the input domain, which does not scale with the dimensionality of the problem. A second line of research studies the ability of neural networks to approximate the solution of PDEs. The idea relies on the universal approximation theorem, that states that any continuous function can be approximated by a neural network with one hidden layer Cybenko (1989); Hornik et al. (1990); Hornik (1991); Petersen and Voigtlaender (2018); Pinkus (1999). In particular, if a PDE does not have an explicit solution, then its solution can be approximated with a neural network Lagaris et al. (1998, 2000); Lee and Kang (1990); Rudd (2013); Sirignano and Spiliopoulos (2018). Unlike numerical methods, this approach is not as sensitive to the problem dimension.
Lample and Charton (2020) proposed several approaches to generate arbitrarily large datasets of functions with their integrals, and differential equations with their solutions. They found that a transformer model Vaswani et al. (2017) trained on millions of examples could outperform stateoftheart symbolic frameworks such as Mathematica or MATLAB WolframResearch (2019); MathWorks (2019) on a particular subset of equations. Their model was used to guess solutions, while verification (arguably a simpler task) was left to a symbolic framework Meurer et al. (2017). Arabshahi et al. (2018a, b) proposed to use neural networks to verify the solution of differential equations, and found that TreeLSTMs Tai et al. (2015) were better than sequential LSTMs Hochreiter and Schmidhuber (1997) at generalizing beyond the training distribution. Other approaches investigated the capacity of neural networks to perform arithmetic operations Kaiser and Sutskever (2015); Saxton et al. (2019); Trask et al. (2018) or to run short computer programs Zaremba and Sutskever (2014). More recently, Saxton et al. (2019) found that neural networks were good at solving arithmetic problems or at performing operations such as differentiation or polynomial expansion, but struggled on tasks like prime number decomposition or on primality tests that can typically not be guessed and require a significant number of steps to compute. Unlike problems like prime number factorization, the problems we address in this paper cannot be solved by simple algorithmic computations.
3 Differential systems and their stability
A differential system of degree is a system of equations of variables ,
Many problems can be set as differential systems. Special cases include nth order ordinary differential equations (letting , , … ), systems of coupled differential equations, and some particular partial differential equations (separable equations or equations with characteristics). Differential systems are one of the most studied areas of mathematical sciences. They are found in physics, mechanics, chemistry, biology, and economics as well as in pure mathematics. Most differential systems have no explicit solution. Therefore, mathematicians have studied the properties of their solutions, and first and foremost their stability, a notion of paramount importance in many engineering applications. Studies on stability began in the century with Euler, Lagrange, and then Dirichlet, with significant contributions by Poincaré and Lyapunov at the turn of the century. This will be the first problem we investigate.
3.1 Local stability
Let be an equilibrium point, that is, . If all solutions converge to when their initial positions at are close enough, the equilibrium is said to be locally stable (see Appendix B for a proper mathematical definition). This problem is well known, if is differentiable in , an answer is provided by the Spectral Mapping Theorem (SMT) (Coron, 2007, Theorem 10.10):
Theorem 3.1.
Let be the Jacobian matrix of in (the matrix of its partial derivatives relative to its variables). Let be the largest real part of its (complex) eigenvalues. If is positive, is an unstable equilibrium. If is negative, then is a locally stable equilibrium.
If is negative, close to , the solutions will converge to this equilibrium exponentially fast with decay rate . Predicting at a point for a given differential system is our first problem. Therefore, to measure the stability of a differential system at a point , we need to:

differentiate each function with respect to each variable, obtain the formal Jacobian

evaluate , the Jacobian in (a real or complex matrix)

calculate the eigenvalues of

return the speed of convergence of the system
3.2 Control theory
One of the lessons of the spectral mapping theorem is that instability is very common. In fact, unstable systems are plenty in nature (Lagrange points, epidemics, satellite orbits, etc.), and the idea of trying to control them through external variables comes naturally. This is the controllability problem. Control theory has a lot of practical applications, including space launch and the landing on the moon, the US Navy automated pilot, or recently autonomous vehicles Bernhard et al. (2017); Minorsky (1930); Funke et al. (2016). Formally, we are given a system
(1) 
where is the state of the system. We want to find a function , the control action, such that, beginning from a position at , we can reach a position at (see Appendix B). The first rigorous mathematical analysis of this problem was given by Maxwell (1868), but a turning point was reached in 1963, when Kalman gave a precise condition for a linear system Kalman et al. (1963), later adapted to nonlinear system Coron (2007) as follows:
Theorem 3.2 (Kalman condition).
Let and , if
(2) 
then the system is locally controllable around .
When this condition holds, a solution to the control problem that makes the system locally stable in is (c.f. Coron (2007); Kleinman (1970); Lukes (1968) and appendix C for key steps of the proof), where is the control feedback matrix:
(3) 
In the nonautonomous case, where (and and ) depends on , (2) becomes:
(4) 
where and . All these theorems make use of advanced mathematical results, such as the CayleyHamilton theorem, or LaSalle invariance principle. Learning them by predicting controllability and computing the control feedback matrix is our second problem. To measure whether the system is controllable at a point , we need to:

differentiate the system with respect to its internal variables, obtain the Jacobian

differentiate the system with respect to its control variables, obtain the matrix

evaluate and in

calculate the rank of , if , the system is controllable

(optionally) if , compute the control feedback matrix with (3)
A step by step derivation of this example is given in Section A of the appendix.
3.3 Stability of partial differential equations using Fourier Transform
Partial Differential Equations (PDEs) naturally appear when studying continuous phenomena (e.g. sound, electromagnetism, gravitation). Over such problems, ordinary differential systems are not sufficient. Like differential systems, PDEs seldom have explicit solutions, and studying their stability has many practical applications. It is also a much more difficult subject, where few general theorems exist. We consider linear PDEs of the form
(5) 
where , , and are time, position, and state. is a multiindex and are constants. Famous examples of such problems include the heat equation, transport equations or Schrodinger equation Evans (2010). We want to determine whether a solution of (5) exists for a given an initial condition , and if it tends to zero as . This is mathematically answered (see appendix C and Evans (2010); Bahouri et al. (2011) for similar arguments) by:
Proposition 3.1.
Given , the space of tempered distribution, there exists a solution if there exists a constant such that
(6) 
where is the Fourier transform of and is the Fourier polynomial associated with the differential operator . In addition, if , this solution goes to zero when .
Learning this proposition and predicting, given an input and , whether a solution exists, if so, whether it vanishes at infinite time, will be our third and last problem.
To predict whether our PDE has a solution under given initial conditions, and determine its behavior at infinity, we need to:

find the Fourier polynomial associated to

find the Fourier transform of

find the set of frequency for which

minimize on

output (0,0) if this minimum is infinite, (1,0) is finite and negative, (1,1) if finite and positive. (optionally) output
4 Datasets and models
To train our models, we generate datasets of problems and solutions. Since all problems have known solutions, we randomly sample problems and compute their solutions with mathematical software Virtanen et al. (2020); Meurer et al. (2017), using the techniques described in Section 3. For both the stability and controllability datasets, the problems are differential systems with equations and variables ( for stability, for autonomous controllability, and for nonautonomous controllability). Generating a system amounts to sampling random functions.
Following Lample and Charton (2020), we generate random functions by sampling unarybinary trees, and randomly selecting operators, variables and integers for their internal nodes and leaves. We use as operators, and variables or integers between and for our leaves. When generating functions with variables, we build trees with up to operators.
Generated trees are enumerated in prefix order (also known as normal Polish notation) and converted into a sequence of tokens compatible with seq2seq models. Integers are written in positional notation ( as the four token sequence [INT+, , , ]) and reals in decimal floating point representation ( as the sequence [FLOAT+, , DOT, , , e, INT, ]). Real numbers are rounded to a fixed number of significant digits. A derivation of the size of the problem space is provided in appendix D.
Local stability
Datasets for local stability include systems with to equations (in equal proportion). Functions that are not differentiable in (i.e. the point of equilibrium) and systems with a constant equation or lacking one of the variables (leading to a degenerate Jacobian) are eliminated during generation. The convergence speed at is expressed as a floating point decimal rounded to significant digits. We generate a dataset with over million systems.
Since many operators or their derivatives are undefined at zero, choosing the origin as would result in a lot of undefined Jacobians, slowing generation and biasing the dataset by reducing the frequency of operators like division, square root, or logarithms. Instead, we select with all coordinates equal to (denoted as ).
Control theory
Datasets for autonomous control include systems with to equations, and to variables (among which to control variables). As before, we eliminate systems with indefinite or degenerate Jacobians, or where one of the control variables is absent. We also skip functions with complex derivatives in , since these problems are less meaningful from a physical perspective (for this problem, or ). Controllability and the feedback matrix are calculated, and expressed either as a Boolean integer or as a matrix of floating point decimals.
Under these conditions, more than of systems are controllable. This makes learning controllability difficult, because an obvious but wrong solution (predicting that all models are controllable) yields high accuracy (). To mitigate this, we build a balanced dataset by oversampling noncontrollable solutions to achieve balance between controllable and noncontrollable systems. For feedback matrix prediction, we restrict generation to controllable systems. For the nonautonomous case, we generate systems with or equations. Overall, we generate datasets with more than million examples each: a balanced dataset for predicting autonomous controllability, a dataset of controllable systems with feedback matrices, and a dataset for nonautonomous controllability.
Stability of partial differential equations using Fourier Transform
We generate a differential operator and an initial condition . The operator is a polynomial in . is the product of functions with known Fourier transforms, and operators , with and . We calculate 1) the existence of solutions, 2) their behavior when , and 3) the set of frequencies, and express these three values as a sequence of two Booleans followed by floating point decimals. We create a dataset of over million examples.
Models and evaluation
In all experiments, we use a transformer architecture Vaswani et al. (2017) with attention heads. We train our models with the Adam optimizer Kingma and Ba (2014), a learning rate of , and follow the learning rate scheduler of Vaswani et al. (2017). We vary the dimension from to , and the number of layers from to . We use minibatches composed of 1024 systems, and train our models on 8 V100 GPUs with float16 operations to speed up training and reduce memory usage.
At the end of each epoch, evaluation is carried out on a heldout validation set of
examples. We ensure that test examples are never seen during training (given the size of the problem space, this never happens in practice). To evaluate a model output, we either compare it with the reference solution, or use a problemspecific metric.5 Experiments
5.1 Local stability
In these experiments, the model is given functions , where . It is trained to predict , the largest real part of the eigenvalues of their Jacobians at , corresponding to the convergence speed to the equilibrium. We consider that predictions are correct when they fall within of the ground truth.
An layer transformer model of dimensions correctly predicts in of cases. Performances range from for systems of degree , to for systems of degree . Overall, increasing the dimension and adding layers improves the performance. Table 1 summarizes results for different sets of parameters.
Prediction of to better accuracy can be achieved by training models on data rounded to , or significant digits, and measuring the number of exact predictions on the test sample. Overall, we predict with two significant digits in of test cases. Table 2 summarizes the results for different precisions (for transformers with layers and a dimensionality of ).
Degree 2  Degree 3  Degree 4  Degree 5  Degree 6  Overall  
layers, dim  
layers, dim  
layers, dim  
layers, dim  
layers, dim  
layers, dim  96.3  90.4  86.2  82.7  77.3  86.6 
Degree 2  Degree 3  Degree 4  Degree 5  Degree 6  Overall  

digits  
digits  
digits 
5.2 Control theory
In these experiments, the model is given functions ( in the nonautonomous case) and is trained on a qualitative task, predicting controllability in , and a numerical task, calculating feedback matrices for the control variables.
When predicting controllability in the autonomous case, transformers with dimensions and layers achieves accuracy. Performance decreases with the size of the system, from for equations to for equations. Surprisingly, even models with a dimensionality of 64 and only 1 or 2 layers achieve a reasonable performance, above .
Dimension 64  Dimension 128  Dimension 256  Dimension 512  

1 layers  
2 layers  
4 layers  
6 layers  97.4 
In the non autonomous case, transformers with dimensions and layers achieve accuracy, but using smaller models makes little difference in accuracy. Overall, this task is solved with a very high accuracy, even by very small models.
Dimension 64  Dimension 128  Dimension 256  Dimension 512  

1 layer  
2 layers  
4 layers  
6 layers  99.7 
On the feedback matrix prediction task, we train transformers with layers and a dimensionality of . We use two metrics to evaluate the performance: 1) prediction within of all coefficients in the target matrix, and 2) verifying that the model outputs a correct feedback matrix , i.e. that all eigenvalues in have negative real parts. The second metric makes more mathematical sense, as it checks that the model actually solves the control problem (like a differential equation, a feedback control problem can have many different solutions).
Using the first metric, of feedback matrices are predicted with less than error. Accuracy is for systems with equations, but drops fast as systems becomes larger. These results, although low, are well above chance level (less than ). However, with the second metric (i.e. the metric that actually matters mathematically), we achieve accuracy, a much better result. Accuracy decreases with system size, but even degree systems, with to feedback matrices, are correctly predicted of the time. Therefore, while the model fails to approximate to a satisfactory level, it does learn to predict correct solutions to the control problem to a high accuracy. This result is very surprising, as it suggests that a mathematical property characterizing feedback matrices might have been learned.
Degree 3  Degree 4  Degree 5  Degree 6  Overall  

Prediction within  
Correct feedback matrix  87.5  77.4  58.0  41.5  66.5 
5.3 Partial differential equations and Fourier Transform
In these experiments, the model is given a differential operator and an initial condition , and is trained to predict if a solution to exists and, if so, whether it converges to when . The space dimension (i.e. dimension of ) is between and .
In a first series of experiments the model is only trained to predict the existence and behavior of solutions, encoded as a sequence of two Booleans. Overall accuracy is 98.4%. In a second series, we introduce an auxiliary task by adding to the output the frequency bounds of . We observe that this auxiliary task significantly contributes to the stability of the model with respect to hyperparameters. In particular, without the auxiliary task, the model is very sensitive to the learning rate scheduling and often fails to converge to something better than random guessing. However, in case of convergence, the model reaches the same overall accuracy, with and without auxiliary task. Table 6 details the results for different space dimensions.
Space dimension for  Dim 2  Dim 3  Dim 4  Dim 5  Dim 6  Overall 

Accuracy 
5.4 Discussion
We studied five problems of advanced mathematics from widely researched areas of mathematical analysis. In three of them, we predict qualitative and theoretical features of differential systems. In two, we perform numerical computations. According to mathematical theory, solving these problems requires a combination of advanced techniques, symbolic and numerical, that seem unlikely to be learnable from examples. Yet, our model achieves more than accuracy on all qualitative tasks, and between and on numerical computations. Such high performances over difficult mathematical tasks may come as a surprise.
One way to generate datasets of problems with their solutions consists in sampling the solution first, and deriving an associated problem. For instance, pairs of functions with their integrals can be generated by sampling then differentiating random functions. Such backward approach might result in a biased dataset Yehuda et al. (2020). In this paper, datasets for all considered tasks are generated using a forward approach, by directly sampling from the input space. As a result, potential biases caused by backward generative methods do not apply here. Besides, we studied problems from three different fields, that use different mathematical techniques, and different generators, suggesting that the high results we obtain do not come from some specific aspect of a particular problem.
An objection traditionally raised is that the model might memorize a very large number of cases, and interpolate between them. This is unlikely. First, because the size of our problem space is too large to be memorized (for all considered problems, we did not get a single duplicate over 50 million generated examples). Second, because in some of our problems such as nonautonomous control, even a model with one layer and 64 dimensions obtains a high accuracy, and such a small model would never be able to memorize that many examples. Third, because for some of our problems (e.g. local stability), we know from mathematical theory that solutions (e.g. the real values of eigenvalues) cannot be obtained by simple interpolation.
6 Conclusion
In this paper, we show that by training transformers over generated datasets of mathematical problems, advanced and complex computations can be learned, and qualitative and numerical tasks performed with high accuracy. Our models have no builtin mathematical knowledge, and learn from examples only. It seems that our models have learned to solve these problems, but this does not mean they learned the techniques we use to compute their solutions. Problems such as nonautonomous control involve long and complex chains of computations. Yet, even very small models (one layer transformers with dimensions) achieve high accuracy.
Most probably, our models learn shortcuts that allow them to solve specific problems, without having to learn or understand their theoretical background. Such a situation is common in everyday life. Most of us learn and use language without understanding its rules. On many practical subjects, we have tacit knowledge and know more than we can tell (
Polanyi and Sen (2009)). This may be the way neural networks learn advanced mathematics. Understanding what these shortcuts are and how neural networks discover them is a subject for future research.References
 Arabshahi et al. [2018a] Forough Arabshahi, Sameer Singh, and Animashree Anandkumar. Combining symbolic expressions and blackbox function evaluations for training neural programs. In International Conference on Learning Representations, 2018a.
 Arabshahi et al. [2018b] Forough Arabshahi, Sameer Singh, and Animashree Anandkumar. Towards solving differential equations through neural programming. 2018b.
 Bahdanau et al. [2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Bahouri et al. [2011] Hajer Bahouri, JeanYves Chemin, and Raphaël Danchin. Fourier analysis and nonlinear partial differential equations, volume 343. Springer Science & Business Media, 2011.
 Bernhard et al. [2017] Pierre Bernhard, Marc Deschamps, et al. Kalman on dynamics and contro, linear system theory, optimal control, and filter. Technical report, 2017.
 Chen et al. [2018] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pages 6571–6583, 2018.
 Coron [2007] JeanMichel Coron. Control and nonlinearity, volume 136 of Mathematical Surveys and Monographs. American Mathematical Society, Providence, RI, 2007. ISBN 9780821836682; 0821836684.

Cybenko [1989]
George Cybenko.
Approximation by superpositions of a sigmoidal function.
Mathematics of control, signals and systems, 2(4):303–314, 1989.  Evans [2010] Lawrence C Evans. Partial differential equations, volume 19. American Mathematical Soc., 2010.
 Evans et al. [2018] Richard Evans, David Saxton, David Amos, Pushmeet Kohli, and Edward Grefenstette. Can neural networks understand logical entailment? arXiv preprint arXiv:1802.08535, 2018.
 Funke et al. [2016] Joseph Funke, Matthew Brown, Stephen M Erlien, and J Christian Gerdes. Collision avoidance and stabilization for autonomous vehicles in emergency scenarios. IEEE Transactions on Control Systems Technology, 25(4):1204–1216, 2016.

He et al. [2016]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Hornik [1991] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
 Hornik et al. [1990] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural networks, 3(5):551–560, 1990.
 Kaiser and Sutskever [2015] Lukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. CoRR, abs/1511.08228, 2015.
 Kalman et al. [1963] Rudolf E. Kalman, YuChi Ho, and Kumpati S. Narendra. Controllability of linear dynamical systems. Contributions to Differential Equations, 1:189–213, 1963. ISSN 05895839.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kleinman [1970] David Kleinman. An easy way to stabilize a linear constant system. IEEE Transactions on Automatic Control, 15(6):692–692, 1970.
 Lagaris et al. [1998] Isaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Artificial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks, 9(5):987–1000, 1998.
 Lagaris et al. [2000] Isaac E Lagaris, Aristidis C Likas, and Dimitris G Papageorgiou. Neuralnetwork methods for boundary value problems with irregular boundaries. IEEE Transactions on Neural Networks, 11(5):1041–1049, 2000.
 Lample and Charton [2020] Guillaume Lample and François Charton. Deep learning for symbolic mathematics. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=S1eZYeHFDS.
 Larsson et al. [2016] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultradeep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016.
 Lee and Kang [1990] Hyuk Lee and In Seok Kang. Neural algorithm for solving differential equations. Journal of Computational Physics, 91(1):110–131, 1990.
 Lu et al. [2017] Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121, 2017.
 Lukes [1968] Dahlard L Lukes. Stabilizability and optimal control. Funkcial. Ekvac, 11:39–50, 1968.
 MathWorks [2019] MathWorks. Matlab optimization toolbox (r2019a), 2019. The MathWorks, Natick, MA, USA.
 Maxwell [1868] James Clerk Maxwell. I. on governors. Proceedings of the Royal Society of London, pages 270–283, 1868.
 Meurer et al. [2017] Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Štěpán Roučka, Ashutosh Saboo, Isuru Fernando, Sumith Kulal, Robert Cimrman, and Anthony Scopatz. Sympy: symbolic computing in python. PeerJ Computer Science, 3:e103, January 2017. ISSN 23765992. doi: 10.7717/peerjcs.103. URL https://doi.org/10.7717/peerjcs.103.
 Minorsky [1930] Nicolas Minorsky. Automatic steering tests. Journal of the American Society for Naval Engineers, 42(2):285–310, 1930.

Petersen and Voigtlaender [2018]
Philipp Petersen and Felix Voigtlaender.
Optimal approximation of piecewise smooth functions using deep relu neural networks.
Neural Networks, 108:296–330, 2018.  Pinkus [1999] Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143–195, 1999.
 Polanyi and Sen [2009] Michael Polanyi and Amartya Sen. The Tacit Dimension. University of Chicago Press, 2009. ISBN 9780226672984.
 Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
 Rudd [2013] Keith Rudd. Solving partial differential equations using artificial neural networks. PhD thesis, Duke University Durham, NC, 2013.
 Saxton et al. [2019] David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations, 2019.
 Selsam et al. [2018] Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L Dill. Learning a sat solver from singlebit supervision. arXiv preprint arXiv:1802.03685, 2018.
 Sirignano and Spiliopoulos [2018] Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375:1339–1364, 2018.
 Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 Tai et al. [2015] Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from treestructured long shortterm memory networks. arXiv preprint arXiv:1503.00075, 2015.
 Trask et al. [2018] Andrew Trask, Felix Hill, Scott E Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic logic units. In Advances in Neural Information Processing Systems, pages 8035–8044, 2018.
 Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
 Virtanen et al. [2020] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake Vand erPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1. 0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: https://doi.org/10.1038/s4159201906862.
 Wigner [1960] Eugene P Wigner. The unreasonable effectiveness of mathematics in the natural sciences. communications on pure and applied mathematics, 12:1–14, 1960.
 WolframResearch [2019] WolframResearch. Mathematica, version 12.0, 2019. Champaign, IL, 2019.
 Yehuda et al. [2020] Gal Yehuda, Moshe Gabel, and Assaf Schuster. It’s not what machines can learn, it’s what we cannot teach. arXiv preprint arXiv:2002.09398, 2020.
 Zaremba and Sutskever [2014] Wojciech Zaremba and Ilya Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
 Zhang et al. [2017] Xingcheng Zhang, Zhizhong Li, Chen Change Loy, and Dahua Lin. Polynet: A pursuit of structural diversity in very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 718–726, 2017.
Appendix A Examples of computations
a.1 Step by step example : autonomous control
To measure whether the system
is controllable at a point , with asymptotic control , we need to

differentiate the system with respect to its internal variables, obtain the Jacobian

differentiate the system with respect to its control variables, obtain a matrix

evaluate and in ,

calculate the controllability matrix given by (2).

output , with the rank of the controllability matrix, the system is controllable if

(optionally) if , compute the control feedback matrix as in (3)
a.2 Examples of inputs and outputs
a.2.1 Local stability
System 



(locally stable)  
(locally stable) 
a.2.2 Controllability: autonomous systems
Autonomous system 



0 (controllable)  
1  
2  
0 (controllable)  
1 
a.2.3 Controllability: nonautonomous systems
Nonautonomous system 



False  
False  
False  
True  
True 
a.2.4 Stability of partial differential equations using Fourier transform
PDE and initial condition 



False , False  
True , False  
True , False  
True , True 
Appendix B Mathematical definitions
b.1 Notions of stability
Let us consider a system
(7) 
is an attractor, if there exists such that
(8) 
But, counter intuitive as it may seem, this is not enough for asymptotic stability to take place.
Definition B.1.
We say that is a locally (asymptotically) stable equilibrium if the two following conditions are satisfied:

is a stable point, i.e. for every , there exists such that
(9) 
is an attractor, i.e. there exists such that
(10)
In fact, the SMT of Subsection 3.1 deals with an even stronger notion of stability, namely the exponential stability defined as follows:
Definition B.2.
We say that is an exponentially stable equilibrium if is locally stable equilibrium and, in addition, there exist , , and such that
In this definition, is called the exponential convergence rate, which is the quantity predicted in our first task. Of course, if is locally exponentially stable it is in addition locally asymptotically stable.
b.2 Controllability
We give here a proper mathematical definition of controllability. Let us consider a nonautonomous system
(11) 
such that .
Definition B.3.
Let , we say that the nonlinear system (11) is locally controllable at the equilibrium in time with asymptotic control if, for every , there exists such that, for every with and there exists a trajectory such that
(12) 
An interesting remark is that if the system is autonomous, the local controllability does not depend on the time considered, which explains that it is not precised in Theorem 3.2.
b.3 Tempered distribution
We start by recalling the multiindex notation: let , , and , we denote
(13) 
is said to be a multiindex and . Then we give the definition of the Schwartz functions:
Definition B.4.
A function belongs to the Schwartz space if, for any multiindex and ,
(14) 
Finally, we define the space of tempered distributions:
Definition B.5.
A tempered distribution is a linear form on such that there exists and such that
(15) 
Appendix C Proofs of theorems
c.1 Analysis of Problem 2
The proofs of Theorem 3.2, of validity of the feedback matrix given by the expression (3), and of the extension of Theorem 3.2 to the nonautonomous system given by condition (4) can be found in Coron [2007]. We give here the key steps of the proof for showing that the matrix given by (3) is a valid feedback matrix to illustrate the underlying mechanisms:

Setting , where is solution to , and
(16) 
Showing, using the form of , that

Showing that, if for any , , then for any ,

Concluding from the previous and LaSalle invariance principle that the system is locally exponentially stable.
c.2 Analysis of Problem 3
In this section we prove Proposition 3.1. We study the problem