Globally Optimal Symbolic Regression

10/29/2017 ∙ by Vernon Austel, et al. ∙ ibm 0

In this study we introduce a new technique for symbolic regression that guarantees global optimality. This is achieved by formulating a mixed integer non-linear program (MINLP) whose solution is a symbolic mathematical expression of minimum complexity that explains the observations. We demonstrate our approach by rediscovering Kepler's law on planetary motion using exoplanet data and Galileo's pendulum periodicity equation using experimental data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Discovering mathematical models that explain the behavior of a system has a broad range of applications. Symbolic regression is an important field in machine learning whose goal is to find a symbolic mathematical expression that explains a dependent variable in terms of a number of independent variables for a given data set, most commonly as an explicit function of the independent variables. Unlike traditional (numerical) regression schemes, the functional form of the expression is not assumed to be known

a priori kroll2017workflow . The utility of the approach has been established for a broad range of applications connor1977scaling ; willis1997genetic ; davidson1999method ; davidson2003symbolic , including the discovery of equations todorovski1997declarative ; langley1981data ; schmidt2009distil ; schmidt2010symbolic .

Starting with koza

, symbolic regression problems have been typically solved with genetic programming

banzhaf1998genetic ; augusto2000symbolic , an evolutionary metaheuristic

related to genetic algorithms; another such technique is grammatical evolution

ge . Even before koza

, heuristics to find explicit functional relationships were developed as part of the BACON system, see

langley1987 . There has been much research into improving symbolic regression techniques. In Luo2017 , the separability of the desired functional form is exploited to speed up symbolic regression, whereas a set of candidate basis functions is used in ffx for this purpose. Other recent work has focused on finding accurate constants in the derived symbolic mathematical expression dgp . To discover meaningful functions in the context of physical systems, the authors of schmidt2009distil search for functions that not only match the data, but also have the property that partial derivatives also match the empirically computed partial derivatives. The same authors search for implicit functional relationships in schmidt2010symbolic . Another approach populates a large hypothesis space of functional forms, on which sparse selection is applied brunton2016discovering .

In this study we present a Mixed-Integer Non-Linear Programming (MINLP) formulation that produces the simplest symbolic mathematical expression that satisfies a prescribed upper bound on the prediction error. Our MINLP formulation can be solved to optimality using existing state-of-the-art global optimization solvers. The key advantage of our approach is that it produces a globally optimal mathematical expression while avoiding exhaustive search of the solution space. Another advantage is that it produces correct real-valued constants (within a tolerance) directly; most other methods use specialized algorithms to refine constants dgp ; topchy and cannot guarantee global optimality. In addition, our formulation can, in principle, seamlessly incorporate additional constraints and objectives to capture domain knowledge and user preferences.

2 A MINLP Formulation for Free-Form Model Discovery

A symbolic regression scheme consists of a space of valid mathematical expressions together with a mechanism for its exploration. A mathematical expression can be represented by a (rooted) expression tree where each node is labeled by one of the following entities: operators (such as , , ), variables (i.e., the independent variables), and constants. Edges of the tree link these entities in a way that is consistent with a prescribed grammar. For example, an expression tree for is:





Here , and the power function are the operators, are the variables, and numbers , and are the constants.

In our MINLP based approach we model the grammar of valid expression trees by a set of constraints. The discrete variables of the formulation are used to define the structure of the expression tree, and the continuous variables to evaluate the resulting symbolic expression for specific numerical values associated with the data. To choose among the many possible expressions that fit the data (within an error bound), we set the objective to minimize the description complexity of the expression. Consequently, the MINLP formulation has the form

s.t. (Grammar)

where and are decision variables, is the expression tree defined by these variables, is the set of values of the decision variables that define valid expression trees of bounded size, measures the description complexity of the expression tree, measures error of the predicted values , and are the observed data. In practice, MINLP solvers (e.g., BARON ts05 , COUENNEBeLeLiMaWa08 , SCIPMaherFischerGallyetal2017 ) employ various convex relaxation schemes to obtain lower bounds on the objective function value and use these bounds in divide-and-conquer strategies to obtain globally optimal solutions ts05 . We note that our MINLP approach bears similarities to the one presented in horesh2016tech , however unlike our model, the one presented in horesh2016tech is computationally impractical.

We next give the details of our formulation. The input to the formulation consists of a rooted binary tree, a set of candidate operators and an observed dataset. For each observation , we denote the value of the independent variables by and the dependent variable values by . Let denote the indices of the independent variables. Let be the input binary tree, let denote root node and denote the leaf nodes. Each node in the tree has exactly one predecessor and each node has exactly two successors. Let denote the set of candidate operators where contains the unary operators (such as squareroot) and

contains the binary operators (such as addition). Our formulation has two parts: The first part consists of constraints and decision variables that construct the expression tree and the second part is used to evaluate the difference between the estimated and actual dependent variable value of each observation using the constructed expression tree.

Grammar. The first part of the formulation chooses a subtree of and assigns either an operator, a constant, or an input variable to each node of the chosen subtree in a consistent way. For each , we have a decision variable to indicate if the node is used (active) in the expression tree. Decision variables and denote whether the operator or the input variable are assigned to node respectively. Finally we have a decision variable if a constant value is assigned the node. The constraint


enforces that each active node is assigned an operator, a constant, or one of the variables. Let be a non-leaf node with successors where has a smaller index than . Note that if node is assigned a binary operator then both nodes have to be active in the tree; and if is assigned a unary operator then exactly one of the nodes (say ) has to be active. Consequently, we have constraints and . Notice that these constraints also enforce that if a node is inactive, or is assigned a constant or is a variable, then its successor nodes cannot be active. As the leaf nodes cannot be assigned operators, we also set for all . Finally, we define a continuous variable for to denote the value of the constant when . It is possible to show that any

together with binary vectors

, and (of appropriate dimensions) that satisfy the constraints above define an expression tree and vice versa.

In addition to the basic model described above, we impose some additional constraints to improve computational performance (without compromising optimality). For example we make sure that if a node is assigned a binary operator, then both its successor nodes cannot be assigned constants at the same time. Similarly the left successor of a unary operator cannot be a constant. We also have several constraints to deal with symmetry as expression trees are inherently symmetric objects in the sense that the same function can be represented with different trees, for example by flipping two branches succeeding a commutative binary operator. To avoid numerical problems, we also set bounds on the absolute value of the continuous variables.

Prediction. Using the first set of variables, the second part of the formulation computes the value of the function (defined by the expression tree) for each observation. More precisely, we define a variable to denote the value of node for observation . Value of a node clearly depends on the operator assigned to it and the values of its successor nodes. For example, if the addition operator is assigned to node that has successor nodes , then would be the sum of and . Therefore we have a constraint:


where denotes the function on the argument(s) of the operator . Also note that only if . Leaf nodes cannot be assigned operators, therefore the last two summations in (2) are zero for these nodes.

Error. Recall that denotes the root node of the expression tree and therefore corresponds to the estimated output variable for observation . We define the error for observation as and relative error as . Using these expressions, we can define a discrepancy function such as . In our computational experiments, we control the relative error in our model by adding the constraint for a given .

Complexity. We can define a function to measure description complexity of the model in various ways, for example where is a complexity weight assigned to operator by the user. In our computational experiments, we use the expression to minimize the total number of nodes in the expression tree.

3 Computational Experiments

We report our computational experience on the discovery of physical laws using real-world datasets. We solve the optimization problem described in Section 2 with the MINLP solver BARON ts05 v17.8.9, using IBM ILOG CPLEX and IPOPT ipopt as subsolvers. We can only use operators that are handled by BARON. Furthermore, when multiple globally optimal solution exists, we do not have control over which one is returned. Experiments are run on the cloud, so CPU speed cannot be precisely quantified.

Exoplanet Data. The first set of experiments concerns the discovery of Kepler’s law on planetary motion on a dataset taken from NASA NASAExoplanetArchive reporting information on planetary systems. The dataset consists of tuples of the form (planet, star, ) where is the orbital period of the planet, the mass of the star, the mass of the planet, the major semi-axis of the orbit. Planet features are normalized to Earth’s, star features to Sol’s. is further normalized since its range is very large. We build three test problems from this dataset, including the systems listed in brackets: EP1 (Sol and Trappist), EP2 (HD 219134 and Kepler), EP3 (GJ 667C, HD 147018, HD 154857, HD 159243 and HD 159868).

Maximum relative error
Dataset 2% 5% 10% 20% 30% 50%
Table 1: Results for the exoplanet dataset. Time limit is 6 hours. Grey cells exceed the time limit. represents constants (we do not report its value for brevity).

We consider the set of operators , and full binary expression trees with depth at most 3 (the root has depth 0) in which each operator can appear at most 3 times. We aim to predict in terms of the other input variables. Results are reported in Table 1, for different upper bounds applied to the relative model error. For small model errors, we always find a refined expression of Kepler’s third law: . Given the data, the more comprehensive formula would be , but the dependency on is not picked up because is negligible compared to . As model error upper bound increases, we find simpler formulas that do not accurately reflect Kepler’s third law.

Most of the formulas returned are certified globally optimal in a relatively short period of time: on average, 50 minutes for the instances solved to global optimality (two instances hit the time limit and thus are not certified globally optimal within the time limit), standard deviation 75. The number of nodes explored by the Branch-and-Bound algorithm varies greatly: from 147 for the simplest formula

, to over 600k for one instance that is not solved to optimality within the allotted runtime. The geometric mean is 19917. By contrast, the number of binary expression trees of depth

with candidate operations, input variables, and numerical constants is without accounting for symmetry. We remark that the number of nodes explored by Branch-and-Bound does not correspond to the number of candidate expression trees that have been examined: it is merely a measure of difficulty of the search.

Pendulum Data. The second set of experiments concerns ten pendulums with different link lengths, and timestamps for the time at which each pendulum crosses the midpoint of its arc. The data is obtained from a high-resolution video recording of the pendulums, yielding hundreds of timestamps. From the raw data, we randomly extract ten tuples (one for each link length) of the form (), where is the link length, and are integers, and are the timestamps for the -th and -th crossing of the midpoint, and . We aim to predict .

Maximum relative error
Dataset 0.5% 1% 2% 5%
Table 2: Results for the pendulum dataset. Time limit is 6 hours. All formulas are certified optimal.

The period of the pendulum is given by . Hence, we aim to obtain the equation . The experimental setup is similar to the exoplanet dataset, but we add “” to the list of operators. Table 2 shows that many experiments return the correct formula except the multiplicative constant, since and its discrepancy from is too small to be significant, up to experimental error. With error bound , we observe overfitting on DS1 and DS4: the simpler formula does not satisfy the error bound due to experimental error. Because the link lengths of the ten pendulums are close to each other (ranging from to ), for larger error bounds a simpler expression of the form suffices. The average runtime is 2 minutes, standard deviation 2.3. The number of nodes ranges from 57 to 8134, geometric mean 299. Despite the presence of redundant variables, the algorithm quickly identifies the simplest formula explaining the data.

In this extended abstract we focused on small datasets and did not address scalability, which is a known limitation of Branch-and-Bound-based methods for MINLP. This is left for future research.


  • (1) Douglas Adriano Augusto and Helio JC Barbosa. Symbolic regression via genetic programming. In

    Proceedings of the Sixth Brazilian Symposium on Neural Networks

    , pages 173–178. IEEE, 2000.
  • (2) Wolfgang Banzhaf, Peter Nordin, Robert E Keller, and Frank D Francone. Genetic programming: an introduction, volume 1. Morgan Kaufmann, San Francisco, 1998.
  • (3) Pietro Belotti, Jon Lee, Leo Liberti, François Margot, and Andreas Wächter. Branching and bounds tightening techniques for non-convex MINLP. Optimization Methods and Software, 24(4-5):597–634, 2008.
  • (4) Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15):3932–3937, 2016.
  • (5) Zonglin Jiang Changtong Luo, Chen Chen. A divide and conquer method for symbolic regression. arXiv:1705.08061v2, 2017.
  • (6) Jack W Connor and John B Taylor. Scaling laws for plasma confinement. Nuclear Fusion, 17(5):1047, 1977.
  • (7) James W Davidson, Dragan A Savic, and Godfrey A Walters. Method for the identification of explicit polynomial formulae for the friction in turbulent pipe flow. Journal of Hydroinformatics, 1(2):115–126, 1999.
  • (8) James W Davidson, Dragan A Savic, and Godfrey A Walters. Symbolic and numerical regression: experiments and applications. Information Sciences, 150(1):95–117, 2003.
  • (9) Lior Horesh, Leo Liberti, and Haim Avron. Globally optimal mixed integer non-linear programming (MINLP) formulation for symbolic regression. Technical Report 219095, IBM, 2016.
  • (10) Dario Izzo, Francesco Biscani, and Alessio Mereta. Differentiable genetic programming. In J. McDermott et al., editor, EuroGP 2017, LNCS 10196, pages 35–51, 2017.
  • (11) John R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, 1992.
  • (12) Paul Kroll, Alexandra Hofer, Ines V Stelzer, and Christoph Herwig. Workflow to set up substantial target-oriented mechanistic process models in bioprocess engineering. Process Biochemistry, 2017.
  • (13) Pat Langley. Data-driven discovery of physical laws. Cognitive Science, 5(1):31–54, 1981.
  • (14) Pat Langley, Gary Lee Bradshaw, and Herbert A Simon. Heuristics for empirical discovery. In L. Bolc, editor, Computational models of learning. Springer, Berlin, 1987.
  • (15) Stephen J. Maher, Tobias Fischer, Tristan Gally, Gerald Gamrath, Ambros Gleixner, Robert Lion Gottwald, Gregor Hendel, Thorsten Koch, Marco E. Lübbecke, Matthias Miltenberger, Benjamin Müller, Marc E. Pfetsch, Christian Puchert, Daniel Rehfeldt, Sebastian Schenker, Robert Schwarz, Felipe Serrano, Yuji Shinano, Dieter Weninger, Jonas T. Witt, and Jakob Witzig. The SCIP optimization suite 4.0. Technical Report 17-12, ZIB, Takustr.7, 14195 Berlin, 2017.
  • (16) T. McConaghy. Ffx: Fast, scalable, deterministic symbolic regression technology. In Genetic Programming Theory and Practice X, pages 235–260. Springer, New York, 2011.
  • (17) NASA. NASA exoplanet archive, 2017. URL:
  • (18) M. O’Neill and C. Ryan. Grammatical evolution. IEEE Trans. Evol. Comput., 5(4):349–358, 2001.
  • (19) Michael Schmidt and Hod Lipson. Distilling free-form natural laws from experimental data. Science, 324(3):81–85, 2009.
  • (20) Michael Schmidt and Hod Lipson. Symbolic regression of implicit equations. In Genetic Programming Theory and Practice VII, pages 73–85. Springer, Berlin, 2010.
  • (21) Mohit Tawarmalani and Nikolaos V Sahinidis. A polyhedral branch-and-cut approach to global optimization. Mathematical Programming, 103(2):225–249, 2005.
  • (22) Ljupco Todorovski and Saso Dzeroski. Declarative bias in equation discovery. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 376–384, 1997.
  • (23) A. Topchy and W. F. Punch. Faster genetic programming based on local gradient search of numeric leaf values. In

    Proceedings of the Genetic and Evolutionary Computation Conference

    , pages 155–162, 2011.
  • (24) Andreas Wächter and Larry T Biegler. On the implementation of a primal-dual interior point filter line search algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1):25–57, 2006.
  • (25) Mark J Willis, Hugo G Hiden, Peter Marenbach, Ben McKay, and Gary A Montague. Genetic programming: An introduction and survey of applications. In Genetic Algorithms in Engineering Systems: Innovations and Applications, 1997. GALESIA 97. Second International Conference On (Conf. Publ. No. 446), pages 314–319. IET, 1997.