Dimensionally Constrained Symbolic Regression

06/20/2011 ∙ by Suyong Choi, et al. ∙ Korea University 0

We describe dimensionally constrained symbolic regression which has been developed for mass measurement in certain classes of events in high-energy physics (HEP). With symbolic regression, we can derive equations that are well known in HEP. However, in problems with large number of variables, we find that by constraining the terms allowed in the symbolic regression, convergence behavior is improved. Dimensionally constrained symbolic regression (DCSR) finds solutions with much better fitness than is normally possible with symbolic regression. In some cases, novel solutions are found.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Extraction of a physical parameter from data is an area where application of symbolic regression can be useful, especially if the relationship between the parameter of interest and measured variables are contrived and non-linear [2]. In the problem of mass measurement of a particle in high-energy physics, the relationship is determined exactly if all the decayed particles are detected and measured by the detector. The problem becomes non-trivial if some of the particles escape detection.

In boson mass measurement with at hadron colliders, the neutrino () escapes detection, but its transverse components of the momentum are measured indirectly. In this case, transverse mass () is known to be the most sensitive variable to the boson mass [1].

In searches of interest at LHC and the Tevatron, such as the Higgs particle and supersymmetry phenomena, there are two or more particles which escape the detector without leaving any signal. It makes mass measurement of these particles challenging. In some of cases, there many solutions are known, but it is not clear in what sense they are optimal for the mass measurement [4, 5, 3]. By applying symbolic regression to these problems, we may be able to derive optimal equations sensitive to mass and gain new insight.

2 Dimensionally Constrained Symbolic Regression and Its Implementation

Symbolic regression(SR) is a method which can be used to derive relationship between input variables and the desired output values. It is an application of genetic programming technique. In SR, an individual of a population is symbolic form of an equation. By applying genetic operations, the population evolves and gradual improvement of the overall population as well its most fit individual occur. Selection is done at the individual level at each generation based on its fitness.

We need to define the representation, functions/operators to be used, terminals that will be used to form an individual in SR. These, in addition to genetic operations, fitness function, and numerous parameters, define an SR. Later, we will describe the dimensionally constrained symbolic regression (DCSR) used for mass reconstruction problems in HEP. To have the flexibility DCSR as an option, we implemented a general SR in C++ language based on ROOT C++ libraries [7].

2.1 Representation and Genetic Operations in SR

Internally, mathematical expression is represented as a binary tree (Fig. 1). Operators or functions appear at tree nodes and arguments to them appear as branches. Binary tree representation allows for efficient manipulation and evaluation of mathematical expressions.

The functions and operators allowed are as follows:

  • Basic arithmetic operators: , , ,

  • Negation

  • Power

  • Transcendental functions: , , , ,

  • Unit step function

For a function or operator that accepts only one argument, the second leaf may be populated, but is not evaluated. Although the second argument has no meaning with respect to the function or operator, it may nonetheless affect evolution. With cross-over operations and mutations, this second leaf may be picked up for genetic operations. Our implementation is flexible enough such that we can add more functions or operators, if needed.

Figure 1: A binary tree of expression .

When creating an arbitrary expression, the functions or operators are picked randomly according to certain probabilities. The probabilities should be provided as an input to SR. To a certain degree, they will depend on the problem to be solved. We can control the relative rates these functions or operators are chosen through the input parameters of SR.

2.2 Terminals

The input variables and constant become terminals of expression trees. In our implementation of SR, the relative rate for variables and constants can be controlled. However, we set the probability to choose among the variables the same. For the DCSR, which we will describe latter section, we only allow dimensionless arbitrary constants.

2.3 Building Expressions from Functions/Operators and Terminals

A random expression can be build by first choosing a function or an operator randomly with the predefined probability. This becomes the head node of a tree. Then the leaf nodes are built recursively by creating a new expression. An expression can be any of the following:

  • Terminal - constant or variable

  • Unary function - , where is another expression

  • Binary function - , where and are expressions

Since the expression can go on forever, probability to choose a terminal must not be 0, and/or when the maximum depth of a tree is reached, one of the terminals is chosen.

2.4 Parameters of SR

Below, we list the complete start up parameters

  • Number of variables () to use in terminals.

  • Number of generations () in evolution of population.

  • Population size () - At each generation, the population size is kept fixed.

  • Tournament size () - A parent is chosen through a tournament from a randomly chosen subset of size from the population.

  • Flag whether to try double tournament - If , each parent is chosen in separate tournaments. Otherwise, the best two in a single tournament are chosen as parents.

  • Maximum depth of tree ().

  • Copy reproduction () - best individuals of a population are simply copied to the next generation. These individuals may still participate in sexual reproduction.

  • Fraction of new children produced through sexual reproduction in population () - Number of children in each generation is .

  • Mutation probability () - Mutation probability per node

  • Drop probability () - Probability per node to drop it in a mutation

  • Probability to pick a constant ()

  • Probability to pick th function or operator ()

For the examples in the next section, some of the parameters of interest are: , , , . The basic arithmetic operators have probabilities between 5% to 25% each. The other functions and operators take up the remaining probability equally. These probabilities vary for each run.

2.5 Initialization, training, fitness evaluation and evolution of SR

In the initialization stage, random expressions are created to populate the first generation. The user can choose to create an initial population with the expression filled to the maximum tree depth or not. This depends on the problem being considered. For the purpose of mass reconstruction, we find that having the initial population filled to the maximum depth of the tree is not necessarily beneficial, since most of the initial population has poor fitness and get discarded after a few generations.

No distinction is made between the training and testing sample. One half of the input sample is randomly chosen for the training sample and the other half is the testing sample. This choice is made for every generation. Fitness of each individual is evaluated on the testing sample according to the criteria for a problem.

Evolution of population is done through two main avenues. First is the sexual reproduction, where parents are not chosen randomly, but through a tournament. In a tournament, random subpool of individuals are chosen randomly and the best two individuals are chosen as parents in a single-tournament method. In a double tournament method, the parents are chosen from two tournaments by selecting the best fit individual in each tournament.

Once the parents are selected, then the subnodes are selected from two parents and swapped in place. We accept only one child from a set of parents, this choice is random. It is well-documented in literature that selecting a random subexpression for swapping leads to trivial or minimal modifications as nodes that have terminals are chosen more often. Therefore, we choose internal nodes 90% of the times for swapping.

Asexual reproduction is done by having mutations occurring in random nodes in the expression tree. Selected node is either replaced with another expression or it is dropped. If it is to be replaced, an expression tree randomly generated using the parameters of the symbolic regression algorithm. An individual which successfully had a child can still participate in an asexual reproduction. We finally, let the “elite” () to go to the next generation without modifications.

Figure 2: Creation of a child through cross-over operation. The cross over of genes occuring between and from the two parents yields a new child

2.6 Evaluation of Fitness

Fitness function is problem specific. Evaluation of fitness is done on a set of test sample by traversing the expression tree. This is computationally the most time consuming part of SR. Expressions which do not yield machine-real numbers are assigned a number corresponding to a very poor fitness, marking them likely for removal. An individual with a more compact expression is favorable and a small penalty proportional to the number of nodes is imposed on each individual.

2.7 Dimensionally Constrained Symbolic Regression

Symbolic regression must yield a dimensionally sensible result if it is to be interpretable. In this application, we would like to construct equations related to mass of a particle. In natural units ( and ), the mass has the same physical dimensions as momentum and energy. We will assume that the target dimension is , i.e., momentum raised to integer power, and that variables have dimension 1 or 0.

We see that the arguments to operators or functions must satisfy certain constraints. If stands for the physical dimension of expression :

  • : .

  • : .

  • : .

  • :

  • : and

  • Transcendental function : and .

Note that the argument to a transcendtal function must be dimensionless.

Using these rules, we can impose that an expression built by symbolic regression should have the correct physical dimensions. We would need to reformulate the transformation rules for evolution compared to the regular SR.

An (an integer) dimensional term can be built recursively by applying one of the following rules:

  • -dim term -dim term

  • -dim term -dim term

  • -dim term -dim term

  • square root of an -dim term

A 0-dimensional term can be created by dividing two 1-dimensional terms or applying a transcendental function to argument of 0-dimension. And -1-dimensional term can be created by taking an invervse of 1-dimensional term. One can see that any integer dimensional term can be created using these rules.

In DCSR, rules for cross-over and mutation need to be modified accordingly. Exchange or replacement of expressions can occur only among those with the same physical dimensions, otherwise it is not permitted.

3 Application of Symbolic Regression to Mass Measurement Problem

In this section, we will apply symbolic regression to a few problems of mass determination in high-energy physics. In the first two cases, we apply it to the problem where the answer is well-known. And in the third case, we try to find a symbolic expression for the Higgs mass that can be applied to a case where Higgs boson decays to and each boson decays leptonically.

3.1 Invariant mass

As a first example, we choose a trivial example, one for which we already know the answer. The variables given are the components of 4-vector,

, and the target value we would like to obtain is . A 1000 toy data sample was generated satisfying the relationship.

Figure 3: Evolution of minimum absolute error of symbolic regression trained with invariant mass data. Out of 100 runs, 26% yield the correct formula . Evolution of fitness of 36th run is shown by itself.

Symbolic regressions with different values of parameters were tried on this data. Fitness function is , where is the size of data. Figure 3 shows evolution of fitnesses of the best fit individuals in each run. In 26% of the runs, the correct formula for invariant mass is found. The right plot of Fig. 3 shows the evolution of the best fit individual of run 36. At the start of the run, the best fit expression is , which is meaningless physically. By generation 86, the equation is found. When DCSR is used on the same sample, the chances for success increases to 68%.

3.2 Events with Missing Information - Transverse Mass

A more realistic and interesting application of symbolic regression is to a problem where information is lost. A classic example would be where a massive particle decays into a charged lepton and a neutrino in a hadronic collision. Transverse mass is typically used to measure the mass of the boson. To test whether we can extract this mass relationship, a phase-space generator is used to create the momenta of lepton and neutrino from a hypothetical massive particle with an exponentially falling initial transverse momentum distribution for particle masses between 0 GeV and 100 GeV.

Figure 4: Evolution of minimum of fitness in 100 runs without DCSR (left) and with DCSR (right) during training using lepton and neutrino momenta variable set as inputs . With DCSR, the fraction of runs that find the equation for transverse mass increases dramatically.

The fitness function used was , where is the value returned by individual. Some trees reach a global minimum as shown in Fig 4. For those that reach a global minimum, the equation is , which is the same as the definition of except for the overall constant factor.

Depending on the parameters of the genetic algorithm, it may take longer to reach this minimum. Some may fail to reach this minimum For example, if the minimization criteria were

, almost all trials of symbolic regression fail to reach the global minimum. The populations get quickly trapped into a local minima. This is the result of using a minimizing criteria that heavily penalizes the outliers. It is a known feature of genetic algorithm that having individuals with poor fitness in the gene pool is important in population to reach the global minimum. Without DCSR, 7% of the trials find the solutions, while with DCSR the success rate increases to 40%. We find that constraining the terms allowed increases the speed of convergence and the chances for a successful convergence.

3.3 Mass Sensitive Variable in

Higgs mass determination in in hadron colliders is an inportant problem. In this channel, two lepton momenta and the vector sum of the two neutrino transverse momenta are measured in experiments. Since there are only two equations related to neutrino momenta, the system is under-constrained. If we knew both bosons were real, we would still need two extra equations to constrain the system. Therefore one cannot solve for the neutrino momenta exactly even in principle.

Existing studies relied on analysis of kinematics to find expressions that behave linearly to the Higgs boson mass [3, 5]. In this study, we apply symbolic regression to the problem and find an expression that not only shows linear behavior, but also whose widths of the mass distribution are narrow.

Symbolic regression is applied to a data generated with PYTHIA at TeV with varying from 120 GeV to 200 GeV [8]. Detector simulation is not performed on this data in order to find the ideal expression. Momentum components and energy of the two charged leptons () and missing information () are used as input variables for the symbolic regression. The fitness function used is the average of fractional absolute difference: .

Without DCSR, the symbolic regression is not able to yield meaningful results. This seems to be due to the greater number of variables used. The number of terms of dimension 2 with only multiplication allowed is 78, which makes the possible function space to explore very large.

If fractional root mean-squared (RMS) () were used as the fitness function, the symbolic regression would get trapped into local minima even with DCSR since outliers pay a heavy penalty.

Figure 5

shows evolution of fitness of best-fit individuals in 100 runs as a function of the number of generations. DCSR is able to converge on meaningful results and yields the best estimate for the


Symmetry of the two leptons in the system is recognized by the symbolic regression automatically, even though symmetry condition was not imposed. More detail on the Higgs mass measurement using this result is discussed in Ref. [9].

Figure 5: Left: Evolution of best fit individuals function in 100 runs. Right: Distribution of ratio of predicted mass to true mass, , versus the true Higgs mass . The sample was generated using PYTHIA.

3.4 Computational performance

With our implementation, on a 24 thread X5650 Xeon machine, it takes approximately 1 day to complete the 100 runs for the Higgs mass example. Most of the time is spent in evaluation of the fitness function. We noticed memory fragmentation occurring and evaluation tended to become slower as time went on. A Mathematica implementation, which was simple to write, did not suffer from memory problems. However, it was inherently slow. Using the parallel processing capabability of Mathematica 7, even when we used all 24 threads, one run took about a day to complete.

With the advent of higher multiplicity core CPUs in the future, it may become computationally favorable to use this tool. One avenue that needs to be explored is to consider the backgrounds when optimizing. At the moment, it would be very expensive to evaluate fitness including backgrounds since a simple fitness function would no longer suffice.

4 Conclusion

We have defined and implemented dimensionally constrained symbolic regression for use in high-energy physics problems. We find that it reproduces some of the well-known results, such as invariant mass and transverse mass. We used it to construct a function that is linear and sensitive to the Higgs boson mass . that has not been previously known. We expect that this method would be useful in cases with more than one undetected particles, where it is not trivial what variable could give the best performance.