1 Introduction
Polylogrithmic functions, such as and naturally appear in many calculations relevant to physical systems. For example, in high energy physics, Feynman integrals which describe the scattering of elementary particles, polylogarithms are ubiquitous. One way to understand the connection between polylogarithms and Feynman integrals is through the method of differential equations [KOTIKOV1991158, Gehrmann:2000zt]
, which recasts a Feynman integral in terms of an auxiliary ordinary differential equation which can be integrated multiple times. These nested integrals naturally produce a class of functions which includes the classical polylogarithms. In recent years, progress has been made in understanding the broader class of functions which can appear in Feynman integrals, such as multiple polylogarithms
[Goncharov:1998kja, Borwein:1999js, goncharov1mpl] or elliptic polylogarithms [levin1994elliptic, multipleLevin, multiplebrown], but the complete characterization of the possible functional forms of Feynman integrals remains an unsolved problem. Nevertheless, even when an integral can be evaluated and expressed only in classical polylogarithms, , simplifying the initial expression can be challenging.To be concrete, consider the following integral which arises in the computation of the Compton scattering total cross section at nexttoleading order (NLO) in quantum electrodynamics [comptoncrosssection]:
(1) 
One can often evaluate integrals like this by rationalizing the squareroot and expanding in partial fractions. In this case, defining the integral can be evaluated as
(2)  
(3)  
(4)  
(5) 
where and are cuberoots of unity. Although this expression is correct, it can be simplified further to
(6) 
There is no existing software package (to our knowledge) which can integrate and simplify such expressions automatically. The goal of this paper is to explore how machine learning may assist in the simplification of expressions like which arise from Feynman integrals. For this initial study, we restrict our considerations to classical polylogarithms and functions of a single variable where the problem is already challenging.
Simplifying expressions involving polylogarithms involves applying various identities. For example, the logarithm satisfies
(7) 
Identities involving the dilogarithm include reflection
(8) 
duplication
(9) 
and others. Similar identities are known for and higher weight classical polylogarithms. Although the complete set of identities is not known for general , known identities are often enough to simplify expressions that arise in physical problems.
Even with a given set of identities, it is not always easy to simplify polylogarithmic expressions. There may be many terms, and it may take the application of multiple identities in a specific order for the expression to simplify. For example, applying the duplication identity in Eq. (9) to a single term will increase the number of dilogarithms, and it may be that only after applying reflection after duplication that terms will cancel. An analogy is a sacrifice in chess: initially it seems illadvised, but after a subsequent move or two the sacrifice may be beneficial. The comparison with moves in a game suggests that perhaps reinforcement learning could be wellsuited to the simplification of polylogarithms. We explore the reinforcementlearning approach in Section 4.
The simplification of expressions resulting from Feynman integrals is not of purely aesthetic interest. For example, constructing the function space [Yan:2022cye] and understanding the singularities of scattering amplitudes [Hannesdottir:2021kpd, Bourjaily:2020wvq] is essential to the matrix bootstrap program [Eden:1966dnq]. However, singularities of individual terms in a Feynman integral may be absent in the full expression. Thus methods for simplifying iterated integrals have been of interest in physics for some time. A particularly effective tool is the symbol [polylogsmhv6l]. We review the symbol in Section 2.2. Briefly, for an iterated integral of the form
(10) 
where the iterated integrals are defined such that
(11) 
we have an associated symbol as
(12) 
In particular,
(13) 
and
(14) 
The symbol is part of a broader algebraic operation called the coproduct [duhrhopf]. It serves to extract the part of an iterated integral with maximal transcendental weight (e.g. ). The symbol has a product rule similar to the logarithm:
(15) 
This product rule can in fact be used to generate all of the known dilogarithm identities (see Section 2.2).
The symbol helps with the simplification of polylogarithmic functions. For example, the symbol of Eq. (5) is
(16) 
which using Eqs. (13) and (14) immediately gives the leading transcendentality part of Eq. (6). A more involved example is the simplification of twoloop 6point function in super YangMills theory from a 17page expression to a few lines [polylogsmhv6l].
Unfortunately, the symbol does not trivialize the simplification of polylogarithmic functions. One can use the product rule to pull out factors of rational polynomial arguments, but getting the resulting symbol into sums of terms of the forms in Eqs. (13) and (14
) still requires intuition and often educated guesswork. While there are algorithms to compute the symbol from a given polylogarithm expression, there is not a known inverse algorithm, to compute the polylogarithm from the symbol. Machine learning, however, is good at inverse problems: one can view the integration of the symbol as translation of natural language. Given a mapping from English (polylogarithms) to French (symbol), can we find the English translation (polylogarithmic form) of a given French phrase (known symbol)? Viewing the simplification problem as a sequencetosequence translation task allows us to explore a host of other machine learning techniques. In this paper, we will focus on using one of the most powerful known methods for natural language processing (NLP), transformer networks, to tackle the symbolpolylogarithm translation task. The way that reinforcement learning or transformer networks can be used in the simplification task is illustrated in Fig.
1.Machine learning has already proven useful for symbolic problems in mathematics. For instance, tree LSTM’s were deployed to verify and complete equations [arabshahi2018combining], while transformers were shown to be able to solve ODE’s, integrate functions [fbtransformer] or solve simple mathematical problems [DBLP:journals/corr/abs190401557]. Finding new identities or checking the equivalence between two mathematical expressions falls into this category of symbolic tasks and has also been tackled using machine learning [AllamanisCKS17, Zaremba]. In particular, reinforcement learning has provided interesting applications in symbolic domains, with contributions in theorem proving [rltheorem, zombori2021towards, Wu2021TacticZeroLT, Lederman2020Learning], symbolic regression [petersen2021deep] or for the untangling of knots [unknot, davies2021advancing]. Physics inspired applications of machine learning have emerged, predicting physical laws from data [cranmer2020, tegmarkfeynmanai, guimera2020] or discovering its symmetries [thaler2022, tegmark2022] but are for the most part focused on problems with numerical input data. Using machine learning in purely symbolic mathematical environments is often quite challenging: when solving a given problem the solution has to be exact and the path to reach it is typically narrow. For example when simplifying equations we have to take a sequence of identities in a precise order where any deviation from it leads to a completely different expression. Those challenges are discussed in [contrastiveRLPoesia] where the use of contrastive learning [aaron2019, Haotian2020] was key in solving various symbolic domains.
We begin in Section 2 with a review of the relevant mathematics of polylogarithms and symbols that we use later in the paper. Section 3 describes our “classical” approach to simplifying polylogarithm expressions, while Section 4 presents our reinforcement learning based technique. Section 5 explores the transformer network’s perspective in the translation task between a symbol and a polylogarithmic expression. A example complete nontrivial application of the transformer network is given in Section 6. Some conclusions and the outlook is discussed in Section 7.
2 Mathematical preliminaries
In this section we briefly review the classical polylogarithms and the symbol map. The reader familiar with the mathematics of polylogarithms can safely skip this section. More details can be found in [duhrhopf, dilogkirilov].
2.1 Classical polylogarithms
The classical polylogarithm is defined as an iterated integral, most conveniently expressed recursively as
(17) 
with the base case
(18) 
so that
(19) 
In general, one has to be careful about the branch cuts of polylogarithmic functions. has a branch point at and is conventionally defined to have a branch cut on the real axis for . for has a branch point at with a branch cut from on the real axis and a branch point at on a higher Riemann sheet. For numerical checks, it is important to keep track of the branch of the polylogarithm, but for symbolic manipulation we can largely ignore the subtleties of the analytic structure.
The logarithm satisfies the product rule
(20) 
which is the only identity needed to simplify logarithms.
The dilogarithm satisfies a number of identities, such as
(21)  
(22)  
(23) 
These identities can be derived by taking derivatives, using the logarithm product rule and then integrating. Inversion and reflection form a closed group^{1}^{1}1This is often called the Coxeter group with the presentation ., in that applying them iteratively generates only 6 different arguments for the dilogarithm
(24) 
Thus without something more complicated like duplication, the simplification of an expression involving dilogarithms would be a finite problem. The duplication identity is part of a broader set of identities
(25) 
For the dilogarithm, the five term identity:
(26) 
is a kind of master identity [wojtkowiak1996functional, zagierdilog] from which other identities can be derived by judicious choices of . For instance taking gives
(27) 
which generates the duplication identity after using the reflection and inversion on the second and third dilogarithms. In fact all dilogarithm identities involving a single variable are expected to be recovered from the 5 term identity in this way [dilogkirilov].
Higherweight polylogarithms in satisfy similar identities, such as
(28) 
and so on. It should be noted however that fewer functional identities are known for polylogarithms of higher weight.
2.2 Multiple polylogarithms and the symbol map
A generalization of classical polylogarithms are the multiple or Gonchorov polylogarithms [goncharov1mpl] defined iteratively as
(29) 
starting from
(30) 
and with
(31) 
We can write these functions in an alternative notation as
(32) 
which follows the definition in Eq. (11). The multiple polylogarithms form a Hopf algebra, which is an algebra with a product and a coproduct [duhrhopf]; this algebra also has a shuffle product [ree1958lie]
. The details of the algebraic structure are not of interest to us at the moment. All we need to know is that the maximum iteration of the coproduct, called the
symbol [polylogsmhv6l] and denoted extracts the logarithmic differentials from an iterated integral(33) 
so that
(34) 
with the special cases
(35) 
and
(36) 
The symbol acting on a complex number vanishes, e.g. . So a function cannot always be exactly reconstructed from its symbol. One can, however, reconstruct the leadingtranscendentality part of a function.^{2}^{2}2Transcendentality is defined so that and have transcendentality . The transendentality of a product of two functions is the sum of the functions’ transcendentality. E.g. has transcendentality . So and have the same symbol.
The symbol is powerful because it is additive
(37) 
and satisfies the product rule
(38) 
Similar to how , at the level of the symbol, any multiplicative constant can also be discarded:
(39) 
This behavior is consistent with the symbol only capturing the highest transcendental weight portion of a given function.
As an example of the use of a symbol, consider the reflection dilogarithm identity. Taking the symbol we find
(40) 
where the product rule was used in the last step.
For transcendentalweighttwo functions (dilogarithms), the symbol has two entries. It can be helpful to decompose such symbols into symmetric and antisymmetric parts. The symmetric part can always be integrated into products of logarithms
(41) 
The antisymmetric part is harder to integrate. It can be convenient to express dilogarithms in terms of the the Rogers function
(42) 
whose symbol is antisymmetric
(43) 
So the antisymmetric part of a symbol can be integrated if one can massage it into a difference of terms of the form . For higherweight functions, one can antisymmetrize the first two entries of their symbols, but there is not a canonical way to write or integrate these symbols.
For a neat application, define so that and hence
(44) 
Now consider the combination [dilogkirilov].
(45) 
The symbol of is the antisymmetrized version of
(46)  
(47)  
(48) 
Using Eq. (44), this simplifies to . We conclude that the combination in Eq. (45) is free of dilogs. Indeed, as computed by Coxeter in 1935 using a geometric argument [coxeter1935], and in [dilogkirilov] using the 5term dilogaritm identity. Using the symbol vastly simplifies the derivation of Coxeter’s formula.
A powerful application of the symbol is to simplify polylogarithmic expressions through algebraic manipulations. However, after simplifying the symbol, it is not always easy to integrate it back into a function. As an example, a Feynman integral might lead us to consider
(49) 
Doing the integral in Mathematica with and gives
(50) 
The symbol of this function is easiest to read directly from the integral form in Eq. (49)
(51)  
(52)  
(53) 
The second line required massaging the symbol so that the symbol of was manifest and then recognizing the remainder as the symbol of a simpler function. The result is that
(54) 
up to lower transcendentalweight functions (there are none in this case).
As this example indicates, it is often possible to simplify a function using the symbol, but such a simplification requires some intuition. Simplicity and intuition are not easy to encode in classical algorithms, however, they are traits that machine learning has been known to harness. In Section 5 we will see a concrete application by deploying transformer networks to integrate the symbol.
3 Classical algorithms
Before introducing Machine Learning (ML) algorithms, we briefly summarize how one could simplify polylogarithms using standard algorithmic approaches. This will serve as a good benchmark for any ML implementation. We start by discussing algorithms for reducing polylogarithm expressions using identities. If we restrict the identities needed for a specific problem to a finite set (such as inversion, reflection and duplication), then it is not difficult to write a decent algorithm. We will see however that the main disadvantage of such methods lies in their computational cost scaling.
3.1 Exploring the simplification tree
Breadthfirst search Starting from a given expression, such as , one approach to find a simplified form is a breadthfirst search. We construct a simplification tree, by applying iteratively all possible identities to all distinct dilogarithm terms. Each new expression obtained in such manner is a node of the simplification tree. In this way we can exhaustively explore the set of all possible expressions reached by our identities up to a given depth. The depth of this tree will correspond to the number of moves required to reach our starting point, the root of the tree. If the number of terms in a given expression always remained 2, the scaling for the number of nodes would go as at depth . The scaling is made worse by the presence of the duplication identity, which can increase the number of distinct dilogarithm terms. Clearly this kind of exhaustive search is extremely inefficient. Nonetheless, provided we have enough computational resources we can fully explore the tree and we are assured that any solution we find will be the one requiring the fewest identities to reach. In practical applications where we can have up to dilogarithm terms [Bonciani2011] this is not reasonable.
Modified Bestfirst search Another way of searching the simplification tree is by following a bestfirst search. At each given iteration we once again apply all identities to all distinct dilogarithm terms. This time however we choose only one node as our next starting point, the one corresponding to the simplest expression. We define this “best” node as the one with the fewest dilogarithms. In the event of a tie (for instance when no obvious simplifications are achieved by using a single identity) we retain the node that corresponds to applying a reflection identity on the first dilogarithm term. After doing so, we apply a cyclic permutation to the dilogarithm terms of the expression, preventing the algorithm from getting stuck in a loop. This algorithm is expected to perform well if only reflection and inversion are considered. Indeed, these identities satisfy
(55) 
generating a closed group of order 6 as in Eq. (24). Allowing the algorithm to take a reflection identity on each term sequentially guarantees that any two matching terms will be identified by our best first search procedure, using one inversion at most. For instance we could have the following simplification chain
(56) 
The number of evaluations needed at each depth of the search tree is given by . This scaling can also be prohibitive for computations of physical interest. Guiding the search tree would provide a crucial improvement and we will discuss in Section 4 how Reinforcement Learning techniques could be of some use.
3.2 Symbol Approach
We next consider a classical algorithm to construct a polylogarithmic expression given its symbol, i.e. to “integrate the symbol”, to benchmark our ML implementation. As mentioned in Section 2.2, the symbol assists in the simplification of polylogarithmic expressions by reducing the set of identities to the product rule, so simplification can be done with standard polynomialfactorization algorithms. For example, the mathematica package PolyLogTools [polylogtools] can automate the computation of the symbol of a given polylogarithmic expression and its simplification. The tradeoff is that it is not so easy to integrate the simplified symbol back into polylogarithmic form.
In general, symbols can be extraordinarily complicated. To have a welldefined problem and to guarantee physical relevance, we consider as example the symbol of the various transcendentalitytwo terms in the Compton scattering total cross section at NLO [comptoncrosssection]. The original function space is composed of 12 transcendental weighttwo functions , which each contain complex numbers and irrational numbers like and . The corresponding symbols after mild simplification are:
(57) 
Note that these symbols are purely real. Also note that the polynomials appearing in the rational function arguments have degree at most 4 and integer coefficients no larger than 5. Thus it will not be unreasonable to consider the simpler problem of simplifying expressions with restrictions on the set of rational functions appearing as arguments of the polylogarithms.
The general strategy for integration is as follows :

Integrate symmetric terms that give rise to logarithms:
(58) 
Integrate terms with uniform powers into polylogrithms. The idea is to find two constants and such that
(59) If , solving this equation gives , , which guarantees the integrability of the uniform power terms. For terms like , we can apply the product rule: and integrate it similarly.

Search for terms that can be combined to . For example, we search for terms like
(60) which can be integrated following step 2. For the remaining symbol, we can try feeding terms like . Explicitly,
(61) and both terms can be integrated directly.
Under this algorithm, we integrate all basis symbols successfully except , and all expressions are free of square roots and complex numbers. For example we find
(62) 
The remaining basis requires introducing another variable , and our algorithm gives
(63) 
In conclusion, while it is often possible to integrate the symbol with a classical algorithm, one must proceed on a casebycase basis. That is, there often seem to be nearly as many special cases as expressions, even at weight two. As we will see in Section 5, transformer networks seem to be able to resolve many challenging weighttwo expressions in a more generic way.
4 Reinforcement Learning
As detailed in previous sections our goal is to simplify expressions involving polylogarithms. Sometimes it is possible to simplify polylogarithmic expressions by hand, by choosing both an identity and the term to apply it on. The difficulty in this procedure lies in figuring out which choice of identity will ultimately lead us to a simpler expression, maybe many manipulations down the line. In practice this often relies on educated guesswork, requiring experience and intuition. Such a problem seems particularly well adapted for Reinforcement Learning (RL). RL is a branch of machine learning where an agent interacts with an environment. Based on a given state of the environment the agent will decide to take an action in order to maximize its internal reward function. Our goal is to explore whether RL can be useful in reducing linear combinations of polylogarithms.
We will consider a simplified version of this general problem by restricting ourselves to dilogarithmic expressions which can be reduced using the actions of inversion, reflection and duplication shown in Eqs. (21), (22) and (23). Since simplifying single logarithms and constants is not difficult, we drop these terms from our calculus (since we can keep track of each action taken we are able to add back the discarded constants and logarithms at the end). So all of our simplifications will be made at the level of dilogarithms only.
It may seem that we have oversimplified the problem to the point where it is no longer of interest in physical applications, like computing Feynman integrals. However, if one cannot solve this simple problem, then there is no hope of solving the more complex one of physical interest. We view our toy dilogarithm problem analogously to the problem of integrating relatively simple functions of a single variable in [fbtransformer] or the factorization of polynomials like tackled in [DBLP:journals/corr/abs190401557].
4.1 Description of the RL agent
We have motivated the following problem:
Problem statement What is the simplest form of a linear combination of dilogarithms with rational functions as arguments? We consider only forms that can be simplified using the inversion, reflection and duplication identities. By using these identities to generate more complicated forms from simple ones, we can guarantee that all the examples in our data set simplify.
State space The state space corresponds to the set of linear combinations of dilogarithms whose arguments are rational functions over the integers. This state space is preserved under inversion, reflection and duplication. We encode the state space so that the agent is confronted with numerical data as a opposed to symbolic data. For this purpose we first translate any given equation into prefix notation, taking advantage of the fact that any mathematical equation can be represented by a treelike structure. For instance the mathematical expression is parsed as [‘add’, ‘mul’, ‘+’, ‘2’, ‘polylog’, ‘+’, ‘2’, ‘x’, ‘polylog’, ‘+’, ‘2’, ‘add’, ‘+’, ‘1’, ‘mul’, ‘’, ‘1’, ‘x’] which is represented by the tree of Fig. 2.
For the constants we use a numeral decomposition scheme, where integers are explicitly written as . This implies adding the extra word ‘10’ to our vocabulary compared to [fbtransformer], but typically leads to a better performance, following the observations made in [graphmr]. In total we have words.
Word embedding Our prefix expression is passed through a word embedding layer for which different choices are considered. Since we have a small number
of words (‘2’, ‘polylog’, ‘+’, etc.) that are required to parse our mathematical expression we experimented with either label encoding, onehot encoding
[Pargent2022] or a dedicated embedding layer at the word level . We found that onehot encoding provided the most stable results and it will be our encoding of choice. The mapping between the words of the dictionaryand the target vector space is given by
. Each word will correspond to a distinct unit vector in this space.Sentence embedding We also need to specify an encoding scheme for the sentence as a whole. The simplest choice is to consider a finite length for our input,
and to pad the observation vector. The padding value is a vector of
’s following our onehot encoding implementation. The resulting onehot encoded equation has dimensions . Any equation with a prefix notation expression longer than will be deemed invalid and will be discarded.We also consider Graph Neural Networks (GNN) for encoding the prefix expression (sentence). The graph is given by the tree structure of the mathematical expression, where the nodes take as value the onehot encoded word. Our architecture is implemented using
PyTorch Geometric [torchgeometric] and consists of multiple message passing layers using a mean aggregation scheme along with a final global pooling layer across the nodes. In that way the entire mathematical expression can be represented as a vector of size , where is the embedding dimension used by the message passing layers.Action space Each identity (reflection, inversion or duplication) corresponds to one of the actions that the agent can take. We also need to specify the dilogarithm term that we are acting on. For this purpose we take the convention that an identity is always applied on the first term of the linear combination. To make sure that the agent has access to all terms we implement a fourth action: the possibility of doing a cyclic permutation through the dilogarithm terms in the expression. The set of actions is illustrated in the following example:
(64) 
After taking an action we process the expressions to combine relevant terms whenever possible. Following this convention, in our example the result of the reflection identity is directly . Simplifying the expressions after each action allows us to keep a small action space. Since the first dilogarithm term has now more importance, being the term that we are acting on, for the GNN embedding we extend the state vector by adding to it the embedding corresponding to the first dilogarithm term only.
Reward function Our goal is to simplify linear combinations of dilogarithms. For that purpose a natural reward function is one that penalizes long sequences. In particular at each time step we could consider imposing a penalty based on the expression’s length:
(65) 
where is the length of the sequence in prefix notation, the number of distinct dilogarithm terms and a set of parameters. We observed that such a reward function is very sensitive to the choices of and . It also leads to the duplication action being mostly ruled out by the agent, as any “wrong” application leads to an immediate increase in the length of the expression.
We found a better reward function to be
(66) 
Choosing this reward function allows our agent to explore a larger portion of the state space since no penalty is imposed on the length of the sequence.
Adopting the second reward scheme, one could be concerned with the possibility of having cyclic trajectories not being penalized by the agent. For instance, successive applications of the reflection action lead to recurrences
(67) 
To dissuade the agent from following such behavior we can add an extra penalty term
(68) 
where is the action taken at the time step and the corresponding number of dilogarithm terms. This penalty term checks whether we are repeating actions, without shortening the expression length. Adding this constraint should guide us away from cyclic trajectories.
Since RL is used to solve Markov Decision Processes we must ensure that our environment satisfies the Markov property, namely that any future state reached depends only on the current state and on the action taken. To respect this property and get accurate expected rewards we extend our state vector (after it has passed through the sentence embedding layer) by adding to it
. This additional information can now be used to estimate the full reward
.RL Agent For the RL agent we tried Proximal Policy Optimization (PPO) [ppo] and Trust Region Policy Optimization (TRPO) [trpo], both implemented within stablebaselines3 [stablebaselines3]. These agents are onpolicy algorithms, where the exploration phase is governed by using actions sampled from the latest available version of the policy. The trajectories collected (actions taken, states, rewards) during the evolution of the agent are used to update the policy by maximizing a surrogate advantage^{3}^{3}3We note that stablebaselines3 relies on Generalized Advantage Estimation (GAE) [gae]
when computing the policy gradient. In practice this is used to provide a tradeoff between using real rewards (small bias/ high variance) and estimations from the value network (high bias/ low variance).
. We refer the reader to the aforementioned references for an indepth explanation.In our experiments we observed that TRPO gave the more consistent and stable training. The particularity of the TRPO algorithm lies in that the policy updates are constrained, making the learning process more stable. By estimating the KLdivergence between the old and the new policy one takes the largest step in parameter space that ensures both policies’ action distributions remain close. The TRPO algorithm can also guarantee a monotonic improvement of the policy during updates, with increasing expected returns, although at the cost of longer training times.
Network Architecture The TRPO agent has a policy and a value function , both parameterized by neural networks, for which we design a custom architecture. The input for both networks will be an observation , which is the output of the sentence embedding layer, where we add an extra flattening layer if required. The final output is a scalar for the value net and a vector of dimensions
for the policy net. We opt for one shared layer of size 256 between the policy and value networks, followed by three independent layers of size 128, 128 and 64, where all the activation functions are taken to be ReLU.
For the graph neural network used in the sentence embedding layer we refer to the GraphSAGE message passing architecture from [graphsage] along with a mean aggregation scheme. We do not use a neighbor sampler over the nodes here contrary to GraphSage and have no LSTM aggregator. Our model is composed of 2 message passing layers and an embedding dimension of 64. Other architectures were considered such as Graph Convolution Networks from [gcn] or Graph Attention Networks from [gat], but the simple linear message passing ended up giving the best performance. The complete architecture, including a GNN for the sentence embedding layer, is illustrated in the Fig. 3 with two message passing layers.
Episode An episode is defined by letting the agent attempt to simplify a given expression. At each episode reset we sample a new expression to simplify, drawing from a previously generated testing set. During an episode the agent runs for steps at most, where each step corresponds to taking one action. Since we may not know how simple the final equation can be, we terminate the episode early only if the mathematical expression is reduced to 0. We will use for training, although we also experimented with lengthier episodes. For simple examples increasing does not necessarily lead to a better performance as taking more actions in the wrong direction will typically increase the length of the expression drastically.
4.2 Training the RL agent
Our implementation is done in Python using standard libraries such as SymPy [sympy], Gym [gym] and PyTorch [pytorch]. If no graph neural networks are used the architecture is simple enough that the RL agents can be trained on a standard CPU. The environment corresponding to our problem is made to match the requirements of a Gym environment allowing us to use the stablebaselines3 [stablebaselines3] RL library.
4.2.1 Solving for different starting points
The objective of our RL algorithm is to reduce various linear combinations of dilogarithms. To ensure a simpler monitoring of the training we will consider the simpler subproblem of simplifying expressions that are reducible to 0 after a suitable combination of identities. We create a generator for such expressions, making sure that our RL agent encounters a variety of different environments during training.
Generating starting points We generate our multiple starting points using a random function generator and a “scrambling” procedure. By a scramble we refer here to the action of applying an identity to a given dilogarithm term in order to make it appear more complicated. The steps for creating a training expression are as follows

We sample the number of distinct dilogarithms terms that we want to consider.

For each term we sample a singlevariable rational function . We limit the degree of each to be at most two, since the complexity of the expressions can increase drastically with the usage of the duplication identity. The coefficients of are also constrained between , leading to around 7,000 unique rational functions.

For each dilogarithm term we sample an integer constant between . At this point we can create a combination of dilogarithms that is explicitly 0 as
(69) 
We sample the total number of scrambles to be applied , with .

We choose how to distribute the scrambles/identities amongst each term. We ask that every zero added, indexed by , is acted upon by at least one identity.

We apply the identities on , making sure that no identity is applied twice in a row on the same term (or we would get back to the starting point).

We discard any expression, which, when converted to prefix notation, is larger than . This ensures that every example can fit inside the network.^{4}^{4}4In practice we observe that above or some expressions start to reach 500 words in length. For we limit to prevent this and keep simpler expressions. In our architecture we take .
There are around possible distinct expressions that can be produced this way.
Training set Equipped with our generator we can create sets of starting points for the environment. We tested two different approaches: at each reset (each new episode) we can either create a brand new example or draw one from a previously generated set. We observed that both procedures gave similar results and find it easier to present the second case only. We generate an initial training set of 13,500 expressions, where each equation is annotated with the number of scrambles used to generate it. At each episode reset a new expression will be sampled from that set and used as a new starting point. To give a few concrete examples, some of the equations considered are:
which can all be simplified to 0 (up to logarithms). In addition to the training set, we also generate a test set of 1,500 expressions, independent of the training set, which the agent will not see at all during training.
4.2.2 Details of the training runs
Hyperparameter tuning
We performed a nonexhaustive search for the tuning of our RL agent hyperparameters, starting with the defaults of
[stablebaselines3]. During the tuning the agents were set to run for time steps and the hyperparameters were picked in order to both maximize the expected rewards and minimize the mean episode length. The parameters that we ended up using are shown in Table 1. A finer optimization of the various parameters could be implemented but is not the main focus of this work. For the Graph Neural Network we used the default parameters of [torchgeometric].Hyperparameter  TRPO Agent  PPO Agent 

2048/4096^{5}^{5}5We use 2048 steps when the sentence embedding layer is a GNN and 4096 otherwise.  4096  
Batch size  128  128 
Learning rate  10  10 
GAE  0.9  0.9 
Discount factor  0.9  0.9 
Target KL  0.01  N/A 
Max steps in CG algorithm  15  N/A 
Number of critic updates  10  N/A 
. The Kullback–Leibler divergence
[kl] constraint and the Conjugate Gradient algorithm [Hestenes1952MethodsOC] are both used within TRPO [trpo].Experiments We run two sets of experiments, for the two choices of sentence embeddingg, Graph Neural Network or onehot encoding, as described in Section 4.1. In both cases we will compare the performance of the agent with and without the cyclic penalty of Eq. (68) added to the reward function, for which we take . For all experiments we let the agents run for 3M steps each, keeping a maximum episode length of .
Monitoring the training To get a simple measure of whether the agent is training correctly we focus on two different metrics: the average rewards and the mean episode lengths. Since our equations can be simplified down to 0, at which point an episode is terminated early, we do expect to see the mean episode length decreasing with the training steps. To present readable plots we use a moving average with a window size of 20 policy iterations to smooth the learning curves.
The results of the training runs are shown in Fig. 4. We observe that both sentence embedding schemes follow a similar learning curve; in both cases the episode length does decrease and reach comparable values. We also observe that the training involving the GNN is less stable, with the mean rewards fluctuating a lot more than in the onehot encoding scheme. As expected, the addition of the cyclic penalty to the reward function is reflected in Fig. 4 by a smaller total reward, especially for the early episodes. The fact that the cyclic penalty reward curves do not quite catch up to the base reward curves is an indication that our final agent can still get stuck in cyclic trajectories. Additional reward shaping was considered to minimize this effect but did not end up being beneficial for the overall performance.
Evaluating performance To accurately reflect how well our agents learned to simplify the starting expressions we perform a dedicated test run. For this we use our independent test set of 1,500 expressions. To probe an agent’s performance we measure the number of equations that it is able to resolve, along with the number of identities that are required. We do not count the cyclic permutation action as using an identity here, since it does not correspond to an actual mathematical operation on the dilogarithms. Our goal is to quantify how much of the “simplification tree” we have to explore compared to classical algorithms. Since all of the examples in the test set are tagged with the number of scrambles used to generate them, we can study the performance as a function of the complexity of the starting point.
In order to reduce the expressions with our trained RL agents we try out two different approaches. Our first approach is a greedy, singleroll out one, where at each time step we perform the top action proposed by the policy network, taking
(70) 
Our second approach is a beam search one, where at every step we consider the top two actions proposed by the policy network. In order to limit the number of trajectories to explore, we only ever keep the top trajectories in memory. To determine which trajectories are better, we retain the ones that are expected to give us to the best return. This is measured by looking at
(71) 
summing the rewards already accumulated, the reward associated with the action and the reward to go, which we estimate with the value network. Here, is the discount factor mentioned in Table 1.
RL results We report in the Table 2 the overall performance of each RL agent on the test set. We include as a benchmark the performance of a random agent and of a classical algorithm. For the classical algorithm, we use the modified bestfirst search algorithm, described in Section 3.1, which is left to explore the simplification tree up to a depth of 10. There is no major difference to be observed between both of our embedding schemes, with the GNN slightly outperforming the simple onehot encoding approach. Overall we do notice that the inclusion of a cyclic penalty does help the agent, prompting it to explore a more relevant part of the state space during the exploration phase.
Reward  Agents  Greedy  Beam size 3  

Solved (%) 

Solved (%) 


No penalty  Onehot  50 %  5.3  78 %  14.7  
GNN  56 %  6.4  80 %  15.9  
penalty  Onehot  59 %  7.4  85 %  19.2  
GNN  53 %  8.7  89 %  20.3  
Random  13 %  8.7  
Classical  91 %  39.3 
Although the agents do manage to learn in this challenging environment, with the greedy singleroll out evaluation they are only able to resolve of the test equations. On the other hand, the beam search analysis is able to give more accurate results (89%), indicating that the policy network’s top prediction is not confident. It is important to correlate the performance with the number of evaluations (unscrambling steps) explored when searching for the simplified form, giving a measure of how efficient the agent or algorithm is. In that regard the classical algorithm requires an larger exploration of the simplification tree before reaching the simplified form. That is, the classical algorithm essentially searches blindly while the RL algorithm searches in a directed manner.
We can also ask about the scaling of performance with the complexity of the input equation. To make this connection more precise we plot in Fig. 5 the performance as a function of the number of identities used (scrambling steps) to generate our expressions. Also shown is the number of identities used (unscrambling steps) when looking for the simplified form. The greedy RL agents only consider a single action per step, which is inexpensive, and so the scaling observed is highly correlated with the minimal number of steps required to solve an expression. When using a beam search of size we can take up to evaluations per step, since we are probing the two best actions for the trajectories in memory. However, even for the beam search the number of unscrambling steps explored does not scale with the expression’s length but only with the number of steps required to find the solution. This is to be contrasted with the classical algorithm, where the number of evaluations required per step scales with the number of actions and the number of distinct dilogarithm terms. For expressions that have been scrambled a lot, which tend to be lengthier due to the duplication identity, the number of evaluations required by the classical algorithm quickly becomes unmanageable. We can see a realization of this behaviour on the Fig. 5 where the classical algorithm needs to perform a much broader search before finding the solution. The performance for all of the agents can be found in Appendix A.
Challenges of RL Our trained agents have learned to simplify linear combinations of dilogarithms, reaching a beam search performance of compared to the of the random agent. However the environment considered was fairly straightforward since the expressions considered were the ones reducing to 0. Training in the general case, where the goal state is not known, is expected to be much harder. Nonetheless RL seems to offer some advantages over at a simple classical algorithm, for instance being able to reduce the number of evaluations required to find the simplified expression.
One can look to analogous problems for possible improvements. For example, a classic RL problem is solving the Rubik’s cube. In [deepcube] classical algorithms were used in conjunction with policy and value networks to guide the search for the shortest path to the solution. Even though solving the Rubik’s cube and simplifying dilogarithms share similarities as a search problem the nature of the two environments are somewhat different. In our symbolic setting the representation of the environment itself must use various embedding layers, the action space does not have all actions on the same footing (the duplication identity tends to increase the length of expressions if used incorrectly for instance), and the trajectories taken to reach the correct solution are generally heavily constrained (these trajectories are generically unique, up to a cyclic usage of actions). Training the embedding layers jointly with the policy and value networks also proved complicated, as is apparent by the fact that using a GNN did not yield significant gains over the naive encoding approach. As was done in [unknot], one could try out larger networks like the reformer [reformer], known to provide a robust semantic embedding. However we find that in practice it is hard to train these large architectures in conjunction with the policy and value networks. In fact it has been observed that in typical RL settings bigger networks are not necessarily correlated with better performance [deeprl, largenet]. One strategy adopted by [largenet] to tackle this issue is to decouple the representation learning from the actual RL by pretraining the embedding layer for the state. However such an approach is not easily applicable in our context. Indeed whereas powerful sentence embedding schemes exist for natural language [sentembed] they do not seem adapted for representing mathematical data. One could also consider pretraining a graph embedding model [graph2vec] on a subset of well chosen equation trees. The problem of simplifying polylogarithms with RL is a rich one and worthy of continued exploration.
5 Transformer Networks
In Section 4 we explored how reinforcement learning can be used to train agents to simplify linear combinations of dilogarithms by using various functional identities. An advantage of the RL approach is its reproducibility: the list of actions taken to arrive at the final answer is known and can be studied and checked. However, demanding reproducibility also imposes a limitation on the way the data is learned and understood. RL algorithms such as ours can suffer from sample inefficiency as commonly seen in other applications of policy gradient methods [GuLilGhaTurLev17]: since they only train on the current version of the policy, they can be slow to train. This slowless limited our training sample to only 13,500 expressions. It is actually somewhat remarkable that the RL network can perform so well with so little diversity in the training. Partly this is due to the RL algorithm devoting 50 time steps to each expression during training.
If we are only interested in the final simplified form, not the explicit set of identities used to get there, seq2seq models are well adapted, with the prime example being the Transformer Network [transformer]. A transformer network will generate a guess for the answer, which we can verify, but not a humaninterpretable sequence of steps to get there. Our approach is based on the work of [fbtransformer], where transformer networks were used to perform a symbolic integration task, showing competitive performance compared to classical algorithms used by SymPy or Mathematica.
To use the transformer we have to convert our task of simplifying equations into a translation one. We will explore two different approaches. In Section 5.1 we will inquire whether it is possible to train a transformer to simplify linear combinations of dilogarithms directly, as in the problem RL was applied to in Section 4. Then in Section 5.2 we will explore using the symbol to represent a polylogarithmic expression and then applying the transformer network to look for a simplest expression that is associated with a given symbol.
From a high level point of view the transformer takes as input a sentence, in our case a mathematical expression, encodes it, and proceeds to output the translated sentence. As with the RL approach described in Section 4, the input data in our case are mathematical expressions which we express as trees and parse into prefix notation. The particularity of the transformer lies in that both the encoding and the decoding process rely on the attention mechanism, ensuring that any given word knows about the other constituents of the sentence. In the decoding layer the attention mechanism is used both on the input sentence and on the words of the translated sentence that have already been generated. We refer to [transformer] for a detailed overview of the architecture.
The particular attention mechanism we use is called “Scaled DotProduct Attention”. We start by passing each word through an embedding layer, representing it as a vector in , with the embedding dimension. For a given word w the resulting embedding is associated with a query vector , a key vector and a value vector . Those are obtained by multiplying the word embedding with suitable weight matrices, learned during the training. For a given attention head we have , where , with the dimension of the attention head. When using multiheaded attention with heads we will take , following the implementation of [fbtransformer, transformer]. To calculate the attention for a given word we compute the dot product with going over all the words in the sentence. In practice this quantifies how similar is to the other words of the sentence. After dividing by we apply the softmax function to obtain the attention weight for the word associated with the word :
(72) 
The final output of the attention layer for the word is given by multiplying the attention weights with the value vectors
(73) 
and the new embedding learned is dependant on all of the other words in the sentence. To remain sensitive to the ordering of the words we have to add a positional encoding layer at the start of the model, following [positional_embed].
5.1 Simplifying dilogarithms with the transformer
Having introduced the transformer network, we first deploy it to tackle a problem similar to the one we have motivated in Section 4.
Data generation Since we are using the transformer to guess the answer, we will have to consider dilogarithmic expressions which do not necessarily simplify to 0 (otherwise the prediction will be trivial and no learning will take place). In order to create relevant training samples we slightly modify the generation script of Section 4.2.1. To create a scrambled dilogarithmic expression we proceed in the following way

We sample the number of distinct dilogarithms for the simplified expression.

We sample the number of times that we want to add zero in the form .

We create random arguments and for each dilogarithm. Each function is a rational function of degree 2 at most over the integers between and . This gives us a skeleton
(74) 
We sample the total number of scrambles (inversion, reflection or duplication) to do. Here we take up to 10 scrambles as the scrambles may be applied on either the or the terms of Eq. (74). We are able to allow for more scrambling with the transformer network than the RL network in Section 4.2.1 (which allowed for only 67 scrambles) because the transformer network is easier to train.

We randomly choose how to distribute the scrambles amongst each term. We ask that every zero added, indexed by , is scrambled at least once.

We apply the scrambles and discard the resulting expression if it has more than words.
Training the transformer is done in a supervised way: for each input expression we have an associated output expression which we try to recover. In our case the input expressions are the scrambled expressions, while the desired outputs will be the corresponding simple expressions of the form . Following the outlined data generation procedure we create about 2.5M distinct pairs