A complicated function can be constructed by a hierarchy of simpler functions. For instance, when a microprocessor calculates the value of a function for a given set of inputs, it computes the function through composing simpler implemented functions, e.g. logic gates. Another example is that of an addition function for any number of inputs which can be obtained by composing simpler addition functions of two inputs. Even when the set of simple functions is small, like in the case of working only with logic gates, the set of functions that can be built may be exponentially large. We know from computation theory that all computable functions can be constructed in this manner [Sip06]. Therefore, one approach to understand the set of computable functions is to investigate their potential representations as hierarchical compositions of simpler functions.
Here we study the set of functions of multiple variables that can be computed by a hierarchy of functions that each accepts two inputs. Such compositions can be characterized by binary rooted trees (in the following we will refer to them as binary trees) that determines the hierarchical order in which the functions of two variables are applied. Associated with any binary tree is a (continuous or discrete) tree function space (TFS) consisting of all functions that can be obtained as a composition (superposition) based on the hierarchy that the tree provides. In Theorem 2.2 we exhibit a set of necessary and sufficient conditions for analytic functions of variables to have a representation via a given tree. We show that this amounts to describing the corresponding TFS as the solution set to a group of non-linear partial differential equations (PDEs). We also study the same representability problem in the context of discrete functions.
Related mathematical background. Representing multivariate continuous functions in terms of functions of fewer variables has a rich background that roots back to the problem on David Hilbert’s famous list of mathematical problems for the century [H02]. Hilbert’s original conjecture was about describing solutions of degree equations in terms of functions of two variables. The problem has many variants based on the category of functions – e.g. algebraic, analytic, smooth or continuous – in which the question is posed. See [VH67, chap. 1] or the survey article [Vit04] for a historical account. Later in the 1950s, the Soviet mathematicians Andrey Kolmogorov and Vladimir Arnold did a thorough study of this problem in the context of continuous functions that culminated in the seminal Kolmogorov-Arnold Representation Theorem ([Kol57]) which asserts that every continuous function can be described as
for suitable continuous single variable functions , .111There are more refined versions of this theorem with more restrictions on the single variable functions that appear in the representation [Lor66, chap. 11]. So in a sense, addition is the only real multivariate function. The idea of applying Kolmogorov-like results to studying networks is not new. Based on the mathematical works of Anatoli Vituškin (see below), the article [GP89] suggests that the number of variables of a function is not a suitable indication of its complexity due to the fact that there are highly regular functions that cannot be represented by continuously differentiable functions of a smaller number of variables [Vit64]. Nevertheless, the article [Ků91] argues that there is an approximation result of this type.
Although pertinent to our discussion, the reader should be aware that the representations of multivariate functions studied in this article are different in following ways:
Motivated by both the structures of computation graphs and the model of neurons as binary trees, we desire multivariate functions that could be obtained via composition of functions of two variables instead of single variable ones.
Unlike the summation above, we work with a single
superposition of functions. In the presence of differentiability, this enables us to use the full power of the chain rule. In the case of ternary functions for instance, a typical question (to be addressed in §5.1) would be whether can be written as . In fact, if one allows sum of superpositions, a result of Arnold (which could be found in his collected works [Arn09b]) states that every continuous can be written as a sum of nine superpositions of the form . But we look for a single superposition, not a sum of them.
We mostly work in the analytic context; see §5.3 for difficulties that may arise if one works with smooth functions. It must be mentioned that assuming that the continuous has certain regularity (e.g. smooth or analytic) does not guarantee that in a representation such as (1.1) the functions can be arranged to be of the same smoothness class. In fact, it is known that there are always functions222A function is called (of class) if it is differentiable of order and its order partial derivatives are continuous. A () function is said to be (resp. times) continuously differentiable. A function which is infinitely many times differentiable is called smooth or (of class) . The smaller class of (real) analytic functions that are locally given by convergent power series is denoted by . We refer the reader to [Pug02] for the standard material from elementary mathematical analysis. of three variables which cannot be represented as sums of superpositions of the form with and being as well [Vit54]. Because of the constraints that we put on in our main result, Theorem 2.2, it turns out that the functions of two variables that appeared in tree representations are analytic as well, or even polynomial if is polynomial; see Proposition 5.9.
Applying the chain rule to superpositions of analytic (or just ) bivariate functions results in the PDE constraints (2.4). It must be mentioned that the fact that the partial derivatives of functions appearing in any superposition of differentiable functions must be related to each other is by no means new. Hilbert himself has employed this point of view to construct analytic functions of three variables that are not (single or sum of) superpositions of analytic functions of two variables [Arn09a, p. 28]. Ostrowski for instance, has used this idea to exhibit an analytic bivariate function that cannot be represented as a superposition of single variable smooth functions and multivariate algebraic functions due to the fact that it is not a solution to any non-trivial algebraic PDE [Vit04, p. 14]. But, to the best of our knowledge, a systematic characterization of superpositions of analytic bivariate functions as outlined in Theorem 2.2 (or its discrete version in Theorem 6.6) and utilizing that for studying tree functions and neural networks has not appeared in the literature before.
Neuroscience motivation. Over a century ago, the father of modern neuroscience, Santiago Ramón y Cajal, drew the distinctive shapes of neurons [yC95]. Neurons receive their inputs on their dendrites which both exhibit non-linear functions and have a tree structure. The trees, called morphologies, are central to neuron simulations [HC97, Rei99]. Neuronal morphologies are not just the distinctive shapes of neuron but also pertain to their functions. One approximate way of thinking about neural function is that neurons receive inputs and by passing from the dendritic periphery towards the root, the soma, implement computation which gives a neuron its input-output function. In that view, the question of what a neuron with a given dendritic tree and inputs may compute boils down to the question of characterizing its TFS.
2. Outline and overview of main results
The order of appearance of functions in a superposition can be represented by a tree whose leaves, nodes (branch points) and root represent inputs, functions occurring in the superposition and the output respectively. Here we assume that the tree and the set of functions that could be applied at each node are given and each leaf is labeled by a variable. We can now define the space of functions generated through superposition, i.e. the corresponding tree function space (TFS) (see Definition 4.1). The most tangible case of a TFS is when all of the inputs are real numbers and the functions assigned to the nodes are bivariate real-valued functions. Nonetheless, our definition in §4 covers other cases: an arbitrary tree and sets of functions associated with its nodes result in the set of functions represented by superpositions. One example is when the functions at the nodes are bit-valued functions. Another example is when the inputs are time-dependent and the functions at nodes are operators. The latter case is important since it contains the function that a neural morphology would implement when we only allow soma-directed influences and ignore back-propagating action potentials [SSSH97].
The smallest non-trivial tree representing a superposition is the one with three leaves illustrated in Figure 1. Denoting the inputs by , and , an element of the corresponding TFS is the superposition below of a function of two variables and :
For and multiplication or addition, we end up with two basic examples and . By changing and one can construct other examples and hence the question of which functions could be answered in this manner.
To find a necessary condition for a function of three variables to have a representation such as (2.1), we assume differentiability and take the derivative. A straightforward application of the chain rule to (2.1) shows that must satisfy
A detailed treatment may be found in the discussion from the beginning of §5.1. This partial differential equation for puts a constraint on functions in the TFS and hence rules out certain ternary functions such as
While (2.2) is only a necessary condition, we prove that it is also sufficient in the case of analytic () functions:
Let be an analytic function defined on an open neighborhood of the origin that satisfies the identity in (2.2). Then there exist analytic functions and for which over some neighborhood of the origin.
To prove Theorem 2.1 we look at the Taylor expansion of with respect to and argue that each partial derivative has a representation like (2.1). We then explicitly construct the desired and in (2.1) with the help of the Taylor series. Consequently, we arrive at a description of the TFS containing analytic functions of three variables as the set of solutions to a single PDE.
Generalizing this setup to a higher number of variables, the following question arises: When can an analytic multivariate function be obtained from composition of functions of two variables? Allowing more than three leaves results in graph-theoretically distinct binary trees. For example, in the case of functions of four variables, there exist two non-isomorphic binary trees; Figure 1. The corresponding representations are
for the first tree and
for the second one. Thus each (labeled) binary tree comes with its corresponding space of analytic tree functions that could be obtained from analytic functions on smaller number of variables via composition according to the hierarchy that provides; see Definition 4.1.
Condition (2.2) from the ternary case is the prototype of constraints that general smooth functions from a TFS must satisfy. By fixing variables of a function of variables in the TFS under consideration, the resulting function of three variables belongs to the TFS of the tree formed by those three leaves and is hence a solution to a PDE of the form (2.2). Since this is true for any triple of variables, numerous necessary conditions must be imposed. In Theorem 2.2 we prove that for analytic functions, these conditions are again sufficient.
Let be a binary tree with terminals and . Suppose the terminals of are labeled by the coordinate functions on . Then for any three leaves of corresponding to variables of with the property that there is a sub-tree of containing the leaves while missing the leaf (Figure 2), must satisfy
Conversely, an analytic function defined in a neighborhood of a point can be implemented on the tree provided that for any triple of its variables with the above property (2.4) holds and moreover, for any two sibling leaves , either or is non-zero.
The general argument for the trees with larger number of leaves builds on the proof in the case of ternary functions demonstrated above. The proof occupies §5.2 and heavily uses the analyticity assumption. We digress to the setting of smooth () function in §5.3 to show that this assumption cannot be dropped.
The constraints in the Theorem 2.2 are algebraically dependent. The number of the constraints imposed by Theorem 2.2 is , hence cubic in the number of leaves. In §5.4 we show that “generically” the number of constraints could be reduced to
. This leads to a heuristic result on the co-dimension of the TFS in Proposition5.6 which states that the number of independent functional equations describing a TFS grows only quadratically with the number of leaves.
The space of analytic functions is infinite-dimensional and this makes it difficult to rigorously measure how “small” a TFS is relative to the ambient space of all analytic functions. However, under certain restrictions, the dimensions of the tree function space or even the space itself are finite. Two examples are worthy of investigation: bit-valued functions of the form and polynomials of bounded degree. In the bit-valued setting §6.1 each node is characterized by a function . We prove that Theorem 2.2 still holds in the sense of formal differentiation; see Theorem 6.6. Moreover, we enumerate the functions in the discrete TFS as (Corollary 6.2), a number which is much smaller than the number of all possible bit-valued functions which is . We use this to conclude that the total number of tree functions of variables obtained from all labeled binary trees of leaves is ; see Corollary 6.3. In the polynomial setting, each node is a bivariate polynomial. We establish that the Theorem 2.2 applies and furthermore, holds globally; cf. Proposition 5.9. In this case, if we consider polynomials of variables and of degree not greater than , the polynomial TFS would be of an algebraic variety whose dimension does not exceed ; see Proposition 5.11. Again, observe that this is much smaller than the dimension of the ambient polynomial space.
The set of binary functions that can be implemented on a given tree is limited and this set allows the reconstruction of the underlying tree from its corresponding TFS; see Proposition 6.9. For two labeled trees we define a metric: the proportion of functions that can only be represented by one of the trees (§6.2). This can be useful: in the case of two neurons with different morphologies, this simple metric quantifies how similar the sets of functions are that the two neurons can implement.
In a more general manner, the functions defined by neural networks are interesting examples of superpositions. In §7.1 we discuss a procedure of “expanding” a neural network to a tree by forming the corresponding Tree Expansion of the Neural Network or TENN for short. The idea is to convert the neural network of interest into a tree by duplicating the nodes that are connected to more than one node of the upper layer; see Figure 4. The procedure then allows us to revert back to the familiar setting of trees. A crucial point to notice is that, unlike previous sections, trees associated with neural networks are not necessarily binary and furthermore, a variable could appear as the label of more than one leaf. In other words, the functions constituting the superposition may have variables in common, e.g. . Seeking similar constraints for describing the TFS of a tree with repeated labels, in §7.2 we take a closer look at the preceding superposition and prove that:
Assuming that are four times differentiable, the superposition satisfies the PDE below:
The proposition suggests that tree functions are again solutions to (perhaps more tedious) PDEs. It is intriguing to ask if in presence of repeated labels there is a characterization, similar to Theorem 2.2, of a TFS as the solution set to a system of PDEs; cf. Question 7.2. We finish with one final application of this idea of transforming a neural network to a tree: In Theorem 7.9 of §7.3, we give an upper bound on the number of bit-valued functions computable by a neural network in terms of parameters depending on the architecture of the network.
Here we study the functions that are obtained from hierarchical superpositions; we study functions that can be computed on trees. The hierarchy is represented by a rooted binary tree where the leaves take different inputs and at each node a bivariate function is applied to outputs from the previous layer. In the setting of analytic functions, in Theorem 2.2 we characterize the space of functions that could be generated accordingly as the solution set to a system of PDEs. This characterization enables us to construct examples (e.g. (2.3)) of functions that could not be implemented on a prescribed tree. This is reminiscent of Minsky’s famous XOR theorem [MP17]
. The space of analytic functions is infinite-dimensional and this motivates us to investigate two settings in which the TFS is finite-dimensional (polynomials) or even finite (bit-valued). We show that the dimension or size of the TFS is considerably smaller than that of the ambient function space. The number of bit-valued functions could be estimated even for non-binary trees following the same ideas. Finally, we bridge between trees and neural networks by associating with each feed-forward neural network its corresponding TENN; cf. Figure4. This procedure allows us to apply our analysis of trees yielding an upper bound for the number of bit-valued functions that can be implemented by a neural network.
. While they constitute a large class of functions, there are important cases where one must deal with continuous non-analytic functions too. For example, a typical deep learning network is built through composition from analytic functions such as linear and sigmoid functions or the hyperbolic tangent; and also, from non-analytic functions such as ReLU or the pooling function. Continuous functions could be approximated locally by analytic ones with any desired precision (although not over the entirety of the real axis). Therefore, while our main result (local by formulation) is an exact classification, one future direction is to study how well arbitrary continuous functions could be approximated by analytic tree functions.
. A typical dendritic tree receives synaptic inputs from surrounding neurons. When activated, the synapses induce a current which changes the voltage on the dendrite. This is followed by a flow of the resulting current towards (and away from) the root (soma) of the neuron. In typical models of neural physiology, a neuron is segmented into compartments where their connections and the biological parameters define the dynamics of the voltage for each compartment[Seg98]. The dynamics of the electrical activity is often given by the following well-known ODE333The quantities appeared here are: is the voltage potential of the compartment; is the resting voltage potential for the compartment and the ion ; is the membrane capacitance of the compartment; denotes the resistance of the compartment and is the resistance between and compartments; is non-linear function corresponding to the ion . We only consider currents towards the soma. [HC97]:
Consequently, in the case of time-varying inputs, TFSs could be of neuroscientific interest. In this situation, the functions at the nodes are operators that receive time-dependent functions as their inputs. Constraints such as (2.4) in the main theorem may be formulated in this case as well: An operator
admitting a tree representation must satisfy equations such as
where derivatives of the operator must be understood in the variational sense
Utilizing discrete TFSs, in §6.2 we introduce a metric on the set of labeled binary trees that may be potentially used to quantify how similar two neurons are. A careful adaptation of our results to the time-varying situation could be the object of future enquiries.
Certain assumptions must be made before any application of our treatment of tree functions to the study of neural morphologies. First, from a biological standpoint not all functions are admissible as functions applied at nodes of neurons. Secondly, the acyclic nature of trees assumes that a neuron functions only due to feed-forward propagation whereas in reality back-propagating action potentials also occur. Thirdly, it is well-known that there are biological mechanisms, such as ephaptic connectivity or neuromodulations, that could affect the computations in a neuron’s morphology, and they are not taken into account in typical compartmental models. Our approach only applies to an abstraction of models; however, this abstraction appears meaningful.
In this paper, the “complexity” of a TFS in bit-valued and polynomial settings is measured by its cardinality or dimension. However, there are other notions of complexity in the literature that try to capture the capacity of the space of computable functions. For example, the VC-dimension measures the expressive power of a space of functions by quantifying the set of separable stimuli [VLC94]. When the functions at the nodes are piece-wise linear, one can count the number of linear regions of the output function [MPCB14, HR19]. The choice of the complexity measurement method is important when it comes to quantifying the difference between two architectures.
When a model is trained on data, we search for the best fit in the function space. Characterizing the landscape of function space can present new methods for training [BRK19]SL18]. For some models such as regression, we have explicit formulae that show how to find the parameters from training data. One future line of research is to investigate whether our PDE description of the TFSs can point toward new methods of training.
Since a TFS is much smaller than the ambient space of functions, it is suggestive to consider the approximation by them. In this regard, we fix a target function and take into account the tree functions that approximate it. Searching for the best approximation of a target function in the function space is realized by the training process. Hence one important question for approximation of a function is the stability of this process [HR17]. Another approach is to develop the mean-field equations to approximate the function space with a fewer equations that are easier to handle [MMN18]. Poggio et al. have found a bound for the complexity of a neural networks with smooth (e.g. sigmoid) or non-smooth (e.g. ReLU) non-linearity that provides a prescribed accuracy for functions of a certain regularity [PMR17]. Now that we have a description of analytic tree functions as solutions to a system of PDEs, one further direction is to study approximations of arbitrary continuous functions by these solutions.
When the tree function is fed the same input more than once through different leaves, the constraints put on superpositions in Theorem 2.2 must be refined and become more tedious. In §7.2, we study the simplest possible case, namely the superposition (7.1). Computing higher derivatives via the chain rule along with a linear algebra argument yield the complicated fourth order PDE (2.3) as a constraint. One future line of research is to formulate similar PDE constraints in the case of general (not necessary binary) trees with repeated labels; cf. Question 7.2
. Finding necessary or sufficient constraints in the repeated regime would have immediate applications to the study of continuous functions computed via neural networks with this consideration in mind that for the TENN associated with a neural network even functions assigned to nodes are probably repeated. Moreover, in the more specific context of polynomial functions, it is promising to try to formulate results such as Proposition5.11 about the space of polynomial tree functions; or in the bit-valued setting, any strengthening of the bound on the number of bit-valued functions implemented on a general tree that Corollary 7.7 provides would be desirable.
One major goal of the theoretical deep learning is to understand the role of various architectures of neural networks. Previous studies have shown that, compared to shallow networks, deep networks can represent more complex functions [BS14, LTR17]. Theorem 7.9
from the last section of this paper provides further intuition in this direction once instead of more traditional fully connected multi-layer perceptrons, we work with currently more popular sparse neural networks (e.g. convolutional neural networks). This is due to the fact that in the tree expansion of a sparse network the number of children of any arbitrary node would be relatively small. Theorem7.9 indicates that the number of bit-valued functions computable by the network could be large only if the associated tree has numerous leaves. Since the tree is sparse, this could happen only if the depth of the tree (or equivalently, that of the network) is relatively large. The discussion in that section suggests that studying tree functions could serve as a foundation for interesting theoretical approaches to the study of neural networks.
4. Tree functions
In this section we define the function space associated with a tree in the most general setting. Suppose we have inputs (leaves) of a binary tree . We recursively compute the output by applying at each node a function and passing the result to the next level. These calculations continue until we reach the root.
Let be a tree and the set of all possible inputs that a leaf could receive. For any , suppose . The tree function space, , is defined recursively: if has only one node. For larger trees, assuming that the successors of the root of are the roots of smaller sub-trees , define:
Tree function spaces could be investigated in two different regimes:
A tree is called binary if every non-terminal vertex (every node) of it has precisely two successors; cf. Figure 6.
The terminology of binary trees.
Root: the unique vertex with no predecessor/parent.
Leaf/Terminal: a vertex with no successor444The reader is cautioned that in our usage of terms such as “children”, “parent”, “successor” and “predecessor” in reference to vertices we have a rooted tree as illustrated in Figure 6 in mind where the root precedes every other vertex whereas to implement a function, the computations are done in the “upward” direction starting from the leaves in the lowest level and culminating at the root./child.
Node/Branch point: a vertex which is not a leaf, i.e. has (two) successors.
Sub-tree: all descendants of a vertex along with the vertex itself.
Sibling leaves: two leaves with the same parent.
Convention. All trees are assumed to be rooted. The number of leaves (terminals) of a tree is always denoted by , and each leaf presents a variable. Unless stated otherwise, the tree is binary and these variables are assumed to be distinct, and hence the corresponding functions are of variables. Repeated labels come up only in §7.
5. Analytic function setting
5.1. The case of ternary functions
In this section, we focus on the first interesting case, namely a binary tree with three inputs. It turns out that the treatment of this basic case and the ideas therein are essential to the proof of Theorem 2.2. In order to make one output, two of the inputs should be first combined at one node, and the result of that combination is then combined with the third input at the root. Such functions can be written as:
where and are two smooth functions of two variables.
So which functions of three inputs, could be written as in (5.1). Taking the derivative w.r.t. and , we have:
Taking the derivative w.r.t. yields:
Hence for every we should have:
as both sides coincide with . In particular, for
which is the constraint (2.2) from the introduction. Notice that the identity is solely based on the function and serves as a necessary condition for the existence of a presentation such as (5.1) for .
It is essential to observe that constraint (2.2) implies the rest of the constraints imposed on in (5.2). This is trivial for the points where . Otherwise, either or should be non-zero at the point under consideration and hence throughout a small enough neighborhood of it. We proceed by induction on . Differentiating (5.2) w.r.t. yields:
where the latter terms of two sides coincide since from the induction hypothesis , while the base case indicates
. The vectorsand are multiples of the non-zero vector ; so they are multiples of each other, i.e. . This finishes the inductive step.
In the same vein, identity (2.4) implies the more general identity below:
This is true even for a greater number of variables in place of in the following sense:
Proof of Proposition (2.1).
Let us first impose a mild non-singularity condition at the origin: either or is non-zero Without any loss of generality, we may assume and . The idea is to come up with a new coordinate system
centered at the origin in which the function is dependent on only . Define
The Jacobian of w.r.t. is given by
whose determinant at the origin is which we have assumed to be non-zero. Thus is indeed a coordinate system centered at . Next, we consider the Taylor expansion of w.r.t. :
the equality which holds near the origin due to the analyticity assumption. We claim that in the new coordinate system the partial derivatives that appeared above are independent of . The latter is clear and for the former we apply the chain rule to differentiate with respect to :
To calculate one has to invert the Jacobian matrix (5.6):
that yields . Plugging in the previous expression for we get:
which is zero due to (5.2); keep in mind that in a neighborhood of the origin the aforementioned identities are implied by (2.2); cf. Lemma 5.1. We conclude that in (5.7) each term is a function of , e.g.
Now defining to be and to be , the identity (5.7) implies that throughout a small enough neighborhood of .
Next, we omit the condition that one of the partial derivatives of is non-zero in Proposition 2.1. If either of or is non-zero for some integer , to to get:
Then integrating times w.r.t. provides us with a similar expression for . ∎
The idea from the last part of the proof seems to work only for this particular presentation as in general, integration w.r.t. to one of the variables does not preserve forms such as . Therefore, we are going to need the non-singularity condition of Theorem 2.2 in the following section.
An elegant reformulation (from the field of integrable systems) of constraint (2.2) imposed on a smooth tree function is to say that the differential form555The reader may find a very readable account of the theory of differential forms on Euclidean spaces in [Pug02, chap. 9, sec. 5]. must satisfy :
Similar identities also hold in the general case of a (smooth) tree function as has appeared in Theorem 2.2. To any two sibling leaves assign the differential -form . A straightforward calculation yields as
which turns out to be zero since any other leaf is an outsider with respect to neighboring ; hence the terms inside parentheses vanish due to (2.4). The non-vanishing condition of Theorem 2.2 implies that these -forms are linearly independent throughout some small enough open subset of . They define a differential system on the aforementioned open subset whose rank is:
and the identities could be reinterpreted as the integrability of this system according to a classical theorem of Frobenius [Nar68, Theorem 2.11.11].
The discussion in this subsection settles Theorem 2.2 for the most basic case of a binary tree with three leaves.
5.2. Proof of the main theorem
Let be a binary tree with leaves as in Theorem 2.2 and be a differentiable function of variables on an open neighborhood of .
The proof of necessity
Let . Consider a triple of variables as in Theorem 2.2. For the ease of notation, suppose they are the first three coordinates . Given and a point
we need to verify (2.4) at . Setting the last coordinates to be constants we end up with the function
of three variables defined on the open neighborhood of which is the image of under the projection onto the first three coordinates . This new function is implemented on a tree with three inputs (corresponding to leaves in the original statement of Theorem 2.2) and with adjacent to the same node as and were separated from in the original tree; see Figure 3. Hence (2.2) holds for this function:
which at yields the desired constraint
Next, we argue that under the assumptions outlined in the second part of Theorem 2.2 the identities such as (2.4) are enough to implement in locally around . The proof of sufficiency is based on recursively constructing the desired presentation of as a composition of bivariate functions by reducing the size of . The base of the induction, the case of a tree with three terminals, has already been settled in Proposition 2.1.
We claim that, up to relabeling variables, can be written as either
where the function satisfies the hypothesis of the existence part of Theorem 2.2 for or variables. In terms of the tree , the first normal form occurs when is connected directly to the root; the removal of the leaf and the root then results in a smaller tree with leaves; cf. part (a) of Figure 7. The induction hypothesis then establishes and finishes the proof. On the other hand, (5.10) comes up when neither of the two rooted sub-trees obtained by excluding the root is singleton. The number here denotes the number of the leaves of one of these sub-trees, e.g. the “left” one. By symmetry, let us assume that variables are labeled such that are the leaves of the sub-tree to the left of the root while appear in the sub-tree to the right. Graph-theoretically, gathering the variables together in (5.10) amounts to collapse the left sub-tree to a leaf. This results in a new binary tree with the same root but with leaves which is of the form discussed before: it has a “top” leaf directly connected to the root; see part (b) of Figure 7. The final step is to invoke the induction hypothesis to argue that in (5.10) belongs to .
Part I of the proof of sufficiency: Suppose there is a leaf adjacent to the root.
Without loss of generality, we consider everything in a neighborhood of and assume . Theorem 2.2 also requires at least one of the partial derivatives of w.r.t. to be non-zero; by symmetry, let us assume .
This is a new coordinate system centered at the origin as the Jacobian
is of determinant at the origin. The goal is to write down in a form
similar to (5.9) for a suitable bivariate function and the applying the induction hypothesis to which of course satisfies (2.4) with the original tree replaced with . To this end, we consider the Taylor expansion of w.r.t. :
We claim that the functions
appeared as coefficients are dependent only on the first component of the new coordinate system (5.11), or in other words
This is immediate when as we are basically differentiating w.r.t. . For , we need to apply the chain rule to get
where the partial derivative are entries of the inverse of (5.12) given by