Deep learning has become a highly successful machine learning method employed in a wide range of applications such as optical character recognition , image classification , and speech recognition 
. In a typical deep learning scenario one aims to fit a parametric model, realized by a deep neural network, to match a set of training data points.
Definition 1 (Neural network).
We call an ordered sequence
a neural network, where
is a positive integer, referred to as the depth of ,
is an -tuple of positive integers, called the layout,
, , are matrices whose entries are referred to as the network’s weights, and
, are vectors of the so-called biases.
Furthermore, we stipulate that none of the , , have an identically zero row or an identically zero column.
Given a neural network and a nonlinear function , referred to as the nonlinearity, we define the map realized by under as the function given by
where acts on real vectors in a componentwise fashion.
The requirement that the matrices in Definition 1 have non-zero rows corresponds to the absence of nodes whose contributions depend on the biases only, and are therefore constant as functions of the input. Similarly, columns that are identically zero correspond to nodes whose contributions do not enter the computation at the next layer. The map of a neural network failing this requirement can be realized by a network obtained by simply removing such spurious nodes. In practical applications, the numbers
are typically determined through heuristic considerations, whereas the coefficientsof the affine maps are learned based on training data. For an overview of practical techniques for deep learning, see . Neural networks are often studied as mathematical objects in their own right, for instance in approximation theory [8, 9, 10, 11] and in control theory [12, 13]. In this context, a natural question is that of identification: Can a neural network be uniquely identified from the map it is to realize? Specifically, we will be interested in identifiability according to the following definition.
Definition 3 (Identifiability).
Given positive integers and , define to be the set of all neural networks whose layouts satisfy and , but are otherwise arbitrary. Let be a subset of , a nonlinearity, and an equivalence relation on .
We say that is compatible with if, for all ,
We say that is identifiable up to if, for all ,
Thus, by informally saying that a neural network in a certain class is identifiable, we mean that any neural network in the same class giving rise to the same output map, i.e., , is necessarily equivalent to . The role of the equivalence relation in the previous definition is thus to “measure the degree of non-uniqueness”, and in particular, to accommodate symmetries within the network that may arise either from symmetries induced by the network weights and biases (such as the presence of clone pairs, to be introduced in Definition 5), symmetries of the nonlinearity (e.g.,
is odd), or both simultaneously. These abstract concepts will be incarnated momentarily when discussing the seminal work by Fefferman, and in Section II through Definitions 4 and 5, as well as in the examples leading up to the formulation of the paper’s main results. In , Fefferman showed that neural networks satisfying the following genericity conditions:
Assumptions 1 (Fefferman’s genericity conditions).
, for all and , and , for all and with .
, for all , , and , and
for all , and with ,
are, indeed, uniquely determined by the map they realize under the nonlinearity , up to certain obvious isomorphisms of networks. More precisely, for fixed positive integers and , Fefferman showed that is identifiable up to , where is defined as the set of all neural networks in satisfying Assumptions 1, and is defined by stipulating that if and only if
and , and
there exists a collection of signs , , and permutations such that
is the identity permutation and , , whenever or , and
for all , , and ,
It can be verified that is an equivalence relation on . Networks , such that are said to be isomorphic up to sign changes. The permutations
reflect the fact that the ordering of the neurons in the hidden layersis not unique, whereas the freedom in choosing the signs reflects that is an odd function. It can be verified that any two networks isomorphic up to sign changes give rise to the same map under the nonlinearity, so is compatible with . The crux of Fefferman’s result therefore lies in proving the converse statement, namely that two networks giving rise to the same map with respect to are necessarily isomorphic up to sign changes. This is effected by the insight that the depth, the layout, and the weights and biases of a network are encoded in the geometry of the singularities of the analytic continuation of .
We note that Fefferman distilled the precise conditions of Assumptions 1 from his proof technique, in order to define a class of neural networks that is, on the one hand, sufficiently small to guarantee identifiability, and on the other hand, sufficiently large to encompass “generic” networks. Indeed, if we consider the network weights and biases as elements of the space , then Assumptions 1 rule out only a set of measure zero, and hence with the nonlinearity satisfies a universal approximation property (e.g., in the sense of Cybenko  and Hornik ). In the contemporary practical machine learning literature, however, a network satisfying Assumptions 1 would hardly be considered generic, as part (i) of Assumptions 1 implies that all biases are non-zero, and part (ii) imposes full connectivity throughout the network.
Indeed, Fefferman remarks explicitly that it would be interesting to replace Assumptions 1 with minimal hypotheses, and to study nonlinearities other than . The present paper aims to address these two issues. Characterizing the fundamental nature of conditions necessary for identifiability with respect to a fixed nonlinearity, even a simple one such as , is likely a rather formidable task. In fact, the minimal identifiability conditions may generally depend on “fine” properties of the nonlinearity under consideration, and it is hence unclear how much insight can be obtained by having conditions that are specific to a given nonlinearity. We will thus be interested in an identification result with very mild conditions on the weights and biases of the neural networks to be identified, while still accommodating a broad class of nonlinearities.
We begin with two motivating examples. These lead up to the statements of our main contributions, whose corresponding proofs are developed in the remainder of the paper. We consider nonlinearities which are not necessarily odd (as ), and thus need an equivalence relation which dispenses with sign changes.
Definition 4 (Neural network isomorphism).
We say that the neural networks and are isomorphic, and write , if
and , and
there exist permutations such that
is the identity permutation for and , and
for all , , and ,
In the remainder of the paper we will work exclusively with isomorphisms in the sense of Definition 4. Note that any two isomorphic networks give rise to the same map with respect to any nonlinearity , and thus is an equivalence relation compatible with any pair . The requirement that be the identity for in the previous definition again corresponds to the fact that the inputs and the outputs of a neural network are not generally interchangeable. Indeed, suppose that , is the map of a neural network with respect to some nonlinearity . Let , , and be the networks obtained from by interchanging the inputs of , the outputs of , and both inputs and outputs, respectively. Then , , and are, indeed, distinct functions. We now give an example that Fefferman uses to motivate the necessity of restricting the class of all neural networks to a smaller class to be identifiable up to an equivalence relation. In Fefferman’s case, the equivalence relation is , but the example is equally pertinent to the relation . Suppose that is a neural network with , and with and are such that and for all . Then, if is obtained from by replacing and with an arbitrary pair of numbers and such that , then , for any . This example motivates the following definition:
Definition 5 (No-clones condition).
Let be a neural network as in Definition 1. We say that has a clone pair if there exist and with such that
If does not have a clone pair, we say that satisfies the no-clones condition.
As the nonlinearity in the example above is completely arbitrary, the no-clones condition is necessary to have any hope of obtaining identifiability up to . Hence, with our program in mind, given positive integers and , we define
and seek nonlinearities such that is identifiable up to . As any class strictly containing , paired with any nonlinearity, fails identifiability up to , the no-clones condition furnishes a canonical minimal assumption for identifiability up to . Similarly to , the class , paired with any measurable nonlinearity such that the limits and exist and are not equal, satisfies the universal approximation property in the sense of Hornik . The following example demonstrates that insisting on the no-clones condition as the only assumption on the weights, biases, and layout will necessarily come at the cost of restricting the class of nonlinearities that allow for identifiability. Let
Now, given an arbitrary neural network with satisfying the no-clones condition, the network
also satisfies the no-clones condition, and yields the identically-zero output, i.e., . We have thus constructed an infinite collection of distinct networks satisfying the no-clones condition and all yielding the identically-zero map. The class of identically-zero output maps therefore contains networks of different depths and layouts, and thus identifiability up to fails. This leads to the conclusion that a uniqueness result for neural networks with the clipped ReLU nonlinearity would need to encompass genericity conditions more stringent than the no-clones condition. Nonetheless, we are able to construct a class of real meromorphic nonlinearities yielding identifiability without any assumptions on the neural networks beyond the no-clones condition, and which is large enough to uniformly approximate any piecewise nonlinearity with , where
is the space of functions of bounded variation on .
Concretely, we have the following main result of this paper.
Theorem 1 (Uniqueness Theorem).
Let and be arbitrary positive integers. Furthermore, let be a piecewise function with and let . Then, there exists a meromorphic function , , such that and is identifiable up to .
We note that, having fixed the input and output dimensions and , the depths and the layouts of the networks in are completely arbitrary. Examples of nonlinearities covered by Theorem 1
include many sigmoidal functions such as the aforementioned clipped ReLU, the logistic function, the hyperbolic tangent , the inverse tangent , the softsign function , the inverse square root unit , the clipped identity , and the soft clipping function , where is fixed in the last two cases. Unbounded nonlinearities such as the ReLU are not comprised. The nonlinearities for which we have identifiability, unfortunately, need to be constructed, and, at the present time, we do not have an identification result for arbitrary given . Furthermore, we remark that the statement of Theorem 1 is “not continuous” in the approximation error . Indeed, while the clipped ReLU function satisfies the conditions of Theorem 1, as shown in the example above, there exist non-isomorphic networks and satisfying the no-clones condition and , for all , where is the clipped ReLU function. We will see that Theorem 1 is, in fact, a consequence of the following result, which states that the maps realized by pairwise non-isomorphic networks with , under a nonlinearity according to Theorem 1, are linearly independent functions .
Theorem 2 (Linear Independence Theorem).
Let be an arbitrary positive integer, let be a piecewise function with , and let . Then, there exists a meromorphic function , , such that with the following property: Suppose that , , are pairwise non-isomorphic (in the sense of ) neural networks in .
Then, is a linearly independent set of functions , where denotes the constant function taking on the value 1.
The function is included in the linearly independent set both for the sake of greater generality of the statement, and to facilitate the proof of Theorem 2.
Unfortunately, Theorem 2 does not generalize to multiple outputs , as shown by the following example: Fix an arbitrary network according to Definition 1 such that , , , and satisfies the no-clones condition. Define , , as the submatrices of consisting of the rows and , and , and , and and , respectively. Furthermore, define the networks
for . As satisfies the no-clones condition, the networks , , also satisfy the no-clones condition, and are pairwise non-isomorphic.
Now, let be an arbitrary nonlinearity, and write , where , . Then
The set is hence linearly dependent, showing that Theorem 2 cannot be generalized to multiple outputs by replacing with . We now provide a panorama of the proofs of Theorems 1 and 2. The proof of Theorem 1 is by way of contradiction with Theorem 2. Specifically, assume that , , , and are as in the statement of Theorem 1, and let be a nonlinearity satisfying the conclusion of Theorem 2 with these , , and . For a network , we write the map in terms of the coordinate functions , . Now, let be networks such that , for all , and suppose for contradiction that they are non-isomorphic. We construct a network containing both and as subnetworks (a precise definition of “subnetwork” is given in Section III, Definition 9). It follows that contains subnetworks with maps satisfying , for and . We then show that, as a consequence of and being non-isomorphic, there exists a such that and are non-isomorphic. But then
which stands in contradiction to Theorem 2. This completes the proof of Theorem 1. The proof of Theorem 2 is significantly more involved, as it requires extensive “fine tuning” of the function . Thus, let be as in the statement of Theorem 2. In addition to the properties stated in Theorem 2, the function we construct exhibits the following convenient structural properties:
The domain of is the complement of an (infinite) discrete set of poles,
is -periodic, i.e., , for all , and
for any network , the natural domain of , viewed as a holomorphic function, is the complement of a closed countable subset of , and therefore a connected open set.
These three properties are all satisfied by the function , and are essentially the key insight leading to Fefferman’s identifiability result in , which establishes that, under the genericity conditions stated in Assumptions 1, a neural network can be read off from the asymptotic periodicity (as the imaginary part of the argument tends to infinity) of the singularities of the map it realizes under the nonlinearity. The properties 1) – 3) will be key to our results as well, but instead of studying the set of singularities of the map in its own right, our proof of Theorem 2 will proceed by contradiction. The proof consists of three steps that we call amalgamation, input splitting, and input anchoring, and involves the use of analytic continuation, graph-theoretic constructions, and Ratner’s orbit closure theorem  from the theory of Lie groups, the latter two of which are novel tools in this context and signify a radical departure from Fefferman’s proof technique in . We now briefly describe the proof of Theorem 2 according to the aforementioned program. Suppose that are pairwise non-isomorphic neural networks satisfying the no-clones condition. For the sake of simplicity of this informal discussion, we assume that , , and . By way of contradiction, we suppose that there exists a nontrivial linear combination such that , for all . Amalgamation: In Section III we construct a neural network , called the amalgam of , containing each as a subnetwork. In particular, we have , for all . The linear dependence of thus translates to
for all . By our construction of , the natural domains are complements of closed countable sets, and hence, by analytic continuation, (1) is valid for all . Now define to be the set of all neural networks with a linear dependency as in (1) between the output functions and the constant function. Note that is nonempty, simply as . We then fix a network of minimum size (the precise definition of size will be given in the proof of Theorem 4). Write for the layout of , and let be the weights of the first layer of (i.e., the entries of according to Definition 1). At this point the proof splits into two cases, depending on whether there exist , , such that is irrational. Input splitting, the easy case. Provided there do exist such and , we use Ratner’s orbit closure theorem  and the properties (i) – (iii) of to construct a network with layout , for some , and first-layer weights such that the first rows of form a identity matrix. Input anchoring. We then construct a third network , obtained by fixing of the inputs of to specific real numbers, and “cutting out” all the parts of the network whose contributions to the output map have become constant in the process. The resulting network will be a network in of size smaller than , which contradicts the minimality of , and thereby completes the proof. Input splitting, the hard case. If, however, all the ratios , are rational, the input splitting construction described above cannot be carried out. This problem will be remedied by further refining our initial construction of . Specifically, we will ensure that the real parts of the poles of form a subset of satisfying what we call the self-avoiding property, to be introduced in Section V. This will enable an alternative construction of a network with at least two inputs. The resulting will, however, not be a neural network in the sense of Definition 1, but rather a generalized network in the sense of Definition 8, to be introduced in Section III. Input anchoring. Finally, we apply an input anchoring procedure similar to the one described above. This will result in a network of smaller size than , again completing the proof by contradiction.
We conclude this section by laying out the organization of the remainder of the paper. In Section III we develop a graph-theoretic framework needed to define amalgams of neural networks and several other technical concepts. In Section IV we state results from complex analysis and the theory of Lie groups needed in arguments involving analytic continuation and input splitting, respectively. The proofs of these results are relegated to the Appendix. In Section V we discuss the fine structural properties of the function constructed in the proof of Theorem 2. Finally, Section VI contains the proofs of our two main results.
Iii Directed acyclic graphs, general neural networks, and
neural network amalgams
As already mentioned, in the proof of Theorem 2 we will work with a form of neural networks that does not fit in with Definitions 1 and 2. In order to accommodate for this notion of neural networks, and to lighten the manipulations needed to formalize the aforementioned techniques of amalgamation and input anchoring, we introduce a graph-theoretic framework.
We start by introducing the concept of a directed acyclic graph (DAG), commonly encountered in the graph theory literature .
Definition 6 (Directed acyclic graph).
A directed graph is an ordered pairwhere is a finite set of nodes, and is a set of directed edges.
A directed cycle of a directed graph is a set such that, for every , , where we set .
A directed graph is said to be a directed acyclic graph (DAG) if it has no directed cycles.
We interpret an edge as an arrow connecting the nodes and and pointing at .
Definition 7 (Parent set, input nodes, and node level).
Let be a DAG.
We define the parent set of a node by .
We say that is an input node if , and we write for the set of input nodes.
We define the level of a node recursively as follows. If , we set . If and are defined, we set .
Since the graph in Definition 7 is assumed to be acyclic, the level is well-defined for all nodes of . We are now ready to introduce our generalized definition of a neural network.
A general feed-forward neural network (GFNN) is an ordered sextuple , where
is a DAG, called the architecture of ,
is the set of inputs of ,
is the set of outputs of ,
is the set of weights of , and
is the set of biases of .
The depth of a GFNN is defined as .
When translating from Definition 1 to Definition 8, we will interpret a zero weight simply as the absence of a directed edge between the nodes concerned, hence we do not allow the edges of a GFNN to have zero weight. If and are the sets of nodes of GFNNs and , respectively, and , we will say that and share the node . When dealing with several networks sharing a node , we will write for the parent set of in the architecture of , to avoid ambiguity. Note that the set of outputs of a GFNN can be an arbitrary subset of the non-input nodes. In particular, can include nodes with . Related to the concept of the parent set of a node is the concept of a subnetwork introduced next.
Definition 9 (Subnetwork and ancestor subnetwork).
Let be a GFNN. A subnetwork of is a GFNN such that there exists a set so that
, where for a set we define and , for .
If additionally , then is uniquely specified by . In this case we say that is the ancestor subnetwork of in , and write for this network.
A layered feed-forward neural network (LFNN) is a GFNN satisfying for all .
For an example of a GFNN that is not layered, see Figure 1. We notice that LFNNs correspond to neural networks as specified by Definition 1, with the nodes of level corresponding to the -th network layer. Specifically, if is a LFNN, we can label the nodes by , , and let , when and else. Apropos, this correspondence is the reason for the indices of the weight associated with the edge of a GFNN appearing in “reverse order”. The following definition generalizes Definition 2 to GFNNs.
Definition 11 (Output maps of nodes and networks).
Let be a GFNN, and let be a nonlinearity. The map realized by a node under is the function defined recursively as follows:
If , set , for all .
Otherwise set , for all .
The map realized by under is the function given by . When dealing with several networks we will write for the map realized by in , to avoid ambiguity.
We will treat nodes only as “handles”, and never as variables or functions. This is relevant when dealing with several networks with shared nodes, such as depicted in Figure 2. On the other hand, the output map realized by is a function.
In the special case when the nonlinarity is holomorphic on a neighborhood of , the output maps realized by the nodes of a network will extend to holomorphic functions on their natural domains, as given by the following definition.
Definition 12 (Natural domain).
Let be a GFNN, and let be a function holomorphic on an open domain and such that . For a node , we define the natural domain and extend the definition of the function recursively as follows:
For , let , and set , for all .
Otherwise, set , and let , for all .
It follows that the natural domain of a node is open, as it is the preimage of an open set with respect to a continuous map. Moreover, the output map realized by is holomorphic on , as it is given explictly by a concatenation of affine maps and the nonlinearity , which are themselves holomorphic functions.
The following definition is a straightforward generalization of Definition 5.
Definition 13 (Clone pairs and the no-clones condition).
Let be a GFNN. We say that the nodes , , are clones if , , and ,