A well-known and compelling property of feedforward neural network () models is that they are capable of approximating any continuous function over a compact set. Indeed, classical results in, e.g., [cybenko1989approximation, hornik1989multilayer], show that, given any non-constant, bounded and continuous function, there exists a with a single hidden layer that can approximate it over a compact set. This universal approximation capability, together with the development of efficient algorithms to tune the network weights, paved the way to the efficient application of artificial neural networks in several frameworks, such as circuit design [zaabab1995neural], control and identification of nonlinear systems , optimization over graphs [peterson1989new], and many others.
However, when the goal of a neural network model is to construct a surrogate model for describing, and then optimizing, a complex input-output relation, it is of crucial importance that the structure of the model be well tailored for the subsequent numerical optimization phase. This is not usually the case for generic . Indeed, if the input-output model does not satisfy certain properties (such as, for instance, convexity [boyd2004convex]), then designing the input so that the output is minimized, possibly under additional design constraints, can be an extremely difficult task.
In [CaGaPo:18], we showed that -networks, that is,
with exponential activation functions in the inner layer and logarithmic activation function in the output neuron, parametrized by a positive “temperature” parameter, provide a smooth convex model capable of approximating any convex function over a convex, compact set. Maps in the class are precisely log-Laplace transforms of nonnegative measures with finite support, a remarkable class of maps enjoying smoothness and strict convexity properties. We showed in particular that if the data to be approximated satisfy some convexity assumptions, such network structures can be readily exploited to perform data-driven design by using convex optimization tools. Nevertheless, most real-life input-output maps are of nonconvex nature, hence while they might still be approximated via a convex model, such an approximation may not yield a desirable accuracy.
The purpose of this paper is to propose a new type of neural network model, here named Difference-LSE network (), which is constructed by taking the difference of the outputs of two networks. First, we prove that networks guarantee universal approximation capabilities (thus overcoming the limitations of plain convex networks), see creftypecap 2. By using a logarithmic transformation, networks map to a family of ratios of generalized posynomials functions, which we show to be subtraction free universal approximators of positive functions over compact subsets of the positive orthant. Subtraction free expressions are fundamental objects in algebraic complexity, studied in particular in [fomin]. It is a result of independent interest that subtraction free expressions provide universal approximators.
Moreover, we show that Difference-LSE network are of practical interest, as they have a structure which is amenable to effective optimization over the inputs by using “DC-programming” methods, as discussed in Section VI.
Training and subsequent optimization of networks has been implemented in a numerical Matlab toolbox111See https://github.com/Corrado-possieri/DLSE_neural_networks. named DLSE_Neural_Network that we made publicly available. The theoretical results in the paper are illustrated by an example dealing with data-driven design of a diet for a patient with type 2 diabetes.
I-C Related work
Similar to our previous work [CaGaPo:18] on which it builds, the present paper is inspired by ideas from tropical geometry and max-plus algebra. The class of functions in that we study here plays a key role in Viro’s patchworking methods [viro, itenberg]
for real curves. We note that the application of tropical geometry to neural networks is an emerging topic: at least two recent works have used tropical methods to provide combinatorial estimates, in terms of Newton polytopes, of the “classifying power” of neural networks with piecewise affine functions, see[charisopoulos2017morphological], [lim]. Other related results concern the “zero-temperature” () limit of the approximation problem that we consider, i.e., the representation of piecewise linear functions by elementary expression involving min, max, and affine terms [ovchinnikov, wang], and the approximation of functions by piecewise linear functions. In particular, Th. 4.3 of [Goodfellow] shows that any continuous function can be approximately arbitrarily well on a compact domain by a difference of piecewise linear convex functions. A related approximation result, still concerning differences of structured piecewise linear convex functions, has appeared in [1903.08072]. In contrast, our results provide approximation by differences of a family of smooth structured convex functions. This smoothing is essential when deriving universal approximation results by subtraction free rational expresssions.
Ii Notation and technical preliminaries
Let , , , , and denote the set of natural, integer, real, nonnegative real, and positive real numbers, respectively. Given a function , we define its domain as . If the function is differentiable at point , we denote by its gradient at .
Ii-a Log-Sum-Exp functions
Following [CaGaPo:18], we define (Log-Sum-Exp) as the class of functions that can be written as
for some , , , , where
is a vector of variables. Further, given, we define as the class of functions that can be written as
for some , , and , . By letting , , we have that functions in the family can be equivalently parameterized as
where the s have no sign restrictions. It may sometimes be convenient to highlight the full parameterization of , in which case we shall write , where , and . It can be readily observed that, for any , the following property holds:
The maps in are special instances of the log-Laplace transforms of nonnegative measures, studied in [klartag2012centroid]. In particular, the maps in are smooth, they are convex (this is an easy consequence of Cauchy-Schwarz inequality), and they are even strictly convex if the vectors constitute an affine generating family of , see [CaGaPo:18, Prop. 1]. Maps of this kind play a key role in tropical geometry, in the setting of Viro’s patchworking method [viro], dealing with the degeneration of real algebraic curves to a piecewise linear limit. We note in this respect that the family of functions given by (3) converges uniformly on , as , to the function
Actually, the following inequality holds for all
see [CaGaPo:18] for details and background.
Ii-B Posynomials and functions
Given and , a positive monomial is a product of the form . A posynomial is a finite sum of positive monomials,
We denote by the class of functions of the form (6). Posynomials are log-log-convex functions, meaning that the log of a posynomial is convex in the log of its argument, see, e.g., Section II.B of [CaGaPo:18]. We denote by the class of functions that can be expressed as
for some and . These functions are log-log-convex, and they form a subset of the so-called generalized posynomial functions, see, e.g., Section II.B of [CaGaPo:18]. It is observed in Proposition 3 of [CaGaPo:18] that and functions are related by a one-to-one correspondence. That is, for any and it holds that
Iii A universal approximation theorem
Iii-a Preliminary: approximation of convex functions
We start by recalling a key result of [CaGaPo:18], stating that functions in are universal smooth approximators of convex functions, see Theorem 2 in [CaGaPo:18].
Theorem 1 (Universal approximators of convex functions, [CaGaPo:18]).
Let be a real valued continuous convex function defined on a compact convex subset . Then, for all there exist and a function such that
We now extend the above result by showing that it actually holds for the restricted class of with rational parameters. This extension will allow us to apply our results to the approximation by subtraction free expressions.
The following corollary holds.
Corollary 1 (LSE approximation with rational parameters).
First, inspecting the proof of Theorem 2 in [CaGaPo:18], one obtains that can be approximated uniformly by a map , for all small enough, hence we can always assume that is of the form for some positive integer . It then remains to be proved that the approximation result still holds if also and are rational. To this end, let us study the effect of a perturbation of these parameters on the map . Observe that the map , satisfies
for all , where is the sup-norm. This follows from the fact that is order preserving and commutes with the addition of a constant, see e.g. [CT80] and Section 2 of [1605.04518]. It follows from (9) that if is as in (3), and if
then, letting , , and , we get
for all . Hence, choosing to be a rational approximation of such that , and supposing that is a -approximation of on , we deduce that is a -approximation of on , from which the statement of the corollary follows. ∎
Iii-B Approximation of general continuous functions
This section contains our main result on universal approximation of continuous functions. To this end, we first define the class of functions that can be expressed as the difference of two functions in .
Definition 2 ( functions).
We say that a function belongs to the class, if , for some . Further, we say that has rational parameters, if and have rational parameters.
The following result shows that any continuous function can be approximated uniformly by a function in a class.
Theorem 2 (Universal approximation property of ).
Let be a real-valued continuous function defined on a compact, convex subset . Then, for any there exist a function with rational parameters, for some where is a positive integer, such that , .
A classical result of convex analysis states that any continuous function defined on a compact convex subset of can be written as the difference where are continuous, convex functions defined on , see, e.g., Proposition 2.2 of [borwein]. Then, by creftypecap 1, for all , we can find a rational and a function with rational parameters such that holds for all . Similarly, we can find a rational and a function with rational parameters such that holds for all . Hence, by taking any rational such that and are integer multiples of , it follows from the nesting property in Lemma 1 of [CaGaPo:18] that and both belong to . Thus, there exist a rational and such that, for all ,
Summing these conditions we obtain that , for all . The claim then immediately follows by recalling that for all , and letting , whence . ∎
The following explicit example illustrates the approximation of a non-convex and nondifferentiable function by a function in .
Observe that , which is indeed a difference of two nonsmooth convex functions. By using (5), we can approximate each term of this difference by a function in as
It follows that the map
is in and satisfies the following uniform approximation property of :
The previous explicit approximation carries over to a continuous piecewise affine function of a single real variable, as follows. By piecewise affine, we mean that can be covered by finitely many intervals in such a way that is affine over each of these intervals. Then, can be written in a unique way as
where are real parameters, are the nondifferentiability points of , and is the jump of the derivative of at point . Another way to get insight of (10) is to make the following observation: the function has a second derivative in the distribution sense, ; then (10) is gotten by integrating twice the latter expression of . Possibly after subtracting to an affine function, we will always assume that .
and note that
Then, setting where
and using (5), we get
Iii-C Data approximation
Consider a collection of data-points,
where , , and is an unknown function. The following universal data approximation result holds.
Corollary 2 (Universal data approximation).
Given a collection of data as in (11), for any there exists and a function with rational coefficients such that
Let be the convex hull of the input data points. Consider a triangulation of the input points : recall that such a triangulation consists of a finite collection of simplices , satisfying the following properties: (i) the vertices of these simplices are taken among the points ; (ii) each point is the vertex of at least one simplex; (iii) the interiors of theses simplices have pairwise empty intersections, and (iv) the union of these simplices is precisely . Then, there is a unique continuous function, , affine on each simplex , and such that for . Observe that is convex and compact by construction. Now, a direct application of creftypecap 2 shows that for any there exists and a function with rational coefficients such that
which concludes the proof. ∎
Iv Positive functions on the positive orthant
In this section we discuss approximation results for functions taking positive values on the open positive orthant. A particular case of this class of functions is given by log-log-convex functions, whose uniform approximation by means of functions was discussed in Corollary 1 of [CaGaPo:18]. We shall first extend this result to functions with rational parameters, and then provide a universal approximation result for continuous positive functions over the open positive orthant.
Iv-a Uniform approximation results
The following preliminary definitions are instrumental for our purposes: a subset will be said to be log-convex if its image by the map that takes the logarithm entry-wise is convex. We shall say that a function has rational parameters if it can be written as in (7) with given by (6), in such a way that , the entries of the vectors , …, , and the scalars are rational numbers. The following corollary extends Corollary 1 of [CaGaPo:18].
Corollary 3 (Universal approximators of log-log-convex functions).
Let be a log-log-convex function defined on a compact, log-convex subset of . Then, for all there exist a function with rational parameters, for some where is a positive integer, such that, for all ,
By using the log-log transformation, define . Since is log-log-convex in , is convex in . Furthermore, the set is convex and compact since the set is log-convex and compact. Thus, by creftypecap 1, for all , there exist where is a positive integer, and a function with rational coefficients such that for all . From this point on, the proof follows the very same lines as the proof of Corollary 1 of [CaGaPo:18]. ∎
We next state an approximation result for functions on the positive orthant. The derivation of creftypecap 3 from creftypecap 3 is similar to the derivation of creftypecap 2 from creftypecap 1 and thus we omit its proof.
Theorem 3 (Universal approximators of functions on the open orthant).
Let be a continous positive function defined on a compact log-convex subset . Then, for all there exist two functions with rational parameters, for some where is a positive integer, such that, for all ,
Iv-B Universal approximation by subtraction-free expressions
We next derive from creftypecap 3 an approximation result by subtraction-free expressions. The latter are an important subclass of rational expressions, studied in [fomin]. Subtraction-free expressions are well formed expressions in several commutative variables , defined using the operations and using positive constants, but not using subtraction. Formally, a subtraction-free expression in the variables is a term produced by the context-free grammar rule
where can take the value of any positive constant. For instance, is a subtraction-free expression, whereas is not a subtraction-free expression, owing to the presence of the sign. Note that , thought of as a formal rational fraction, coincides with which is subtraction free, i.e., an expression which is not subtraction-free may well have an equivalent subtraction-free expression. However, there are rational fractions, and even polynomials, like , without subtraction equivalent free expressions, because any subtraction-free expression must take positive values on the interior of the positive cone, whereas vanishes on the line . Important examples of subtraction-free expressions arise from series-parallel composition rules for resistances. More advanced examples, coming from algebraic combinatorics, are discussed in [fomin].
Corollary 4 (Approximation by subtraction-free expressions).
Let be a continous positive function defined on a compact log-convex subset . Then, for all there exist positive integers and a subtraction-free expression in variables such that the function
in which is substituted to the variable , satisfies, for all ,
where the vectors and have rational entries. Denoting by the least common multiple of the denominators of the entries of the vectors and , we see that is precisely of the form where is a subtraction-free rational expression. ∎
Iv-C Approximation of positive data
Consider a collection of data pairs,
where , , , with , , where is an unknown function. The data in is referred to as positive data. The following proposition is now an immediate consequence of creftypecap 3, where can be taken as the log-convex hull of the input data points222For given , we define their log-convex hull as the set of vectors , where , , and ..
Given positive data , for any there exist a rational and two functions with rational parameters such that
Functions in can be modeled through a feedforward neural network () architecture, composed of two networks in parallel, whose outputs are fused via an output difference node, see Fig. 1.
It may sometimes be convenient to highlight the full parameterization of the input-output function synthesized by the network, in which case we shall write
where , , , and are the parameter vectors of the two components.
Each component has input nodes, one hidden layer with nodes, and one output node. The activation function of the hidden nodes is , and the activation of the output node of each component is . Each node in the hidden layer of the first component network computes a term of the form , where the -th entry of represents the weight between node and input , and is the bias term of node . Each node of the first network thus generates activations
We consider the weights from the inner nodes to the output node to be unitary, whence the output node of the first network computes and then, according to the output activation function, the output layer returns the value
An identical reasoning applies to the output of the second component network. The overall output realizes a function which, by creftypecap 2, allows us to approximate any continuous function over a compact convex domain. Similarly, by creftypecap 2, we can approximate data via a network, to any given precision.
Given a collection of data , for each there exists a neural network such that
where is the input-output function of the network.
V-a Training networks
As a matter of fact, given the parameter vectors , , , , it can be easily derived that
and hence dealing with a neural network corresponds to deal with a neural network whose input and output have been rescaled. Each of the two components of the network realizes an input-output map of the form
The gradients of this function with respect to its parameters , are, for ,
Thus, by using the chain rule, we have that
Given a dataset as in (11), these gradients can be used to train a network by using classical algorithms such as the Levenberg-Marquardt algorithm [marquardt1963algorithm], the Fletcher-Powell conjugate gradient [davidon1991variable]
, or the stochastic gradient descent algorithm[bertsekas1999nonlinear]. In numerical practice, one may fix the parameter and and train the network with respect to the parameters , , and by using one of the methods mentioned above, until a satisfactory cross-validated fit is found. A suitable initial value for the parameter may be set, for instance, to the inverse mid output range . Alternatively, can be considered as a trainable variable as well, and computed by the training algorithm alongside with the other parameters.
The following example illustrates the application of the Levenberg-Marquardt algorithm to a simple case.
Consider the function ,
Such a function, which is clearly nonconvex, has been used to generate the dataset , where each has been taken uniformly at random in and , . The Levenberg-Marquardt algorithm has been used to train a network fitting such data with and . Fig. 3 depicts the output of the network and of its two components and .
Vi Non-convex optimization via networks
In view of the results established in Sections III and V, networks can be efficiently used to compute a difference of convex (DC) approximate decomposition of any continuous function over a compact set. Indeed, by using the tools described in Section V, given any continuous function defined on a convex compact set (or, more generally, a dataset generated through any function ) and , we can determine such that