A Universal Approximation Result for Difference of log-sum-exp Neural Networks

05/21/2019 ∙ by Giuseppe C. Calafiore, et al. ∙ Inria 0

We show that a neural network whose output is obtained as the difference of the outputs of two feedforward networks with exponential activation function in the hidden layer and logarithmic activation function in the output node (LSE networks) is a smooth universal approximator of continuous functions over convex, compact sets. By using a logarithmic transform, this class of networks maps to a family of subtraction-free ratios of generalized posynomials, which we also show to be universal approximators of positive functions over log-convex, compact subsets of the positive orthant. The main advantage of Difference-LSE networks with respect to classical feedforward neural networks is that, after a standard training phase, they provide surrogate models for design that possess a specific difference-of-convex-functions form, which makes them optimizable via relatively efficient numerical methods. In particular, by adapting an existing difference-of-convex algorithm to these models, we obtain an algorithm for performing effective optimization-based design. We illustrate the proposed approach by applying it to data-driven design of a diet for a patient with type-2 diabetes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Motivation

A well-known and compelling property of feedforward neural network () models is that they are capable of approximating any continuous function over a compact set. Indeed, classical results in, e.g., [cybenko1989approximation, hornik1989multilayer], show that, given any non-constant, bounded and continuous function, there exists a with a single hidden layer that can approximate it over a compact set. This universal approximation capability, together with the development of efficient algorithms to tune the network weights, paved the way to the efficient application of artificial neural networks in several frameworks, such as circuit design [zaabab1995neural], control and identification of nonlinear systems [221350], optimization over graphs [peterson1989new], and many others.

However, when the goal of a neural network model is to construct a surrogate model for describing, and then optimizing, a complex input-output relation, it is of crucial importance that the structure of the model be well tailored for the subsequent numerical optimization phase. This is not usually the case for generic . Indeed, if the input-output model does not satisfy certain properties (such as, for instance, convexity [boyd2004convex]), then designing the input so that the output is minimized, possibly under additional design constraints, can be an extremely difficult task.

In [CaGaPo:18], we showed that -networks, that is,

with exponential activation functions in the inner layer and logarithmic activation function in the output neuron, parametrized by a positive “temperature” parameter

, provide a smooth convex model capable of approximating any convex function over a convex, compact set. Maps in the class are precisely log-Laplace transforms of nonnegative measures with finite support, a remarkable class of maps enjoying smoothness and strict convexity properties. We showed in particular that if the data to be approximated satisfy some convexity assumptions, such network structures can be readily exploited to perform data-driven design by using convex optimization tools. Nevertheless, most real-life input-output maps are of nonconvex nature, hence while they might still be approximated via a convex model, such an approximation may not yield a desirable accuracy.

I-B Contribution

The purpose of this paper is to propose a new type of neural network model, here named Difference-LSE network (), which is constructed by taking the difference of the outputs of two networks. First, we prove that networks guarantee universal approximation capabilities (thus overcoming the limitations of plain convex networks), see creftypecap 2. By using a logarithmic transformation, networks map to a family of ratios of generalized posynomials functions, which we show to be subtraction free universal approximators of positive functions over compact subsets of the positive orthant. Subtraction free expressions are fundamental objects in algebraic complexity, studied in particular in [fomin]. It is a result of independent interest that subtraction free expressions provide universal approximators.

Moreover, we show that Difference-LSE network are of practical interest, as they have a structure which is amenable to effective optimization over the inputs by using “DC-programming” methods, as discussed in Section VI.

Training and subsequent optimization of networks has been implemented in a numerical Matlab toolbox111See https://github.com/Corrado-possieri/DLSE_neural_networks. named DLSE_Neural_Network that we made publicly available. The theoretical results in the paper are illustrated by an example dealing with data-driven design of a diet for a patient with type 2 diabetes.

I-C Related work

Similar to our previous work [CaGaPo:18] on which it builds, the present paper is inspired by ideas from tropical geometry and max-plus algebra. The class of functions in that we study here plays a key role in Viro’s patchworking methods [viro, itenberg]

for real curves. We note that the application of tropical geometry to neural networks is an emerging topic: at least two recent works have used tropical methods to provide combinatorial estimates, in terms of Newton polytopes, of the “classifying power” of neural networks with piecewise affine functions, see 

[charisopoulos2017morphological], [lim]. Other related results concern the “zero-temperature” () limit of the approximation problem that we consider, i.e., the representation of piecewise linear functions by elementary expression involving min, max, and affine terms [ovchinnikov, wang], and the approximation of functions by piecewise linear functions. In particular, Th. 4.3 of [Goodfellow] shows that any continuous function can be approximately arbitrarily well on a compact domain by a difference of piecewise linear convex functions. A related approximation result, still concerning differences of structured piecewise linear convex functions, has appeared in [1903.08072]. In contrast, our results provide approximation by differences of a family of smooth structured convex functions. This smoothing is essential when deriving universal approximation results by subtraction free rational expresssions.

Ii Notation and technical preliminaries

Let , , , , and denote the set of natural, integer, real, nonnegative real, and positive real numbers, respectively. Given a function , we define its domain as . If the function is differentiable at point , we denote by its gradient at .

Ii-a Log-Sum-Exp functions

Following [CaGaPo:18], we define (Log-Sum-Exp) as the class of functions that can be written as

(1)

for some , , , , where

is a vector of variables. Further, given

, we define as the class of functions that can be written as

(2)

for some , , and , . By letting , , we have that functions in the family can be equivalently parameterized as

(3)

where the s have no sign restrictions. It may sometimes be convenient to highlight the full parameterization of , in which case we shall write , where , and . It can be readily observed that, for any , the following property holds:

(4)

The maps in are special instances of the log-Laplace transforms of nonnegative measures, studied in [klartag2012centroid]. In particular, the maps in are smooth, they are convex (this is an easy consequence of Cauchy-Schwarz inequality), and they are even strictly convex if the vectors constitute an affine generating family of , see [CaGaPo:18, Prop. 1]. Maps of this kind play a key role in tropical geometry, in the setting of Viro’s patchworking method [viro], dealing with the degeneration of real algebraic curves to a piecewise linear limit. We note in this respect that the family of functions given by (3) converges uniformly on , as , to the function

Actually, the following inequality holds for all

(5)

see [CaGaPo:18] for details and background.

Ii-B Posynomials and functions

Given and , a positive monomial is a product of the form . A posynomial is a finite sum of positive monomials,

(6)

We denote by the class of functions of the form (6). Posynomials are log-log-convex functions, meaning that the log of a posynomial is convex in the log of its argument, see, e.g., Section II.B of [CaGaPo:18]. We denote by the class of functions that can be expressed as

(7)

for some and . These functions are log-log-convex, and they form a subset of the so-called generalized posynomial functions, see, e.g., Section II.B of [CaGaPo:18]. It is observed in Proposition 3 of [CaGaPo:18] that and functions are related by a one-to-one correspondence. That is, for any and it holds that

Iii A universal approximation theorem

Iii-a Preliminary: approximation of convex functions

We start by recalling a key result of [CaGaPo:18], stating that functions in are universal smooth approximators of convex functions, see Theorem 2 in [CaGaPo:18].

Theorem 1 (Universal approximators of convex functions, [CaGaPo:18]).

Let be a real valued continuous convex function defined on a compact convex subset . Then, for all there exist and a function such that

(8)

We now extend the above result by showing that it actually holds for the restricted class of with rational parameters. This extension will allow us to apply our results to the approximation by subtraction free expressions.

Definition 1.

A function has rational parameters if is a rational number and is of the form (3) where the vectors have rational entries, and are rational numbers. We shall also say that is a -approximation of on when (8) holds.

The following corollary holds.

Corollary 1 (LSE approximation with rational parameters).

Under the hypotheses of creftypecap 1, for all there exists a rational and a function with rational parameters such that (8) holds. Moreover, may be chosen of the form where is a positive integer.

Proof.

First, inspecting the proof of Theorem 2 in [CaGaPo:18], one obtains that can be approximated uniformly by a map , for all small enough, hence we can always assume that is of the form for some positive integer . It then remains to be proved that the approximation result still holds if also and are rational. To this end, let us study the effect of a perturbation of these parameters on the map . Observe that the map , satisfies

(9)

for all , where is the sup-norm. This follows from the fact that is order preserving and commutes with the addition of a constant, see e.g. [CT80] and Section 2 of [1605.04518]. It follows from (9) that if is as in (3), and if

then, letting , , and , we get

for all . Hence, choosing to be a rational approximation of such that , and supposing that is a -approximation of on , we deduce that is a -approximation of on , from which the statement of the corollary follows. ∎

Iii-B Approximation of general continuous functions

This section contains our main result on universal approximation of continuous functions. To this end, we first define the class of functions that can be expressed as the difference of two functions in .

Definition 2 ( functions).

We say that a function belongs to the class, if , for some . Further, we say that has rational parameters, if and have rational parameters.

The following result shows that any continuous function can be approximated uniformly by a function in a class.

Theorem 2 (Universal approximation property of ).

Let be a real-valued continuous function defined on a compact, convex subset . Then, for any there exist a function with rational parameters, for some where is a positive integer, such that , .

Proof.

A classical result of convex analysis states that any continuous function defined on a compact convex subset of can be written as the difference where are continuous, convex functions defined on , see, e.g., Proposition 2.2 of [borwein]. Then, by creftypecap 1, for all , we can find a rational and a function with rational parameters such that holds for all . Similarly, we can find a rational and a function with rational parameters such that holds for all . Hence, by taking any rational such that and are integer multiples of , it follows from the nesting property in Lemma 1 of [CaGaPo:18] that and both belong to . Thus, there exist a rational and such that, for all ,

Summing these conditions we obtain that , for all . The claim then immediately follows by recalling that for all , and letting , whence . ∎

The following explicit example illustrates the approximation of a non-convex and nondifferentiable function by a function in .

Example 1.

Consider

Observe that , which is indeed a difference of two nonsmooth convex functions. By using (5), we can approximate each term of this difference by a function in as

It follows that the map

is in and satisfies the following uniform approximation property of :

Example 2.

The previous explicit approximation carries over to a continuous piecewise affine function of a single real variable, as follows. By piecewise affine, we mean that can be covered by finitely many intervals in such a way that is affine over each of these intervals. Then, can be written in a unique way as

(10)

where are real parameters, are the nondifferentiability points of , and is the jump of the derivative of at point . Another way to get insight of (10) is to make the following observation: the function has a second derivative in the distribution sense, ; then (10) is gotten by integrating twice the latter expression of . Possibly after subtracting to an affine function, we will always assume that .

Then, setting

and

we write

Let

and note that

Then, setting where

and using (5), we get

Iii-C Data approximation

Consider a collection of data-points,

(11)

where , , and is an unknown function. The following universal data approximation result holds.

Corollary 2 (Universal data approximation).

Given a collection of data as in (11), for any there exists and a function with rational coefficients such that

Proof.

Let be the convex hull of the input data points. Consider a triangulation of the input points : recall that such a triangulation consists of a finite collection of simplices , satisfying the following properties: (i) the vertices of these simplices are taken among the points ; (ii) each point is the vertex of at least one simplex; (iii) the interiors of theses simplices have pairwise empty intersections, and (iv) the union of these simplices is precisely . Then, there is a unique continuous function, , affine on each simplex , and such that for . Observe that is convex and compact by construction. Now, a direct application of creftypecap 2 shows that for any there exists and a function with rational coefficients such that

which concludes the proof. ∎

Iv Positive functions on the positive orthant

In this section we discuss approximation results for functions taking positive values on the open positive orthant. A particular case of this class of functions is given by log-log-convex functions, whose uniform approximation by means of functions was discussed in Corollary 1 of [CaGaPo:18]. We shall first extend this result to functions with rational parameters, and then provide a universal approximation result for continuous positive functions over the open positive orthant.

Iv-a Uniform approximation results

The following preliminary definitions are instrumental for our purposes: a subset will be said to be log-convex if its image by the map that takes the logarithm entry-wise is convex. We shall say that a function has rational parameters if it can be written as in (7) with given by (6), in such a way that , the entries of the vectors , …, , and the scalars are rational numbers. The following corollary extends Corollary 1 of [CaGaPo:18].

Corollary 3 (Universal approximators of log-log-convex functions).

Let be a log-log-convex function defined on a compact, log-convex subset of . Then, for all there exist a function with rational parameters, for some where is a positive integer, such that, for all ,

(12)
Proof.

By using the log-log transformation, define . Since is log-log-convex in , is convex in . Furthermore, the set is convex and compact since the set is log-convex and compact. Thus, by creftypecap 1, for all , there exist where is a positive integer, and a function with rational coefficients such that for all . From this point on, the proof follows the very same lines as the proof of Corollary 1 of [CaGaPo:18]. ∎

We next state an approximation result for functions on the positive orthant. The derivation of creftypecap 3 from creftypecap 3 is similar to the derivation of creftypecap 2 from creftypecap 1 and thus we omit its proof.

Theorem 3 (Universal approximators of functions on the open orthant).

Let be a continous positive function defined on a compact log-convex subset . Then, for all there exist two functions with rational parameters, for some where is a positive integer, such that, for all ,

(13)

Iv-B Universal approximation by subtraction-free expressions

We next derive from creftypecap 3 an approximation result by subtraction-free expressions. The latter are an important subclass of rational expressions, studied in [fomin]. Subtraction-free expressions are well formed expressions in several commutative variables , defined using the operations and using positive constants, but not using subtraction. Formally, a subtraction-free expression in the variables is a term produced by the context-free grammar rule

where can take the value of any positive constant. For instance, is a subtraction-free expression, whereas is not a subtraction-free expression, owing to the presence of the sign. Note that , thought of as a formal rational fraction, coincides with which is subtraction free, i.e., an expression which is not subtraction-free may well have an equivalent subtraction-free expression. However, there are rational fractions, and even polynomials, like , without subtraction equivalent free expressions, because any subtraction-free expression must take positive values on the interior of the positive cone, whereas vanishes on the line . Important examples of subtraction-free expressions arise from series-parallel composition rules for resistances. More advanced examples, coming from algebraic combinatorics, are discussed in [fomin].

Corollary 4 (Approximation by subtraction-free expressions).

Let be a continous positive function defined on a compact log-convex subset . Then, for all there exist positive integers and a subtraction-free expression in variables such that the function

in which is substituted to the variable , satisfies, for all ,

(14)
Proof.

creftypecap 3 shows that (14) holds with where for some positive integer , and are functions in with rational parameters, i.e.,

where the vectors and have rational entries. Denoting by the least common multiple of the denominators of the entries of the vectors and , we see that is precisely of the form where is a subtraction-free rational expression. ∎

Iv-C Approximation of positive data

Consider a collection of data pairs,

where , , , with , , where is an unknown function. The data in is referred to as positive data. The following proposition is now an immediate consequence of creftypecap 3, where can be taken as the log-convex hull of the input data points222For given , we define their log-convex hull as the set of vectors , where , , and ..

Proposition 1.

Given positive data , for any there exist a rational and two functions with rational parameters such that

(15)

V networks

Functions in can be modeled through a feedforward neural network () architecture, composed of two networks in parallel, whose outputs are fused via an output difference node, see Fig. 1.

Fig. 1: A network is composed of two networks in parallel, with a difference output node.

It may sometimes be convenient to highlight the full parameterization of the input-output function synthesized by the network, in which case we shall write

where , , , and are the parameter vectors of the two components.

Each component has input nodes, one hidden layer with nodes, and one output node. The activation function of the hidden nodes is , and the activation of the output node of each component is . Each node in the hidden layer of the first component network computes a term of the form , where the -th entry of represents the weight between node and input , and is the bias term of node . Each node of the first network thus generates activations

We consider the weights from the inner nodes to the output node to be unitary, whence the output node of the first network computes and then, according to the output activation function, the output layer returns the value

An identical reasoning applies to the output of the second component network. The overall output realizes a function which, by creftypecap 2, allows us to approximate any continuous function over a compact convex domain. Similarly, by creftypecap 2, we can approximate data via a network, to any given precision.

Theorem 4.

Given a collection of data , for each there exists a neural network such that

where is the input-output function of the network.

V-a Training networks

By using the scaling property (4) of functions, it can be noticed that a simpler network structure can be used to implement a neural network, as shown in Fig. 2.

Fig. 2: The same network as in Fig. 1 can be obtained by suitably pre-scaling the input and outputs of the two component networks.

As a matter of fact, given the parameter vectors , , , , it can be easily derived that

and hence dealing with a neural network corresponds to deal with a neural network whose input and output have been rescaled. Each of the two components of the network realizes an input-output map of the form

The gradients of this function with respect to its parameters , are, for ,

Thus, by using the chain rule, we have that

Given a dataset as in (11), these gradients can be used to train a network by using classical algorithms such as the Levenberg-Marquardt algorithm [marquardt1963algorithm], the Fletcher-Powell conjugate gradient [davidon1991variable]

, or the stochastic gradient descent algorithm 

[bertsekas1999nonlinear]. In numerical practice, one may fix the parameter and and train the network with respect to the parameters , , and by using one of the methods mentioned above, until a satisfactory cross-validated fit is found. A suitable initial value for the parameter may be set, for instance, to the inverse mid output range . Alternatively, can be considered as a trainable variable as well, and computed by the training algorithm alongside with the other parameters.

The following example illustrates the application of the Levenberg-Marquardt algorithm to a simple case.

Example 3.

Consider the function ,

(16)

Such a function, which is clearly nonconvex, has been used to generate the dataset , where each has been taken uniformly at random in and , . The Levenberg-Marquardt algorithm has been used to train a network fitting such data with and . Fig. 3 depicts the output of the network and of its two components and .

Fig. 3: Output of the trained network.

As shown in Fig. 3, although the function is nonconvex, it is well approximated by a network. Indeed, the data represented in Fig. 3 are approximated by the trained network with a mean square error of .

Vi Non-convex optimization via networks

In view of the results established in Sections III and V, networks can be efficiently used to compute a difference of convex (DC) approximate decomposition of any continuous function over a compact set. Indeed, by using the tools described in Section V, given any continuous function defined on a convex compact set (or, more generally, a dataset generated through any function ) and , we can determine such that