I-a Motivation and context
A key challenge that has to be faced when dealing with real-word engineering analysis and design problems is to find a model for a process or apparatus that is able to correctly interpret the observed data. The advantages of having at one’s disposal a mathematical model include enabling the analysis of extreme situations, the verification of decisions, the avoidance of time-consuming and expensive experimental tests or intensive numerical simulations, and the possibility of optimizing over model parameters for the purpose of design. In this context, a tradeoff must be typically made between the accuracy of the model (here broadly intended as the capacity of the model in reproducing the experimental or simulation data) and its complexity, insofar as the former usually increases with the complexity of the model. Actually, the use of “simple” models of complex fenomena is gaining increasing interest in engineering design; examples are the so-called surrogate models constructed from complex simulation data arising, for instance, in aerodynamics modeling, see, e.g., [1, 2, 3].
In particular, if the purpose of the model is performing optimization-based design, then it becomes of paramount importance to have a model that is suitably tailored for optimization. To this purpose, it is well known that an extremely advantageous property for a model to possess is convexity, see, e.g., [4, 5]. In fact, if the objective and constraints in an optimization-based design problem are convex, then efficient tools (such as interior-point methods, see, e.g., ) can be used to solve the problem in an efficient, global and guaranteed sense. Conversely, finding the solution to a generic nonlinear programming problem may be extremely difficult, involving compromises such as long computation time or suboptimality of the solution; [5, 7]. Clearly, not all real-world models are convex, but several relevant ones are indeed convex, or can anyways be approximated by convex ones. In all such cases it is of critical importance to be able to construct convex models from the available data.
The focus of this work is on the construction of functional models from data, possessing the desirable property of convexity. Several tools have been proposed in the literature to fit data via convex or log-log-convex functions (see Section II-B for a definition of log-log convexity). Some remarkable examples are, for instance, , where an efficient least-squares partition algorithm is proposed to fit data through max-affine functions; , where a similar method has been proposed to fit max-monomial functions; , where a technique based on fitting the data through implicit softmax-affine functions has been proposed; and [11, 12], where methods to fit data through posynomial models have been proposed.
Since the pioneering works [13, 14, 15], artificial feedforward neural networks have been widely used to find models apt at describing the data, see, e.g., [16, 17, 18]. However, the input-output map represented by a neural network need not possess properties such as convexity, and hence the ensuing model is in general unsuitable for optimization-based design.
The main objective of this paper is to show that, if the activation function of the hidden layer and of the output layer are properly chosen, then it is possible to design a feedforward neural network with one hidden layer that fits the data and that represents a convex function of the inputs. Such a goal is pursued by studying the properties of the log-sum-exp (or softmax-affine) class of functions, by showing that they can be represented through a feedforward neural network, and by proving that they posses universal approximator properties with respect to convex functions; this constitutes our main result, stated in creftypecap 2 and specialized in creftypecap 4 and creftypecap 3. Furthermore, we show that an exponential transformation maps the class of functions into the generalized posynomial family , which can be used for fitting log-log convex data, as stated in creftypecap 1, creftypecap 5, and creftypecap 4. Our approximation proofs rely in part on tropical
techniques. The application of tropical geometry to neural networks is an emerging topic — two recent works have used tropical methods to provide combinatorial estimates, in terms of Newton polytopes, of the “classifying power” of neural networks with piecewise affine functions, see[19, 20]. Although there is no direct relation with the present results, a comparison of these three works does suggest tropical methods may be of further interest in the learning-theoretic context.
We flank the theoretical results in this paper with a numerical Matlab toolbox, named Convex_Neural_Network, which we developed and made freely available on the web111See https://github.com/Corrado-possieri/convex-neural-network/. This toolbox implements the proposed class of feedforward neural networks, and it has been used for the numerical experiments reported in the examples section.
Convex neural networks are important in engineering applications in the context of construction of surrogate models for describing and optimizing complex input-output relations. We provide examples of application to two complex physical processes: the amount of vibration transmitted by a vehicle suspension system as a function of its mechanical parameters, and the peak power generated by the combustion reaction of propane as a function of the initial concentrations of the involved chemical species.
I-C Organization of the paper
The remainder of this paper is organized as follows: in Section II we introduce the notation and we give some preliminary results about the classes of functions under consideration. In Section III, we illustrate the approximation capabilities of the considered classes of functions, by establishing that generalized log-sum-exp functions and generalized posynomials are universal smooth approximators of convex and log-log-convex data, respectively. In Section IV, we show the correspondence between these functions and feedforward neural networks with properly chosen activation function. The effectiveness of the proposed approximation technique in realistic applications is highlighted in Section V, where the class is used to perform data-driven optimization of two physical phenomena. Conclusions are given in Section VI.
Ii Notation and technical preliminaries
Let , , , , and denote the set of natural, integer, real, nonnegative real, and positive real numbers, respectively. Given , denotes the Dirac measure on the set
. The vectorsare linearly independent if for all not identically zero, whereas they are affinely independent if are linearly independent. Given , let
Supposing that , define the Fenchel transform of as
where denotes an inner product; in particular, the standard inner product will be assumed all throughout this paper. By the Fenchel-Moreau theorem, , it results that if and only if is convex and lower semicontinuous, whereas, in general, it holds that . We shall assume henceforth that all the considered convex functions are proper, meaning that their domain is nonempty.
Ii-a The Log-Sum-Exp class of functions
Let (Log-Sum-Exp) be the class of functions that can be written as
for some , , , , where is a vector of variables. Further, given (usually referred to as the temperature), define the class of functions that can be written as
for some , , and , . By letting , , we have that functions in the family can be equivalently parameterized as
where the s have no sign restrictions. It may sometimes be convenient to highlight the full parameterization of , in which case we shall write , where , and . It can then be observed that, for any , the following property holds:
A key fact is that each is smooth and convex. Indeed, letting be a positive Borel measure on , following the terminology of , the log-Laplace transform of is
The convexity of this function is well known, being a direct consequence of Hölder’s inequality. Hence, letting be a sum of Dirac measures, we obtain that each is convex. The convexity of all follows immediately by the fact that convexity is preserved under positive scaling. On the other hand, the smoothness of each follows by the smoothness of the functions and in their domain. The interest in this class of functions arises from the fact that, as established in the subsequent creftypecap 2, functions in are universal smooth approximators of convex functions.
In the following proposition, we show that if the points with coordinates constitute an affine generating family of , or, equivalently, if one can extract affinely independent vectors from then the function given in (2) is strictly convex. In dimension , this condition means that the family of points of coordinates contains the vertices of a triangle; in dimension , the same family must contain the vertices of a tetraehedron, and so on.
The function given in (2) is strictly convex whenever the vectors constitute an affine generating family of .
Let be a positive Borel measure on . For every
, consider the random variable, whose distribution , absolutely continuous with respect to , has the Radon-Nikodym derivative equal to . It can be checked that the Hessian of the log-Laplace transform of is , where denotes the covariance matrix of the random variable at argument, see the proof of [23, Prop 7.2.1]. Hence, as soon as the support of the distribution of contains affinely independent points, this covariance matrix is positive definite, which entails the strict convexity of . The proposition follows by considering the log-Laplace transform of , in which the support of is . ∎
If the points with coordinates do not constitute an affine generating family of , we can find a vector such that for . It follows that
showing that is affine in the direction .
We next observe that the function class enlarges as decreases, as stated more precisely in the following lemma.
For all and each , , one has
By definition, for a function there exist , and , , such that
where the last equality follows from the observation that, by expanding the (integer) power , we obtain a summation over terms, each of which has the form of products of terms taken from the larger parentheses. These terms retain the format of the original terms in the parentheses, only with suitably modified parameters and . The claim then follows by observing that the last expression represents a function in . ∎
Consider now the class of max-affine functions with terms, i.e., the class of all the functions that can be written as
When the entries of are nonnegative integers, the function is called a tropical polynomial, [24, 25]. Allowing these entries to be relative integers yields the class of Laurent tropical polynomials. When these entries are real, by analogy with classical posynomials (see Section II-B), the function is sometimes referred to as a tropical posynomial. Note that the class of functions has been recently used in learning problems, , , and in data fitting, see  and . Such functions are convex, since the function obtained by taking the point-wise maximum of convex functions is convex. It follows from the parameterization in (3) that, for all , , i.e., the function given in (3) approximates as tends to zero, see . This deformation is familiar in tropical geometry under the name of “Maslov dequantization,” , and it is a key ingredient of Viro’s patchworking method, . The following uniform bounds are rather standard, but their formal proof is given here for completeness.
For any , in (3), and for all , it holds that
Given and , a positive monomial is a product of the form . A posynomial is a finite sum of positive monomials,
Posynomials are thus functions ; we let denote the class of all posynomial functions.
Definition 1 (Log-log-convex function).
A function is log-log-convex if is convex in .
A positive monomial function is clearly log-log-convex, since is linear (hence convex) in . Log-log convexity of functions in the family can be derived from the following proposition, which goes back to Kingman, .
Proposition 2 (Lemma p. 283 of ).
If and are log-log-convex functions, then the following functions are log-log-convex:
Since is log-log-convex, then by creftypecap 2 each function in the class is log-log-convex. Posynomials are of great interest in practical applications since, under a log-log transform, they become convex functions [10, 28]. More precisely, by letting , one has that
which is a function in the family. Furthermore, given , since positive scaling preserves convexity, , letting be a posynomial, we have that functions of the form
are log-log-convex. Functions that can be rewritten in the form (10), with , are here denoted by and they form a subset of the family of the so-called generalized posynomials. It is a direct consequence of the above discussion that and functions are related by a one-to-one correspondence, as stated in the following proposition.
Let and . Then,
Iii Data approximation via
The main objective of this section is to show that the classes and can be used to approximate convex and log-log-convex data, respectively. In particular, in Section III-A, we establish that functions in are universal smooth approximators of convex data. Similarly, in Section III-B, we show that functions in are universal smooth approximators of log-log-convex data.
Iii-a Approximation of convex data via
Consider a collection of data pairs,
where , , , with
and where is an unknown convex function. The data in are referred to as convex data. The main goal of this section is to show that there exists a function that fits such convex data with arbitrarily small absolute approximation error.
The question of the uniform approximation of a convex function by functions can be considered either on , or on compact subsets of . The latter situation is the most relevant to the approximation of finite data sets. It turns out that there is a general characterization of the class of functions uniformly approximable over , which we state as creftypecap 1. We then derive an uniform approximation result over compact sets (creftypecap 2). However, the approximation issue over the whole has an intrinsic interest.
The following statements are equivalent.
The function is convex and is a polytope.
For all , there is such that, , , there is such that .
For all , there exists a convex polyhedral function such that .
(b)(a): If for some , we have . Therefore, item (b) together with the metric estimate (8), which gives , implies that . Item (b) implies that is the pointwise limit of a sequence of convex functions, and so is convex.
(a)(c): Suppose now that (a) holds. Let us triangulate the polytope into finitely many simplices of diameter at most . Let denote the collection of vertices of these simplices, and define the function ,
Observe that is convex and polyhedral. Since is convex and finite (hence is continuous by [29, Thm. 10.1]), we have
Moreover, for all , the latter supremum is attained by a point , which belongs to some simplex of the triangulation. Let denote the vertices of this simplex, so that where , , and . Since is polyhedral, we know that , which is a convex function taking finite values on a polyhedron, is continuous on this polyhedron . So, is uniformly continuous on . It follows that we can choose such that , for all included in a simplex with vertices of the triangulation. Therefore, we have that
which shows that (c) holds.
The condition that the domain of is a polytope in creftypecap 1 is rather restrictive. This entails that the map is Lipschitz, with constant , where is the Euclidean norm. In contrast, not every Lipschitz function has a polyhedral domain. For instance, if , is the unit Euclidean ball. However, the condition on the domain of only involves the behavior of “at infinity”. creftypecap 2 below shows that when considering the approximation problems over compact sets, the restriction to a polyhedral domain can be dispensed with.
Theorem 2 (Universal approximators of convex functions).
Let be a real valued continuous convex function defined on a compact convex set . Then, For all there exist and a function such that
If (11) holds, then is an -approximation of on .
We first show that the statement of the theorem holds under the additional assumptions that is -Lipschitz continuous on for some constant and that has non-empty interior. Observe that there is a sequence of elements in the interior of that is dense in (for instance, we may consider the set of vectors in the interior of that have rational coordinates, this set is denumerable, and so, by indexing its elements in an arbitrary way, we get a sequence that is dense in ). In what follows, we shall identify with the convex function that coincides with on and takes the value elsewhere. Recall in particular that the subdifferential of at a point is the set
and that, by Theorem 23.4 of , is non-empty for all in the relative interior of the domain of , i.e., here, in the interior of . It is also known that for all , and in particular for all with (Corollary 13.3.3 of ). Let us now choose in an arbitrary way an element , for each , and consider the map ,
By definition of the subdifferential, we have for all , and by construction of , for all , so the sequence converges pointwise to on the set . Since , every map is Lipschitz of constant , and so, is also Lipschitz of constant . Hence, the sequence of maps is equi-Lispchitz. A fortiori, it is equicontinuous. Then, by the second theorem of Ascoli (Théorème T.2, XX, 3; 1 of ), the pointwise convergence of the sequence to on the set implies that the same sequence converges uniformly to on the closure of , that is, on . In particular, for all , we can find an integer such that
We now relax the assumption that is Lipschitz continuous. Consider, for all , the Moreau-Yoshida regularization of , which is the map defined by
Observe that is nonincreasing, and that . It is known that the function is convex, being the inf-convolution of two convex functions (Theorem 5.4 of ), it is also known that is Lipschitz of constant (Th. 4.1.4, ) and that the family of functions converges pointwise to as (Prop. 4.1.6, ibid.). Moreover, we supposed that is continuous. We now use a theorem of Dini, showing that if a nondecreasing family of continuous real-valued maps defined on a compact set converges pointwise to a continuous function, then this family converges uniformly. It follows that converges uniformly to on the compact set as . In particular, we can find such that holds for all . Applying the statement of the theorem, which is already proved in the case of Lipschitz convex maps, to the map , we get that there exists a map for some such that holds for all , and so , for all , showing that the statement of the theorem again holds for .
Finally, it is easy to relax the assumption that has non-empty interior: denoting by the affine space generated by , we can decompose a vector in an unique way as with and , where . Setting allows us to extend to a convex continuous function , constant on any direction orthogonal to , and whose domain contains which is a compact convex set of non-empty interior. By applying the statement of the theorem to , we get a -approximation of on by a map in . A fortiori, is a -approximation of on .
The following proposition is now an immediate consequence of creftypecap 2, where can be taken as the convex hull of the input data.
Proposition 4 (Universal approximators of convex data).
Given a collection of convex data generated by an unknown convex function, for each there exists and such that
The following counterexample shows that, in general, we cannot find a function matching exactly the data points, i.e., some approximation is sometimes unavoidable.
Suppose first that , consider the function , and the data , , , , with for , so and . Suppose now that this dataset is matched exactly by a function with , parametrized as in (3). Since the points are not aligned, we know, by creftypecap 1, that the family contains an affinely generating family of (in dimension , this simply means that take at least two values). It follows from creftypecap 1 that is strictly convex. However, a strictly convex function cannot match exactly the subset of data , as it consists of three aligned points.
This entails that in any dimension , there are also data sets that cannot be matched exactly. Indeed, if is a function of variables, then, for any vectors , the function of one variable is also in . Hence, if any data set , is such that a subset of points is included in an affine line , and if a function matches exactly the set of data, then, the function is the solution of an exact matching problem by an univariate function in , and the previous dimension counter example shows that this problem need not be solvable.
Iii-B Approximation of log-log-convex data via
Consider a collection of data pairs,
where , , , with
where is an unknown log-log-convex function. The data in is referred to as log-log-convex. The following corollary states that there exists that fits the data with arbitrarily small relative approximation error. A subset will be said to be log-convex if its image by the map which performs the entry-wise is convex.
Corollary 1 (Universal approximators of log-log-convex functions).
Let be a log-log-convex function defined on a compact log-convex subset . Then, for any there exist and a function such that, for all ,
By using the log-log transformation, define . Since is log-log-convex in , is convex in . Furthermore, the set is convex and compact since the set is log-convex and compact. Thus, by creftypecap 2, for all , there exist and a function such that for all . Note that, by construction
where and . Thus, since, by the reasoning given above, we have and , it results that
Thus, it results that, for all ,
where . Hence, (14) holds since can be made arbitrarily small by letting be sufficiently small. ∎
The following proposition is now an immediate consequence of creftypecap 1, where can be taken as the log-convex hull of the input data points222For given points , we define their log-convex hull as the set of vectors , where for all and (all operations are here intended entry-wise)..
Given a collection of log-log-convex data , for each there exist and a such that
A reasoning analogous to the one used in Remark 1 can be employed to show that, given a collection of log-log-convex data pairs, there need not exist that matches exactly the data in , for any .
Propositions 4 and 5 establish that functions in and can be used as universal smooth approximators of convex and log-log-convex data, respectively. However, there is a difference between the type of approximation of these two classes of functions. As a matter of fact, given a collection of convex data