I Introduction
Supervised learning entails finding a function using a set of training data points and their target values . The function should approximate the target values at the data points and generalize well to new inputs [hastie2009overview]. In a variational framework, the learning problem is formalized as the optimization task
(1) 
where is the function search space,
is a convex loss function controlling the datafitting error, and
is the regularizer. The hyperparameter adjusts the contribution of the regularizer. Regularization is used to promote functions with desirable structures such as sparsity and to reduce overfitting [tian2022comprehensive, scholkopf2002learning].In order to make (1) tractable, the search space
is usually expressed as a parametric space. For instance, linear regression—the simplest model—reduces the learning problem to the search for
and such that [gross2003linear]. This model is very well understood, but it is severely lacking expressivity. Another approach to the learning problem is founded on the theory of reproducingkernel Hilbert spaces (RKHS) [kadri2010nonlinear], [hofmann2005tutorial]. If the search space in (1) is the reproducingkernel Hilbert space with kernel , and where is the Hilbertspace norm, then the RKHS representer theorem states that the solution of (1) admits the closedform(2) 
for some coefficients [scholkopf2001generalized]
. There, the problem is recast into a solvable linear model. Moreover, by using radialbasis functions as kernels, one can approach continuous functions as closely as desired
[schaback2007practical], [miller1992review], [micchelli2006universal]. For highdimensional data, however, kernel methods are outperformed by deep neural networks, which are the stateoftheart in many applications
[abiodun2018state]. A neural network of depth forms a mapping(3) 
where is a learnable affine mapping parameterized by and
is the nonlinearity (a.k.a. “activation function”) performed in the
th layer. Among all possible activation functions, the most common choice and usually the best in terms of performance is the rectified linear unit ReLU() = [schmidt2020nonparametric]. It is known that a ReLU network produces continuous and piecewiselinear (CPWL) functions [montufar2014number]. Conversely, any CPWL function can be described by a ReLU network [arora2016understanding]. The CPWL model is universal, in the sense that it can approximate any continuous mapping [DBLP:conf/iclr/ParkYLS21]. However, a major drawback of letting deep networks learn a CPWL function is the lack of their interpretability, in the sense that the effect of each parameter on the generated mapping is not understood explicitly [fan2021interpretability].There also exists a variational interpretation of shallow ReLU networks [parhi2021kinds, unser2022ridges, ongie2019function]. However, shallow networks cannot represent all CPWL functions, especially in high dimensions. In fact, no variational interpretation is known for deep networks, to the best of our knowledge. Interestingly, there exist other variational problems that admit global CPWL minimizers [debarre2019b], [debarre2021sparsest]. If we investigate (1) in the onedimensional case , with the secondorder total variation as the regularizer, and if we restrict the search space to functions with a bounded secondorder total variation, then the extreme points of the solution set are necessarily of the form
(4) 
where , , and . [unser2021unifying] . For such solutions, the regularization has the simple closedform . As the norm promotes sparsity, a regularization by the secondorder total variation will promote CPWL functions with few knots. A multidimensional generalization of is the Hessian totalvariation () seminorm [aziznejad2021measuring]. is equivalent to in the onedimensional case and a similar regularizer was introduced before for imagereconstruction purposes [lefkimmiatis2011hessian]. Since the null space of is composed of affine mappings, this regularization favors solutions that are predominantly affine; typically, CPWL functions with few pieces. Authors in [campos2021learning] use regularization to learn twodimensional functions. They parameterize the function search space by a span of linear box splines which are themselves CPWL functions [condat2006three]. Although their parameterization has a good approximation power, it has some drawbacks. First, the number of boxspline basis functions grows exponentially with the dimension. Second, there are many domain regions without corresponding data points. These regions add complexity to the model without benefiting to its performance.
In this paper, we investigate an alternative approach that is based on the Delaunay triangulation in an irregular setting. Here are our primary contributions.

Flexible parameterization of CPWL functions. We partition the domain of CPWL functions by the Delaunay triangulation on a set of adaptive grid points; the function values at the grid points are our learnable parameters and identify the CPWL functions uniquely.

Explicit computation of in any dimension. For the proposed CPWL model, we show that is the
norm applied to a linear transformation of the gridpoint values.

Experimental validation of the scheme in dimensions at least two, with real data.
Our learning approach uses the expressivity of CPWL functions in a stable and interpretable manner [goujon2022stable]. These properties stem from the nature of the proposed Delaunaybased parameterization. In addition, the use of regularization enables us to control the complexity of the final mapping by a single hyperparameter.
Our paper is organized as follows: In Section II, we introduce the mathematical tools that we need to develop our method. We define Delaunay triangulation, CPWL functions, and the Hessian total variation. In Section III, we describe our CPWL parameterization, explain the procedure to calculate its , and derive the generalized leastabsoluteshrinkageandselectionoperator (LASSO) formulation of our learning problem. Finally, we present our experimental results in Section IV.
Ii Preliminaries
Iia Delaunay Triangulation
In the dimensional space , the convex hull of affinely independent points forms a polytope known as a simplex, or simplex for short. These simplices are indeed triangles and tetrahedrons in 2 and 3dimensional spaces. A triangulation of a set of points is the partition of their convex hull into simplices such that any two simplices intersect either at a face joint or not at all. Also, the triangulation vertices, referred to as grid points, are exactly the points themselves. We consider two simplices as neighbors if their intersection is a facet, which is a ()simplex. In general, a triangulation of a set of points is not unique. A celebrated triangulation method is the Delaunay triangulation. [Delaunay Triangulation] For a set of points in , a Delaunay triangulation is the triangulation such that no point in is inside the circumhypersphere of any simplex in .
Simply put, the Delaunay triangulation partitions the convex hull of points into wellformed simplices. Specifically for , it is known that the Delaunay triangulation maximizes the minimal angle of the triangles of the triangulation and avoids skinny triangles. Similar optimal properties exist in higher dimensions [rajan1994optimality]. In addition, there exist computational methods that produce Delaunay triangulations in any dimension [lee1980two], [cignoni1998dewall].
IiB Continuous and PiecewiseLinear Functions
[CPWL function] A function is continuous and piecewiselinear if

it is continuous;

its domain can be partitioned into a set of nonoverlapping polytopes over which it is affine, with .
The gradient of the function over each polytope or, equivalently, each linear region , is . We denote the intersection facet of two neighboring polytopes and as . The ()dimensional volume of the intersection is denoted by . For and , this volume corresponds to length and area, respectively. Finally, we define
as the unit vector that is normal to
. We follow these notation throughout the paper.Alternatively, any CPWL function can be defined by a triangulation and the values of the function at its vertices, which we refer to as the simplicial parameterization of CPWL functions. The Authors in [liu2020delaunay] use a similar parameterization. This parameterization yields a Riesz basis, which guarantees a unique and stable link between the model parameters and the CPWL function [goujon2022stable]. In Figure 1, we show an example of the domain of an arbitrary CPWL function and a possible triangulation of it.
IiC Hessian Total Variation
IiC1 Generalized Hessian Matrix
The Hessian matrix of a twicedifferentiable function is defined as
(5) 
One can extend the Hessian definition to generalized functions (distributions) by using the notion of weak partial derivative. This enables us to define the Hessian of CPWL functions even though they are not twicedifferentiable everywhere. The directional derivative at a point and along the unit vector is defined as
(6) 
Likewise, the secondorder directional derivative along the direction is . Using the Hessian matrix, we write that
. A symmetric Hessian matrix has a complete set of eigenvalues and eigenvectors which form an orthogonal basis for
. Consequently, the secondorder directional derivative along the eigenvector of the Hessian is its associated eigenvalue , with for . If we use the eigenvectors of the Hessian to represent the direction , then we have that . This means that, at each point of the domain, the secondorder directional derivatives along the eigenvectors of the Hessian fully characterize the secondorder derivatives of along any direction.IiC2 Schatten pNorm
The Schatten pnorm of a matrix for is defined as the
norm of its singular values given by
(7) 
where are the singular values of . In this paper, we focus on the norm and its dual .
IiC3 Hessian Total Variation
If is twice differentiable with a symmetric Hessian at , then is given by the norm of the secondorder directional derivatives along the eigenvectors of the Hessian matrix. This measure provides a local characterization of the secondorder variation of . Hence, the total variation of the mapping is a reasonable generalization of the secondorder total variation for multidimensional functionals. This is referred to as the Hessian total variation (). If , then is simply defined as
(8) 
In particular, by using the as a regularizer, we promote configurations where most of the singular values of the Hessian are zero throughout the function domain due to the sparsity effects of the  and norms. In that manner, CPWL functions are of special interest as their secondorder directional derivatives vanish almost everywhere. Specifically, the gradient of a CPWL function admits the explicit form
gradient_CPWL where the indicator function is equal to one inside the simplex and zero elsewhere. Hence, the secondorder directional derivatives vanish everywhere except on the boundaries of the domain polytopes. There, the generalized Hessian matrix of the CPWL functions exhibits deltalike distributions. This means that (8) does not hold for CPWL functions as . Yet, we can extend the definition of the Hessian total variation to accommodate CPWL functions by invoking duality. Specifically, we define
(9) 
where the mixed norm for any matrixvalued distribution is
(10) 
In (11), is the matrixvalued Schwartz space and the duality product is defined as
(11) 
This formulation enables us to derive a closed form for the of a CPWL function as
(12)  
(13) 
where the set contains all unique pairs of neighboring polytope indices (see [aziznejad2021measuring] for more details). Equation (14) tells us that, for CPWL functions, penalizes the change of slope at neighboring domain polytopes by the volume of their intersection. For a purely affine mapping, the is zero. Otherwise, it increases with the number of affine pieces.
admits the following properties:
(14) 
In other words, is invariant to translation and rotation while it is covariant with scaling.
Iii Methods
Let us perform a Delaunay triangulation on the set of grid points. We denote the resulting simplices by . We define the CPWL functions inside the convex hull by designating the members of as their linear regions. Then, the functions are parameterized as
(15) 
with expansion coefficients and basis functions . The basis is the hat function attached to the th vertex. It is is given by
(16) 
is the th simplex that contains the vertex and is the barycentric coordinate of the point inside the simplex with respect to vertex . The basis functions are continuous and form a Riesz basis for the space. From the definition of barycentric coordinates, we know that and that . We illustrate an example of this basis function in Figure 2 for . Consequently, the in (16) are given by . The item of relevance is that the function is uniquely identified by . An example of such a function is illustrated in Figure 3 (a). In our work, the grid points are immutable and is our vector of learnable parameters. These parameters are the sampled values of the function at the vertices of the triangulation. Hence, their effect on the generated mapping is known, which makes our model directly interpretable. We can also write the function as
(17) 
where . The importance of this representation is that it expresses the function value as a linear combination of the gridpoint values . In each simplex, the only nonzero basis functions are the ones defined over the vertices of that simplex. This implies that there are at most () nonzero values in the vector .
Iiia Forward Operator
Given the set of training data, we choose grid points such that all data points are inside . Then, the function is evaluated at data points as
(18) 
where and . The matrix is referred to as the forward operator and is a mapping between the gridpoint values and the values of the function at the data points. If we represent the forward matrix as
(19) 
then, by (18), each row of can be written as . This implies that is a sparse matrix with at most () nonzero entries per row.
IiiB Regularization Operator
To determine the Hessian total variation of the function through (13), we need to calculate three quantities for each pair of neighbor simplices in ; the gradient difference of the affine pieces over those pairs; the unit normal of their intersection; and, finally, their intersection volume.
Setup: Assume that are two neighboring simplices and that the sets of indices of their vertices are and , with . We denote the vertices of their intersection by . There are exactly common vertices between two neighboring simplices, so that . We assume that and are the coordinates and values at the vertices indexed by members of and . Also, we denote and the coordinate and value of the vertex indexed by the th smallest member of for . We show an example of this setup in Figure 3 (b).
[Gradient difference] Let and . Their difference can be expressed as the linear combination of grid points values given by
(20) 
where
(21) 
There, the symbol represents the vector and is a sparse binary matrix such that if and only if .
Proof.
Since is affine over the simplices and , we have that
(22) 
Putting all equations together, we obtain that
(23) 
By analogy, we have that
(24) 
Next, we write the difference of and as
(25) 
(26) 
where .
Hence,
(27) 
and, from the definition of , we know that
(28) 
From (28) and (29), we finally obtain (21). ∎
[Determination of a normal vector] The unit normal of the intersection of and is given by
(29) 
where
(30) 
and where the slicing operator returns the first elements of the first column of .
Proof.
The intersection of and is the facet opposite to the vertex with coordinate in the simplex . We show that is perpendicular to that facet and, hence, to the intersection. From the definition of the inverse, we know that
(31) 
Hence, if we write the matrix as
(32) 
for some and , then we have that
(33) 
In particular, this implies that and for , so that
(34) 
for . Equation (35) implies that is perpendicular to the simplex formed by , which is exactly the facet opposite to and is the intersection of and . ∎
[Cayley–Menger determinant [blumenthal1943distribution]] The () dimensional volume of the simplex formed by is given by
(35) 
where and
(36) 
The intersection of and is the simplex formed by . Hence, by using Theorem 3, we can obtain the intersection volume .
We define the matrix as
(37) 
From Theorems 13, we have that
(38) 
From (13), we then calculate the of as
(39) 
where is the set of all unique neighbor simplices in . In (40), is a sparse matrix that is referred to as the regularization operator. It is of size and each of its rows corresponds to the term associated with two neighbor simplices. Hence, there are at most nonzero elements at each row of .
IiiC Learning Problem
To learn the regressor for some given data , we propose to solve the optimization problem
(40) 
where represents the CPWL function search space. If we restrict the search space to CPWL functions parameterized by a triangulation over a set of grid points , then we have that
(41) 
Using the forward and regularization operators introduced in Section III.A and III.B and with , we rewrite (42) as
(42) 
Formulation (43) recasts the problem of finding into a discrete problem of finding grid values . It is generically referred to as the generalized LASSO in the literature. The problem is convex and has solutions that can be found using methods such as the alternatingdirection method of multipliers (ADMM) [tibshirani2011solution], [boyd2011distributed]. In the special case when the forward operator is the identity , the dual of the problem is
(43) 
where the relation holds. Although the dual problem (44) is highdimensional, it is proximable. This means that we can use optimization algorithms such as fast iterative shrinkagethresholding algorithm (FISTA) to solve it whereas they are not applicable for the primal problem [beck2009fast]. FISTA has better convergence rate than ADMM and helps us to accelerate our computations. Algorithm 1 describes the iterations of FISTA when solving (44). The initialization
is equivalent to the interpolation state (
) as it corresponds to . The value of is less than the inverse of the largest eigenvalue of and the function is defined as(44) 

Train MSE  Test MSE  Number of parameters  HTV  Sparsity 

LR  
DHTV  
NN2  
NN6  
NN36  
RBF  6697 
IiiD Final Regressor
The solution of (43) gives us the grid values that define uniquely the CPWL function . However, the domain of the definition of this function is restricted to the convex hull of the grid points. The most intuitive way to extend is to use the notion of nearest neighbors and assign the value of the closest grid points to the points outside the convex hull. However, the mapping generated by using nearest neighbors is piecewiseconstant outside the convex hull, which does not generate a global CPWL relation. To overcome this issue, we define our final CPWL function over as
(45) 
where is the orthogonal projection of the point onto the convex hull of the grid points. The projection on the convex hull can be formulated as a quadratic optimization problem [gabidullina2018problem]. It can be shown that the solution of this problem corresponds to CPWL functions, so that our final regressor is guaranteed to be globally CPWL [spjotvold2007continuous].
Iv Experiments
In this section, we choose the grid points as the data points. In case of duplicate data points, we keep only one of them and let its target value to be the average of the target values of the duplicates. This results in the forward operator being identity and enables us to solve the learning problem with Algorithm 1. To have comparable ranges in each dimension of the input space, we standardize each feature. For the Delaunay triangulation, we use Scipy, which can safely handle data up to dimension [2020SciPyNMeth]. Our codes are available on the GitHub repository^{1}^{1}1https://github.com/mehrsapo/DHTV.
Iva Matterhorn
The purpose of our first experiment is to illustrate the effect of regularization. To that end, we consider the elevation map of the iconic Swiss mountain Matterhorn. We sample randomly 4800 points from the domain of the function^{2}^{2}2ASTER Global Digital Elevation Model V003: https://lpdaac.usgs.gov/products/astgtmv003/. Then, the mountain is reconstructed from the sampled data using our framework, which we refer to as Delaunay Hessian total variation (DHTV). The result of the DHTV learning is reported in Figure 4 for several values of the regularization hyperparameter . When increases, we see that the number of affine pieces of the final mapping decreases. Interestingly, the regularizer tends to omit local fluctuations while preserving the main ridges.
IvB CombinedCycle PowerPlant Data Set
For the experiment of this section, we use a data set that forms a 4dimensional mapping . The data contain 9568 samples. They record the net hourly electricalenergy output of a power plant in terms of temperature, ambient pressure, relative humidity, and exhaust vacuum [tufekci2014prediction]. We compare four learning schemes: linear regression (LR), our framework (DHTV), neural networks (NN), and a kernel method with Gaussian radialbasis functions (RBF). For all learning schemes, we perform 30 independent runs. At each run, we split the data randomly into train (), validation (), and test () points. In our framework, at each run we perform a grid search on the values of the regularization hyperparameter and retain the best in terms of validation error. For the neural network, we try three different fully connected networks: the first is a twolayer neural network (NN2) with 500 units in the hidden layer; the second is a sixlayer network (NN6) with 200 units per hidden layer; and the third is a thirtysixlayer network (NN36) with 200 hidden units per layer, where the purpose is to demonstrate overfitting. To train the networks, we use the ADAM optimizer [kingma2014adam]
, a batch size of 32, and 10000 epochs. We set the learning rate to
at the beginning and reduce it by a factor of 10 in the 400th, 600th, and 800th epoch. For the first two architectures (NN2 and NN6), we perform at each run a grid search on the weightdecay hyperparameter. For the last network (NN36), the weight decay is set to zero. We also perform a grid search on the hyperparameters of the RBF method and choose the model with the best validation error. To achieve a fair comparison of the estimations of each model, we construct a random triangulation with 1000 grid points. The coordinates of these grid points follow a standard normal distribution in each dimension. We sample the final mapping of each model on the random grid points and calculate the of the mapping using the sampled function values , the regularization operator of the random triangulation , and Formula (40). In addition, we define a metric of sparsity as . It corresponds to the percentage of almostcoplanar linear pieces over neighboring pairs of simplices. In practice, we set . We report in Table I the metrics averaged over 30 random splittings. The reported is normalized by the mean of the interpolation result or, equivalently, the generated mapping when in the DHTV framework. We observe that DHTV performs well in terms of prediction error in comparison with other methods. Our parameterization results in a mapping with low fitting error and good prediction power while having fewer learnable parameters than neural networks. An important feature of our method is the control of the complexity of the model by the value of . This is illustrated in Figure 5. The DHTV framework yields sparser mappings while maintaining or even increasing the generalization performance. In addition, as we can see in Table I, the and sparsity metrics are in correspondence with the prediction error of the various models, confirming that is a good metric for the model complexity.V Conclusion
We have proposed a novel regression method referred to as Delaunay Hessian total variation (DHTV). Our approach provides continuous and piecewiselinear (CPWL) mappings—the same class of functions generated by ReLU networks. We employ a parameterization based on the Delaunay triangulation of the input space. This parameterization represents any CPWL function by its samples on a grid. Unlike deep networks, it is straightforward to understand the effect of the model parameters on the mapping, which makes the proposed model directly interpretable. We have formulated the learning process as a convex minimization task. The Hessian totalvariation () regularization was used to control the complexity of the generated mapping. has an intuitive formula for CPWL functions which involves a summation over individual affine pieces. We have showed that the of the proposed model is the norm applied to a linear transformation of the gridpoint values. This result has enabled us to recast the learning problem as the generalized least absolute shrinkage and selectionoperator. By a clever choice of the grid points, we use the fast iterative shrinkagethresholding algorithm to solve the optimization problem. Our experiments show that the regularizing leads to simple models while preserving the generalization power. In future works, we plan to investigate the removal of unnecessary grid points based on their contribution to the , which could be used for meshsimplification purposes.
Vi Acknowledgement
The authors would like to thank Shayan Aziznejad and Joaquim Campos for having fruitful discussions.