DeepAI
Log In Sign Up

Delaunay-Triangulation-Based Learning with Hessian Total-Variation Regularization

08/16/2022
by   Mehrsa Pourya, et al.
EPFL
0

Regression is one of the core problems tackled in supervised learning. Rectified linear unit (ReLU) neural networks generate continuous and piecewise-linear (CPWL) mappings and are the state-of-the-art approach for solving regression problems. In this paper, we propose an alternative method that leverages the expressivity of CPWL functions. In contrast to deep neural networks, our CPWL parameterization guarantees stability and is interpretable. Our approach relies on the partitioning of the domain of the CPWL function by a Delaunay triangulation. The function values at the vertices of the triangulation are our learnable parameters and identify the CPWL function uniquely. Formulating the learning scheme as a variational problem, we use the Hessian total variation (HTV) as regularizer to favor CPWL functions with few affine pieces. In this way, we control the complexity of our model through a single hyperparameter. By developing a computational framework to compute the HTV of any CPWL function parameterized by a triangulation, we discretize the learning problem as the generalized least absolute shrinkage and selection operator (LASSO). Our experiments validate the usage of our method in low-dimensional scenarios.

READ FULL TEXT VIEW PDF
12/12/2021

Measuring Complexity of Learning Schemes Using Hessian-Schatten Total-Variation

In this paper, we introduce the Hessian-Schatten total-variation (HTV) –...
06/11/2019

Numerical computations of split Bregman method for fourth order total variation flow

The split Bregman framework for Osher-Solé-Vese (OSV) model and fourth o...
08/06/2017

Interpretable Low-Dimensional Regression via Data-Adaptive Smoothing

We consider the problem of estimating a regression function in the commo...
09/14/2020

Input Hessian Regularization of Neural Networks

Regularizing the input gradient has shown to be effective in promoting t...
02/06/2018

Fast Piecewise-Affine Motion Estimation Without Segmentation

Current algorithmic approaches for piecewise affine motion estimation ar...
06/18/2012

Total Variation and Euler's Elastica for Supervised Learning

In recent years, total variation (TV) and Euler's elastica (EE) have bee...
06/29/2022

From Kernel Methods to Neural Networks: A Unifying Variational Formulation

The minimization of a data-fidelity term and an additive regularization ...

I Introduction

Supervised learning entails finding a function using a set of training data points and their target values . The function should approximate the target values at the data points and generalize well to new inputs [hastie2009overview]. In a variational framework, the learning problem is formalized as the optimization task

(1)

where is the function search space,

is a convex loss function controlling the data-fitting error, and

is the regularizer. The hyperparameter adjusts the contribution of the regularizer. Regularization is used to promote functions with desirable structures such as sparsity and to reduce overfitting [tian2022comprehensive, scholkopf2002learning].

In order to make (1) tractable, the search space

is usually expressed as a parametric space. For instance, linear regression—the simplest model—reduces the learning problem to the search for

and such that [gross2003linear]. This model is very well understood, but it is severely lacking expressivity. Another approach to the learning problem is founded on the theory of reproducing-kernel Hilbert spaces (RKHS) [kadri2010nonlinear], [hofmann2005tutorial]. If the search space in (1) is the reproducing-kernel Hilbert space with kernel , and where is the Hilbert-space norm, then the RKHS representer theorem states that the solution of (1) admits the closed-form

(2)

for some coefficients [scholkopf2001generalized]

. There, the problem is recast into a solvable linear model. Moreover, by using radial-basis functions as kernels, one can approach continuous functions as closely as desired

[schaback2007practical], [miller1992review], [micchelli2006universal]

. For high-dimensional data, however, kernel methods are outperformed by deep neural networks, which are the state-of-the-art in many applications

[abiodun2018state]. A neural network of depth forms a mapping

(3)

where is a learnable affine mapping parameterized by and

is the nonlinearity (a.k.a. “activation function”) performed in the

th layer. Among all possible activation functions, the most common choice and usually the best in terms of performance is the rectified linear unit ReLU() = [schmidt2020nonparametric]. It is known that a ReLU network produces continuous and piecewise-linear (CPWL) functions [montufar2014number]. Conversely, any CPWL function can be described by a ReLU network [arora2016understanding]. The CPWL model is universal, in the sense that it can approximate any continuous mapping [DBLP:conf/iclr/ParkYLS21]. However, a major drawback of letting deep networks learn a CPWL function is the lack of their interpretability, in the sense that the effect of each parameter on the generated mapping is not understood explicitly [fan2021interpretability].

There also exists a variational interpretation of shallow ReLU networks [parhi2021kinds, unser2022ridges, ongie2019function]. However, shallow networks cannot represent all CPWL functions, especially in high dimensions. In fact, no variational interpretation is known for deep networks, to the best of our knowledge. Interestingly, there exist other variational problems that admit global CPWL minimizers [debarre2019b], [debarre2021sparsest]. If we investigate (1) in the one-dimensional case , with the second-order total variation as the regularizer, and if we restrict the search space to functions with a bounded second-order total variation, then the extreme points of the solution set are necessarily of the form

(4)

where , , and . [unser2021unifying] . For such solutions, the regularization has the simple closed-form . As the -norm promotes sparsity, a regularization by the second-order total variation will promote CPWL functions with few knots. A multidimensional generalization of is the Hessian total-variation () seminorm [aziznejad2021measuring]. is equivalent to in the one-dimensional case and a similar regularizer was introduced before for image-reconstruction purposes [lefkimmiatis2011hessian]. Since the null space of is composed of affine mappings, this regularization favors solutions that are predominantly affine; typically, CPWL functions with few pieces. Authors in [campos2021learning] use regularization to learn two-dimensional functions. They parameterize the function search space by a span of linear box splines which are themselves CPWL functions [condat2006three]. Although their parameterization has a good approximation power, it has some drawbacks. First, the number of box-spline basis functions grows exponentially with the dimension. Second, there are many domain regions without corresponding data points. These regions add complexity to the model without benefiting to its performance.

In this paper, we investigate an alternative approach that is based on the Delaunay triangulation in an irregular setting. Here are our primary contributions.

  1. Flexible parameterization of CPWL functions. We partition the domain of CPWL functions by the Delaunay triangulation on a set of adaptive grid points; the function values at the grid points are our learnable parameters and identify the CPWL functions uniquely.

  2. Explicit computation of in any dimension. For the proposed CPWL model, we show that is the

    -norm applied to a linear transformation of the grid-point values.

  3. Experimental validation of the scheme in dimensions at least two, with real data.

Our learning approach uses the expressivity of CPWL functions in a stable and interpretable manner [goujon2022stable]. These properties stem from the nature of the proposed Delaunay-based parameterization. In addition, the use of regularization enables us to control the complexity of the final mapping by a single hyperparameter.

Our paper is organized as follows: In Section II, we introduce the mathematical tools that we need to develop our method. We define Delaunay triangulation, CPWL functions, and the Hessian total variation. In Section III, we describe our CPWL parameterization, explain the procedure to calculate its , and derive the generalized least-absolute-shrinkage-and-selection-operator (LASSO) formulation of our learning problem. Finally, we present our experimental results in Section IV.

Ii Preliminaries

Ii-a Delaunay Triangulation

In the -dimensional space , the convex hull of affinely independent points forms a polytope known as a -simplex, or simplex for short. These simplices are indeed triangles and tetrahedrons in 2- and 3-dimensional spaces. A triangulation of a set of points is the partition of their convex hull into simplices such that any two simplices intersect either at a face joint or not at all. Also, the triangulation vertices, referred to as grid points, are exactly the points themselves. We consider two simplices as neighbors if their intersection is a facet, which is a ()-simplex. In general, a triangulation of a set of points is not unique. A celebrated triangulation method is the Delaunay triangulation. [Delaunay Triangulation] For a set of points in , a Delaunay triangulation is the triangulation such that no point in is inside the circum-hypersphere of any simplex in .

Simply put, the Delaunay triangulation partitions the convex hull of points into well-formed simplices. Specifically for , it is known that the Delaunay triangulation maximizes the minimal angle of the triangles of the triangulation and avoids skinny triangles. Similar optimal properties exist in higher dimensions [rajan1994optimality]. In addition, there exist computational methods that produce Delaunay triangulations in any dimension [lee1980two], [cignoni1998dewall].

Ii-B Continuous and Piecewise-Linear Functions

[CPWL function] A function is continuous and piecewise-linear if

  • it is continuous;

  • its domain can be partitioned into a set of non-overlapping polytopes over which it is affine, with .

The gradient of the function over each polytope or, equivalently, each linear region , is . We denote the intersection facet of two neighboring polytopes and as . The ()-dimensional volume of the intersection is denoted by . For and , this volume corresponds to length and area, respectively. Finally, we define

as the unit vector that is normal to

. We follow these notation throughout the paper.

Alternatively, any CPWL function can be defined by a triangulation and the values of the function at its vertices, which we refer to as the simplicial parameterization of CPWL functions. The Authors in [liu2020delaunay] use a similar parameterization. This parameterization yields a Riesz basis, which guarantees a unique and stable link between the model parameters and the CPWL function [goujon2022stable]. In Figure 1, we show an example of the domain of an arbitrary CPWL function and a possible triangulation of it.

Fig. 1: Domain of a two-dimensional CPWL function. Each patch corresponds to one linear region (left) and its partition into simplices through a Delaunay triangulation (right).

Ii-C Hessian Total Variation

Ii-C1 Generalized Hessian Matrix

The Hessian matrix of a twice-differentiable function is defined as

(5)

One can extend the Hessian definition to generalized functions (distributions) by using the notion of weak partial derivative. This enables us to define the Hessian of CPWL functions even though they are not twice-differentiable everywhere. The directional derivative at a point and along the unit vector is defined as

(6)

Likewise, the second-order directional derivative along the direction is . Using the Hessian matrix, we write that

. A symmetric Hessian matrix has a complete set of eigenvalues and eigenvectors which form an orthogonal basis for

. Consequently, the second-order directional derivative along the eigenvector of the Hessian is its associated eigenvalue , with for . If we use the eigenvectors of the Hessian to represent the direction , then we have that . This means that, at each point of the domain, the second-order directional derivatives along the eigenvectors of the Hessian fully characterize the second-order derivatives of along any direction.

Ii-C2 Schatten p-Norm

The Schatten p-norm of a matrix for is defined as the

-norm of its singular values given by

(7)

where are the singular values of . In this paper, we focus on the -norm and its dual .

Ii-C3 Hessian Total Variation

If is twice differentiable with a symmetric Hessian at , then is given by the -norm of the second-order directional derivatives along the eigenvectors of the Hessian matrix. This measure provides a local characterization of the second-order variation of . Hence, the total variation of the mapping is a reasonable generalization of the second-order total variation for multidimensional functionals. This is referred to as the Hessian total variation (). If , then is simply defined as

(8)

In particular, by using the as a regularizer, we promote configurations where most of the singular values of the Hessian are zero throughout the function domain due to the sparsity effects of the - and -norms. In that manner, CPWL functions are of special interest as their second-order directional derivatives vanish almost everywhere. Specifically, the gradient of a CPWL function admits the explicit form

gradient_CPWL where the indicator function is equal to one inside the simplex and zero elsewhere. Hence, the second-order directional derivatives vanish everywhere except on the boundaries of the domain polytopes. There, the generalized Hessian matrix of the CPWL functions exhibits delta-like distributions. This means that (8) does not hold for CPWL functions as . Yet, we can extend the definition of the Hessian total variation to accommodate CPWL functions by invoking duality. Specifically, we define

(9)

where the mixed norm for any matrix-valued distribution is

(10)

In (11), is the matrix-valued Schwartz space and the duality product is defined as

(11)

This formulation enables us to derive a closed form for the of a CPWL function as

(12)
(13)

where the set contains all unique pairs of neighboring polytope indices (see [aziznejad2021measuring] for more details). Equation (14) tells us that, for CPWL functions, penalizes the change of slope at neighboring domain polytopes by the volume of their intersection. For a purely affine mapping, the is zero. Otherwise, it increases with the number of affine pieces.

admits the following properties:

(14)

In other words, is invariant to translation and rotation while it is covariant with scaling.

Iii Methods

Let us perform a Delaunay triangulation on the set of grid points. We denote the resulting simplices by . We define the CPWL functions inside the convex hull by designating the members of as their linear regions. Then, the functions are parameterized as

(15)

with expansion coefficients and basis functions . The basis is the hat function attached to the th vertex. It is is given by

(16)

is the -th simplex that contains the vertex and is the barycentric coordinate of the point inside the simplex with respect to vertex . The basis functions are continuous and form a Riesz basis for the space. From the definition of barycentric coordinates, we know that and that . We illustrate an example of this basis function in Figure 2 for . Consequently, the in (16) are given by . The item of relevance is that the function is uniquely identified by . An example of such a function is illustrated in Figure 3 (a). In our work, the grid points are immutable and is our vector of learnable parameters. These parameters are the sampled values of the function at the vertices of the triangulation. Hence, their effect on the generated mapping is known, which makes our model directly interpretable. We can also write the function as

(17)

where . The importance of this representation is that it expresses the function value as a linear combination of the grid-point values . In each simplex, the only nonzero basis functions are the ones defined over the vertices of that simplex. This implies that there are at most () nonzero values in the vector .

Fig. 2: The th vertex of the triangulation with coordinate is included in 4 domain simplices over which the associated simplicial basis (hat) function is illustrated.
(a)
(b)
Fig. 3: (a) Example of a CPWL function defined through a triangulation. Over each simplex , one can identify the function by the plane that passes through the points . The vector corresponds to the normal vector of that plane. (b) Top view of the domain simplices and .

Iii-a Forward Operator

Given the set of training data, we choose grid points such that all data points are inside . Then, the function is evaluated at data points as

(18)

where and . The matrix is referred to as the forward operator and is a mapping between the grid-point values and the values of the function at the data points. If we represent the forward matrix as

(19)

then, by (18), each row of can be written as . This implies that is a sparse matrix with at most () nonzero entries per row.

Iii-B Regularization Operator

To determine the Hessian total variation of the function through (13), we need to calculate three quantities for each pair of neighbor simplices in ; the gradient difference of the affine pieces over those pairs; the unit normal of their intersection; and, finally, their intersection volume.

Setup: Assume that are two neighboring simplices and that the sets of indices of their vertices are and , with . We denote the vertices of their intersection by . There are exactly common vertices between two neighboring simplices, so that . We assume that and are the coordinates and values at the vertices indexed by members of and . Also, we denote and the coordinate and value of the vertex indexed by the -th smallest member of for . We show an example of this setup in Figure 3 (b).

[Gradient difference] Let and . Their difference can be expressed as the linear combination of grid points values given by

(20)

where

(21)

There, the symbol represents the vector and is a sparse binary matrix such that if and only if .

Proof.

Since is affine over the simplices and , we have that

(22)

Putting all equations together, we obtain that

(23)

By analogy, we have that

(24)

Next, we write the difference of and as

(25)
(26)

where .

Fig. 4: Effect of the regularization hyperparameter on the learned function for the evaluation map of the Matterhorn with our framework. Each column includes the results for the specified . The images in the first and second row give the top and side views of the learned mappings.

Hence,

(27)

and, from the definition of , we know that

(28)

From (28) and (29), we finally obtain (21). ∎

[Determination of a normal vector] The unit normal of the intersection of and is given by

(29)

where

(30)

and where the slicing operator returns the first elements of the first column of .

Proof.

The intersection of and is the facet opposite to the vertex with coordinate in the simplex . We show that is perpendicular to that facet and, hence, to the intersection. From the definition of the inverse, we know that

(31)

Hence, if we write the matrix as

(32)

for some and , then we have that

(33)

In particular, this implies that and for , so that

(34)

for . Equation (35) implies that is perpendicular to the simplex formed by , which is exactly the facet opposite to and is the intersection of and . ∎

[Cayley–Menger determinant [blumenthal1943distribution]] The (-) dimensional volume of the simplex formed by is given by

(35)

where and

(36)

The intersection of and is the simplex formed by . Hence, by using Theorem 3, we can obtain the intersection volume .

We define the matrix as

(37)

From Theorems 1-3, we have that

(38)

From (13), we then calculate the of as

(39)

where is the set of all unique neighbor simplices in . In (40), is a sparse matrix that is referred to as the regularization operator. It is of size and each of its rows corresponds to the term associated with two neighbor simplices. Hence, there are at most nonzero elements at each row of .

Iii-C Learning Problem

To learn the regressor for some given data , we propose to solve the optimization problem

(40)

where represents the CPWL function search space. If we restrict the search space to CPWL functions parameterized by a triangulation over a set of grid points , then we have that

(41)

Using the forward and regularization operators introduced in Section III.A and III.B and with , we rewrite (42) as

(42)

Formulation (43) recasts the problem of finding into a discrete problem of finding grid values . It is generically referred to as the generalized LASSO in the literature. The problem is convex and has solutions that can be found using methods such as the alternating-direction method of multipliers (ADMM) [tibshirani2011solution], [boyd2011distributed]. In the special case when the forward operator is the identity , the dual of the problem is

(43)

where the relation holds. Although the dual problem (44) is high-dimensional, it is proximable. This means that we can use optimization algorithms such as fast iterative shrinkage-thresholding algorithm (FISTA) to solve it whereas they are not applicable for the primal problem [beck2009fast]. FISTA has better convergence rate than ADMM and helps us to accelerate our computations. Algorithm 1 describes the iterations of FISTA when solving (44). The initialization

is equivalent to the interpolation state (

) as it corresponds to . The value of is less than the inverse of the largest eigenvalue of and the function is defined as

(44)

Train MSE Test MSE Number of parameters HTV Sparsity
LR
DHTV
NN2
NN6
NN36
RBF 6697
TABLE I: Average training and testing mean-square error, normalized , number of parameters, and sparsity metric for several learning approaches used applied to the power-plant data set.
for  do
     
     
     
end for
Algorithm 1 FISTA iterations to solve (43)

Iii-D Final Regressor

The solution of (43) gives us the grid values that define uniquely the CPWL function . However, the domain of the definition of this function is restricted to the convex hull of the grid points. The most intuitive way to extend is to use the notion of nearest neighbors and assign the value of the closest grid points to the points outside the convex hull. However, the mapping generated by using nearest neighbors is piecewise-constant outside the convex hull, which does not generate a global CPWL relation. To overcome this issue, we define our final CPWL function over as

(45)

where is the orthogonal projection of the point onto the convex hull of the grid points. The projection on the convex hull can be formulated as a quadratic optimization problem [gabidullina2018problem]. It can be shown that the solution of this problem corresponds to CPWL functions, so that our final regressor is guaranteed to be globally CPWL [spjotvold2007continuous].

Iv Experiments

In this section, we choose the grid points as the data points. In case of duplicate data points, we keep only one of them and let its target value to be the average of the target values of the duplicates. This results in the forward operator being identity and enables us to solve the learning problem with Algorithm 1. To have comparable ranges in each dimension of the input space, we standardize each feature. For the Delaunay triangulation, we use Scipy, which can safely handle data up to dimension [2020SciPy-NMeth]. Our codes are available on the GitHub repository111https://github.com/mehrsapo/DHTV.

Iv-a Matterhorn

The purpose of our first experiment is to illustrate the effect of regularization. To that end, we consider the elevation map of the iconic Swiss mountain Matterhorn. We sample randomly 4800 points from the domain of the function222ASTER Global Digital Elevation Model V003: https://lpdaac.usgs.gov/products/astgtmv003/. Then, the mountain is reconstructed from the sampled data using our framework, which we refer to as Delaunay Hessian total variation (DHTV). The result of the DHTV learning is reported in Figure 4 for several values of the regularization hyperparameter . When increases, we see that the number of affine pieces of the final mapping decreases. Interestingly, the regularizer tends to omit local fluctuations while preserving the main ridges.

Fig. 5: Effect of the regularization hyperparameter on the validation mean-square error and the sparsity metric throughout the DHTV learning process of the 4-dimensional data. The highlighted area corresponds to the confidence interval.

Iv-B Combined-Cycle Power-Plant Data Set

For the experiment of this section, we use a data set that forms a 4-dimensional mapping . The data contain 9568 samples. They record the net hourly electrical-energy output of a power plant in terms of temperature, ambient pressure, relative humidity, and exhaust vacuum [tufekci2014prediction]. We compare four learning schemes: linear regression (LR), our framework (DHTV), neural networks (NN), and a kernel method with Gaussian radial-basis functions (RBF). For all learning schemes, we perform 30 independent runs. At each run, we split the data randomly into train (), validation (), and test () points. In our framework, at each run we perform a grid search on the values of the regularization hyperparameter and retain the best in terms of validation error. For the neural network, we try three different fully connected networks: the first is a two-layer neural network (NN2) with 500 units in the hidden layer; the second is a six-layer network (NN6) with 200 units per hidden layer; and the third is a thirty-six-layer network (NN36) with 200 hidden units per layer, where the purpose is to demonstrate overfitting. To train the networks, we use the ADAM optimizer [kingma2014adam]

, a batch size of 32, and 10000 epochs. We set the learning rate to

at the beginning and reduce it by a factor of 10 in the 400th, 600th, and 800th epoch. For the first two architectures (NN2 and NN6), we perform at each run a grid search on the weight-decay hyperparameter. For the last network (NN36), the weight decay is set to zero. We also perform a grid search on the hyperparameters of the RBF method and choose the model with the best validation error. To achieve a fair comparison of the estimations of each model, we construct a random triangulation with 1000 grid points. The coordinates of these grid points follow a standard normal distribution in each dimension. We sample the final mapping of each model on the random grid points and calculate the of the mapping using the sampled function values , the regularization operator of the random triangulation , and Formula (40). In addition, we define a metric of sparsity as . It corresponds to the percentage of almost-coplanar linear pieces over neighboring pairs of simplices. In practice, we set . We report in Table I the metrics averaged over 30 random splittings. The reported is normalized by the mean of the interpolation result or, equivalently, the generated mapping when in the DHTV framework. We observe that DHTV performs well in terms of prediction error in comparison with other methods. Our parameterization results in a mapping with low fitting error and good prediction power while having fewer learnable parameters than neural networks. An important feature of our method is the control of the complexity of the model by the value of . This is illustrated in Figure 5. The DHTV framework yields sparser mappings while maintaining or even increasing the generalization performance. In addition, as we can see in Table I, the and sparsity metrics are in correspondence with the prediction error of the various models, confirming that is a good metric for the model complexity.

V Conclusion

We have proposed a novel regression method referred to as Delaunay Hessian total variation (DHTV). Our approach provides continuous and piecewise-linear (CPWL) mappings—the same class of functions generated by ReLU networks. We employ a parameterization based on the Delaunay triangulation of the input space. This parameterization represents any CPWL function by its samples on a grid. Unlike deep networks, it is straightforward to understand the effect of the model parameters on the mapping, which makes the proposed model directly interpretable. We have formulated the learning process as a convex minimization task. The Hessian total-variation () regularization was used to control the complexity of the generated mapping. has an intuitive formula for CPWL functions which involves a summation over individual affine pieces. We have showed that the of the proposed model is the -norm applied to a linear transformation of the grid-point values. This result has enabled us to recast the learning problem as the generalized least absolute shrinkage and selection-operator. By a clever choice of the grid points, we use the fast iterative shrinkage-thresholding algorithm to solve the optimization problem. Our experiments show that the regularizing leads to simple models while preserving the generalization power. In future works, we plan to investigate the removal of unnecessary grid points based on their contribution to the , which could be used for mesh-simplification purposes.

Vi Acknowledgement

The authors would like to thank Shayan Aziznejad and Joaquim Campos for having fruitful discussions.

References