# Lasso for hierarchical polynomial models

In a polynomial regression model, the divisibility conditions implicit in polynomial hierarchy give way to a natural construction of constraints for the model parameters. We use this principle to derive versions of strong and weak hierarchy and to extend existing work in the literature, which at the moment is only concerned with models of degree two. We discuss how to estimate parameters in lasso using standard quadratic programming techniques and apply our proposal to both simulated data and examples from the literature. The proposed methodology compares favorably with existing techniques in terms of low validation error and model size.

## Authors

• 1 publication
• 9 publications
07/21/2018

### T-optimal design for multivariate polynomial regression using semidefinite programming

We consider T-optimal experiment design problems for discriminating mult...
05/29/2019

### Topological Techniques in Model Selection

The LASSO is an attractive regularisation method for linear regression t...
05/22/2012

### A lasso for hierarchical interactions

We add a set of convex constraints to the lasso to produce sparse intera...
10/03/2017

### Bayesian Fused Lasso regression for dynamic binary networks

We propose a multinomial logistic regression model for link prediction i...
09/22/2017

### High Degree Sum of Squares Proofs, Bienstock-Zuckerberg hierarchy and Chvatal-Gomory cuts

Chvatal-Gomory (CG) cuts and the Bienstock-Zuckerberg hierarchy capture ...
11/01/2020

### Lasso hyperinterpolation over general regions

This paper develops a fully discrete soft thresholding polynomial approx...
10/18/2021

### Provable Hierarchy-Based Meta-Reinforcement Learning

Hierarchical reinforcement learning (HRL) has seen widespread interest a...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper is concerned with polynomial regression models and within this class of models, hierarchical polynomial models. Our primary goal is to develop parameter constraints that enforce hierarchy for such linear models. In this paper we develop constraints for both strong and weak hierarchy, using to our advantage the divisibility conditions of model terms.

A polynomial model is hierarchical when the presence of an interaction term such as implies that both terms and are also in the model, in the sense that the coefficients of the terms involved are not zero. This type of hierarchy is known in the literature as strong hierarchy [4] and implies that hierarchical polynomial models must contain an intercept term. Weak hierarchy is a less restrictive form of hierarchy that has also been studied in the literature. In weak hierarchy, the interaction term would only vanish from the model when both terms and have vanished. In this paper the terms “hierarchy” and “strong hierarchy” are interchangeable, while “weak hierarchy” refers only to this type of hierarchy.

There are several arguments for the relevance of hierarchy in modelling. For instance, often the analysis is performed in linearly translated coordinates. If the model under consideration in the transformed scale is not hierarchical, when translating back to the original scale, model terms that were not present in the transformed scale appear [19]. Another case for hierarchy is that, e.g. were the intercept term be removed from the model, this would force it to pass through the origin. This type of model constraint should not be allowed to happen unless there is a strong reason for it [14].

Practical sparsity, the number of variables measured, is another argument used for hierarchy. Small models in the sense of low value of practical sparsity are achieved through hierarchy. Indeed hierarchy “reuses” variables through interactions and higher order terms and should be preferred to modeling without hierarchy considerations [4].

There are several challenges for using and implementing hierarchy in models. A paramount challenge is to have a simple, consistent and intuitive way to formulate hierarchical models. As part of modeling and data analysis, a challenge is to estimate model parameters. Estimation methodologies should be fast and efficient and the computational burden implied must be kept to a minimum.

Another challenge is model size. Despite having relatively low practical sparsity, a hierarchical model may still have many parameters. For example, if is the number of explanatory variables, a model with linear terms and squarefree interactions of order two has parameters excluding intercept, and if pure quadratic terms are added. Once triple interactions are considered, the size of a full square free model is . In short, model size can increase considerably depending on the number of variables and the degree of terms used. Hence when using hierarchy there is a need to balance between the benefits of relatively big models and keeping the models to a manageable size.

### 1.1 Hierarchy in the literature

Our work is developed for a standard linear polynomial regression model

 Y=Xθ+ϵ,

were is the design-model matrix, assumed to have full rank. It has size with , and the columns of correspond to polynomial terms in explanatory variables, and

is the number of observations. The response vector is

and is the vector of model parameters, while

is a vector of independent error terms with zero mean and variance

.

Polynomial models that satisfy strong hierarchy are also known in the literature as “well-formulated polynomial regression models” [19, 20]. Hierarchical models are also known in statistical literature as models that have the the property of marginality, see [14, 15, 16, 17]. Marginality, that is hierarchy, is routinely used for modeling in experimental designs see [1].

There are several versions of polynomial hierarchy available in the statistical literature and that have been implemented in R packages. The authors in [4] considered a second order polynomial model. For this model, constraints were developed to achieve versions of strong and weak hierarchy. This proposal was developed to create the package hierNet which is Lasso for hierarchical second order models. These ideas were later explored and developed further for hypothesis testing, see [3]. Another development, termed VANISH, also considers a second order model as well as a functional extension of it, and they constructed a penalty that imposes strong hierarchy while keeping the criterion convex [22]. In the development known as FAMILY, convex penalty functions are created using the rows and columns of the matrix of quadratic and second order interaction coefficients. This development has also been implemented in R package and considers different penalties that allow both cases of strong and weak hierarchy [11].

A two step hierarchical approach for the quadratic model is available in the package glinternet which first screens candidate main effects and interactions and implements group lasso to select variables while enforcing strong hierarchy [13]. A recent contribution is the sequential search for hierarchical models while simultaneously keeping low a notion of false rejections. This search has been implemented in the package rai and is potentially able to explore models with higher order polynomial interactions [12].

A different approach for hierarchical polynomial model selection is that by [2], who sequentially search and discard model terms. A model is then selected with a compound criterion based on model curvature and validation error. This approach is not limited to polynomial models of second degree, however the search can be prohibitively expensive.

Finally, the authors in [6] explored a general model parametrization that guarantees hierarchy, but in the face of nonlinearity of this approach, they developed hierarchy in a Bayesian context. Another proposal within the context of Bayesian analysis is [18].

### 1.2 Contributions

Our first contribution is the development of general, non Bayesian methodology for the analysis of data with hierarchical models. Our methodology is in practice as close as possible to standard lasso, while still enforcing hierarchy. Consider the plots in Figure 3, where the plot (a) is standard unconstrained lasso. The results for constrained (b) are already hierarchical but still quite restrictive and we are able to ease the constraints to (c) and then (d) which is closer to (a) but keeping hierarchy.

Our contributed methodology has also low validation error and compares favorably with lasso and other existing methods from the statistical literature. For example, in Figure 7, validation errors obtained for our hierarchical models (blue boxplots) improve over lasso and perform well when compared against other methods from literature.

Another contribution is the development of theory for our proposal. The methodology we develop is based on Hasse diagrams. For a given candidate polynomial model, the divisibility conditions between terms can be encoded as a Hasse diagram. Therefore, such diagram serves to represent the constraints for the parameter vector that guarantee hierarchy. The use of Hasse diagrams was motivated by the experimental design literature, where they are routinely used to analyze hierarchical models (see [1]) but the constraints are novel and to the best of our knowledge, such diagrams have not been used previously to construct constraints.

In terms of parameter estimation, we use lasso [23], which we adapt for hierarchy. To guarantee hierarchy, the minimization of lasso criterion is constrained to the conditions on read from the Hasse diagram and in practice we use an implementation of the quadratic programming methodology by [10].

### 1.3 Order of the paper

The order of the rest of the paper is as follows. In Section 2 we define and then use the divisibility conditions implicit in a hierarchical polynomial model. We build a Hasse diagram from which we read parametric constraints that guarantee model hierarchy. We then discuss the generation of strong and weak hierarchies as well as the relation between such parametric constraints. In Section 3 we apply the constraints from Section 2 as part of estimation in lasso. We develop this constrained estimation within standard lasso and also develop a relaxation of it. In Section 4 we apply our methods to examples from the literature. We add a discussion in Section 5 in which we comment on potential extensions to the methodology.

## 2 Hierarchical polynomial modelling

We first define polynomial notation, and elaborate on hierarchical polynomial models which satisfy divisibility conditions. Then using Hasse diagrams, we develop the constraints on parameters implied by hierarchy. The basic reference for polynomial notation is [7], and for the use of this notation in statistics, see [21].

Consider indeterminates . A monomial term is defined as the power product , where is the exponent vector whose entries are non negative integers. The degree of the term is the sum of its exponents . Let

be a finite set of exponent vectors so that the expectation of a linear regression with terms in

is

 E(Y(x))=∑α∈Mθαxα, (1)

where is the coefficient associated with the term . Each coefficient is a fixed real quantity and thus the right hand side of Equation (1) is a polynomial. We refer to as the model, as it is the set of candidate terms which will be used to model the response.

A model satisfies strong hierarchy when for every term , all the divisors of have exponents in as well. This is the strong hierarchy described in the introduction of this paper, and we imply that the corresponding coefficients are non zero. The list of elements of a hierarchical can be retrieved from the list of directing monomials of , where a directing monomial is a monomial that cannot be divided by other monomial terms from , see [2].

A polynomial model satisfies weak hierarchy, when, for every term with , at least one of the divisors of have exponents in as well. In a model, strong hierarchy implies weak hierarchy.

### 2.1 Hierarchy and partial ordering

There is a natural ordering of monomials implied by monomial division and denoted with the symbol . Consider two distinct monomial terms . We say when divides or conversely, when is a monomial multiple of . Note that is attained when componentwise, i.e. when .

###### Example 1

Consider the set of monomials in variables. The divisibility condition is equivalent to checking componentwise that . The following are all the divisibility relations between monomials in the list above: , , , , , and .

For a set of monomials with exponents from , the ordering generated by divisibility is a transitive relation. An instance of this in Example 1 is that , so that holds. The ordering has guaranteed a unique minimum in only if the set includes , i.e. the polynomial contains the intercept term . This is always the case when the model is hierarchical. However note that defines only a partial order and not a total order in . This is because in general, divisibility cannot uniquely sort a list of monomials. A simple instance of this, also taken from Example 1, is that per se cannot order from . Despite the divisibility relation not being a total order, in this paper we do not require divisibility to satisfy this property and to define hierarchy constraints in the model, it is sufficient to have a partial order.

### 2.2 Hasse diagrams and model constraints

The collection of partial orderings among monomials which appear by divisibility conditions translates naturally into domination constraints for model parameters. We next list those relations in the model and later use the list to establish natural linear constraints between the model parameters. Let be the collection of ordering relations between monomials in :

 R:={xα≺xβ:α,β∈M such that β−α≥0 componentwise}. (2)

We restrict to only contain those relations when the degrees of monomials and differ only by one. The aim of this restriction is to keep to a minimum size by only listing essential relations, and as is transitive, there is no lack of generality by doing this. The construction of the ordering relations for a model is given in Algorithm 1.

All relations of the type are excluded from when the exponent is removed from in the algorithm. Removing the intercept can be done because the intercept term is often of little practical interest. Indeed after standardisation of data, the intercept term is absent from the modelling process. Unless stated otherwise, in the rest of this paper any relations or constraints involving the intercept have been removed from both and the analysis.

The relations listed in can be depicted with a Hasse diagram [5]. To build this diagram, model terms are nodes and edges are drawn when terms are related by divisibility . Ascendant terms are divisors of other model terms and they are located at the top of the Hasse diagram. Conversely, descendant terms can be written as polynomial multiples of other model terms and they are located in the lower part of the diagram. We put ascendant terms such as at the top of the Hasse diagram to reflect the importance of such terms in the parameter constraints to come later. The hierarchy of terms in the diagram we propose coincides with the hierarchy in Hasse diagrams when used in the analysis of experiments [1].

###### Example 2

For the model of Example 1, the set of divisibility conditions is . These relations were used to build the Hasse diagram shown in Figure 1.

A Hasse diagram can be built even when the model does not fully satisfy hierarchy. In such case the diagram would reflect the hierarchical part of between terms that whose degree differ by one.

In summary, what we describe in this Section is model preprocessing, to be done before data analysis. The model preprocessing does not involve response data, but only applying Algorithm 1 to model and then building Hasse diagrams and developing model constraints.

We now explore simple forms in which we can create and combine parameter constraints to obtain different forms of hierarchy. These parameter constraints are read from the Hasse diagram built for the model using the divisibility conditions encoded in the set .

#### 2.2.1 Using the edges of the Hasse diagram

The simplest way is to read parameter constraints directly from the edges in the Hasse diagram. For every relation in the list , we associate the constraint to the model. This constraint ensures that if in the model the coefficient for is zero, i.e. , then the term will be absent from the model as its coefficient will be forced to satisfy . Let be the list of all such constraints:

 H:={|θα|≥|θβ| for every pair α,β∈M such that xα≺xβ∈R}. (3)

The constraints imposed by correspond to strong hierarchy and there are as many constraints of this type as edges in the Hasse diagram.

If there is a model term for which there are no divisors (ascendants) nor multiples of it (descendants), then for such term there will be no divisibility conditions in and hence no parameter constraints appearing on the set . The simplest example of this is when fitting a model with only linear terms so that, after discarding the intercept, the expectation is . The sets and are empty and estimation of model parameters involves no constraints.

###### Example 3

(Continuation of Example 2) From the Hasse diagram of Figure 1, the following parameter constraints are directly read: , , and . These parameter constraints are associated to the model to ensure that it remains hierarchical.

#### 2.2.2 Adding over descendant nodes in the Hasse diagram

We can constrain by adding over multiples of model terms. For every term with exponent , define as the collection of exponents of multiples of , taken from the list , that is To build the constraint, let be a positive weight and add over coefficients of multiples of the monomial :

 wα|θα|≥∑β∈B(α)|θβ|. (4)

Recall that is the list of descendants of in the Hasse diagram, restricted to the immediate descendants. Hence if no monomials are below in the diagram, then will be empty and no constraint is built for . The constraint (4) implies strong hierarchy, as the vanishing of implies vanishing of all its descendant terms. Define to be the set of all such constraints, built over all terms that have multiples (descendant terms)

 S:=⎧⎨⎩wα|θα|≥∑β∈B(α)|θβ|:α∈M such that B(α)≠∅⎫⎬⎭.

The selection of the weight can be arbitrarily made. There are three obvious choices. One is to let the weight of in (4) to be the number of descendants of , that is , where is the cardinality of the set . This selection of is equivalent to constrain by letting the absolute value of each parameter exceeding the mean of the absolute values of the parameters of its immediate descendants. A second possibility is to let all the weights in be equal to one, i.e. exceeds the sum of absolute values of descendants of . The second instance is a more restrictive, penalizing heavily parameters of higher order terms. A third possibility is to let the weight exceed . This is a less restrictive form of strong hierarchy. The relation between these choices is discussed in Section 2.3.

###### Remark 1

When the model corresponds to a polynomial of degree two, our formulation of strong hierarchy with weights coincides with the strong hierarchy as developed by [4].

The symmetry constraint for strong hierarchy by [4] is precisely equal to our constraint with unit weights, which is a collection of inequalities of the form

 |θj|≥p∑i=1|θij|=|θ1j|+|θ2j|+…+|θjj|+…+|θpj|,

where runs over all variables . If the model only includes double interactions but not quadratic terms in the variables then the parameter is absent from the above inequality. In both versions of it, the correspondence between constraints of Remark 1 holds.

#### 2.2.3 Adding over parent nodes in the Hasse diagram

We can constrain parameters by adding over divisors of model terms. In a similar manner as above, for every term with exponent , define as the collection of exponents of ascendant terms taken from the list , formally Let be a positive weight and constrain the parameter of by adding over its ascendants

 ∑α∈A(β)|θα|≥wβ|θβ|. (5)

The set lists all terms that are higher up than in the Hasse diagram, restricted to those immediate ascendants. Akin to the earlier development, if no monomials are above in the Hasse diagram, then is empty and no constraints would be created for . The constraint (5) implies weak hierarchy, as the coefficient of term would only vanish when all the coefficients of its ascendant terms have vanished. Mirroring what was done earlier, define to be the set of all constraints (5), built over terms that have ascendants

 W:=⎧⎨⎩∑α∈A(β)|θα|≥|θβ|:β∈M such that A(β)≠∅⎫⎬⎭.

The specification of each weight is arbitrary and we also consider three cases. The first is to let this weight to be the number of ascendants, which we write and this means that the absolute value of the coefficient for is smaller than the mean of absolute values of its ascendants nodes. The second case is to let all weights to be . This is less restrictive than the case above, being easier to attain. A third case is to let weights be smaller than one, making the constraints much less restrictive than the first two cases. In the next example we give the explicit constraints for the different cases of hierarchy.

###### Example 4

We build different parameter constraints using the set of Examples 1 and 2. The model constraints appearing from the edges of the diagram are

 H={|θ1|≥|θ12|,|θ1|≥|θ13|,|θ2|≥|θ12| and |θ3|≥|θ13|}.

Using weights , we add over multiples of every node in the diagram (adding over descendants) to obtain strong hierarchy

 {|θ1|≥|θ12|+|θ13|,|θ2|≥|θ12| and |θ3|≥|θ13|}

while with weights and adding over divisors of every node (ascendants) we have weak hierarchy

 {|θ1|+|θ2|≥|θ12| and |θ1|+|θ3|≥|θ13|}.

The version of strong hierarchy using weights is

 {2|θ1|≥|θ12|+|θ13|,|θ2|≥|θ12| and |θ3|≥|θ13|},

while weak hierarchy with weights is

 {|θ1|+|θ2|≥2|θ12| and |θ1|+|θ3|≥2|θ13|}.

### 2.3 Relations between constraints

Consider the constraint taken from set of strong hierarchy constraints of Example 4. It is clear that, if this constraint is satisfied, it follows that both and are also satisfied. This is because and simultaneously. Thus, stemming from a constraint in , we have recovered some constraints from the set . Lemma 2 gives the conditions under which the set of constraints , built over the edges if the Hasse diagram, can be deduced from the addition over descendants .

###### Lemma 2

If every weight in the set of constraints satisfies , then the constraints in imply the set of hierarchical constraints .

Theorem 3 establishes the relation between the sets of constraints for strong hierarchy and weak hierarchy . The implications of the theorem depend on the weights specified in each case. The proof of both results is in the Appendix.

###### Theorem 3

Let the sets of constraints and be as defined in Section 2.2. Then, depending on the specification of weights and for sets of constraints and , the implications shown in the diagram in Figure 2 hold.

In the top row of the diagram in Figure 2, we have the constraints associated with strong hierarchy , while the bottom row has constraints associated with weak hierarchy . In the same diagram, the constraints are also ordered from left to right from the most restrictive instances to the least restrictive. The choice of weights provides the modeler with a wide range of models, all of which would satisfy the required type of hierarchy.

Finally, we must distinguish between the hierarchy as built with parameter constraints that we have related in this section and hierarchy when built with arbitrary values of the parameters. While it is true that if a model with exponent set is strongly hierarchical then it is also weakly hierarchical, it is also true that the parameters associated with such models may not necessarily satisfy the implications of the theorems in this section. Indeed it is possible to have a model that is strongly hierarchical that does not satisfy any of the sets of constraints as defined in this paper. Our contribution provides a useful set of parameter constraints that guarantee model hierarchy, and while it does not cover all possible cases, it gives a general and flexible modeling framework.

## 3 Parameter estimation

The development below is based around the lasso shrinkage methodology [23]. This methodology estimates by minimizing over the criterion

 L=12||Y−Xθ||22+λ||θ||1. (6)

If no shrinkage is needed, set in to coincide with least squares estimation. We next discuss constrained lasso and then a relaxed version of it.

### 3.1 Constrained lasso

The estimation problem for is to minimize of Equation (6) over subject to a set of constraints, and we refer to this as constrained lasso. In our proposal, the constraints are given by one of the sets , or , selected by the modeler. With a slight abuse of notation, let denote the column vector whose entries are absolute values of coefficients, i.e. . The parameter constraints take the form

 A|θ|≥0,

where is a matrix of constants with columns, read from the earlier Hasse development of Section 2.2, and the inequality is interpreted componentwise.

###### Example 5

Consider the hierarchy as developed in Example 4, and let the vector of absolute values of coefficients be , then the matrix for the constrained optimization would be

 A=⎛⎜ ⎜ ⎜⎝100−101000−1010−100010−1⎞⎟ ⎟ ⎟⎠.

For a given vector , simultaneous attainment of all the inequalities means that satisfies . For the cases of strong hierarchy S with weights and for unit weights, the matrix would be

 ⎛⎜⎝200−1−1010−100010−1⎞⎟⎠ and ⎛⎜⎝100−1−1010−100010−1⎞⎟⎠,

respectively. For weak hierarchy W with weights and for unit weights, the matrices would be

 (110−201010−2) and (110−101010−1).

For a given value of , the numerical minimization of constrained lasso is a minimization of a quadratic form with absolute constraints. Note that for every orthant of , the minimization remains a standard quadratic problem with linear constraints. This is because inside every orthant, the vector of absolute values is linear with respect to each of its coordinates .

In practice we use the following minimization procedure: select an initial orthant based on the least squares estimate, then for a set of values of , compute lasso estimates in the selected orthant. The minimization is done using the standard R package quadprog and by default, the values of range from to . As part of our search procedure, we also explore if neighboring orthants given by a sign change may have a better estimate than our selected orthant and thus for every value of we explore orthants. We do not claim that our search is a universally optimal procedure, but it has worked well in practice and it is much cheaper than a brute force exploration of all full dimensional orthants of .

### 3.2 Relaxed constrained lasso

An alternative approach to is to do a convex relaxation of the lasso problem, see [9]. This relaxation is used to linearize the sum of absolute values in Equation (6). In the relaxed lasso, instead of the parameter vector , we have two non-negative parameter vectors and , which for simplicity we collect in the column vector which has rows and is . The non-negative condition of from the relaxation implies that . The vector of parameters is built as the difference , that is

 θ=(I−I)u, (7)

where

is an identity matrix of size

. The proxy vector of absolute values of is defined as the addition of these nonnegative vectors

 |θ|=θ++θ−=(II)u

so that is replaced by the addition of all the elements in , that is by with a column of ones with rows. After collecting terms, the relaxed version of the Lasso criterion is

 Lr=12YTY−uT((XTY−XTY)−λ1)+12uT(XTX−XTX−XTXXTX)u. (8)

The problem of estimation of minimizing (6) has been turned into minimization of the relaxed criterion of (8) subject to (component wise). The matrix contains the constraints imposed by hierarchy as discussed earlier as well as the non negativity constraints for values of . The constraints matrix has the following block form

 B=⎛⎜⎝AAI00I⎞⎟⎠,

where is a matrix with columns that has the parameter constraints as developed in Section 2.2; the matrix is an identity matrix of size and in both cases above the matrix is of size . Note that when there are no edges in the Hasse diagram, then the matrix does not exist and matrix for the relaxation of Lasso only contains the lower part with identity and zero matrices.

The estimation of the lasso path for the relaxed constrained lasso is performed in a similar manner to the procedure in Section 3.1: for a collection of values of , estimates are computed using the same R package quadprog. Note that an advantage of the relaxation is that there is no need to explore different quadrants, as the nonnegativity of and hierarchy constraints are all handled by .

We briefly discuss the construction of through an example. Consider the model of Example 1 and let be the parameter vector so that for the relaxation we have and and . Any constraints for this model that involve absolute values, such as those given in Example 4, are reformulated using the component wise convention . For instance, when considering hierarchy in the relaxed Lasso for model , the constraint is replaced by . This constraint is rearranged as , where the brackets separate components of from those of . Note the repetition of roles of elements of and of inside each bracket. Rearranging the inequality, we read the constraint , where the duplication of roles in terms in brackets above implies the repetition of the coefficients as noted earlier. The rest of the constraints in give the hierarchy constraints

 ⎛⎜ ⎜ ⎜⎝100−10100−101000−11000−1010−10010−100010−10010−1⎞⎟ ⎟ ⎟⎠u≥0,

where, as in earlier developments, the inequality is interpreted component wise. This is simply a case of

 (AA)u≥0 (9)

where the matrix is the same as the development in the text of Section 3.1. The nonnegativity constraints for and for are simply written as

 (10)

respectively. In summary, for the hierarchy in model , the relaxed lasso uses a constraints matrix of size ; the first four rows impose the hierarchy , and the remaining ten rows are an identity of size ten that gives non-negativity constraints for the relaxed parameterisation.

Relaxed constrained lasso appears to be a simple alternative to constrained lasso of Section 3.1. The relaxed method does not have to consider multiple orthants and move around them, as this is automatically handled by the problem formulation. Hierarchy is rigorously held for the vector of proxy absolute values , but paradoxically, the method does not guarantee that hierarchy will be enforced for the vector of proxy parameters , nor that the model retrieved will coincide with that of constrained lasso, see the numerical comparison of methods in Section 4.4. We would advocate this simple, approximate method for cases where strong hierarchy is not a crucial requirement, or for a first stage screening of model terms where only a rough list of active terms is required.

## 4 Examples

### 4.1 Small example, synthetic data

Consider the data set in factors given in Table 1. The model under consideration has terms for the intercept and all linear factors together with the interaction terms and . Figure 3 shows the lasso paths for standard lasso in (a) as well as the lasso paths built using the following constraints as defined in this paper: in (b), with weights given by in (c), with all weights equal in (d).

The lasso path in Figure 3 (a) was computed with the standard package lars and added for reference, noting that the coefficients in this path do not obey hierarchy. The rest of the paths in Figure 3 were computed minimizing subject to absolute constraints. We note that the introduction of hierarchy severely constraints the lasso path to the point that the coefficient estimates for terms and become zero for all the path and the paths of and are fully correlated. This is of course too limiting, as seen in (b) in the said figure. The path of hierarchy shown in (c) is still limiting as the same coefficients as in are still zero, but the gradual change in the path towards lasso starts to become evident. Finally, the path in (d) is much closer to the original lasso, while still keeping hierarchy.

### 4.2 Prediction comparison: strong hierarchy and lasso

We carried out a simulation study to compare the performance of our constrained lasso proposal versus standard lasso. To this end, we considered data from a hierarchical model assumed true with terms (excluding intercept) in variables, and integer coefficients taken from the interval and an error term was added to the deterministic part of the model. The model had directing monomials and .

The above simulated data was modelled with a candidate model with terms and directing monomial . The design used was a random uniform design in with training, validation and prediction points. After training, the validation error was used to pick a model from the path for each of the trajectories of lasso and constrained lasso . Finally, using the selected model and the prediction data, we computed the prediction error and compared for both models. This comparison was repeated times for each of levels of variance of Gaussian error, and for nine different values of weights ranging from to .

Figure 4 shows the proportion of times that the prediction error of constrained lasso was smaller than or equal to that of lasso, plotted against weight . Firstly, in all simulations, this proportion was at least , and for increasing levels of error variance, the figure settled at around . An interesting case appeared for lower values of the weight , for which the proportion was initially much higher than the rest of cases, to finally settle for a similar proportion as the rest of cases.

### 4.3 Benchmarking techniques with engine data set

This data set was generated by a computer experiment involving input variables and one output. The first observations of the data were used to train the model and the remaining observations of the data were used for validation purposes. Two initial models were considered for analysis.

In the original analysis by Bates et al. (2003), they considered a saturated model of terms. For our analysis, in order to have a non-saturated initial model, we removed the single term of degree five in their model. This initial model is referred to as ‘Initial BGW’. Using the training data, we built constrained lasso paths and for each model in the path, we reestimated the coefficients using least squares estimates. We then selected a model from the path using the validation error. This analysis was performed for hierarchy and several choices of weights for hierarchies and .

The models, sizes and validation mean squared errors for this study are given in Table 2, and the last column of the table states whether the model satisfies strong hierarchy. The constrained lasso models in the first six rows of Table 2 are ordered according to a reading from left to right by row in the diagram of Theorem 3. The first four rows correspond to strongly hierarchical models which are ordered in decreasing order of restriction. For example, the large weight in row four made this model considerably less restrictive than any model above it, while still obeying strong hierarchy. The fifth and sixth row correspond to weak hierarchy are also ordered in decreasing order of restriction. Table 2 also shows in its last three rows the results from a standard lasso path, from the final model by [2], termed BGW; as well as the model termed RAI from the greedy stepwise regression search by [12].

A common feature in all the models we tried was that the term involving the second variable was absent from the selected model, a desirable analysis consequence noted in [2]. From the models, the best result was obtained with strong hierarchy and the large weight , which is the least restrictive of our constrained lasso with strong hierarchy. The second best model is also a strongly hierarchical. Interestingly enough, the entirely unconstrained standard lasso came second to last with its only benefit being a slightly smaller model size. The model BGW, despite having a relatively large MSE in Table 2, has a reduced size. This is a consequence of that model being obtained through a compound criterion that penalized heavily higher order terms by their curvature, in contrast with our methods that penalize higher order terms concerning hierarchy. The smallest model was RAI, with the trade-off of having the largest validation error in the table.

### 4.4 Comparison between methods of constrained estimation

We compared the performance of constrained lasso against the relaxed version of it using simulated data of the non-polynomial model , suggested by [2]. We modeled and compared two scenarios with and variables. In the first scenario, the data does not depend on one input factor while on the five dimensional scenario, factors and have no influence in the output.

For the three factor scenario, we generated a random latin hypercube design (LH) of points which were used to train a model of degree three with terms. A second latin hypercube of points was used to compute validation error which selected a working model. We did this for constrained lasso and relaxed lasso with constraints, using the strong hierarchy . We recorded the proportion of times that the final model of relaxed lasso concided with that of the constrained version in both terms and respective coefficient signs. We also recorded the proportion when they coincided on the terms only, disregarding the signs of coefficients. These proportions were computed using replications of this experiment, performed for a variety of weights for constraint .

For the five factor scenario, we used a LH design of points in five dimensions to train a model of terms with the same degree three and hierarchy as the previous scenario. The validation design consisted of a second LH of points, and we recorded the same proportions as the earlier case, using instead replications.

Figure 5 shows the experiment results. For moderate to large values of the weight , the coincidence of terms reaches a value of around for both three and five-dimensional scenarios. Coincidence of terms and signs is much lower for both scenarios, never reaching even

of the cases. The results suggest that while the relaxed lasso has the ability to detect active terms that obey the required hierarchy, it does not do that with a very high probability.

### 4.5 Run times of the calculations

This example is concerned with a comparison of the run times of the implementation of our computational procedure. We randomly generated data scenarios varying numbers of input factors and candidate models and to which each of the constraints and was fitted to generate lasso paths in both versions of constrained lasso and as relaxed lasso. Run times for a standard laptop (64 bit processor 1.8GHz, 8GB RAM) were recorded for each scenario and method.

The details of the simulation are as follows. The number of factors ranged from to . For each value of we generated a random latin hypercube with points. The number of points for this design was , where was taken at random between and . Output data values consisted of only simulated uniform noise with no specific trend, and to each design-output configuration we randomly choose a candidate model with hierarchical structure which was then fitted to data. The size of determines the dimension of and consequently, the complexity of the estimation. In total we ran such scenarios and for each scenario, the fit was performed for both constrained lasso and relaxed methodologies as described in Section 3. We generated a fit using each of the constraints and , using values of . For and hierarchies we used weights , respectively. The run time was recorded for each case. In summary, for each of the simulations, there were run times measured: three for each of and constraints using constrained lasso; another three for the same hierarchy cases with relaxed lasso; for comparison we also fitted standard, unconstrained lasso.

The scatterplot in Figure 6 shows run times for hierarchy against model size. This plot is representative of what happens for the other cases and . The run times for constrained lasso are between one and two orders of magnitude bigger than those of relaxed constrained lasso. Indeed this is one advantage to be had when using the relaxed version of constrained lasso, while the obvious drawback of it is the potential lack of hierarchy of models, as discussed earlier in Section 3.2.

The increasing pattern of run times is similar in shape for both constrained lasso and relaxed lasso. This fact is not surprising as both methods consist on quadratic minimization over orthants and what we are plotting are in essence, runtimes of quadprog. In both cases, the logarithm of run time appears to depend on the square root of the model size. Note in the same figure the different pattern and much lower run times of lasso which does not depend on constraints hence lasso models are not generally hierarchical. Interestingly enough, occasionally lasso has similar run times than the constrained versions.

### 4.6 Comparison between methodologies: olive oil data

We compared the performance of existing methodologies using the olive oil data set by [8]. The response was the indicator variable for oils coming from the region of Apulia, modelled as a function of eight other variables in the dataset. The data set was split randomly in two halves, one was used to training the model, while the other half of the data was used to compute the validation error. The smallest validation error in the path was recorded, and this procedure was repeated times for different random splits of the data.

The analysis was carried out for three different candidate models . The candidate model labelled as “Quadratic (full)” consisted of all terms of degree less than or equal to two in eight variables totalling terms. The model labelled as “Quadratic (square free)” had terms, obtained by removing the pure terms of degree two from the model “Quadratic (full)”. A more complex model labelled “Cubic” with terms was built adding pure terms of degree three and triple interactions to the model “Quadratic (full)”. To each of these candidate models , seven different constraints were tried to have scenarios. In addition to these, a simple lasso with main effects only was tried as well as the proposals RAI and FAMILY, see [12, 11]. Boxplots with the results for all these scenarios are shown in Figure 7. The boxplots corresponding to strong hierarchy or are shown in blue, while weak hierarchy are colored in green, while simple lasso is shown in red and boxplots in black correspond to other methodologies. Table 3 contains a description and labels of the cases that are shown in Figure 7.

Overall, adding extra terms of higher degree than two resulted in a model with better prediction capabilities. Indeed the results for candidate model “Cubic” were smaller than the rest of scenarios. Within this candidate model “Cubic”, the result labelled ‘H’ in the figure was the more consistent in the sense of having smaller range of the MSE. The strong hierarchy with bigger weights, labelled ‘SH’ in the plot, had the smallest errors but also had the second biggest spread in this group. For the candidate model “Quadratic (full)”, the result were very similar amongst methods, with the strong hierarchy labelled as ‘SB’ being slightly better. In this cathegory note that the method FAMILY with label ‘F’ has slightly worse results. For candidate model “Quadratic (square free)”, the results for five of the models are very similar, with the case ‘SB’ being a little better with smaller range. Note that for these two cases of quadratic candidate model, the cases ‘S1’ are precisely the hierNet analyses by [4]. In the “Quadratic (full)“, the case ‘S1’ is slightly improved by ‘SB’, while for “Quadratic (square free)” initial model, clearly ‘S1’ is the worst, although not by much. Finally, we note that apart from a main effects only lasso ‘ML’ which is the second worst of the methods tried, there is not much difference in each case between lasso and the rest of methodologies. In other words, adding constraints of the types , or does not impair severely the predictive capabilities of the model. The recent methodology RAI searches for models with higher interactions, and while it has the potential to produce good models, in the examples we tried it had the biggest variability, and the boxplot shown was trimmed due to its very large right tail.

## 5 Discussion

We perceive there is a lack in the literature for a single, comprehensive approach to polynomial hierarchy that is efficient and compares favorably with existing results. Our proposal gives direct, intuitive restrictions over the model coefficients so that the resulting model satisfies hierarchy. Using this idea in a Lasso context also provides a simple, efficient search over a potentially large set of candidate models.

The proposed methodology requires knowledge of a candidate model . If this candidate model is not known, it could be tempting to fit a relatively complex hierarchical model and let the constrained lasso procedure determine a suitable model. Theoretically this is possible and uses the dual nature of lasso as estimation and screening procedure. For example, if is a model of degree two, then our methodology coincides with that of [4], depending on selected weights. However in general we would not advocate to start with a complicated model with expensive parameter estimation and we would rather advocate a standard two stage procedure. An initial screening stage would reduce the number of factors and once a reduced set of factors is available, then a more complex model could be tried.

A line of future work is concerned with the application and development of constraints for other statistical models. In concrete, a natural development is to adapt our constrained methodology for the linear predictor of a generalized linear model so that . This could be particularly advantageous as the linear predictor would have more flexibility to describe different, non monotonic patterns.

Concerning implementation of our constrained lasso methodology of Section 3.1, work is under progress for efficient computation of the lasso path. An initial step is to take advantage of the recently developed package quadprogXT, a development based upon the techniques of the quadprog library [10, 24].

The minimization of the relaxed constrained lasso problem in Section 3.2 is easily implemented using existing the quadratic programming library quadprog see [24]. However, a point that needs clearer understanding is the matrix in the the third summand of (8). This matrix is at the core of the relaxed version of Lasso and is not a full rank matrix. To overcome this difficulty, a solution was to add a multiple of identity to the lower block of that matrix to have instead

 (XTX−XTX−XTXδI+XTX).

This solution has worked well in practice but its stability and accuracy needs further study.

## Acknowledgements

The first author acknowledges partial funding by EPSRC travel grant EP/K036106/1.

## References

• [1] R. A. Bailey (2008) Design of comparative experiments. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press. External Links: Document Cited by: §1.1, §1.2, §2.2.
• [2] R. A. Bates, B. Giglio, and H. P. Wynn (2003)

A global selection procedure for polynomial interpolators.

.
Technometrics 45 (3), pp. 246–255. External Links: Link Cited by: §1.1, §2, §4.3, §4.3, §4.4.
• [3] J. Bien, N. Simon, and R. Tibshirani (2015) Convex hierarchical testing of interactions. Ann. Appl. Stat. 9 (1), pp. 27–42. External Links: Cited by: §1.1.
• [4] J. Bien, J. Taylor, and R. Tibshirani (2013) A LASSO for hierarchical interactions. Ann. Statist. 41 (3), pp. 1111–1141. External Links: Cited by: §1.1, §1, §1, §2.2.2, §4.6, Table 3, §5, Remark 1.
• [5] R. Brüggemann and G. P. Patil (2011) Ranking and prioritization for multi-indicator systems. Environmental and Ecological Statistics, Vol. 5, Springer, New York. Note: Introduction to partial order applications External Links: Cited by: §2.2.
• [6] N. H. Choi, W. Li, and J. Ji Zhu (2010-03-01) Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association 105 (489), pp. 354–364 (English (US)). External Links: Document, ISSN 0162-1459 Cited by: §1.1.
• [7] D. A. Cox, J. Little, and D. O’Shea (2007) Ideals, varieties, and algorithms: an introduction to computational algebraic geometry and commutative algebra, 3/e (undergraduate texts in mathematics). Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 0387356509 Cited by: §2.
• [8] M. Forina, C. Armanino, S. Lanteri, and E. Tiscornia (1983) Classification of olive oils from their fatty acid composition. In Food Research and Data Analysis, H. Martens and H. J. Russwurm (Eds.), pp. 189–214. Cited by: §4.6.
• [9] L. R. Foulds (1981) Optimization techniques. an introduction.. Undergraduate Texts in Mathematics, Springer-Verlag, New York-Berlin. Note: Cited by: §3.2.
• [10] D. Goldfarb and A. Idnani (1983) A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming 27, pp. 1––33. Cited by: §1.2, §5.
• [11] A. Haris, D. Witten, and N. Simon (2016) Convex modeling of interactions with strong heredity. J Comput Graph Stat 25 (4), pp. 981––1004. Note: doi: 10.1080/10618600.2015.1067217 Cited by: §1.1, §4.6, Table 3.
• [12] K. D. Johnson, R. A. Stine, and D. P. Foster (2019-06) Fitting High-Dimensional Interaction Models with Error Control. arXiv e-prints, pp. arXiv:1510.06322. External Links: 1510.06322 Cited by: §1.1, §4.3, §4.6, Table 3.
• [13] M. Lim and T. Hastie (2015) Learning interactions via hierarchical group-lasso regularization. Journal of Computational and Graphical Statistics 24 (3), pp. 627–654. Note: PMID: 26759522 External Links: Cited by: §1.1.
• [14] P. McCullagh and J. Nelder (1989) Generalized linear models. Vol. 37, CRC press. Cited by: §1.1, §1.
• [15] J. Nelder (1997) Functional marginality is important. Appl. Stat. 46, pp. 281–286. Cited by: §1.1.
• [16] J. Nelder (1998) The selection of terms in response-surface models-how strong is the weak-heredity principle?. Am. Stat. 52 (4), pp. 315–318. Cited by: §1.1.
• [17] J. Nelder (2000) Functional marginality and response-surface fitting. J. Appl. Stat. 27 (1), pp. 109–112. Cited by: §1.1.
• [18] H. Noguchi, Y. Ojima, and S. Yasui (2015) Bayesian lasso with effect heredity principle. In Frontiers in Statistical Quality Control 11, K. S. and S. W. (Eds.), pp. 355–365. Cited by: §1.1.
• [19] J.L. Peixoto (1987) Hierarchical variable selection in polynomial regression models. Am. Stat. 41 (4), pp. 311–313. Cited by: §1.1, §1.
• [20] J.L. Peixoto (1990) A property of well-formulated polynomial regression models. The American Statistician 44 (1), pp. 26–30. Cited by: §1.1.
• [21] G. Pistone, E. Riccomagno, and H. Wynn (2001) Algebraic statistics. Chapman and Hall/CRC. External Links: Document Cited by: §2.
• [22] P. Radchenko and J. M. Gareth (2010) Variable selection using adaptive nonlinear interaction structures in high dimensions. Journal of the American Statistical Association 105 (492), pp. 1541––1553. Cited by: §1.1.
• [23] R. Tibshirani (1996) Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 (1), pp. 267–288. External Links: Link Cited by: §1.2, §3, Table 3.
• [24] B. A. Turlach and A. Weingessel (2013) Quadprog: functions to solve quadratic programming problems.. Note: S original by Berwin A. Turlach R port by Andreas WeingesselR package version 1.5-5 External Links: Link Cited by: §5, §5.

## Appendix

### Proof of Lemma 2

Proof. We note that the sum of absolute values is greater than or equal than the any of its absolute components, that is for any , and the notation above is simply to distinguish terms inside the summation. Hence a consequence of the inequality in Equation (4) is that for all , so that when we have . All the above holds for coefficients of descendants (multiples) of with exponents . As we scan over all constraints of the form (4) in , we recover all of .

### Proof of Theorem 3

Proof. We first show the implications in the first row involving strong hierarchy . To deduce them it is enough to manipulate the inequality (4).

S with implies S with .

This first implication follows from noting that implies that so that the following holds .

S with implies S with .

For this implication, we first use Lemma 2 to retrieve a collection of inequalities of the type for , that is, constraints from the set . We then add the terms in each side of the inequality to retrieve the desired implication.

S with implies S with .

The third and last implication follows from noting that if (4) holds, then also holds, where .

We now go through the implications in the second row that involve weak hierarchy . The implications are deduced by manipulating the inequality (5).

W with implies W with .

This first implication follows from the fact that if , then so that .

W with implies W with .

This implication is a consecuence of the fact that as then the right hand side of inequality obbeys .

W with implies W with .

The implication follows from the fact that if (5) holds for , then we have for .

We are only left to show the implications that link sets and .

S with implies W with .

We first use Lemma 2 for all constraints in with to retrieve the full collection of constraints . Now for every with , we add each side of inequalities of the the type for to retrieve the constraint (4) with .

S with implies W with .

Consider a term for which we want to determine weak constraints of the type W with . In the development that goes below, refer to this exponent as . From the set of constraints S with , consider those constraints that involve for , that is . As is the set of parent terms to then for each constraint, clearly one of the elements in each sum is precisely the given of interest and we obtain the following inequality .

In short, we have the collection of inequalities

 {|B(α)||θα|≥|θβ′|:α∈A(β′)}

and our task is to show that these inequalities imply the constraint

 ∑α∈A(β′)|θα|≥|θβ′|.

The proof is completed by an indirect argument, and we negate the latter constraint to