## 1 Introduction

Shape restrictions such as monotonicity, convexity, and concavity provide a natural way of limiting the complexity of many statistical estimation problems. Shape-constrained estimation is not as well understood as more traditional nonparametric estimation involving smoothness constraints. For instance, the minimax rate of convergence for multivariate convex regression has yet to be rigorously established in full generality. Even the one-dimensional case is challenging, and has been of recent interest (Guntuboyina and Sen, 2013).

In this paper we study the problem of variable selection in multivariate convex regression. Assuming that the regression function is convex and sparse, our goal is to identify the relevant variables. We show that it suffices to estimate a sum of one-dimensional convex functions, leading to significant computational and statistical advantages. This is in contrast to general nonparametric regression, where fitting an additive model can result in false negatives. Our approach is based on a two-stage quadratic programming procedure. In the first stage, we fit an convex additive model, imposing a sparsity penalty. In the second stage, we fit a concave function on the residual for each variable. As we show, this non-intuitive second stage is in general necessary. Our first result is that this procedure is faithful in the population setting, meaning that it results in no false negatives, under mild assumptions on the density of the covariates. Our second result is a finite sample statistical analysis of the procedure, where we upper bound the statistical rate of variable screening consistency. An additional contribution is to show how the required quadratic programs can be formulated to be more scalable. We give simulations to illustrate our method, showing that it performs in a manner that is consistent with our analysis.

Estimation of convex functions arises naturally in several applications. Examples include geometric programming (Boyd and Vandenberghe, 2004), computed tomography (Prince and Willsky, 1990), target reconstruction (Lele, Kulkarni and Willsky, 1992), image analysis (Goldenshluger and Zeevi, 2006) and circuit design (Hannah and Dunson, 2012). Other applications include queuing theory (Chen and Yao, 2001) and economics, where it is of interest to estimate concave utility functions (Meyer and Pratt, 1968). See Lim and Glynn (2012) for other applications. Beyond cases where the assumption of convexity is natural, the convexity assumption can be attractive as a tractable, nonparamametric relaxation of the linear model.

Recently, there has been increased research activity on shape-constrained estimation. Guntuboyina and Sen (2013) analyze univariate convex regression and show surprisingly that the risk of the MLE is adaptive to the complexity of the true function. Seijo and Sen (2011) and Lim and Glynn (2012) study maximum likelihood estimation of multivariate convex regression and independently establish its consistency. Cule, Samworth and Stewart (2010) and Kim and Samworth (2014) analyze log-concave density estimation and prove consistency of the MLE; the latter further show that log-concave density estimation has minimax risk lower bounded by for , refuting a common notion that the condition of convexity is equivalent, in estimation difficulty, to the condition of having two bounded derivatives. Additive shape-constrained estimation has also been studied; Pya and Wood (2014) propose a penalized B-spline estimator while Chen and Samworth (2014) show the consistency of the MLE. To the best of our knowledge however, there has been no work on variable selection and and estimation of high-dimensional convex functions.

Variable selection in general nonparametric regression or function estimation is a notoriously difficult problem. Lafferty and Wasserman (2008)

develop a greedy procedure for adjusting bandwidths in a local linear regression estimator, and show that the procedure achieves the minimax rate as if the relevant variables were isolated in advance. But the method only provably scales to dimensions

that grow logarithmically in the sample size , i.e., . This is in contrast to the high dimensional scaling behavior known to hold for sparsity selection in linear models using penalization, where is logarithmic in the dimension . Bertin and Lecué (2008) develop an optimization-based approach in the nonparametric setting, applying the lasso in a local linear model at each test point. Here again, however, the method only scales as , the low-dimensional regime. An approximation theory approach to the same problem is presented in DeVore, Petrova and Wojtaszczyk (2011), using techniques based on hierarchical hashing schemes, similar to those used for “junta” problems (Mossel, O’Donnell and Servedio, 2004). Here it is shown that the sample complexity scales as if one adaptively selects the points on which the high-dimensional function is evaluated.Comminges and Dalalyan (2012) show that the exponential scaling is achievable if the underlying function is assumed to be smooth with respect to a Fourier basis. They also give support for the intrinsic difficulty of variable selection in nonparametric regression, giving lower bounds showing that consistent variable selection is not possible if or if , where is the number of relevant variables. Variable selection over kernel classes is studied by Koltchinskii and Yuan (2010).

Perhaps more closely related to the present work is the framework studied by Raskutti, Wainwright and Yu (2012) for sparse additive models, where sparse regression is considered under an additive assumption, with each component function belonging to an RKHS. An advantage of working over an RKHS is that nonparametric regression with a sparsity-inducing regularization penalty can be formulated as a finite dimensional convex cone optimization. On the other hand, smoothing parameters for the component Hilbert spaces must be chosen, leading to extra tuning parameters that are difficult to select in practice. There has also been work on estimating sparse additive models over a spline basis, for instance the work of Huang, Horowitz and Wei (2010), but these approaches too require the tuning of smoothing parameters.

While nonparametric, the convex regression problem is naturally formulated using finite dimensional convex optimization, with no additional tuning parameters. The convex additive model can be used for convenience, without assuming it to actually hold, for the purpose of variable selection. As we show, our method scales to high dimensions, with a dependence on the intrinsic dimension that scales polynomially, rather than exponentially as in the general case analyzed in Comminges and Dalalyan (2012).

In the following section we give a high-level summary of our technical results, including additive faithfulness, variable selection consistency, and high dimensional scaling. In Section 3 we give a detailed account of our method and the conditions under which we can guarantee consistent variable selection. In Section 4 we show how the required quadratic programs can be reformulated to be more efficient and scalable. In Section 5 we give the details of our finite sample analysis, showing that a sample size growing as is sufficient for variable selection. In Section 6 we report the results of simulations that illustrate our methods and theory. The full proofs are given in a technical appendix.

## 2 Overview of Results

In this section we provide a high-level description of our technical results. The full technical details, the precise statement of the results, and their detailed proofs are provided in following sections.

Our main contribution is an analysis of an additive approximation for identifying relevant variables in convex regression. We prove a result that shows when and how the additive approximation can be used without introducing false negatives in the population setting. In addition, we develop algorithms for the efficient implementation of the quadratic programs required by the procedure.

We first establish some notation, to be used throughout the paper. If

is a vector, we use

to denote the vector with the -th coordinate removed. If , then denotes the smallest coordinate of in magnitude, and denotes the -th smallest; is the all ones vector. Ifis a random variable and

, then is the subvector of restricted to the coordinates in . Given samples , we use to denote the sample mean. Given a random variable and a scalar , we use as a shorthand for .### 2.1 Faithful screening

The starting point for our approach is the observation that least squares nonparametric estimation under convexity constraints is equivalent to a finite dimensional quadratic program. Specifically, the infinite dimensional optimization

(2.1) |

is equivalent to the finite dimensional quadratic program

(2.2) |

Here is the estimated function value , and the vectors

represent supporting hyperplanes to the epigraph of

. See Boyd and Vandenberghe (2004), Section 6.5.5. Importantly, this finite dimensional quadratic program does not have tuning parameters for smoothing the function.This formulation of convex regression is subject to the curse of dimensionality. Moreover, attempting to select variables by regularizing the subgradient vectors

with a group sparsity penality is not effective. Intuitively, the reason is that all components of the subgradient appear in every convexity constraint ; small changes to the subgradients may not violate the constraints. Experimentally, we find that regularization with a group sparsity penality will make the subgradients of irrelevant variables small, but may not zero them out completely.This motivates us to consider an additive approximation. Under a convex additive model, each component of the subgradient appears only in the convexity constraint for the corresponding variable:

(2.3) |

where and is the subgradient at point . As we show, this leads to an effective variable selection procedure. The shape constraints play an essential role. For general regression, using an additive approximation for variable selection may make errors. In particular, the nonlinearities in the regression function may result in an additive component being wrongly zeroed out. We show that this cannot happen for convex regression under appropriate conditions.

We say that a differentiable function depends on variable if

with probability greater than zero. An additive approximation is given by

(2.4) |

We say that is additively faithful in case implies that does not depend on coordinate . Additive faithfulness is a desirable property since it implies that an additive approximation may allow us to screen out irrelevant variables.

Our first result shows that convex multivariate functions are additively faithful under the following assumption on the distribution of the data.

###### Definition 2.1 .

Let be a density supported on . Then
satisfies the *boundary flatness condition* if for all , and
for all ,

As discussed in Section 3, this is a relatively weak condition. Our first result is that this condition suffices in the population setting of convex regression.

###### Theorem 1.

Let be a positive density supported on that satisfies the boundary flatness property. If is convex and twice differentiable, then is additively faithful under .

Intuitively, an additive approximation zeroes out variable when, fixing , every “slice” of integrates to zero. We prove this result by showing that “slices” of convex functions that integrate to zero cannot be “glued together” while still maintaining convexity.

While this shows that convex functions are additively faithful, it is difficult to
estimate the optimal additive functions. The difficulty
is that need not be
a convex function, as we show through a counterexample
in Section 3. It may be possible to estimate with smoothing parameters, but, for the purpose of variable screening, it is sufficient in fact to approximate by a *convex* additive model.

Our next result states that a convex additive fit, combined with a series of univariate concave fits, is faithful. We abuse notation in Theorem 2 and let the notation represent convex additive components.

###### Theorem 2.

Suppose is a positive density on that satisfies the boundary flatness condition. Suppose that is convex and twice-differentiable. and that , , and are all continuous as functions on . Define

(2.5) |

where is the set of univariate convex functions, and, with respective to ’s from above, define

(2.6) |

with denoting the set of univariate concave functions. Then and implies that does not depend on , i.e., with probability one.

This result naturally suggests a two-stage screening procedure for variable selection. In the first stage we fit a sparse convex additive model . In the second stage we fit a concave function to the residual for each variable having a zero convex component . If both and , we can safely discard variable . As a shorthand, we refer to this two-stage procedure as AC/DC. In the AC stage we fit an additive convex model. In the DC stage we fit decoupled concave functions on the residuals. The decoupled nature of the DC stage allows all of the fits to be carried out in parallel. The entire process involves no smoothing parameters. Our next results concern the required optimizations, and their finite sample statistical performance.

### 2.2 Optimization

In Section 4 we present optimization algorithms for the additive convex regression stage. The convex constraints for the additive functions, analogous to the multivariate constraints (2.2), are that each component can be represented by its supporting hyperplanes, i.e.,

(2.7) |

where and is the subgradient at point . While this apparently requires equations to impose the supporting hyperplane constraints, in fact, only constraints suffice. This is because univariate convex functions are characterized by the condition that the subgradient, which is a scalar, must increase monotonically. This observation leads to a reduced quadratic program with variables and constraints.

Directly applying a QP solver to this optimization is still computationally expensive for relatively large and . We thus develop a block coordinate descent method, where in each step we solve a sparse quadratic program involving variables and constraints. This is efficiently solved using optimization packages such as mosek. The details of these optimizations are given in Section 4.

### 2.3 Finite sample analysis

In Section 5
we analyze the finite sample variable selection consistency of convex
additive modeling, without making the assumption that the true
regression function is additive. Our analysis first establishes
a sufficient deterministic condition for variable selection
consistency, and then considers a stochastic setting.
Our proof technique decomposes the KKT conditions for the optimization
in a manner that is similar to the now standard *primal-dual
witness* method (Wainwright, 2009).

We prove separate results that allow us to analyze false negative rates and false positive rates. To control false positives, we analyze scaling conditions on the regularization parameter for group sparsity needed to zero out irrelevant variables , where is the set of variables selected by the AC/DC algorithm in the population setting. To control false negatives, we analyze the restricted regression where the variables in are zeroed out, following the primal-dual strategy.

Each of our theorems uses a subset of the following assumptions:

are independent.

is convex and twice-differentiable.

and for all .

The noise is mean-zero sub-Gaussian, independent of . In Assumption A3, denotes the optimal additive projection of in the population setting.

Our analysis involves parameters and , which are measures of the signal strength of the weakest variable:

Intuitively, if is small, then it is easier to make a false omission in the additive convex stage of the procedure. If is small, then it is easier to make a false omission in the decoupled concave stage of the procedure.

We make strong assumptions on the covariates in A1 in order to make very weak assumptions on the true regression function in A2; in particular, we do not assume that is additive. Relaxing this condition is an important direction for future work. We also include an extra boundedness constraint to use new bracketing number results (Kim and Samworth, 2014).

Our main result is the following.

###### Theorem 3.

Suppose assumptions A1-A4 hold. Let be any AC solution and let be any DC solution, both estimated with regularization parameter scaling as . Suppose in addition that

(2.8) | |||

(2.9) |

where and is a constant dependent only on .

Then, for sufficiently large , with probability at least :

This shows that variable selection consistency is achievable under exponential scaling of the ambient dimension, for , as for linear models. The cost of nonparametric estimation is reflected in the scaling with respect to , which can grow only as .

We remark that Comminges and Dalalyan (2012) show that, even with the product distribution, under traditional smoothness constraints, variable selection is achievable only if . Here we demonstrate that convexity yields the scaling .

## 3 Additive Faithfulness

For general regression, an additive approximation may result in a relevant variable being incorrectly marked as irrelevant. Such mistakes are inherent to the approximation and may persist even in the population setting. In this section we give examples of this phenomenon, and then show how the convexity assumption changes the behavior of the additive approximation. We begin with a lemma that characterizes the components of the additive approximation under mild conditions.

###### Lemma 3.1.

Let be a distribution on with a positive density function . Let be an integrable function, and define

Then ,

(3.1) |

and this solution is unique.

Lemma 3.1 follows from the stationarity conditions of the optimal solution. This result is known, and criterion (3.1) is used in the backfitting algorithm for fitting additive models. We include a proof as our results build on it.

###### Proof.

Let be the minimizers as defined. We first show that the optimal is for any such that . This follows from the stationarity condition, which states that . Uniqueness is apparent because the second derivative is strictly larger than zero and strong convexity is guaranteed.

We now turn our attention toward the s. It must be that minimizes

(3.2) |

subject to . Fixing , we will show that the value

(3.3) |

uniquely minimizes

(3.4) |

The first-order optimality condition gives us

(3.5) | ||||

(3.6) | ||||

(3.7) |

The square error objective is strongly convex, and the second derivative with respect to is , which is always positive under the assumption that is positive. Therefore, the solution is unique. Noting that has mean zero as a function of completes the proof. ∎

In the case that the distribution in Lemma 3.1 is a product distribution, the additive components take on a simple form.

###### Corollary 3.1.

Let be a product distribution on with density function which is positive on . Let be defined as in Lemma 3.1. Then and and this solution is unique.

###### Example 3.1 .

Using Corollary 3.1, we
give two examples of *additively unfaithfulness* under the
uniform distribution—where relevant variables are
erroneously marked as irrelevant under an additive
approximation. First, consider the following function:

(3.8) |

defined for . Then and for each and . An additive approximation would set and . Next, consider the function

(3.9) |

defined for . In this case for each ; therefore, we expect under the additive approximation. This function, for every fixed , is a zero-intercept linear function of with slope .

In order to exploit additive models in variable selection, it is important to understand when the
additive approximation accurately captures all of the relevant variables.
We call this property additive faithfulness. We first formalize the intuitive notion that a multivariate function *does not depends on* a coordinate .

###### Definition 3.1 .

Let and let . We say that does not depends on coordinate if for all , is a constant as a function of . If is differentiable, then does not depend on if is 0 for all .

In addition, suppose we have a distribution over and the additive approximation

(3.10) |

We say that is additively faithful under if implies that does not depend on coordinate .

Additive faithfulness is an attractive property because it implies that, in the population setting, the additive approximation yields consistent variable selection.

### 3.1 Additive Faithfulness of Convex Functions

We now show that under a general class of distributions which we characterize below, convex multivariate functions are additively faithful.

###### Definition 3.2 .

A density be a density supported on satisfies
the *boundary flatness condition* if, for all , and for all :

(3.11) |

The boundary flatness condition is a weak condition. For instance, it is satisfied when the density is flat at the boundary of support—more precisely, when the joint density satisfies the property that at boundary points . The boundary flatness property is also trivially satisfied when is a product density.

The following theorem is the main result of this section.

###### Theorem 3.1.

Let be a positive density supported on that satisfies the boundary flatness property. If is convex and twice differentiable, then is additively faithful under .

We pause to give some intuition before we presenting the full proof. Suppose that the underlying distribution has a product density. Then we know from Lemma 3.1 that the additive approximation zeroes out when, fixing , every “slice” of integrates to zero. We prove Theorem 3.1 by showing that “slices” of convex functions that integrate to zero cannot be “glued together” while still maintaining convexity.

###### Proof.

Fixing and using the result of Lemma 3.1, we need only show that for all , implies that does not depend on coordinate , i.e., for all .

Let us use the shorthand notation that and assume without loss of generality that . We then assume that for all ,

(3.12) |

We let denote and denote and likewise for and . We then differentiate under the integral, valid because all functions are bounded, and obtain

(3.13) | |||

(3.14) |

By the boundary flatness condition, we have that and are zero at . The integral equations then reduce to the following:

(3.15) | |||

(3.16) |

Because is convex, must be a convex function of for all . Therefore, for all , . Since by the assumption that is a positive density, we have that necessarily.

The Hessian of at then has a zero at the -th main diagonal entry. A positive semidefinite matrix with a zero on the -th main diagonal entry must have only zeros on the -th row and column; see proposition 7.1.10 of Horn and Johnson (1990). Thus, at all , the gradient of with respect to must be zero. Therefore, must be constant for all . By equation 3.15, we conclude that for all . We can use the same reasoning for the case where and deduce that for all .

Because as a function of is convex, it must be that, for all and for all ,

(3.17) |

Therefore does not depend on .

∎

Theorem 3.1 plays an important role in our finite sample analysis, where we show that the additive approximation is variable selection consistent (or “sparsistent”), even when the true function is not additive.

###### Remark 3.1 .

We assume twice differentiability in Theorems 3.1 to simplify the proof. We expect, however, that this this smoothness condition is not necessary—every convex function can be approximated arbitrarily well by a smooth convex function.

###### Remark 3.2 .

We have not found natural conditions under which the opposite direction of additive faithfulness holds—conditions implying that if does not depend on coordinate , then will be zero in the additive approximation. Suppose, for example, that is only a function of , and that follows a degenerate 3-dimensional distribution where . In this case exactly captures the additive approximation error. The best additive approximation of would have a component even though does not depend on .

###### Remark 3.3 .

In Theorem 3.1, we do not assume a parametric form for the additive components. The additive approximations may not be faithful if we use parametric components. For example, suppose we approximate a convex function by a linear form . The optimal linear function in the population setting is . Suppose the ’s are independent and follow a symmetric distribution and suppose , then .

###### Remark 3.4 .

It is possible to get a similar result for distributions with unbounded support, by using a limit condition

. Such a limit condition however is not obeyed by many common distributions such as the multivariate Gaussian distribution. The next example shows that certain convex functions are not additive faithful under certain multivariate Gaussian distributions.

###### Example 3.2 .

Consider a two dimensional quadratic function where is positive definite and a Gaussian distribution where . As we show in Section 9 of the Appendix, the additive approximation has the following closed form.

Where , , are constants such that and both have mean zero. Let , then it is easy to check that if , then and additive faithfulness is violated, if , then is a concave function. We take the setting where , compute the optimal additive functions via numerical simulation, and show the results in Figure 2(a)– is zero as expected.

Although the Gaussian distribution does not satisfy the boundary flatness condition, it is possible to approximate the Gaussian distribution arbitrarily well with distributions that do satisfy the boundary flatness conditions.

###### Example 3.3 .

Let be as in Example 3.2 with so that . Consider a mixture where is the density of a *truncated* bivariate Gaussian bounded in and is the uniform distribution over a square. The uniform distribution is supported over a slightly larger square to satisfy the boundary flatness conditions.

When is large, is small, and is small, the mixture closely approximates the Gaussian distribution but is still additively faithful for convex functions. Figure 2(b) shows the optimal additive components under the mixture distribution, computed by numerical integration with . True to our theory, , which is zero under the Gaussian distribution, is nonzero under the mixture approximation to the Gaussian distribution. We note that the magnitude , although non-zero, is very small, consistent with the fact that the mixture distribution closely approximates the Gaussian distribution.

### 3.2 Convex Additive Models

Although convex functions are additively faithful—under appropriate conditions—it is difficult to
estimate the optimal additive functions s as defined in
equation (3.10). The reason is that need not be
a convex function, as example 3.2
shows. It may be possible to estimate via smoothing, but we prefer an approach that is free of smoothing parameters.
Since the true regression function is convex, we approximate the additive model with a *convex* additive model. We abuse notation and, for the rest of the paper, use the notation to represent convex additive fits:

(3.18) |

where is the set of univariate convex functions.
The convex functions are not additively faithful by
themselves, i.e., it could be that the true function depends on variable but . However, faithfulness can be restored by coupling the ’s with a set of univariate concave fits on the *residual* :

(3.19) |

###### Theorem 3.2.

Suppose is a positive density on that satisfies the boundary flatness condition. Suppose that is convex and twice differentiable, and that , and are all continuous as functions of . Let and be as defined in equations (3.18) and (3.19), then the ’s and the ’s are unique. Furthermore, and implies that , that is, does not depend on .

Before we can prove the theorem, we need a lemma that generalizes Theorem 3.1.

###### Lemma 3.2.

Suppose is a positive density on satisfying the boundary flatness condition. Let be a convex twice differentiable function on . Let be any function that does not depend on . Then, we have that the unconstrained univariate function

(3.20) |

is given by , and implies that .

Comments

There are no comments yet.