# Bayesian nonparametric multivariate convex regression

In many applications, such as economics, operations research and reinforcement learning, one often needs to estimate a multivariate regression function f subject to a convexity constraint. For example, in sequential decision processes the value of a state under optimal subsequent decisions may be known to be convex or concave. We propose a new Bayesian nonparametric multivariate approach based on characterizing the unknown regression function as the max of a random collection of unknown hyperplanes. This specification induces a prior with large support in a Kullback-Leibler sense on the space of convex functions, while also leading to strong posterior consistency. Although we assume that f is defined over R^p, we show that this model has a convergence rate of log(n)^-1 n^-1/(d+2) under the empirical L2 norm when f actually maps a d dimensional linear subspace to R. We design an efficient reversible jump MCMC algorithm for posterior computation and demonstrate the methods through application to value function approximation.

• 2 publications
• 85 publications
06/01/2022

### Asymptotic Properties for Bayesian Neural Network in Besov Space

Neural networks have shown great predictive power when dealing with vari...
03/30/2019

### Strong Consistency of Nonparametric Bayesian Inferential Methods for Multivariate Max-Stable Distributions

Predicting extreme events is important in many applications in risk anal...
12/07/2017

### Remarks on Bayesian Control Charts

There is a considerable amount of ongoing research on the use of Bayesia...
09/27/2021

### pyStoNED: A Python Package for Convex Regression and Frontier Estimation

Shape-constrained nonparametric regression is a growing area in economet...
05/23/2020

### Multivariate Convex Regression at Scale

We present new large-scale algorithms for fitting a multivariate convex ...
02/16/2018

### Nonparametric Bayesian estimation of multivariate Hawkes processes

This paper studies nonparametric estimation of parameters of multivariat...
06/17/2020

### Functional Group Bridge for Simultaneous Regression and Support Estimation

There is a wide variety of applications where the unknown nonparametric ...

## 1 Introduction

Consider the problem of estimating the function for the model

 y=f(x)+ϵ,

where , , is a mean regression function and Given the observations , we would like to estimate subject to the convexity constraint,

 f(x1)≥f(x2)+∇f(x1)T(x1−x2), (1)

for every , where is the gradient of at . This is called the convex regression problem. Convex regression can easily be modified to allow concave regression by multiplying all of the values by negative one.

Convex regression problems are common in economics, operations research and reinforcement learning. In economics, production functions (Skiba, 1978) and consumer preferences (Meyer & Pratt, 1968) are often convex, while in operations research and reinforcement learning, value functions for stochastic optimization problems can be convex (Shapiro et al., 2009). If a problem is known to be convex, a convex regression estimate provides advantages over an unrestricted estimate. First, convexity is a powerful regularizer: it places strong conditions on the derivatives—and hence smoothness—of a function. Convexity constraints can substantially reduce overfitting and lead to more accurate predictions. Second, maintaining convexity allows the use of convex optimization solvers when the regression estimate is used in an objective function of an optimization problem.

Multivariate convex regression has received relatively little attention in the literature. The oldest method is the least squares estimator (LSE) (Hildreth, 1954; Dykstra, 1983; Boyd & Vandenberghe, 2004; Seijo & Sen, 2011),

 min^y1:n,g1:n n∑i=1(yi−^yi)2 (2) subject to ^yj≥^yi+gTi(xj−xi),   i,j=1,…,n.

The resulting function is piecewise linear, generated by taking the maximum over the supporting hyperplanes, . However, Equation (2) has constraints, making solution infeasible for more than a few thousand observations. Recently, there has been interest in multivariate convex regression beyond the LSE. Henderson & Parmeter (2009) proposed a method that generates a regression estimator via a weighted kernel estimate subject to conditions on the Hessian of the estimator; solutions are found using sequential quadratic programming. Convexity is guaranteed only at points where the Hessian condition is enforced and the method does not scale well to high dimensions or large datasets. Hannah & Dunson (2011) proposed a method, Convex Adaptive Partitioning (CAP), that adaptively splits the dataset and fits linear estimates within each of the subsets. Like the least squares estimator, the CAP estimator is formed by taking the maximum over hyperplanes; unlike previous methods, it produces a sparse estimator that scales well to large datasets and large numbers of covariates. However, it has theoretical guarantees only in the univariate case.

Piecewise planar models, like the LSE and CAP, are poor when used in the objective function of an optimization problem. The minima of piecewise planar functions occur at a vertex where hyperplanes intersect. The location of vertices is sensitive to the number of hyperplanes and the hyperplane parameters. The parameters are in turn sensitive to noise and observation design. Bayesian models could reduce these problems: prior distributions on parameters reduce design sensitivity and model averaging produces a smoother estimate.

Bayesian models have been used for convex regression, but only in the univariate case. In this setting, methods rely on the ordering implicit to the real line: a positive semi-definite Hessian translates into an increasing derivative function in one dimension. Ramgopal et al. (1993) discretized the covariate space and placed a Dirichlet prior over the normalized integral of the slope parameters between those points. Chang et al. (2007) used Bernstein polynomials as a basis by placing a prior on the number of polynomials and then sampling from a restricted set of coefficients. Shively et al. (2011) used fixed knot and free-knot splines with a prior that placed an order restriction on the coefficients for each basis function. In a single dimension, Bayesian convex regression is closely related to Bayesian isotonic regression (Lavine & Mockus, 1995; Neelon & Dunson, 2004; Shively et al., 2009). In multiple dimensions, however, convexity constraints become combinatorially difficult to enforce through projections.

We take an entirely different approach to modeling convex functions. Instead of creating an estimator based on a set of restricted parameters or projecting an unconstrained estimate back into the space of convex functions, we place a prior over a smaller set of functions that are guaranteed to be convex: piecewise planar functions. The number of hyperplanes and their parameters are random; we define the function to be the maximum over the set of hyperplanes. We efficiently sample from the posterior distribution with reversible jump Markov chain Monte Carlo (RJMCMC). We call this approach Multivariate Bayesian Convex Regression (MBCR). Although the set of piecewise planar functions does not include all convex functions, it is dense over that space and we show strong (

) consistency for MBCR. If for some matrix and function , we show convergence rates for MBCR with respect to the norm to be . The dimension of the linear subspace, , determines the convergence rate, not the dimension of the full space, .

In numerical experiments, we show that MBCR produces estimates that are competitive with LSE and CAP in terms of traditional metrics, like mean squared error, and can outperform them in objective function approximation. Through examples on toy problems, we show that MBCR has the potential to produce regression estimates that are much better suited to objective function approximation than piecewise planar methods.

## 2 Multivariate Bayesian Convex Regression

Convexity is defined by the set of supporting hyperplane constraints in Equation (1): any supporting hyperplane of the function at is less than or equal to at any other point . This is equivalent to having a positive semi-definite Hessian. In multiple dimensions, it is difficult to project onto the set of functions that satisfy these constraints. Instead of placing a prior over an unconstrained set of functions and then restricting the parameters to meet convexity conditions, we place a prior over a smaller set of functions that automatically meet the conditions. Specifically, for all in a compact set we place a prior over all functions that are the maximum over a set of hyperplanes, ,

 f(x)=maxk∈{1,…,K}αk+βTkx, (3)

where is unknown. This set of functions can approximate any convex function arbitrarily well while maintaining straightforward inference.

Assuming follows Equation (3), we let

 Yi=f(xi;θ)+ϵi,ϵi∼N(0,σ2), (4)

where the unknown parameters are

 θ={K,α=(α1,…,αK)T,β=(β1,…,βK)T,σ2}.

The prior over is factored as,

 Π(K,α,β,σ2)=Πσ(σ2)ΠK(K)K∏k=1Πθ(αk,βk).

The prior for the variance parameter,

, is defined as , and the prior for the number of hyperplanes, , is The hyperplane parameters, are given the prior These yield the model,

 K ∼ΠK, σ2 ∼Πσ, θk|K ∼Πθ, k=1,…,K.

MBCR is similar to Bayesian adaptive regression spline (BARS) models (Denison et al., 1998; DiMatteo et al., 2001; Shively et al., 2009, 2011)

in that the method places a prior over a finite set of locally parametric models, with the prior accommodating uncertainty in the number of models, their locations and their parameters. Indeed, we use the same inference method: reversible jump Markov chain Monte Carlo (RJMCMC). In both cases, RJMCMC works by adaptively adding and removing local models while updating the model-specific parameters. However, while BARS explicitly introduces random changepoints or knots within a region, in MBCR regions are implicitly defined as corresponding to locations across which a particular hyperplane dominates. Let

be a partition of where

 Ak={x∈X:k=argmaxj∈{1,…,K}αj+βTjx}.

As in the local knot search of DiMatteo et al. (2001), we use these regions to produce an efficient proposal distribution for the RJMCMC. We discuss implementation details for MBCR in Section 4, but first we show consistency and rate of convergence for MBCR in Section 3.

## 3 Theoretical Results

Posterior consistency occurs if the posterior assigns probability converging to one in arbitrarily small neighborhoods of the true function

as the number of samples grows. The rate of convergence is the rate at which the neighborhood size can contract with respect to while still maintaining consistency. Despite the longstanding interest in shape-restricted estimators, relatively little work has explored their asymptotic properties—particularly in multivariate and Bayesian settings. In the frequentist framework, Hanson & Pledger (1976) showed consistency of the univariate LSE for convex regression; Groeneboom et al. (2001) showed it has a local convergence rate of . More recently, Seijo & Sen (2011) showed consistency for the multivariate LSE.

There is also a recent literature on the related topic of multivariate convex-transformed density estimation. Cule et al. (2010) showed consistency for the MLE log-concave density estimator; Seregin & Wellner (2010) showed consistency for the MLE of convex-transformed density estimators and gave a lower minimax bound on the convergence rate of . Bayesian shape-restricted asymptotics have received even less attention. Shively et al. (2009) showed consistency for monotone regression estimation with free knot splines in the univariate case; this was extended to univariate convex regression estimation by Shively et al. (2011).

Let be the set of parameters to be estimated. Let be the prior induced on by

 K−1 ∼Poisson(λ), σ2 ∼Πσ, θk|K ∼Πθ, k=1,…,K, (5)

where is defined in Assumption B2 and in Assumptions B3 and B4. We consider strong, or , consistency. That is, let

 Lϵ={(f,σ):∫X|f(x)−f0(x)|dx<ϵ, ∣∣∣σσ0−1∣∣∣<ϵ},

where the data-generating model is

 Yi=f0(xi)+ϵi,ϵi∼N(0,σ0)2.

We would like as , almost surely , where is the product measure under the true distribution. Throughout the rest of this paper, we use lower case and to denote known or observed quantities, while and

denote random variables. We show that MBCR is strongly consistent under a general set of conditions.

Bayesian rates of convergence are slightly different from their frequentist counterparts. A series where is a rate of convergence under a metric if

 P∞f0,σ0Π(θ∈Θ:d(θ,θ0)≥Hnϵn|(Xi,Yi)ni=1)→0

for every . We examine convergence rates with respect to the empirical norm. Moreover, if actually maps a -dimensional linear subspace of to , then the convergence rate is determined by the dimensionality of the subspace, , rather than the full dimensionality, .

### 3.1 Consistency

We consider two design cases for consistency: fixed design and random design. We place a series of assumptions on the true function, the prior and the design. Some of the assumptions on the prior are specific to the design type. In both cases, we assume that is uniformly bounded:

1. The function is uniformly bounded on the compact set .

Without loss of generality, we assume that .

For both design types, we need define the prior and in Equation (5). First, we assume that the prior on

has compact support bounded away from zero. This is not a restrictive assumption in practice since zero measurement error is unlikely to occur and an upper bound can be easily chosen to cover a wide range of plausible values. Second, in the case of fixed design, we assume compact support of the prior for the hyperplane parameters; again, a wide range of plausible values can be chosen. Truncated normal and inverse-gamma distributions provide a convenient choice.

1. Let be the prior on ; is non-atomic and only has support over with

2. Let be the prior on , where is the

dimensional Gaussian distribution.

3. Let . Let be a constant such that and for some , let

 Ω={(α,β):max{α,β1,…,βp}≤V}.

Set and let .

For both design cases, we need to ensure that the covariate space is sufficiently well-sampled.

1. For each hypercube in , let be the Lebesgue measure. Suppose that there exists a constant with such that whenever , contains at least one design point for sufficiently large .

2. Let be the density of the random design points; is non-atomic and for every .

With these assumptions, we now give consistency results.

###### Theorem 3.1.

Assume that is compact, the covariate design is fixed and that is convex with continuous first order partial derivatives. Suppose that conditions B1, B2, B4 and B5 hold. Then for every ,

 P∞f0,σ0Π(LCϵ|Y1,…,Yn,x1,…,xn)→0.

In the stochastic design case, assumptions B4 and B5 are replaced by B3 and B6, respectively. We note that for random design, convergence follows directly from convergence in probability for a uniformly bounded function .

###### Theorem 3.2.

Assume that is compact, the covariate design is random and that is convex with continuous first order partial derivatives. Suppose that conditions B1-B3 and B6 hold. Then for every ,

 P∞f0,σ0Π(LCϵ|Y1,…,Yn,X1,…,Xn)→0.

To prove Theorems 3.1 and 3.2, we use the consistency results for Bayesian nonparametric regression of Choi & Schervish (2007). We show that the prior satisfies the following assumptions,

1. The prior puts positive measure on the neighborhood

 Bδ={(f,σ):||f−f0||∞<δ,∣∣σσ0−1∣∣<δ}

for every .

2. Set , where

 Θ1n={f:||f||∞

and with . Then there exists such that .

Choi & Schervish (2007) modifies the consistency theorem of Schwartz (1965) for non-i.i.d. observations; the requirements are prior positivity on a set of variationally close Kullback-Leibler neighborhoods and the existence of exponentially consistent tests separating the desired posterior region from the rest. Assumption A1 satisfies the prior positivity while assumption A2 constructs a sieve that is used to create the exponentially consistent tests. Assumptions A1 and A2 generate pointwise convergence in the empirical and in--probability metrics for the fixed and random design cases, respectively. Assumptions B1 to B6 are then used to extend consistency under these metrics to consistency under the metric. See Appendix A for details.

### 3.2 Rate of Convergence

We determine the rate of convergence of MBCR with respect to the empirical norm,

 ||f||n=(1nn∑i=1f(xi)2)1/2.

For both the fixed design and random design cases, we make the following assumptions:

1. The model variance is known.

2. There exists a convex function and a matrix with rank where such that .

We make assumption B7 for convenience; it can be loosened with sufficient algebraic footwork. Assumption B8 says that actually lives on a dimensional subspace; this is not restrictive as it is possible for . However, many situations arise when . For example, there may be extraneous covariates or the mean function may be a function of a linear combination of the covariates—in effect, . It is not required that is known a priori, but simply that it exists. We also keep assumption B4

, which truncates the tails of the Gaussian priors for the hyperplane slopes and intercepts; this is done to bound the prior probability of the compliment of the sieve.

###### Theorem 3.3.

Assume that is compact and that is convex, has continuous first order partial derivatives and suppose that conditions B1, B4, B7 and B8 hold. For both random covariates and fixed covariates and sufficiently large ,

 P∞f0Π(f:||f−f0||n≥Hnϵn|Y1,…,Yn,x1,…,xn)→0

for any , where .

Theorem 3.3 is proven by showing that the conditions for Theorem 3 of Ghosal & van der Vaart (2007) are satisfied. Details are given in Appendix B.

We note that the rates achieved in Theorem 3.3 are within a log term of global minimax rates for general nonparametric convergence, , assuming . However, the -metric entropy of the set of bounded convex functions with respect to the metric scales like  (van der Vaart & Wellner, 1996), leaving open the possibility of convergence rates of for bounded convex functions. In certain settings of convex-transformed density estimation that rate has been obtained (Seregin & Wellner, 2010). We, however, do not believe that MBCR achieves this rate in a general setting.

## 4 Implementation

In this section, we extend MBCR to a model that can accommodate heteroscedastic data and provide a reversible jump MCMC sampler.

### 4.1 Heteroscedastic Model

The model in Section 2 assumes a global variance parameter,

. While this is often a reasonable assumption, it can lead to particularly poor results when it is violated in a shape-restricted setting: locally chasing outliers in high-variance regions can lead to globally poor prediction due to the highly constrained nature of convex regression. To accommodate heteroscedasticity, we consider the following model,

 Yi=f(xi;θ)+ϵi,ϵi∼N(0,g(xi)),

where . Specifically, to induce a flexible prior on , we introduce a separate variance term for each hyperplane and modify the model to let

 Yi =maxk∈{1,…,K}αk+βTkxi+ϵi,ϵi∼N(0,σ2k), (θk,σ2k) ∼Np+1IG(μα,β,Vα,β,a,b),k=1,…,K, K−1 ∼Poisson(λ).

Here denotes the normal inverse gamma distribution with a dimensional normal. We choose a Poisson prior for the number of components, although we note that the model is generally not sensitive to the prior on the number of components. Due to the adaptable nature of the heteroscedastic model and its resistance to variance misspecification, we use it for all numerical work.

### 4.2 Posterior Inference

To sample from the posterior distribution, we use RJMCMC with the marginal posterior distribution of as the stationary distribution. Similar methods have been used for posterior inference on free-knot spline models by Denison et al. (1998) and DiMatteo et al. (2001).

RJMCMC works by proposing a candidate model, , and determining whether or not to move to that new model based on a Metropolis-Hastings type acceptance probability,

 a(K∗,α∗,β∗,σ2∗ |K,α,β,σ2)=min{1,p(Y|x,K∗,α∗,β∗,σ2∗)p(Y|x,K,α,β,σ2) (6) ×Π(K∗,α∗,β∗,σ2∗)Π(K,α,β,σ2)q(K,α,β,σ2|K∗,α∗,β∗,σ2∗)q(K∗,α∗,β∗,σ2∗|K,α,β,σ2)}.

Here is the likelihood ratio of the data conditioned on the models, is the prior ratio of the models and

 q(K,α,β,σ2|K∗,α∗,β∗,σ2∗)/q(K∗,α∗,β∗,σ2∗|K,α,β,σ2)

is an asymmetry correction for the proposal distribution. Candidate models are entirely new models: all parameters are updated as a block. If only individual parameters or hyperplanes are updated, acceptance rates for parameters in the most constrained areas are orders of magnitude lower than those in the relatively unconstrained regions on the boundary of the function. Without block updates, there is poor mixing. There are three types of candidate models: hyperplane relocations, deletions and additions. All candidate models are generated from proposal distributions, which significantly impact the efficiency of the RJMCMC algorithm.

To generate proposal distributions we use the covariate partition induced by the current model to create a set of basis regions. Basis regions are determined by partitioning the set of training data. For example, suppose that a partition of the observations, , has subsets. Let , where We can use to produce a set of basis regions for generating with components,

 V∗k =(~V−1α,β+xT[k]x[k])−1, (7) μ∗k =V∗k(~V−1α,β~μα,β+xT[k]y[k]), a∗k =~a+nk2, b∗k =~b+12(~μTα,β~V−1α,β~μα,β+yT[k]y[k]−μ∗kTV∗k−1μ∗k), (α∗k,β∗k,σ2k∗) ∼Np+1IG(μ∗k,V∗k,a∗k,b∗k),k=1,…,K.

Here, , and is the number of elements in subset

. The hyperparameters for the proposal distributions,

, are not necessarily the same as those for the prior. Often the variance parameters are smaller to produce higher acceptance rates. The current set of hyperplanes, , are used to create the partitions that define the basis regions,

 Ck={i:k=argmaxj∈{1,…,K}αj+βTjxi}.

For a relocation proposal distribution, the basis regions are generated by the covariate partition of the current model. The removal proposal distribution is a mixture with components. Each component is generated by removing the hyperplane for and using the remaining hyperplanes to create a set of basis regions. Proposal distributions for additions are less straightforward. The addition proposal distribution is a mixture with components. Beginning with the subsets defined by the current model, , each subset is searched along a random direction for . On each of those random directions, the subset is divided according to a knot into the set of observations less than in direction and those greater. This is done for knots for each subset and direction . An example is shown in Figure 1. Full implementation details are given in Appendix C.

We note that the sampler for MBCR does not behave like a typical MCMC sampler. Convergence and mixing are extremely fast. Unlike most MCMC samplers, the MBCR sampler converges once the “right” number of components has been reached, typically within zero to four of the mean number of components. This is due to the way the proposal distributions are constructed and the strict requirements of convexity. Block updating ensures that autocorrelation drops to near zero rapidly. Numerical results suggest this generally happens after about three samples. While convexity endows the sampler with properties like fast convergence, it can also lead to situations where the restrictions are too rigid for the sampler to function. For example, if the noise level is very low, the number of observations is more than a few thousand, or the number of dimensions is moderate to high, the region of admissible models becomes very small and the acceptance rates rapidly drop to zero. Approximate inference methods seem to be required in these situations.

## 5 Applications

In Section 5.1, we compare the performance of MBCR to other regression methods on a set of synthetic problems. We show that convexity constraints can produce better estimates than their unconstrained counterparts and that MBCR is competitive with state of the art convex regression methods with respect to mean squared error. In Section 5.2, we analyze the behavior of MBCR, CAP and LSE when approximating an objective function for convex optimization. We show that MBCR produces estimates that are more suited to objective function approximation than those produced by CAP or LSE.

### 5.1 Synthetic Problems

In this subsection, we create a set of synthetic problems designed to show off the strength of convexity constraints. Problem 1 is highly non-linear and has moderate dimensionality (5); Problems 2 and 3 also have moderate dimensions in the covariate space (6 and 4, respectively), but both actually reside in a univariate subspace.

##### Problem 1.

Let . Set

 y=(x1+.5x2+x3)2−x4+.25x25+ϵ,

where . The covariates are drawn from a 5 dimensional standard Gaussian distribution, .

##### Problem 2.

Let . Set

 y=(x1+x2)2+ϵ,

where

. The covariates are drawn from a 6 dimensional uniform distribution,

for .

##### Problem 3.

Let . Set

 y =∣∣aTx∣∣+ϵ, aT =[0.8262,0.9305,1.6361,0.6072]

where . The covariates are drawn from a 4 dimensional uniform distribution, for .

#### 5.1.1 Results.

On all of these problems, MBCR is compared to CAP, LSE and Gaussian Process priors (Rasmussen & Williams, 2006). Gaussian process priors are a Bayesian method that is widely and successfully used in regression and classification settings; we use the Matlab gpml package for implementation. All methods were implemented in Matlab; the least squares estimate (LSE) was found using the cvx optimization package. LSE took 5 to 6 minutes to run with 500 observations and 50 to 60 minutes to run with 1,000. The tolerance parameter for CAP was chosen through five-fold cross-validation. MBCR was implemented with component-specific variances. It was run for 1,000 iterations with the first 500 discarded as a burn-in.

Due to the highly constrained nature of the model and block updating, convergence of the sampler was extremely fast in all settings. The model was generally insensitive to the hyperparameter for the number of hyperplanes, ; it was varied over three orders of magnitude and set to 20 for all tests. In lower, dimensions, however, choice of was more important. Likewise, variance hyperparameters were tested over three orders of magnitude with little sensitivity. Distributions were not placed over the variance hyperparameters because of the delicate relationship between the proposal distributions and the hyperparameters. All mean hyperparameters were set to 0.

MBCR and CAP dramatically outperformed other methods on all of the problems. LSE had relatively poor performance although it includes convexity constraints. This is due to overfitting, particularly in boundary regions. MBCR and CAP performed comparably on Problems 2 and 3, which both reside in a univariate subspace of the general covariate space. However, MBCR outperformed CAP on the more complex Problem 1, particularly when there were few observations available.

### 5.2 Objective Function Approximation for Stochastic Optimization

Stochastic optimization methods are used to solve optimization problems with uncertain outcomes. The traditional objective is to minimize expected loss. There are many problems in this class, ranging from stochastic search (Spall, 2003) to sequential decision problems (Sutton & Barto, 1998; Powell, 2007). In this section, we study the use of convex regression to compute response surfaces. A response surface is an approximation of an objective function based on a collection of noisy samples. Once a response surface has been created, it is searched to estimate the minimizer or maximizer of a function. Convex representations are desirable. First, the resulting approximation will likely be closer to the true objective function than an unconstrained approximation. Second, and more importantly, the surrogate objective function is now convex as well and can be easily searched with a commercial solver.

Consider the following problem. We would like to minimize an unknown function with respect to given noisy observations, , where ,

 minx∈XE{f(x)|(xi,yi)ni=1}. (8)

To solve Equation (8), we approximate with three different methods for regression: least squares, CAP and MBCR. Let be the estimate of the mean function given . Unlike CAP and LSE, MBCR is a Bayesian method; it places a distribution over functions rather than producing a single function estimate. Let be a sample from the posterior; the Bayes estimate of the mean function can be approximated by the average of samples from the posterior,

 ^fn(x)≈1MM∑m=1^f(m)n(x).

We demonstrate the empirical differences between the objective functions produced by MBCR, CAP and LSE by solving a small stochastic optimization problem.

##### Example.

Set

 Yi=xiQxTi+ϵi,Q=[10.20.21],ϵi∼N(0,0.1). (9)

The constraint set is for . Observations were sampled randomly from a uniform distribution, We used LSE, CAP and MBCR to approximate the objective function. To examine the stability of these methods for objective function approximation, we sampled 100 observations 50 times for Equation (9). Approximations of the objective functions for one sample are shown in Figure 2.

#### 5.2.1 Results

We compared MBCR, CAP and LSE across 50 samples of 100 observations. The minima of piecewise planar models, like CAP and LSE, are on one of the vertices (or occasionally along one of the edges); this makes the minima of such models highly sensitive to model parameters such as number of hyperplanes and the value of their coefficients. MBCR, however, places a distribution over piecewise planar models. The Bayes estimate averages those models to produce something that is close to smooth and hence is relatively robust to observation design. Figure 2 highlights these differences. The minima of both piecewise planar methods were sensitive to the observation design while the minima of MBCR proved more robust. Locations of minima are shown in Figure 3.

#### 5.2.2 Discussion

Many methods for solving stochastic optimization problems, including response surface methods (Barton & Meckesheimer, 2006; Lim, 2010), Q-learning (Ernst et al., 2005) and approximate dynamic programming (Powell, 2007), involve functional approximations that are then searched to find a solution that minimizes or maximizes the approximate reward. Current solution methods for these problems use either unconstrained regression methods (Lagoudakis & Parr, 2003; Ernst et al., 2005) or additive approximations with univariate convex functions (Powell et al., 2004; Nascimento & Powell, 2009). Robust multivariate convex regression methods could allow efficient solution of a broad set of stochastic optimization problems, inlcuding resource allocation, portfolio optimization and inventory management.

## 6 Conclusions and Future Work

In this article, we introduced a novel fully Bayesian, nonparametric model for multivariate convex regression and showed strong posterior consistency along with convergence rates. We presented an efficient RJMCMC sampler for posterior inference. Our model was used to approximate objective functions for stochastic optimization and showed improvement over existing frequentist methods.

While this work represents a large advancement for convex regression, much remains to be done. First, we need to develop sampling methods that scale to large problems. Second, MBCR needs to be tested on a variety of stochastic optimization problems. Third, MBCR can be combined with other Bayesian methods to produce a class of semi-convex estimators.

Currently, the RJMCMC sampling method only scales well to moderate dimensionality and problem size: its limits are about 8 to 10 dimensions and a few thousand observations. Approximate inference methods, such as variational Bayes, could allow MBCR to solve problems an order of magnitude larger. Implementation, however, is not a straightforward extension of existing methods.

In stochastic optimization, MBCR is an extremely promising tool for value function approximation. Many solution methods for sequential decision problems include value function approximation, such as point-based value iteration (Pineau et al., 2003), fitted Q-iteration (Ernst et al., 2005), approximate dynamic programming (Powell, 2007). All of these methods involve iterative searches of an approximate value function over sets of feasible actions where. In many problems, such as resource allocation, the value function is known to be convex. Robust multivariate convex regression methods would allow a wider variety of problems to be solved, including those with large action spaces and non-separable objective functions.

Perhaps the most intriguing feature of MBCR is that it is a Bayesian model and can easily be combined with other Bayesian models to produce estimators that are convex in some dimensions, but not all. For example, it is well known that consumer preferences for bundled products tend to be convex. However, it is likely that other covariates like consumer age, gender, income and education influence the preference function—and the function is not convex in these covariates. This set of functions could be well-modeled by a combination of MBCR and Bayesian mixture models like Dirichlet processes (Ferguson, 1973; Antoniak, 1974) or hierarchical Dirichlet processes (Teh et al., 2006). Such flexible models would be of great value to an assortment of fields, including economics, operations research and reinforcement learning.

## Acknowledgements

This research was partially supported by grant R01ES17240 from the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (NIH). Lauren A. Hannah is partially supported by the Duke Provost’s Postdoctoral Fellowship.

## Appendix A

Appendix A contains the proofs for Section 3.1.

To show pointwise convergence, we use Theorems 1 to 3 of Choi & Schervish (2007); they are condensed for this paper. For the fixed design case, let be the empirical density of the design points, The empirical density is used to define the following neighborhood,

 Wϵ,n={(f,σ):∫|f(x)−f0(x)|dQn(x)<ϵ,∣∣∣σσ0−1∣∣∣<ϵ}.
###### Theorem 6.1.

(Choi & Schervish, 2007) Let denote the joint conditional distribution of given the covariates, assuming that is the true mean function and is the true variance. If assumptions A1, A2, B1 and B2 are satisfied, then for every ,

 P∞f0,σ0Π{(f,σ)∈WCϵ,n|Y1,…,Yn,x1,…,xn}→0.

For the random design case, let be the density of the random design points. Let

 Uϵ={(f,σ):inf{ϵ>0:Q({x:|f(x)−f0(x)|>ϵ})<ϵ},∣∣∣σσ0−1∣∣∣<ϵ}

be the set of neighborhoods based on the in-probability metric.

###### Theorem 6.2.

(Choi & Schervish, 2007) Let denote the joint conditional distribution of given the covariates, assuming that is the true mean function and is the true variance. If assumptions A1, A2, B1 and B2 are satisfied, then for every ,

 P∞f0,σ0Π{(f,σ)∈UCϵ|Y1,…,Yn,X1,…,Xn}→0.

We now show that the prior satisfies assumptions A1 and A2 for Theorems 6.1 and 6.2.

###### Lemma 6.3.

For every , the prior from B2 and B3 or B4 puts positive measure on the neighborhood

 Bδ={(f,σ2):||f−f0||∞<δ,∣∣∣σσ0−1∣∣∣<δ}.
###### Proof.

Fix . Break into two parts, , and Under a truncated inverse gamma prior,

 Π(Bδ(σ2))=Π((σ20(1−δ)2,σ20(1+δ)2))>0.

To show prior positivity on , we create a sufficiently fine mesh over . On each section of the mesh, we show that there exists a collection of hyperplanes that 1) do not intersect with , and 2) have an distance from of less than in that section. Since is bounded on , it is Lipschitz continuous with parameter . A mesh size parameter , which depends on , can be found to make a mesh over with the following requirements.

Number regions ; call the subsets of the covariate space defined by the regions . Because is Lipschitz continuous, an can be found such that for every region , one can find and where for every and ,

 αr+βTrx

for every .

We create a function to approximate by taking the maximum over the set of hyperplanes; using the above, we can bound the distance between and ,

 supx∈X||f0(x)−fδ(x)||∞ =supx∈X||f0(x)−maxr∈{1,…,R}αr+βTrx||∞, ≤maxr=1,…,Rsupx∈Mγr||f0(x)−αr−βTrx||∞, <δ.

To complete the proof, we note that and for .∎∎

###### Lemma 6.4.

Define the prior as in B2 and B3. There exist constants and such that .

###### Proof.

Without loss of generality, assume . Note that

 Θc1n =Θ∖{||f||∞

Taking the probability of the right hand side of Equation (10),

 Π(ΘC1n) ≤∞∑k=1ΠK(K=k)k∑j=1{Π(|αj|≥Mn2√p)+p∑ℓ=1Π(|βj,ℓ|≥Mn2√p)}, ≤2EΠ[K](p+1)∫∞c0Mn1√2πe−12x2dx, ≤C1e−c1n2α.

∎∎

Note that under the bounded prior assumption B4, for sufficiently large . Theorems 3.1 and 3.2 follow directly from Theorems 4 and 6, respectively, of Choi & Schervish (2007) and Theorems 6.1 and 6.2. In the random design case, convergence is equivalent to in-probability convergence under assumptions B1 and B6; the fixed design case requires more care. See Choi & Schervish (2007) for details.

## Appendix B

Appendix B contains the proofs for Section 3.2. Theorem 3.3 relies on verifying the conditions of Theorem 3 of Ghosal & van der Vaart (2007),

###### Theorem 6.5 (Ghosal & van der Vaart (2007)).

Let be a product measure and a semimetric and let be the space of all -tuples with positive measure under . Suppose that for a sequence such that is bounded away from zero, all sufficiently large and sets , the following conditions hold:

1. There exist tests such that and for all such that

where . Then, for every .

The distance metric, that we will use is the norm. Note that the norm is bounded by the norm; we shall do metric entropy computations with respect to the norm. The values and