# Piecewise Convex Function Estimation: Pilot Estimators

Given noisy data, function estimation is considered when the unknown function is known a priori to consist of a small number of regions where the function is either convex or concave. When the number of regions is unknown, the model selection problem is to determine the number of convexity change points. For kernel estimates in Gaussian noise, the number of false change points is evaluated as a function of the smoothing parameter. To ensure that the number of false convexity change points tends to zero, the smoothing level must be larger than is generically optimal for minimizing the mean integrated square error (MISE). A two-stage estimator is proposed and shown to achieve the optimal rate of convergence of the MISE. In the first-stage, the number and location of the change points is estimated using strong smoothing. In the second-stage, a constrained smoothing spline fit is performed with the smoothing level chosen to minimize the MISE. The imposed constraint is that a single change point occur in a region about each empirical change point from the first-stage estimate. This constraint is equivalent to the requirement that the third derivative of the second-stage estimate have a single sign in a small neighborhood about each first-stage change point. The change points from the second-stage are in a neighborhood of the first-stage change points, but need not be at the identical locations.

## Authors

• 25 publications
03/14/2018

### Signal Processing and Piecewise Convex Estimation

Many problems on signal processing reduce to nonparametric function esti...
03/11/2018

### Piecewise Convex Function Estimation and Model Selection

Given noisy data, function estimation is considered when the unknown fun...
03/11/2018

### Piecewise Convex Function Estimation: Representations, Duality and Model Selection

We consider spline estimates which preserve prescribed piecewise convex ...
12/07/2021

### Change-point regression with a smooth additive disturbance

We assume a nonparametric regression model with signals given by the sum...
03/11/2018

### Function Estimation Using Data Adaptive Kernel Estimation - How Much Smoothing?

We determine the expected error by smoothing the data locally. Then we o...
04/26/2019

### Smoothing and Interpolating Noisy GPS Data with Smoothing Splines

A comprehensive methodology is provided for smoothing noisy, irregularly...
12/01/2020

### Anisotropic local constant smoothing for change-point regression function estimation

Understanding forest fire spread in any region of Canada is critical to ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Our basic tenet is: “Most real world functions are piecewise -convex with a small number of change points of convexity.” Given measurements of the unknown function, , contaminated with random noise, we seek to estimate while preserving the geometric fidelity of the estimate, , with respect to the true function. In other words, the number and location of the change points of convexity of should approximate those of . We say that has change points of -convexity with change points if for .

The idea of constraining the function fit to preserve prescribed -convexity properties has been considered by a number of authors [6, 11, 18, 19]. The more difficult problems of determining the number and location of the -convexity breakpoints will be a focus of this article. Historical perspectives to the problem may be found in [5, 7]. “Bump hunting” dates back at least to [3]. Silverman [16] and Mammen et al. [8, 9]

formulated the problem as a sequential hypothesis testing problem. We refer to the estimation of the number of change points as the “model selection problem” because it resembles model selection in an infinite family of parametric models.

An interesting result of [8, 9, 15]

is that kernel smoothers will often produce too many inflection points/wiggles. If the amount of smoothing is chosen to minimize the mean integrated square error (MISE), then with nonvanishing asymptotic probability, the estimate will have multiple inflection points in a neighborhood of an actual one. For many applications, estimating the correct shape is more important than minimizing the MISE.

In this article, we propose a class of two-stage estimators which estimate the -change points in the first-stage and then perform a constrained regression in the second-stage. In the first stage, the function is strongly smoothed while the smoothing in the constrained second-stage is optimized for the minimal mean square error. When the change points are correctly specified, the constrained spline estimate has a smaller square error (as measured in a particular norm) than the unconstrained estimate.

Our second-stage estimate achieves the asymptotically optimal MISE convergence rate while suppressing artificial change points that can occur with the unconstrained method. Our proof does not exclude the possibility that the second-stage estimate has spurious inflection points far from the first-stage inflection points. We believe that our estimator has the same relative convergence rate as standard nonparametric methods. Thus our estimator suppresses artificial wiggles at nontrivial computational costs but no lost of MISE.

In Section 2, we show that the constrained smoothing spline estimate achieves the optimal rate of convergence for the expected square error even when the constraints are occasionally misspecified, provided that the misspecification rate is sufficiently small.

In Section 3, we evaluate the expected number of false (extra) empirical change points [9] for kernel smoothers when the errors are Gaussian. By adjusting the smoothing parameter, we can guarantee an asymptotically small probability of an error in our imposed constraints.

In Section 4, we propose two-stage estimators which estimate the number and location of the -change points in the first-stage. In the second-stage, we impose that be positive/negative in a small region about each of the empirical -change points. The main advantage of the two-stage procedure is that less smoothing is required in the second-stage than in the first-stage while preserving the geometric fidelity.

Section 5 discusses potential extensions of the method. Section 6 describes a global shape optimization that is heuristically designed to be efficient in the amount of smoothing subject to determining the number of change points consistently.

## 2 Expected error under inexact convexity constraints

We are given measurements of the unknown function, :

 yi=f(ti)+ϵi . (2.1)

The mean integrated squared error (MISE) for the estimate, , of the th derivative is defined as . We consider the MISE for constrained estimation as the number of measurements, , tends to infinity. In describing the large asymptotics, we consider a sequence of measurement problems. For each , the measurements occur at , We suppress the superscript, , on the measurement locations . We define the empirical distribution of measurements, , and let be its limiting distribution.

Assumption A Consider the sequence of estimation problems: , where the

are zero mean random variables and

. Assume that the distribution of measurement locations, converges in the sup norm: , where , is and .

We measure the convergence of a set of measurement times to the continuum limit in terms of the discrepancy of the point set:

Definition The star discrepancy of with respect to the continuous distribution is .

Equivalently, . For regularly spaced points, , while for randomly spaced points, by the Glivenko-Cantelli Theorem.

A popular linear estimator is the smoothing spline [20]: , where

 VP[f]≡λ2∫|f(m)(s)|2ds+1NN∑i=1|yi−f(ti)|2σ2 . (2.2)

We denote the standard Sobolev space of functions with square integrable derivatives by [20]. In general, the smoothing parameter will be decreased as increases: . For smoothing splines, we add the stronger requirements:

Assumption A (Cox [1]) Let Assumption A hold with and . Consider the sequence of smoothing spline minimizers of (2.2). Let the smoothing parameter, , satisfy and as .

The constraint that is very weak since the optimal value of satisfies . The assumption that is can be weakened.

For unconstrained smoothing spline estimates, the MISE has the following upper bound:

###### Theorem 2.1 (Ragozin [13], Cox [1])

Let Assumption A hold, and denote the smoothing spline estimate from (2.2) by . As ,

 E[∫|^f(j)N,λ(t)−f(j)(t)|2]≤αjλ(m−j)m∥f∥2m+βjσ2Nλ(2j+1)2m  , j≤m , (2.3)

where and are positive constants. The MISE is minimized by and

 E[∫|^f(j)N,λ(t)−f(j)(t)|2dt]=O(N−2(m−j)(2m+1)) . (2.4)

For uniformly spaced measurement points, this result is in [13], while Cox [1] generalizes the result to these more general conditions on the measurement points. Cox imposes the condition that while his proof applies to as well.

Given change points, , we define the closed convex cone:

 VK,ℓm,2[x1,…,xK]={f∈Wm,p | (−1)k−1f(ℓ)(t)≥0  for  xk−1≤t≤xk} , (2.5)

where and . When the change points are unknown, we need to consider , where .

If we know the change point locations, , a natural estimator is subject to the convex constraints that . This constrained spline estimate is the basis of our analysis. Detailed representation and duality results are in [14]. If we correctly impose the constraint that , the constrained spline fit always outperforms the nonconstrained fit as the following theorem indicates:

###### Theorem 2.2 (Utreras [18])

Let be in a closed convex cone, . Let be the unconstrained minimizer of (2.2) given ; and be the constrained minimizer, then , where

 ∥f∥2V≡λ2∫|f(m)(s)|2ds+1Nσ2N∑i=1|f(ti)|2 . (2.6)

Theorem 2.2 applies to any set of and does not use . Theorem 2.2 shows that if one is certain that is in a particular closed convex cone, the constrained estimate is always better than the unconstrained one. Unfortunately, Theorem 2.2 does not generalize to unions of convex cones, and thus does not apply to .

###### Theorem 2.3

Consider and let Assumption A hold. Consider a sequence of estimates, , which satisfy the error bound given by (2.3), and a second sequence of estimates, , which satisfy the error bound: The asymptotic error bound given by Eq. (2.3) holds for with different constants, and .

Proof. For uniform sampling, this theorem is proved in [18]

using interpolation inequalities. To generalize Utreras’s result to our sampling hypotheses, we replace his Lemma 4.3 with (

A.3) and (A.4) (with ) and substitute everywhere appears.

This generalization of Utreras’s result does not require the ratio of to to be bounded. In practice, we choose our constraints empirically and sometimes impose an incorrect constraint. We now show that occasionally imposing the wrong constraint does not degrade the asymptotic rate of convergence provided that the probability of an incorrect constraint is sufficiently small. The following theorem is a basis for our data adaptive estimators in Section 4 .

###### Theorem 2.4 (Occasional Misspecification)

Consider a two-stage estimator that with probability, , correctly chooses a closed convex cone , with , in the first-stage and then performs a constrained regression as in (2.2). Under Assumption A, the estimate, , satisfies the asymptotic bound (2.3) (with different constants, and ) provided that as , vanishes rapidly enough: and .

The proof is given in Appendix B.

## 3 Convergence of Kernel Smoothers

In this section, we examine the expected number of zeros of a kernel smoother as a function of the halfwidth parameter. The results in this section are proven in [15]. We begin by presenting convergence results for kernel estimators, , of as . We define , , . We use the notation to denote a size of relative to the main term: . We define to be the sum of the and total variation norms of .

###### Lemma 3.1 (Generalized Gasser-Müller [2, 15])

Let and consider a sequence of estimation problems satisfying Assumption A. Let be a kernel smoother estimate of the form:

 ^f(ℓ)(t)=1Nhℓ+1NN∑iyiwiF′(ti)κ(ℓ)(t−tihN) , (3.1)

where is the kernel halfwidth and the weights, satisfy . Let the kernel,

, satisfy the moment condition:

, and the boundary conditions: for . Choose the kernel halfwidths such that , and ; then , , ,   , and   uniformly in the interval, .

Lemma 3.1 applies to all of the common kernel smoother weightings [2] such as Priestley-Chao and Gasser-Müller. This result is slightly stronger than previous theorems on kernel smoothers [2]. Our hypotheses are stated in terms of the star discrepancy while previous convergence theorems [2] place restrictions on both and .

We now evaluate the expected number of false -change points for a sequence of kernel estimates of . We restrict to independent errors: . Thus, a Gaussian process. The following assumption rules out nongeneric cases:

Assumption B Let have -change points, , with , and

. Consider a sequence of estimation problems with independent, normally distributed measurement errors,

, with variance

. Let be a sequence of kernel estimates of , on the sequence of intervals, .

To neglect boundary effects, we take for kernel estimators and for splines. For each change point, , we define the change point variance [2]: . The following theorem bounds the probability of a false estimate of a change point far away from a true change point.

###### Theorem 3.2

Let Assumption B hold and consider a sequence of kernel estimators, , that satisfy the hypotheses of Lemma 3.1. Choose kernel halfwidths, , and uncertainty intervals, , such that , , . The probability, , that has a false change point outside of a width of from the actual -change points satisfies

 pN(wN)≤ K∑k=1O⎛⎝σif(xk)hNexp⎛⎝−w2N2σ2if(xk)⎞⎠ ⎞⎠ , (3.2)

where on the interval .

In [8, 9], Mammen et al. derive the number of false change points for kernel estimation of a probability density. We present the analogous result for regression function estimation.

###### Theorem 3.3 (Analog of [8, 9])

Let Assumption B hold. Consider a sequence of kernel smoother estimates which satisfy the hypotheses of Lemma 3.1 with . Let the sequence of kernel halfwidths, , satisfy and . The expected number of -change points of in the estimation region, , is asymptotically

 E[^K]−K= 2K∑k=1H⎛⎜⎝ ⎷|f(ℓ+1)(xk)|2NF′(xk)h2ℓ+3σ2∥κ(ℓ+1)∥2⎞⎟⎠ +oR(1) , (3.3)

where with and being the Gaussian density. If has Hölder smoothness of order for some , and , then (3.3) remains valid provided that .

In [8, 9], the correction in (3.3) is shown to be when . We strengthen this result by showing that (3.3) continues to represent the leading order asymptotics even when .

For each change point, , we define the uncertainty interval by , where is the two-sided

-quantile for a normal distribution. The probability that an empirical change point is more than

away from the th actual change point is less than . We consider two change points well resolved if the two uncertainty intervals do not overlap.

A similar variance for change point estimation is given in [10] where Müller shows that the left-most change point is asymptotically normally distributed with variance . For his result, Müller imposes stricter requirements on and , and does not obtain results pertaining to the expected number of false change points.

When , the halfwidth which minimizes the MISE scales as . Other schemes for piecewise convex fitting [5] choose the kernel halfwidth/smoothing parameter to be the smallest value, , that yields only change points in an unconstrained fit. Theorem 3.3 shows that is asymptotically larger than the halfwidth which minimizes the MISE for . As a result, these schemes oversmooth.

## 4 Data-based Pilot Estimators with Geometric Fidelity

We consider two-stage estimators that begin by estimating and using an unconstrained estimate with . From the pilot estimate, we determine the number, , and approximate locations of the change points. In the second-stage, we perform a constrained fit; requiring that be monotone in small regions about each empirical change point. Since spurious change points asymptotically occur only in a neighborhood of an actual change point, the second-stage fit need only be constrained in a vanishingly small portion of the domain asymptotically.

###### Theorem 4.1 (Asymptotic MISE for pilot estimation)

Let satisfy Assumption B and consider a sequence of two-stage estimators. In the first-stage, let the hypotheses of Theorem 3.2 be fulfilled. From the first stage estimate, denote the empirical -change points by . Choose widths such that , , and , where is the first stage halfwidth. In the second-stage, perform a constrained regression as in (2.2), where the second-stage smoothing parameter, , satisfies and . In the second-stage regression, impose the constraints that the second-stage has a single sign in the regions (which matches the sign of .) For , the second-stage estimate, , satisfies the expected error bounds of Eq. (2.3) (with different constants).

Proof. Theorem 3.2 shows that lie within of the with probability , where and . In the remainder of the proof, we implicitly neglect this set of measure and use arguments that are valid for large . By Assumption B, there are no zeros of in and thus . Note that

has a Gaussian distribution with variance

and a bias error bounded by [2]. Thus, the sign of is determined correctly with a probability of . The result follows from Theorem 2.4.

The trick of Theorem 4.1 is to constrain to be positive (or negative) in the uncertainty interval of the estimated change points (linear constraints) rather than constraining to have a single zero around (nonlinear constraints). Theorem 4.1 implies that the second-stage estimate has no false change points within of with high probability. It does not exclude the possibility that false change points occur outside of , but we believe that such false change points seldom occur in practice. We believe that it is adequate to eliminate false inflection points in the regions where they occur in the unconstrained nonparametric estimates.

Asymptotically, the zeros of

will occur in clusters with an odd number of zeros. If a cluster with an even number of zeros occurs in the first stage, it is spurious (with high asymptotic probability). We recommend imposing the constraint that the second-stage

(not ) has a single sign in each neighborhood where an even number of change points of the first-stage occur.

For data adaptive methods, we modify Theorem 4.1 slightly:

###### Corollary 4.2

The hypotheses of Theorem 4.1 on the first-stage estimate (such as , , and ) need only be true with probability , for the conclusion of Theorem 4.1 to be valid, where satisfies and .

Let denote the smoothing bandwidth chosen by generalized cross-validation. Under certain conditions [4], it can be shown that has an asymptotically normal distribution with mean , where and depend on and the kernel shape. For the first-stage halfwidth, we propose using , where is chosen such that satisfies 4.2. If , we recommend choosing with . This scaling corresponds to a uniformly consistent estimate of [2]. The overall moral is: the smoothing level chosen by GCV is asymptotically optimal for estimating functions, but derivative estimation requires more smoothing.

Numerical implementations [19] of constrained least squares usually apply the active set method of quadratic programming. The constrained smoothing spline regression reduces to a finite dimensional minimization when the constraints are on the th derivative. Since we constrain in each neighborhood of an estimated -change point, our algorithm is most readily implemented for using the duality result of [14].

To reduce the number of constraints, we seek to minimize the length of the constraint intervals, . When the hypotheses of Theorem 3.3 are satisfied with , we can estimate the uncertainty interval for inflection points by We recommend choosing the constraint width such that . (For smaller constraint widths, Theorem 4.1 is true, but false inflection points can still occur near the actual inflection point.)

## 5 Potential Extensions

• At present, we cannot exclude the posibility of false change points arising in the second stage estimate away from the constraint intervals. We believe that this is a technical gap in our analysis and not a practical difficulty. The risk of false change points can be reduced by choosing larger intervals to impose the constraints on . Let denote the zeros of in the first stage. Let and being the two closest zeros of to with . A judicous choice of constraint intervals is , which gives large constraint intervals with only a small chance of imposing an incorrect constraint on the second-stage .

• The pilot method suppresses false zeros of , but does not suppress false zeros of where is a nonzero real number. It may be more desirable to apply the pilot estimator to , where is a prescribed function possibly involving a small number of empirically estimated free parameters. The constraints in the second stage will virtually never need to be imposed if is always positive or negative. Thus, we suggest centering about zero by subtracting off a polynomial fit of order (and thereby centering about zero) prior to applying our two-stage estimator.

• Asymptotically, smoothing splines are equivalent to kernel smoothers [1, 17]. Using this convergence, analogous results to Theorems 3.2, 3.3 and 4.1 can be proved for when smoothing splines are used in the first stage estimate.

• It is tempting to try the pilot estimation procedure using local polynomial regression (LPR) in the second stage. Unfortunately, there is a difficulty with shape constrained LPR. Let for . If is constrained to be nonnegative, need not be convex because LPR does not require .

• Our data adaptive convergence results in Sec. 3 and Sec. 4

are for quadratic estimation with Gaussian errors. The results should be extendable to non-Gaussian errors using the central limit theorem and Brownian bridges as in

[9].

## 6 Piecewise Convex Information Criterion

Instead of the two stage pilot estimator, we now propose a second class of estimators which penalize both smoothness and the number of change points. Information/discrepancy criteria are used to measure whether the improvement in the goodness of fit is sufficient to justify using additional free parameters. Both the number of free parameters and their values are optimized with respect to the discrepancy criterion, . Let be a measure of the average residual error: , or its analog. Typical discrepancy functions are

 dI(^f,{yi})=^σ2/[1−(γ1p/N)]2  and (6.1)
 dB(^f,{yi})=^σ2[1+(γ2pln(N)/N)]  , (6.2)

where is the effective number of free parameters in the smoothing spline fit. For , is generalized cross-validation (GCV) which has the same asymptotic behavior as the Akaike information criterion. For , is the Bayesian/Schwartz information criterion. For a nested family of models, is appropriate while corresponds to a nonnested family with candidate models at the th level. In very specialized settings in regression theory and time series, it has been shown that functions like are asymptotically efficient while those like are asymptotically consistent. In other words, using -like criteria will asymptotically minimize the expected error at the cost of not always yielding the correct model. In contrast, the Bayesian criteria will asymptotically yield the correct model at the cost of having a larger expected error.

Our goal is to consistently select the number of convexity change points and efficiently estimate the model subject to the change point restrictions. Therefore, we propose the following discrepancy criterion:

 PCIC=σ2(^f,{yi})[1+γ2Kln(N)/N(1−γ1p/N)2] , (6.3)

where is the number of convexity change points and is the number of free parameters. PCIC stands for Piecewise Convex Information Criterion. In selecting the positions of the change points, there are essentially possible combinations of change point locations if we categorize the change points by the nearest measurement location. Thus, our default values are and .

We motivate PCIC: to add a change point requires an improvement in the residual square error of , which corresponds to an asymptotically consistent estimate. If the additional knot does not increase the number of change points, it will be added if the residual error decreases by . Presently, PCIC is purely a heuristic principle. We conjecture that it consistently selects the number of change points and is asymptotically efficient within the class of methods that are asymptotically consistent with regards to convexity change points.

## 7 Summary

Theorem 3.3 shows that for and , the amount of smoothing necessary for geometric fidelity is larger than the optimal value for minimizing the mean integrated square error. Therefore, we have considered two-stage estimators which estimate the -change points and their uncertainty intervals in the first-stage. In the second-stage, a constrained smoothing spline fit is applied using a data-adaptive estimate of the smoothing parameter.

Our main result is that such two-stage schemes achieve the same asymptotic rate of convergence as standard methods such as GCV that do not guarantee geometric fidelity. We prove this result incrementally. Theorem 2.4 evaluates an acceptable rate of failure for imposing the wrong constraints. Theorem 4.1 proves the asymptotic MISE result when the change points are estimated in the first-stage while the widths of the constraint intervals and the smoothing parameter in the first-stage satisfy certain scaling bounds. The second-stage estimates have no false change points in the regions where the unconstrained estimators have all of their false change points with probability approaching unity.

Linear constraints are necessary only in small neighborhoods about each -change point asymptotically. This suggest that the ratio of the MISE from our two stage estimate to that of kernel smoothers or spline tends to one. Our estimators should be useful in situations where obtaining the correct shape is important and computational costs are not an issue. Piecewise convex fitting may offer larger potential gains in the MISE in small sample situations because the knowledge that there are only a small number of inflections points should be of more value when less data is available. Numerical simulations are underway and will be discussed elsewhere.

Acknowledgments: We thank the referee and editor for their help. Work funded by U.S. Dept. of Energy Grant DE-FG02-86ER53223.

## Appendix A Interpolation Inequalities

We measure the distance of an arbitrary set of measurement times to an equi-spaced set of points in terms of the as defined in Sec. 2. The discrepancy is useful because it describe how closely a discrete sum over an arbitrarily placed set of points approximates an integral. In this appendix, we summarize this results and present a new interpolation identity for discrete sums. An useful condition is

Assumption 0 Assume that the limiting distribution of the measurement locations, , is and .

We denote the set of functions of bounded variation by and the corresponding norm by . The discrepancy is useful in evaluating the approximation accuracy of a discrete sum of arbitrarily placed points to the corresponding integral:

###### Theorem A.1 (Generalized Koksma [15])

Let be a bounded function of bounded variation, , on : . Let the star discrepancy be measured by a distribution, , which satisfies Assumption 0. If the discrete sum weights, , satisfy , then

 ∣∣ ∣∣∫10g(t)dF(t)−1NN∑i=1g(ti)wi∣∣ ∣∣≤[∥g∥TV+C∥g∥∞]D∗N . (A.1)

In our version of Koksma’s Theorem, we have added two new effects: a nonuniform weighting, , and a nonuniform distribution of points, . The total variation of with respect to is equal to the total variation of with respect to . Theorem A.1 follows from Koksma’s Theorem by a change of variables.

In the continuous case, the following Sobolev interpolation result [13] is well known:

###### Lemma A.2

There exists constants depending only on such that for all and :

 θ2j∫10|g(j)(s)|2ds≤cj[∫10|g(s)|2ds + θ2m∫10|g(m)(s)|2ds] . (A.2)

Using Koksma’s theorem and Lemma A.2, we can arrive at the following inequalities:

###### Corollary A.3

Let be in and assume the star discrepancy satisfies Assumption 0 with . The following interpolation bounds hold:

 1NN∑i=1g2(ti) ≤ C1∫10g2(t)dt + c1D∗mN∫10|g(m)(s)|2ds , (A.3)

where , and

 [cF−c1D∗δN−D∗N]∫10|g(s)|2ds≤1NN∑i=1g(ti)2+ c1D∗m(1−2δ)N∫10|g(m)(s)|2ds , (A.4)

for all in such that .

Proof. For , Koksma’s Theorem implies

 ∣∣ ∣∣∫10g2(t)dF(t)−1NN∑i=1g2(ti)∣∣ ∣∣ ≤ ∥g2∥TVD∗N ≤ (∥g∥20,2+∥g∥21,2)D∗N .

We then apply (A.2) to with for arbitrarily small .

 ∣∣ ∣∣∫10g2(t)dF(t)−1NN∑i=1g(ti)2∣∣ ∣