1 Introduction
In this paper, we consider the nonparametric regression setting in which we have observations,
, of the pair of random variables
, where is a metric space with metric . We suppose that the model(1) 
holds, where is an unknown function that we wish to estimate. This problem arises in many settings, including demographic applications (Petersen et al., 2016a; Sadhanala and Tibshirani, 2017), environmental data analysis (Hengl et al., 2007), image processing (Rudin et al., 1992), and causal inference (Wager and Athey, 2017).
A substantial body of work has considered estimating the function in (1) at the observations (i.e. denoising) as well as at other values of the random variable (i.e. prediction). This includes the seminal papers by Breiman et al. (1984), Duchon (1977), and Friedman (1991), as well as more recent work by Petersen et al. (2016b), Petersen et al. (2016a), and Sadhanala and Tibshirani (2017). We refer the reader to Györfi et al. (2006) for a very detailed survey of classical nonparametric regression methods. The vast majority of existing work assumes that the true regression function is homogeneous, in the sense that it has the same smoothness level throughout its domain. For instance, a typical assumption is that is Lipschitz continuous. In this paper, we propose and analyze simple estimators that are locally adaptive, in that they can estimate when it is piecewise constant, piecewise Lipschitz, or satisfies a more general notion of bounded variation, as well as manifold adaptive, in the sense that they adapt to the intrinsic dimensionality of the data.
Recently, interest has focused on socalled trend filtering (Kim et al., 2009), which seeks to estimate under the assumption that its discrete derivatives are sparse, in a setting in which we have access to an unweighted graph that quantifies the pairwise relationships between the observations. In particular, the fused lasso, also known as zerothorder trend filtering or total variation denoising (Mammen and van de Geer, 1997; Rudin et al., 1992; Tibshirani et al., 2005; Tibshirani, 2014; Sadhanala et al., 2017; Wang et al., 2016), solves the optimization problem
(2) 
where is a nonnegative tuning parameter, and where if and only if there is an edge between the th and th observations in the underlying graph. Then, . Computational aspects of the fused lasso have been studied extensively in the case of chain graphs (Johnson, 2013; Barbero and Sra, 2014; Davies and Kovac, 2001) as well as general graphs (Chambolle and Darbon, 2009; Chambolle and Pock, 2011; Tansey and Scott, 2015; Landrieu and Obozinski, 2015; Hoefling, 2010; Tibshirani and Taylor, 2011; Chambolle and Darbon, 2009). Furthermore, the fused lasso is known to have excellent theoretical properties. In one dimension, Mammen and van de Geer (1997) and Tibshirani (2014) characterized rates under the assumption that has bounded variation, whereas Lin et al. (2017) and Guntuboyina et al. (2017) studied the performance of the fused lasso under the assumption that is piecewise constant. Hutter and Rigollet (2016), Sadhanala et al. (2016), and Sadhanala et al. (2017) considered grid graphs. Padilla et al. (2018) and Wang et al. (2016) provide results that hold for general graphs. Importantly, trend filtering is locally adaptive, which means that it can adapt to inhomogeneity in the level of smoothness of the underlying signal; this is discussed in Tibshirani (2014) in the context of a chain graph, and in Wang et al. (2016) in the context of a general graph. However, trend filtering and the fused lasso are only applicable when we have access to a graph underlying the observations (or when the covariate, , is onedimensional, thereby leading to a chain graph); thus, these approaches are not applicable to the general nonparametric regression setting of (1), in which no graph is available.
In this paper, we extend the utility of the fused lasso approach by combining it with the nearest neighbors (NN) procedure. NN has been wellstudied from a theoretical (Stone, 1977; Chaudhuri and Dasgupta, 2014; Von Luxburg et al., 2014; Alamgir et al., 2014), methodological (Dasgupta, 2012; Kontorovich et al., 2016; Singh and Póczos, 2016; Dasgupta and Kpotufe, 2014), and algorithmic (Friedman et al., 1977; Dasgupta and Sinha, 2013; Zhang et al., 2012) perspective. We exploit these ideas in order to generalize the class of problems to which the fused lasso can be applied, thereby yielding a single procedure that inherits the computational and theoretical advantages of both NN and the fused lasso.
In greater detail, in this paper we extend the fused lasso to the general nonparametric setting of (1), by performing a twostep procedure.

Step 1. We construct a nearestneighbor (NN) graph, by placing an edge between each observation and the observations to which it is closest, in terms of the metric .

Step 2. We apply the fused lasso to this NN graph.
The resulting NN fused lasso (NNFL) estimator appeared as an application of graph trend filtering in Wang et al. (2016), but was not studied in detail. In this paper, we study this estimator in detail. We also consider a variant obtained by replacing the NN graph in Step 1 with an nearestneighbor (NN) graph, which contains an edge between and only if .
The main contributions of this paper are as follows:

Local adaptivity. We show that provided that has bounded variation, along with an additional condition that generalizes piecewise Lipschitz continuity (Assumption 5), then the mean squared errors (MSE) of both the NNFL estimator and the NNFL estimator are , ignoring logarithmic factors; here, is the dimension of . In fact, this matches the minimax rate for estimating a twodimensional Lipschitz function (Györfi et al., 2006), but over a much wider function class.

Manifold adaptivity. Suppose that the covariates are i.i.d. samples from a mixture of unknown bounded densities . Suppose further that for , the support of is homeomorphic (see Assumption 3) to , where is the intrinsic dimension of . We show that under mild conditions, if the restriction of to is a function of bounded variation, then the NNFL estimator attains the rate , where is the number of samples associated with the th component. This, in turn, is the rate that we would obtain by running the NNFL estimator separately on the samples associated with each mixture component.
The rest of this paper is organized as follows. In Section 2, we propose the NNFL and NNFL estimators. In Section 3, we show that these estimators are locally adaptive. In Section 4, we show that the NNFL estimator is manifold adaptive. We discuss the use of NNFL to predict a response at a new observation in Section 5. A simulation study and a data application are presented in Section 6, and we close with a Discussion in Section 7. Proofs are provided in the Appendix.
2 Methodology
2.1 The NearestNeighbor and NearestNeighbor Fused Lasso
Both the NNFL and NNFL approaches are simple twostep procedures. The first step involves constructing a graph on the observations. The NN graph, , has vertex set , and its edge set contains the pair if and only if is among the nearest neighbors (with respect to the metric ) of , or vice versa. By contrast, the graph, , contains the edge in if and only if .
After constructing the graph, we apply the fused lasso to over the graph (either or ). We can rewrite the fused lasso optimization problem (2) as
(3) 
where is a tuning parameter, and is an oriented incidence matrix of ; each row of corresponds to an edge in . For instance, if the th edge in connects the th and th observations, then
In this paper, we mostly focus on the setting where is the NN graph. We also include an analysis of the graph, which results from using , as a point of contrast.
Given the estimator defined in (3), we can predict the response at a new observation according to
(4) 
In the case of NNFL, we take , where is the set of nearest neighbors of in the training data. In the case of NNFL, we take . (Given a set , is the indicator function that takes on a value of if , and otherwise.)
We construct the NN and NN graphs using standard Matlab
functions such
as knnsearch
and bsxfun
; this results in a computational complexity of .
We solve the fused lasso with the parametric maxflow algorithm from Chambolle and Darbon (2009), for which software
is available from the authors’ website, http://www.cmap.polytechnique.fr/~antonin/software/; it is in practice much faster than
its worstcase complexity of , where is the number of edges in the graph (Boykov and Kolmogorov, 2004; Chambolle and Darbon, 2009).
In NN and NN, the values of and directly affect the sparsity of the graphs, and hence the computational performance of the NNFL and NNFL estimators. Corollary 3.23 in Miller et al. (1997) provides an upper bound on the maximum degree of arbitrary NN graphs in .
2.2 Example
We now illustrate the performance of NNFL in a small example. We generate
according to the probability density function
(5) 
Thus, concentrates 64 of its mass in the small interval , and 80 in . The lefthand panel of Figure 1 displays a heatmap of observations drawn from (5).
We define in (1) to be the piecewise constant function
(6) 
We then generate with from (1); the data are displayed in the righthand panel of Figure 1. This simulation study has the following characteristics:

The function in (6) is not Lipschitz, but does have low total variation. This suggests that NNFL should exhibit local adaptivity, and should outperform approaches that rely on Lipschitz continuity.

The probability density function in (5) is highly nonuniform. This should not pose a problem for NNFL, given that it is has manifold adaptivity.
We compared the following methods:

NNFL, with the number of neighbors set to , and the tuning parameter chosen to minimize the average MSE over 100 Monte Carlo replicates.

CART (Breiman et al., 1984), with the complexity parameter chosen to minimize the average MSE over 100 Monte Carlo replicates.

NN regression (see e.g. Stone, 1977), with the number of neighbors set to minimize the average MSE over 100 Monte Carlo replicates.
The estimated regression functions resulting from these three approaches are displayed in Figure 2. We see that
NNFL can adapt to lowdensity and highdensity regions of the distribution of covariates, as well as to the local structure of the regression function. By contrast, CART has some artifacts due to the binary splits that make up the decision tree, and
NN regression undersmoothes in large areas of the domain.3 Local Adaptivity of NNFL and NnFl
3.1 Assumptions
We assume that, in (1), the elements of are independent and identicallydistributed meanzero subGaussian random variables,
(7) 
for some positive constants and . Furthermore, we assume that is independent of .
In addition, for a set with a metric space, we write . We let denote the boundary of the set . Moreover, we define .
In the covariate space , we consider the Borel sigma algebra, , induced by the metric . We let be a measure on . We complement the model in (1) by assuming that the covariates satisfy .
We begin by stating assumptions on the distribution of the covariates , and on the metric space . In the theoretical results in Section 3 from Györfi et al. (2006), it is assumed that
is the probability density function of the uniform distribution in
. In this section, we will require only that is bounded above and below. This condition appeared in the framework for studying NN graphs in Von Luxburg et al. (2014), and in the work on density quantization by Alamgir et al. (2014).Assumption 1.
The density satisfies , for all , where .
Although we do not require that be a Euclidean space, we do require that the balls in have volume (with respect to ) that behaves similarly to the Lebesgue measure of balls in . This is expressed in the next assumption, which appeared as part of the definition of a valid region (Definition 2) in Von Luxburg et al. (2014).
Assumption 2.
The base measure in satisfies
for all , where , , and are positive constants, and is the intrinsic dimension of .
Next, we make an assumption about the topology of the space . We require that the space has no holes, and is topologically equivalent to , in the sense that there exists a continuous bijection between and .
Assumption 3.
There exists a homeomorphism (a continuous bijection with a continuous inverse) , such that
for some positive constants .
In particular, Assumptions 2 and 3 immediately hold if , with as the Euclidean distance, as the identity mapping in , and as the Lebesgue measure in . A metric space that satisfies Assumption 3 is a special case of a differential manifold; the intuition is that the space is a chart of the atlas for said differential manifold.
We now proceed to state conditions on the regression function . The first assumption simply requires bounded variation of the composition of the regression function with the homeomorphism from Assumption 3.
Assumption 4.
The function has bounded variation, i.e. . Here is the interior of , and is the class of functions in with bounded variation.
A discussion of bounded variation can be found in Appendix A.1. It is worth mentioning that if and is the identity function in , then Assumption 4 simply states that has bounded variation. However, in order to allow for more general scenarios, the condition is stated in terms of the function which has domain in the unit box, whereas the domain of is the more general set .
We now recall the definition of a piecewise Lipschitz function, which induces a much larger class than the set of Lipschitz functions, as it allows for discontinuities.
Definition 1.
Let . We say that a function is piecewise Lipschitz if there exists a set that has the following properties:

has Lebesgue measure zero.

for some constants , for all .

There exists a positive constant such that if and belong to the same connected component of , then .
Roughly speaking, Definition 1 says that is piecewise Lipschitz if there exists a small set that partitions in such a way that is Lipschitz within each connected component of the partition. Theorem 2.2.1 in Ziemer (2012) implies that if is piecewise Lipschitz, then has bounded variation on any open set within a connected component.
Theorem 1 will require Assumption 5, which is a milder assumption on than piecewise Lipschitz continuity (Definition 1). We now present some notation that is needed in order to introduce Assumption 5.
For , we denote by the rectangular partition of whose elements all have Lebesgue measure . We define . Then, for a set , we define
this is the partition induced in by the grid .
For a function with domain , we define
(8) 
If is piecewise Lipschitz, then will be bounded. To see this, assume that Definition 1 holds with the set above. Then
(9) 
where the secondtolast inequality follows from the fact that the volume of a dimensional ball of radius is , as well as the fact that there are at most in the order of elements of . To better understand (8), notice that we can view as a discretization over the set of the integral of the function
(10) 
Recall that is a Lebesgue point of if (see Definition 2.18 in Giaquinta and Modica, 2010). Furthermore, if (see Appendix A.1 for this notation) then almost all points are Lebesgue points (this follows from the Lebesgue differentiation theorem; see Theorem 2.16 in Giaquinta and Modica, 2010). Thus, if , then loosely speaking each term in (8) is , for any configuration of points . In Assumption 5, we will further require to be bounded, which as mentioned before holds if is piecewise Lipschitz.
Next, we define
(11) 
where
(12) 
and is a test function (see (21) in the Appendix). Thus, (11) is the summation, over evenlysized rectangles of volume that intersect , of the supremum of the function in (12). The latter, for a function , can be thought as the average “Lipschitz constant” near — see the expression inside parenthesis in (12) — weighted by the derivative of a test function. The scaling factor in (12) arises because the integral is taken over a set of measure proportional to .
As with , one can verify that if is a piecewise Lipschitz function, then is bounded. The argument is similar to that in (9).
We now make use of (8) and (11) in order to state our next condition on . This next condition is milder than a piecewise Lipschitz assumption (see Definition 1), and is independent of Assumption 4.
Assumption 5.
Let . There exists a set that satisfies the following:

has Lebesgue measure zero.

for some constants , and for all .

3.2 Results
Letting , we express the MSEs of NNFL and NNFL in terms of the total variation of with respect to the NN and NN graphs.
Theorem 1.
Clearly, this rate is a function of (where equals or ). For the grid graph considered in Sadhanala et al. (2016), , leading to the rate . In fact, when , this is the minimax rate for estimation of a Lipschitz continuous function (Györfi et al., 2006). However, for a general graph, there is no prior reason to expect that . Our next result shows that this is actually the case under the assumptions discussed in Section 3.1.
Theorem 2.
Theorem 2 indicates that under Assumptions 1–5, both the NNFL and NNFL estimators attain the convergence rate , ignoring logarithmic terms. Importantly, Theorem 3.2 from Györfi et al. (2006) shows that in the twodimensional setting, this rate is actually minimax for estimation of a Lipschitz continuous function, when the design points are uniformly drawn from . This is of course a much smaller class than that implied by Assumptions 1–5. In addition, the upper bounds in Theorem 2 hold if Assumptions 1–3 are met and satisfies Definition 1 (replacing Assumptions 4–5), i.e. if is piecewise Lipschitz. The proof of this fact is included in Appendix A.9.
We see from Theorem 2 that both NNFL and NNFL are locally adaptive, in the sense that they can adapt to the form of the function . Specifically, if satisfies Assumptions 4–5 (or, instead, Definition 1), then these estimators do not require knowledge of the set . This is similar in spirit to the onedimensional fused lasso, which does not require knowledge of the breakpoints when estimating a piecewise Lipschitz function.
However, there is an important difference between the applicability of Theorem 2 for NNFL and NNFL. In order to attain the rate in Theorem 2, NNFL requires knowledge of the dimension , since this quantity appears in the rate of decay of . But in practice, the value of might not be clear: for instance, suppose that ; this is a subset of , but it is homeomorphic to , so that . If is unknown, then it can be challenging to choose for NNFL. By contrast, the choice of in NNFL only involves the sample size . Consequently, local adaptivity of NNFL may be much easier to achieve in practice.
4 Manifold Adaptivity of NnFl
In this section, we show that under a much milder set of assumptions than those posed in Section 3, the NNFL estimator can still achieve a desirable rate. In particular, we now allow the observations to be drawn from a mixture distribution, in which each mixture component satisfies the assumptions in Section 3.
We assume that there exists a partition of such that , where , and
(13) 
where satisfies (7), is a density with support , , and is a collection of subsets of . For simplicity, we will assume that for some , and is the Euclidean distance.
We further assume that each set is homeomorphic to a Euclidean box of dimension depending on , as follows:
Assumption 6.
The constraints implied by Assumption 6 are very natural. Inequality (15) states that the intersections of the different manifolds are small. To put it in perspective, if the extrinsic space () were with the Lebesgue measure, then balls of radius of would have measure which is less than for all , and the set has measure that scales like , the same scaling appearing (15). On the other hand (15) and (16) hold if are compact and convex subsets of whose interiors are disjoint.
We are now ready to extend Theorem 2 to the framework described in this section. The following theorem indicates that the NNFL estimator can adapt to the “dimension” of the data in multiple regions of the domain.
Theorem 3.
Suppose that the data are generated as in (13), and Assumption 6 holds. Suppose also that the functions satisfy Assumptions 4–5 in the domain . If for all , then for an appropriate choice of the tuning parameter , the NNFL estimator defined in (3) satisfies
provided that for some positive constant , where is a polynomial function.
The following example suggests that the NNFL estimator may not be manifold adaptive.
Example 1.
For , suppose that for some , and . Note that the sets and satisfy Assumption 6. Let us assume that all of the conditions of Theorem 3 are met with , and . Motivated by the scaling of in Theorem 2, we let . We now consider two possibilities for : , and .

For any , and for any positive constant , there exists a positive constant such that
for large enough , where is the NNFL estimator. (This is shown in Appendix A.11.) In other words, if , then NNFL does not achieve a desirable rate. By contrast, with an appropriate choice of and , the NNFL estimator attains the rate (ignoring logarithmic terms) by Theorem 3.

If , then the minimum degree in the graph of the observations in will be at least , with high probability, for some positive constant . (This can be seen as in Lemma 8.) This has two consequences:

Recall that the algorithm from Chambolle and Darbon (2009) has worstcase complexity , where is the number of edges in the graph (although empirically the algorithm is typically much faster, Wang et al., 2016). Therefore, if is too large, then the computations involved in applying the fused lasso to the NN graph may be too demanding.

The choice of in Theorem 2 leads to an NN graph with degree (This follows from Proposition 27 in Von Luxburg et al. (2014)). In a different context, El Alaoui et al. (2016) argues that it is desirable for geometric graphs (which generalize NN graphs) to have degree . This suggests that using leads to an NN graph that is too dense.

5 Prediction for NnFl
Recall from Section 2 that we estimate for using (4); this amounts to averaging the predictions for the training observations that are near . We now provide an upper bound for the mean integrated squared error (MISE) of this estimator.
Theorem 4.
The righthand side of (17) involves the penalty and the approximation error AErr. We saw in Section 3.2 that the former can be characterized under Assumptions 4–5, or if satisfies Definition 1. We will now show that the piecewise Lipschitz condition (Definition 1) is sufficient to ensure an upper bound on the approximation error.
Theorem 5.
This implies that the rate from Theorem 2 is also attained for the prediction error. This implies that in two dimensions,
NNFL together with interpolation exhibits the minimax rate at unobserved locations drawn from
.6 Experiments
Througout this section, we take to be Euclidean distance.
6.1 Simulated Data
In this section, we compare the following approaches:

NNFL, with held fixed, and treated as a tuning parameter.

NNFL, with held fixed, and treated as a tuning parameter.

CART (Breiman et al., 1984), implemented in the
R
packagerpart
, with the complexity parameter treated as a tuning parameter. 
MARS (Friedman, 1991), implemented in the
R
packageearth
, with the penalty parameter treated as a tuning parameter. 
Random forests (Breiman, 2001), implemented in the
R
packagerandomForest
, with the number of trees fixed at 800, and with the minimum size of each terminal node treated as a tuning parameter. 
NN regression (e.g. Stone, 1977), implemented in
Matlab
using the functionknnsearch
, with treated as a tuning parameter.
We evaluate each method’s performance using mean squared error (MSE; defined as ). Specifically, we apply each method to 150 Monte Carlo data sets with a range of tuning parameter values. For each method, we then identify the tuning parameter value that leads to the smallest average MSE (where the average is taken over the 150 data sets). We refer to this smallest average MSE as the optimized MSE in what follows.
6.1.1 Fixed , varying
Throughout this section, we consider two scenarios with covariates.
Scenario 1.
The function is piecewise constant,
(19) 
The covariates are drawn from a uniform distribution on . The data are generated as in (1) with errors.
Scenario 2.
The function is as in (6), with generative density for as in (5). The data are generated as in (1) with errors.
Data generated under Scenario 1 are displayed in Figure 3(a). Data generated under Scenario 2 are displayed in Figure 1(b).
Figure 3(b) and Figure 3(d) display the optimized MSE, as a function of the sample size, for Scenarios 1 and 2, respectively. NNFL gives the best results in both scenarios. NNFL performs a bit worse than NNFL in Scenario 1, and very poorly in Scenario 2 (results not shown).
Timing results for all approaches are reported in Figure 3(c). For all methods, the times reported are averaged over a range of tuning parameter values. For instance, for NNFL, we fix and compute the time for different choices of ; we then report the average of those times.
6.1.2 Fixed , varying
We now consider two scenarios with .
Scenario 3.
The function is defined as
and the generative density for is uniform in . The errors in (1) have a distribution.
Scenario 4.
The function is defined as
where , , and
. Once again, the generative density for is uniform in , and the errors are i.i.d. .
Optimized MSE for each approach is displayed in Figure 4. When is small, most methods perform well; however, as increases, the competing methods quickly deteriorate, whereas NNFL continues to perform well.
6.2 Flu Data
We consider a data set that consists of flu activity and atmospheric conditions in Texas between January 1, 2003 and December 31, 2009 across different cities. Our datause agreement does not permit dissemination of these hospital records. We also use data on temperature and air quality (particulate matter) in these cities, which can be obtained directly from CDC Wonder (http://wonder.cdc.gov/). Using the number of flurelated doctor’s office visits as the dependent variable, we fit a separate nonparametric regression model to each of 24 cities; each day was treated as a separate observation, so that the number of samples is in each city. Five independent variables are included in the regression: maximum and average observed concentration of particulate matter, maximum and minimum temperature, and day of the year. All variables are scaled to lie in . We performed 50 75%/25% splits of the data into a training set and a test set. All models were fit on the training data, with 5fold crossvalidation to select tuning parameter values. Then prediction performance was evaluated on the test set.
We apply NNFL with , and NNFL with for
. We also fit neural networks (
Hagan et al., 1996; implemented inMatlab
using the functions newfit
and train
),
thin plate splines (Duchon, 1977; implemented using the R
package fields
), and
MARS, CART, and random forests, using software described in Section 6.1.
Average test set prediction error (across the 50 test sets) is displayed in Figure 5. We see that NNFL and NNFL tend to have the best performance. In particular, NNFL performs best in 13 out the 24 cities, and second best in 6 cities. In 8 of the 24 cities, NNFL performs best.
7 Conclusion
In this paper, we introduced a twostage nonparametric regression estimator, by (i) constructing a nearestneighbors graph over the observations; and (ii) performing the fused lasso over the graph. We have shown that when the NN or NN graph is used, the resulting procedure is locally adaptive; furthermore, when the NN graph is used, the procedure is manifold adaptive. Our theoretical results are bolstered by findings on simulated data and on a flu data set.
Appendix A Proofs
a.1 Notation
Throughout, given , we denote by the set . For , , with a metric space, we write
and when the context is clear we will simply write instead of . In the case of the space , we will use the notation for the metric induced by the norm , and we will write
Comments
There are no comments yet.