In this paper, we consider the non-parametric regression setting in which we have observations,
, of the pair of random variables, where is a metric space with metric . We suppose that the model
holds, where is an unknown function that we wish to estimate. This problem arises in many settings, including demographic applications (Petersen et al., 2016a; Sadhanala and Tibshirani, 2017), environmental data analysis (Hengl et al., 2007), image processing (Rudin et al., 1992), and causal inference (Wager and Athey, 2017).
A substantial body of work has considered estimating the function in (1) at the observations (i.e. denoising) as well as at other values of the random variable (i.e. prediction). This includes the seminal papers by Breiman et al. (1984), Duchon (1977), and Friedman (1991), as well as more recent work by Petersen et al. (2016b), Petersen et al. (2016a), and Sadhanala and Tibshirani (2017). We refer the reader to Györfi et al. (2006) for a very detailed survey of classical non-parametric regression methods. The vast majority of existing work assumes that the true regression function is homogeneous, in the sense that it has the same smoothness level throughout its domain. For instance, a typical assumption is that is Lipschitz continuous. In this paper, we propose and analyze simple estimators that are locally adaptive, in that they can estimate when it is piecewise constant, piecewise Lipschitz, or satisfies a more general notion of bounded variation, as well as manifold adaptive, in the sense that they adapt to the intrinsic dimensionality of the data.
Recently, interest has focused on so-called trend filtering (Kim et al., 2009), which seeks to estimate under the assumption that its discrete derivatives are sparse, in a setting in which we have access to an unweighted graph that quantifies the pairwise relationships between the observations. In particular, the fused lasso, also known as zeroth-order trend filtering or total variation denoising (Mammen and van de Geer, 1997; Rudin et al., 1992; Tibshirani et al., 2005; Tibshirani, 2014; Sadhanala et al., 2017; Wang et al., 2016), solves the optimization problem
where is a non-negative tuning parameter, and where if and only if there is an edge between the th and th observations in the underlying graph. Then, . Computational aspects of the fused lasso have been studied extensively in the case of chain graphs (Johnson, 2013; Barbero and Sra, 2014; Davies and Kovac, 2001) as well as general graphs (Chambolle and Darbon, 2009; Chambolle and Pock, 2011; Tansey and Scott, 2015; Landrieu and Obozinski, 2015; Hoefling, 2010; Tibshirani and Taylor, 2011; Chambolle and Darbon, 2009). Furthermore, the fused lasso is known to have excellent theoretical properties. In one dimension, Mammen and van de Geer (1997) and Tibshirani (2014) characterized rates under the assumption that has bounded variation, whereas Lin et al. (2017) and Guntuboyina et al. (2017) studied the performance of the fused lasso under the assumption that is piecewise constant. Hutter and Rigollet (2016), Sadhanala et al. (2016), and Sadhanala et al. (2017) considered grid graphs. Padilla et al. (2018) and Wang et al. (2016) provide results that hold for general graphs. Importantly, trend filtering is locally adaptive, which means that it can adapt to inhomogeneity in the level of smoothness of the underlying signal; this is discussed in Tibshirani (2014) in the context of a chain graph, and in Wang et al. (2016) in the context of a general graph. However, trend filtering and the fused lasso are only applicable when we have access to a graph underlying the observations (or when the covariate, , is one-dimensional, thereby leading to a chain graph); thus, these approaches are not applicable to the general non-parametric regression setting of (1), in which no graph is available.
In this paper, we extend the utility of the fused lasso approach by combining it with the -nearest neighbors (-NN) procedure. -NN has been well-studied from a theoretical (Stone, 1977; Chaudhuri and Dasgupta, 2014; Von Luxburg et al., 2014; Alamgir et al., 2014), methodological (Dasgupta, 2012; Kontorovich et al., 2016; Singh and Póczos, 2016; Dasgupta and Kpotufe, 2014), and algorithmic (Friedman et al., 1977; Dasgupta and Sinha, 2013; Zhang et al., 2012) perspective. We exploit these ideas in order to generalize the class of problems to which the fused lasso can be applied, thereby yielding a single procedure that inherits the computational and theoretical advantages of both -NN and the fused lasso.
In greater detail, in this paper we extend the fused lasso to the general non-parametric setting of (1), by performing a two-step procedure.
Step 1. We construct a -nearest-neighbor (-NN) graph, by placing an edge between each observation and the observations to which it is closest, in terms of the metric .
Step 2. We apply the fused lasso to this -NN graph.
The resulting -NN fused lasso (-NN-FL) estimator appeared as an application of graph trend filtering in Wang et al. (2016), but was not studied in detail. In this paper, we study this estimator in detail. We also consider a variant obtained by replacing the -NN graph in Step 1 with an -nearest-neighbor (-NN) graph, which contains an edge between and only if .
The main contributions of this paper are as follows:
Local adaptivity. We show that provided that has bounded variation, along with an additional condition that generalizes piecewise Lipschitz continuity (Assumption 5), then the mean squared errors (MSE) of both the -NN-FL estimator and the -NN-FL estimator are , ignoring logarithmic factors; here, is the dimension of . In fact, this matches the minimax rate for estimating a two-dimensional Lipschitz function (Györfi et al., 2006), but over a much wider function class.
Manifold adaptivity. Suppose that the covariates are i.i.d. samples from a mixture of unknown bounded densities . Suppose further that for , the support of is homeomorphic (see Assumption 3) to , where is the intrinsic dimension of . We show that under mild conditions, if the restriction of to is a function of bounded variation, then the -NN-FL estimator attains the rate , where is the number of samples associated with the th component. This, in turn, is the rate that we would obtain by running the -NN-FL estimator separately on the samples associated with each mixture component.
The rest of this paper is organized as follows. In Section 2, we propose the -NN-FL and -NN-FL estimators. In Section 3, we show that these estimators are locally adaptive. In Section 4, we show that the -NN-FL estimator is manifold adaptive. We discuss the use of -NN-FL to predict a response at a new observation in Section 5. A simulation study and a data application are presented in Section 6, and we close with a Discussion in Section 7. Proofs are provided in the Appendix.
2.1 The -Nearest-Neighbor and -Nearest-Neighbor Fused Lasso
Both the -NN-FL and -NN-FL approaches are simple two-step procedures. The first step involves constructing a graph on the observations. The -NN graph, , has vertex set , and its edge set contains the pair if and only if is among the -nearest neighbors (with respect to the metric ) of , or vice versa. By contrast, the -graph, , contains the edge in if and only if .
After constructing the graph, we apply the fused lasso to over the graph (either or ). We can re-write the fused lasso optimization problem (2) as
where is a tuning parameter, and is an oriented incidence matrix of ; each row of corresponds to an edge in . For instance, if the th edge in connects the th and th observations, then
In this paper, we mostly focus on the setting where is the -NN graph. We also include an analysis of the -graph, which results from using , as a point of contrast.
Given the estimator defined in (3), we can predict the response at a new observation according to
In the case of -NN-FL, we take , where is the set of nearest neighbors of in the training data. In the case of -NN-FL, we take . (Given a set , is the indicator function that takes on a value of if , and otherwise.)
We construct the -NN and -NN graphs using standard
Matlab functions such
bsxfun; this results in a computational complexity of .
We solve the fused lasso with the parametric max-flow algorithm from Chambolle and Darbon (2009), for which software
is available from the authors’ website, http://www.cmap.polytechnique.fr/~antonin/software/; it is in practice much faster than
its worst-case complexity of , where is the number of edges in the graph (Boykov and Kolmogorov, 2004; Chambolle and Darbon, 2009).
In -NN and -NN, the values of and directly affect the sparsity of the graphs, and hence the computational performance of the -NN-FL and -NN-FL estimators. Corollary 3.23 in Miller et al. (1997) provides an upper bound on the maximum degree of arbitrary -NN graphs in .
We now illustrate the performance of -NN-FL in a small example. We generate
according to the probability density function
We define in (1) to be the piecewise constant function
The function in (6) is not Lipschitz, but does have low total variation. This suggests that -NN-FL should exhibit local adaptivity, and should outperform approaches that rely on Lipschitz continuity.
The probability density function in (5) is highly non-uniform. This should not pose a problem for -NN-FL, given that it is has manifold adaptivity.
We compared the following methods:
-NN-FL, with the number of neighbors set to , and the tuning parameter chosen to minimize the average MSE over 100 Monte Carlo replicates.
CART (Breiman et al., 1984), with the complexity parameter chosen to minimize the average MSE over 100 Monte Carlo replicates.
-NN regression (see e.g. Stone, 1977), with the number of neighbors set to minimize the average MSE over 100 Monte Carlo replicates.
The estimated regression functions resulting from these three approaches are displayed in Figure 2. We see that
-NN-FL can adapt to low-density and high-density regions of the distribution of covariates, as well as to the local structure of the regression function. By contrast, CART has some artifacts due to the binary splits that make up the decision tree, and-NN regression undersmoothes in large areas of the domain.
3 Local Adaptivity of -NN-FL and -Nn-Fl
We assume that, in (1), the elements of are independent and identically-distributed mean-zero sub-Gaussian random variables,
for some positive constants and . Furthermore, we assume that is independent of .
In addition, for a set with a metric space, we write . We let denote the boundary of the set . Moreover, we define .
In the covariate space , we consider the Borel sigma algebra, , induced by the metric . We let be a measure on . We complement the model in (1) by assuming that the covariates satisfy .
We begin by stating assumptions on the distribution of the covariates , and on the metric space . In the theoretical results in Section 3 from Györfi et al. (2006), it is assumed that
is the probability density function of the uniform distribution in. In this section, we will require only that is bounded above and below. This condition appeared in the framework for studying -NN graphs in Von Luxburg et al. (2014), and in the work on density quantization by Alamgir et al. (2014).
The density satisfies , for all , where .
Although we do not require that be a Euclidean space, we do require that the balls in have volume (with respect to ) that behaves similarly to the Lebesgue measure of balls in . This is expressed in the next assumption, which appeared as part of the definition of a valid region (Definition 2) in Von Luxburg et al. (2014).
The base measure in satisfies
for all , where , , and are positive constants, and is the intrinsic dimension of .
Next, we make an assumption about the topology of the space . We require that the space has no holes, and is topologically equivalent to , in the sense that there exists a continuous bijection between and .
There exists a homeomorphism (a continuous bijection with a continuous inverse) , such that
for some positive constants .
In particular, Assumptions 2 and 3 immediately hold if , with as the Euclidean distance, as the identity mapping in , and as the Lebesgue measure in . A metric space that satisfies Assumption 3 is a special case of a differential manifold; the intuition is that the space is a chart of the atlas for said differential manifold.
We now proceed to state conditions on the regression function . The first assumption simply requires bounded variation of the composition of the regression function with the homeomorphism from Assumption 3.
The function has bounded variation, i.e. . Here is the interior of , and is the class of functions in with bounded variation.
A discussion of bounded variation can be found in Appendix A.1. It is worth mentioning that if and is the identity function in , then Assumption 4 simply states that has bounded variation. However, in order to allow for more general scenarios, the condition is stated in terms of the function which has domain in the unit box, whereas the domain of is the more general set .
We now recall the definition of a piecewise Lipschitz function, which induces a much larger class than the set of Lipschitz functions, as it allows for discontinuities.
Let . We say that a function is piecewise Lipschitz if there exists a set that has the following properties:
has Lebesgue measure zero.
for some constants , for all .
There exists a positive constant such that if and belong to the same connected component of , then .
Roughly speaking, Definition 1 says that is piecewise Lipschitz if there exists a small set that partitions in such a way that is Lipschitz within each connected component of the partition. Theorem 2.2.1 in Ziemer (2012) implies that if is piecewise Lipschitz, then has bounded variation on any open set within a connected component.
Theorem 1 will require Assumption 5, which is a milder assumption on than piecewise Lipschitz continuity (Definition 1). We now present some notation that is needed in order to introduce Assumption 5.
For , we denote by the rectangular partition of whose elements all have Lebesgue measure . We define . Then, for a set , we define
this is the partition induced in by the grid .
For a function with domain , we define
If is piecewise Lipschitz, then will be bounded. To see this, assume that Definition 1 holds with the set above. Then
where the second-to-last inequality follows from the fact that the volume of a -dimensional ball of radius is , as well as the fact that there are at most in the order of elements of . To better understand (8), notice that we can view as a discretization over the set of the integral of the function
Recall that is a Lebesgue point of if (see Definition 2.18 in Giaquinta and Modica, 2010). Furthermore, if (see Appendix A.1 for this notation) then almost all points are Lebesgue points (this follows from the Lebesgue differentiation theorem; see Theorem 2.16 in Giaquinta and Modica, 2010). Thus, if , then loosely speaking each term in (8) is , for any configuration of points . In Assumption 5, we will further require to be bounded, which as mentioned before holds if is piecewise Lipschitz.
Next, we define
and is a test function (see (21) in the Appendix). Thus, (11) is the summation, over evenly-sized rectangles of volume that intersect , of the supremum of the function in (12). The latter, for a function , can be thought as the average “Lipschitz constant” near — see the expression inside parenthesis in (12) — weighted by the derivative of a test function. The scaling factor in (12) arises because the integral is taken over a set of measure proportional to .
As with , one can verify that if is a piecewise Lipschitz function, then is bounded. The argument is similar to that in (9).
Let . There exists a set that satisfies the following:
has Lebesgue measure zero.
for some constants , and for all .
Letting , we express the MSEs of -NN-FL and -NN-FL in terms of the total variation of with respect to the -NN and -NN graphs.
Clearly, this rate is a function of (where equals or ). For the grid graph considered in Sadhanala et al. (2016), , leading to the rate . In fact, when , this is the minimax rate for estimation of a Lipschitz continuous function (Györfi et al., 2006). However, for a general graph, there is no prior reason to expect that . Our next result shows that this is actually the case under the assumptions discussed in Section 3.1.
Theorem 2 indicates that under Assumptions 1–5, both the -NN-FL and -NN-FL estimators attain the convergence rate , ignoring logarithmic terms. Importantly, Theorem 3.2 from Györfi et al. (2006) shows that in the two-dimensional setting, this rate is actually minimax for estimation of a Lipschitz continuous function, when the design points are uniformly drawn from . This is of course a much smaller class than that implied by Assumptions 1–5. In addition, the upper bounds in Theorem 2 hold if Assumptions 1–3 are met and satisfies Definition 1 (replacing Assumptions 4–5), i.e. if is piecewise Lipschitz. The proof of this fact is included in Appendix A.9.
We see from Theorem 2 that both -NN-FL and -NN-FL are locally adaptive, in the sense that they can adapt to the form of the function . Specifically, if satisfies Assumptions 4–5 (or, instead, Definition 1), then these estimators do not require knowledge of the set . This is similar in spirit to the one-dimensional fused lasso, which does not require knowledge of the breakpoints when estimating a piecewise Lipschitz function.
However, there is an important difference between the applicability of Theorem 2 for -NN-FL and -NN-FL. In order to attain the rate in Theorem 2, -NN-FL requires knowledge of the dimension , since this quantity appears in the rate of decay of . But in practice, the value of might not be clear: for instance, suppose that ; this is a subset of , but it is homeomorphic to , so that . If is unknown, then it can be challenging to choose for -NN-FL. By contrast, the choice of in -NN-FL only involves the sample size . Consequently, local adaptivity of -NN-FL may be much easier to achieve in practice.
4 Manifold Adaptivity of -Nn-Fl
In this section, we show that under a much milder set of assumptions than those posed in Section 3, the -NN-FL estimator can still achieve a desirable rate. In particular, we now allow the observations to be drawn from a mixture distribution, in which each mixture component satisfies the assumptions in Section 3.
We assume that there exists a partition of such that , where , and
where satisfies (7), is a density with support , , and is a collection of subsets of . For simplicity, we will assume that for some , and is the Euclidean distance.
We further assume that each set is homeomorphic to a Euclidean box of dimension depending on , as follows:
The constraints implied by Assumption 6 are very natural. Inequality (15) states that the intersections of the different manifolds are small. To put it in perspective, if the extrinsic space () were with the Lebesgue measure, then balls of radius of would have measure which is less than for all , and the set has measure that scales like , the same scaling appearing (15). On the other hand (15) and (16) hold if are compact and convex subsets of whose interiors are disjoint.
We are now ready to extend Theorem 2 to the framework described in this section. The following theorem indicates that the -NN-FL estimator can adapt to the “dimension” of the data in multiple regions of the domain.
Suppose that the data are generated as in (13), and Assumption 6 holds. Suppose also that the functions satisfy Assumptions 4–5 in the domain . If for all , then for an appropriate choice of the tuning parameter , the -NN-FL estimator defined in (3) satisfies
provided that for some positive constant , where is a polynomial function.
The following example suggests that the -NN-FL estimator may not be manifold adaptive.
For , suppose that for some , and . Note that the sets and satisfy Assumption 6. Let us assume that all of the conditions of Theorem 3 are met with , and . Motivated by the scaling of in Theorem 2, we let . We now consider two possibilities for : , and .
For any , and for any positive constant , there exists a positive constant such that
for large enough , where is the -NN-FL estimator. (This is shown in Appendix A.11.) In other words, if , then -NN-FL does not achieve a desirable rate. By contrast, with an appropriate choice of and , the -NN-FL estimator attains the rate (ignoring logarithmic terms) by Theorem 3.
If , then the minimum degree in the -graph of the observations in will be at least , with high probability, for some positive constant . (This can be seen as in Lemma 8.) This has two consequences:
Recall that the algorithm from Chambolle and Darbon (2009) has worst-case complexity , where is the number of edges in the graph (although empirically the algorithm is typically much faster, Wang et al., 2016). Therefore, if is too large, then the computations involved in applying the fused lasso to the -NN graph may be too demanding.
The choice of in Theorem 2 leads to an -NN graph with degree (This follows from Proposition 27 in Von Luxburg et al. (2014)). In a different context, El Alaoui et al. (2016) argues that it is desirable for geometric graphs (which generalize -NN graphs) to have degree . This suggests that using leads to an -NN graph that is too dense.
5 Prediction for -Nn-Fl
Recall from Section 2 that we estimate for using (4); this amounts to averaging the predictions for the training observations that are near . We now provide an upper bound for the mean integrated squared error (MISE) of this estimator.
The right-hand side of (17) involves the penalty and the approximation error AErr. We saw in Section 3.2 that the former can be characterized under Assumptions 4–5, or if satisfies Definition 1. We will now show that the piecewise Lipschitz condition (Definition 1) is sufficient to ensure an upper bound on the approximation error.
Througout this section, we take to be Euclidean distance.
6.1 Simulated Data
In this section, we compare the following approaches:
-NN-FL, with held fixed, and treated as a tuning parameter.
-NN-FL, with held fixed, and treated as a tuning parameter.
CART (Breiman et al., 1984), implemented in the
rpart, with the complexity parameter treated as a tuning parameter.
MARS (Friedman, 1991), implemented in the
earth, with the penalty parameter treated as a tuning parameter.
-NN regression (e.g. Stone, 1977), implemented in
Matlabusing the function
knnsearch, with treated as a tuning parameter.
We evaluate each method’s performance using mean squared error (MSE; defined as ). Specifically, we apply each method to 150 Monte Carlo data sets with a range of tuning parameter values. For each method, we then identify the tuning parameter value that leads to the smallest average MSE (where the average is taken over the 150 data sets). We refer to this smallest average MSE as the optimized MSE in what follows.
6.1.1 Fixed , varying
Throughout this section, we consider two scenarios with covariates.
The function is piecewise constant,
The covariates are drawn from a uniform distribution on . The data are generated as in (1) with errors.
Figure 3(b) and Figure 3(d) display the optimized MSE, as a function of the sample size, for Scenarios 1 and 2, respectively. -NN-FL gives the best results in both scenarios. -NN-FL performs a bit worse than -NN-FL in Scenario 1, and very poorly in Scenario 2 (results not shown).
Timing results for all approaches are reported in Figure 3(c). For all methods, the times reported are averaged over a range of tuning parameter values. For instance, for -NN-FL, we fix and compute the time for different choices of ; we then report the average of those times.
6.1.2 Fixed , varying
We now consider two scenarios with .
The function is defined as
and the generative density for is uniform in . The errors in (1) have a distribution.
The function is defined as
where , , and
. Once again, the generative density for is uniform in , and the errors are i.i.d. .
Optimized MSE for each approach is displayed in Figure 4. When is small, most methods perform well; however, as increases, the competing methods quickly deteriorate, whereas -NN-FL continues to perform well.
6.2 Flu Data
We consider a data set that consists of flu activity and atmospheric conditions in Texas between January 1, 2003 and December 31, 2009 across different cities. Our data-use agreement does not permit dissemination of these hospital records. We also use data on temperature and air quality (particulate matter) in these cities, which can be obtained directly from CDC Wonder (http://wonder.cdc.gov/). Using the number of flu-related doctor’s office visits as the dependent variable, we fit a separate nonparametric regression model to each of 24 cities; each day was treated as a separate observation, so that the number of samples is in each city. Five independent variables are included in the regression: maximum and average observed concentration of particulate matter, maximum and minimum temperature, and day of the year. All variables are scaled to lie in . We performed 50 75%/25% splits of the data into a training set and a test set. All models were fit on the training data, with 5-fold cross-validation to select tuning parameter values. Then prediction performance was evaluated on the test set.
We apply -NN-FL with , and -NN-FL with for
. We also fit neural networks (Hagan et al., 1996; implemented in
Matlabusing the functions
train), thin plate splines (Duchon, 1977; implemented using the
fields), and MARS, CART, and random forests, using software described in Section 6.1.
Average test set prediction error (across the 50 test sets) is displayed in Figure 5. We see that -NN-FL and -NN-FL tend to have the best performance. In particular, -NN-FL performs best in 13 out the 24 cities, and second best in 6 cities. In 8 of the 24 cities, -NN-FL performs best.
In this paper, we introduced a two-stage non-parametric regression estimator, by (i) constructing a nearest-neighbors graph over the observations; and (ii) performing the fused lasso over the graph. We have shown that when the -NN or -NN graph is used, the resulting procedure is locally adaptive; furthermore, when the -NN graph is used, the procedure is manifold adaptive. Our theoretical results are bolstered by findings on simulated data and on a flu data set.
Appendix A Proofs
Throughout, given , we denote by the set . For , , with a metric space, we write
and when the context is clear we will simply write instead of . In the case of the space , we will use the notation for the metric induced by the norm , and we will write