Fused Density Estimation: Theory and Methods

05/08/2018 ∙ by Robert Bassett, et al. ∙ 0

In this paper we introduce a method for nonparametric density estimation on geometric networks. We define fused density estimators as solutions to a total variation regularized maximum-likelihood density estimation problem. We provide theoretical support for fused density estimation by proving that the squared Hellinger rate of convergence for the estimator achieves the minimax bound over univariate densities of log-bounded variation. We reduce the original variational formulation in order to transform it into a tractable, finite-dimensional quadratic program. Because random variables on geometric networks are simple generalizations of the univariate case, this method also provides a useful tool for univariate density estimation. Lastly, we apply this method and assess its performance on examples in the univariate and geometric network setting. We compare the performance of different optimization techniques to solve the problem, and use these results to inform recommendations for the computation of fused density estimators.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

Code Repositories

FDE-Tools

Python code for computing fused density estimators


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the pantheon of statistical tools, the histogram remains the primary way to explore univariate empirical distributions. Since its introduction by Karl Pearson in the late 19th century, the form of the histogram has remained largely unchanged. In practice, the regular histogram, with its equal bin widths chosen by simple heuristic formulas, remains one of the most ubiquitous statistical methods. Most methodological improvements on the regular histogram have come from the selection of bin widths—this includes varying bin widths to construct irregular histograms—motivated by thinking of the histogram as a piecewise constant density estimate. In this work, we study a piecewise constant density estimation technique based on total variation penalized maximum likelihood. We call this method fused density estimation (FDE). We extend FDE from irregular histogram selection to density estimation over geometric networks, which can be used to model observations on infrastructure networks like road systems and water supply networks. The use of fusion penalties for density estimation is inspired by recent advances in theory and algorithms for the fused lasso over graphs

[padilla, Sharpnack]. Our thesis, that FDE is an important algorithmic primitive for statistical modeling, compression, and exploration of stochastic processes, is supported by our development of fast implementations, minimax statistical theory, and experimental results.

In 1926, [sturges1926choice] provided a heuristic for regular histogram selection where, naturally, the bin width increases with the range and decreases with the number of points. The regular histogram is an efficient density estimate when the underlying density is uniformly smooth, but irregular histograms can ‘zoom in’ to regions where there is more data and better capture the local smoothness of the density. A simple irregular histogram, known as the equal-area histogram, is constructed by partitioning the domain so that each bin has the same number of points. [denby2009variations] noted that the equal-area histogram can often split bins unnecessarily when the density is smooth and merge bins when the density is variable, and proposed a heuristic method to correct this oversight. Recently, [li2016essential] proposed the essential histogram, an irregular histogram constructed such that it has the fewest number of bins and lies within a confidence band of the empirical distribution. While theoretically attractive, in practice its complex formulation is intractable and requires approximation. If the underlying density is nearly constant over a region, then the empirical distribution is well approximated locally by a constant, and hence the essential histogram will tend to not split this region into multiple bins. Such a method is called locally adaptive, because it adapts to the local smoothness of the underlying density.

In Figure 1, we compare FDE to the regular histogram, both of which have 70 bins. Because FDE can be thought of as a bin selection procedure, in this example, we recompute the restricted MLE after the bin selection, which is common practice for model selection with lasso-type methods. We see that with 70 bins the regular histogram can capture the variability in the left-most region of the domain but under-smooths in the right-most region. We can compare this to FDE which adapts to the local smoothness of the true density. As a natural extension of 1-dimensional data, we will consider distributions that lie on geometric networks—graphs where the edges are continuous line segments—such as is common in many infrastructure networks. Another motivation to use total variation penalties is that they are easily defined over any geometric network, in contrast to other methods, such as the essential histogram and multiscale methods. Figure 2 depicts the FDE for data in downtown San Diego. The geometric network is generated from the road network in the area, and observations on the geometric network are the locations of eateries (data extracted from the OpenStreetMap database [OSM]).

Figure 1: A comparison of FDE (left) and the regular histogram (right) of 10,000 data points from a density (red) with varying smoothness—both have 70 bins.
Figure 2: FDE for the location of eateries in downtown San Diego.

Without any constraints, maximum likelihood will select histograms that have high variation (as in Figure 1), so to regularize the problem, we bias the solution to have low total variation. Total variation penalization is a popular method for denoising images, time series, and signals over the vertices of a graph, with many modern methods available for computation, such as alternating direction method of multipliers, projected Newton methods, and split Bregman iteration [tibshirani2005sparsity, rudin1992nonlinear, beck2009fast, Sharpnack]. Distributions over geometric networks, which we consider here, are distinguished from this literature by the fact that observations can occur at any point along an edge of the network. This leads to a variational density estimation problem, which we reduce to a finite dimensional formulation.

Contribution 1. We show that the variational FDE is equivalent to a total variation penalized weighted least squares problem enabling fast optimization.

In order to justify the use of FDE, we will analyze the statistical performance of FDE for densities of log-bounded variation over geometric networks. The majority of statistical guarantees for density estimates control some notion of divergence between the estimate and the true underlying density. Several authors have used the loss (mean integrated square error) to evaluate their methods for tuning the bin width for the regular histogram [scott1979optimal, freedman1981histogram, birge1997model]. While it is appealing to use loss, it is not invariant to choice of base measure, and divergence measures such as , Hellinger loss, and the Kullback-Leibler (KL) divergence are preferred for maximum likelihood—an idea pioneered by Le Cam [le2012asymptotics] and furthered by [devroye1985nonparametric, hall1988minimizing]. By appealing to Hellinger loss, [birge2006many] proposed a method for optimal choice of the number of bins in a regular histogram, and we will similarly focus on Hellinger loss.

Contribution 2. We provide a minimax non-parametric Hellinger distance rate guarantee for FDE in the univariate case, over densities of log-bounded variation.

When the -density lies in a Sobolev space, an appropriate non-parametric approach to density estimation is maximum likelihood with a smoothing splines penalty [silverman]. The smoothing spline method is not locally adaptive because it does not adjust to the local smoothness of the density or -density. Epi-splines, [RoysetWets], are density estimates formed by maximizing the likelihood such that the density, or log-density, has a representation in a local basis and lies in a prespecified constraint set. [donoho1996density] and [koo1996wavelet] studied wavelet thresholding for density estimation and proved rate and KL-divergence guarantees respectively. In a related work, [koo2000logspline] considered log-spline density estimation from binned data with stepwise knot selection. [willett2007multiscale] used a recursive partitioning approach to form adaptive polynomial estimates of the density, a similar approach to wavelet decomposition.

Total variation penalization has previously been proposed as a histogram regularization technique in [koenker2007density, sardy2010density, padilla2015nonparametric]. Particularly, [padilla2015nonparametric] separately studies the variational form of the fused density estimate and a discrete variant, and provides theoretical guarantees for Lipschitz classes. Our computational results improve on these works by minimizing a variational objective directly, instead of separately proposing discrete approximations to the variational problem. Our theoretical analysis improves on previous work by studying total variation classes directly by showing that FDE achieves the minimax rate for Hellinger-risk over all densities of log-bounded variation. Moreover, we consider density estimation on geometric networks, and extend our Hellinger rate guarantees to this novel setting.

Contribution 3. We prove that the same Hellinger distance rate guarantee for the univariate case also holds for any connected geometric network.

1.1 Problem Statement

When considering road systems and water networks, we observe that individual roads or pipes can be modeled as line segments, and the entire network constructed by joining these segments at nodes of intersection. Mathematically, we model this as a geometric network , a finite collection of nodes and edges , where each edge is identified with a closed and bounded interval of the real line. Each edge in the network has a well-defined notion of length, inherited from the length of the closed interval. We fix an orientation of by assigning, for each edge , a bijection between and the endpoints of the closed interval associated with . This corresponds to the intuitive notion of “gluing” edges together to form a geometric network. Because we only discuss geometric networks in this paper, we will often refer to them as networks.

A point in a geometric network is an element of one of the closed intervals identified with edges in , modulo the equivalence of endpoints corresponding to the same node. After assigning an orientation to the network, a point can be viewed as a pair , where is an edge and is a real number in the interval identified with . However, because we wish to emphasize the network as a geometric object in its own right, we will only use this notation when our use of univariate theory makes it necessary.

Figure 3: An example of a geometric network. This network is formed from bike paths on a university campus. Nodes identify intersections of paths. Edges are paths connecting these intersections, and the length of each edge is the length of the corresponding path. In forming the geometric network, we discard all information related to its embedding in , only preserving the network structure and path lengths.

A real-valued function , defined on a geometric network , is a collection of univariate functions , defined on the edges of . We require that the function respects the network structure, by which we mean that for any two edges and which are incident at a node , . We abuse notation slightly–by referring to , we mean evaluated at the endpoint of the interval identified to . A geometric network inherits a measure from its univariate segments in a natural way, as the sum of the Lebesgue measure along each segment. With this measure we have a straight-forward extension of the Lebesgue measure to , making a measurable space.

For any random variable taking values on the network , we will assume that the measure induced by the random variable is absolutely continuous with respect to the base measure, , and so has density . We will abuse notation by using to refer to both the Lebesgue measure and the base measure on a geometric graph; which of these we mean will be clear from its context. Furthermore, we assume that the density is non-zero everywhere, so that its logarithm is well defined. Moreover, we will assume the log-density is not arbitrarily variable, and for this purpose we will use the notion of total variation. Let . The total variation of a function is defined as

The supremum is over all partitions, or finite ordered point-subsets , of . For a real-valued function defined on a network , we extend the univariate definition to

One advantage of the use of the penalty is that is it invariant to the choice of the segment length in the geometric network, so scaling the edge by a constant multiplier leaves the total variation unchanged. As a consequence fused density estimation will be invariant to the choice of edge length.

Let be a density on a geometric network , and an independent sample identically distributed according to . Let be the empirical measure associated to the sample. We let act on a function, by which we mean that we take the expectation of that function with respect to . So for any function ,

We will also use to denote for non-empirical measures .

Fix . A fused density estimator (FDE) of is a density , such that the log-density is minimizer or the following program,

(1)

where the minimum is taken over all functions for which the expression is finite and the resulting is a valid density. That is, and where

The set will be referred as the set of densities with log-bounded variation. Indeed, the integration constraint on elements of makes them log-densities. Note that densities in are necessarily bounded above and away from zero, as a result of the total variation condition.

The program in (1) is variational, because it is a minimizer over an infinite dimensional function space. It is quite common for variational problems in non-parametric statistics to involve a reproducing kernel Hilbert space (RKHS) penalty, as opposed to a total variation penalty [wahba1990spline]. In the RKHS setting setting, the Hilbert space allows us to establish representer theorems, which reduce the variational program to an equivalent finite dimensional one, so that it can be solved numerically. The space of functions of bounded variation, on the other hand, is an example of a more general Banach space, so RKHS results cannot be applied to this setting. In the next section, we discuss representer theorems for (1), and further show that it can be solved using a sparse quadratic program.

2 Computation

In this section we provide results toward the computation of fused density estimators. The key challenge is the variational formulation of the Fused Density Estimator (1). To this end, we prove that solutions to the variational problem can be finitely parametrized. Moreover, we show that after applying this representer theorem, the finite-dimensional analog of (1) has an equivalent formulation as a total variation penalized least-squares problem. Our main theorem of this section, which reduces the computation of a fused density estimator to a weighted fused-lasso problem, follows.

Theorem 2.1 (Informal).

For the FDE exists almost surely. It can be computed as the minimizer to a finite-dimensional convex, sparse, and total variation penalized quadratic program. That is, FDEs are solutions to an optimization of the form

(2)

The details of this theorem, by which we mean the constructions of , , , and the connection between the minimizer of (2) and the FDE , are given later in this section.

Theorem 2.1 demonstrates that the FDE (1) can be computed as a specific incarnation of the generalized lasso, for which there are well known fast implementations [arnold2016efficient]. In practice we will solve the dual to this problem, which we discuss in Theorem 2.5. Theorem 2.4 is a precise restatement of Theorem 2.1. In order to prove it, we proceed through a series of important lemmas. Lemma 2.2 proves that minimizers of (1) exist almost surely for , below which the solution degenerates to dirac masses at observations. The almost surely qualification pertains to the maximum number of observations which occur at any single point of the network; if observations occur simultaneously then must be increased to overcome degeneracy. Lemma 2.2 further transforms the FDE problem from constrained to unconstrained by removing the integration constraint. From this new formulation, Lemma 2.3 shows that the search space for the fused density estimator problem can be reduced from functions of bounded variation to an equivalent, finite-dimensional version. Theorem 2.4 performs the final step in the proof–demonstrating that the previously derived finite-dimensional problem can be solved using a penalized quadratic program. The last subsection in this section is tangential, but sheds further light on the structure of fused density estimators. Proposition 2.6, which we refer to as the Ordering Property, qualifies the local-adaptivity of fused density estimators by describing their local structure. Omitted proofs can be found in Appendix A.

2.1 Main Computational Results

Our first lemma reduces the fused density estimator problem, (1), to an unconstrained program where the integral constraint is incorporated into the objective. This result is originally due to Silverman [silverman], who proved the result in the context of univariate density estimation and Sobolev-norm penalties. Minor modifications allow us to extend it to geometric networks and the non-Sobolev total variation penalty.

Lemma 2.2.

The problem

(3)

gives an equivalent formulation of (1), because minimizers of (3) satisfy .

We remark that the objective in Lemma 2.2 is equivalent to total variation penalized Poisson process likelihood, where the log-intensity is , so our computations also apply to that setting. Lemma 2.2 gives that the fused density estimator definition (1) can instead be solved by the unconstrained problem (3) over all functions on of bounded variation. An alternative interpretation of the lemma is that the Lagrange multiplier associated to the constraint in (1) is . The next lemma reduces the unconstrained problem (3) to an equivalent finite-dimensional version. The proof technique is analogous to similar results in [MammenvandeGeer]. In the context of Reproducing Kernel Hilbert Spaces, results that reduce variational problem formulations to finite-dimensional analogs are referred to as representer theorems, eg. [wahba1990spline]. We will also use this language to describe our result, even though we are in a more general Banach space setting. The result demonstrates that FDEs have large, piecewise constant regions, which is a well known property of fusion penalties [tibshirani2005sparsity, Kim, Sharpnack].

Lemma 2.3 (Representer Theorem).

A fused density estimator must be piecewise constant along each edge. All discontinuities are contained in the set , the observations and the nodes of .

Using Lemma 2.3

, we can parametrize fused density estimators with three finite-dimensional vectors: the fused density estimator at the observation points,

, the fused density estimator at the vertices of , , and the piecewise constant values of the fused density estimator, . For simplicity, we will assume that no two observations occur at the same location, a condition that we can and will relax in the remark following Theorem 2.4.

Let denote the number of observations along edge . We will denote by the value, in the vector , of the th ordered observation along edge . For a FDE , is the value taken by between the and th observation in the interval associated with , where the th and th observations are set to be the endpoints of that interval. Similarly, with the convention that and are the endpoints of the interval. This gives the length of the segment between two observations, over which the FDE is piecewise constant. We denote by the value in at the vertex . For a given node , let denote the set of edges which are incident to and denote by the segment in which is incident to . The problem (3) becomes

The first summand, over the edges in , gives the log-likelihood term, the total variation along an edge, and the integration term. The second summand gives the total variation at nodes of the geometric graph.

Let denote the objective function above. We show that this problem can be further reduced by removing the variables. Indeed, for any vectors , let . A necessary condition for any to minimize is . But does not have a lower bound when . Furthermore, the set is unbounded when , and has a unique minimum when . These facts are clear from the graph of the functions , which occur as summands in . The function is given in Figure 4.

Figure 4: The function as the maximum of three affine functions. It attains a unique minimum at when .

It is the maximum of three affine functions, from which we conclude that the minimum of is attained uniquely at when . We have shown that is a critical point for the existence of FDEs, below which the total variation penalty is not strong enough to prevent degenerate solutions to (1). For , the value of an FDE at observations is well-behaved and we can reduce to

Because we have the further equivalence,

By [rockafellar, Theorem 23.8], a necessary and sufficient condition for to solve this problem is

Here we make an important point. The subdifferential of each term is constant, and the subdifferential of is piecewise constant, depending only on the ordering of the terms and . Similarly, the subdifferential of the term is piecewise constant and again only depends on the ordering of its terms. Lastly, the subdifferential of is given by its gradient: the th coordinate of the subdifferential is .

Consider the transformation , . This transformation preserves ordering of elements of and , so the subdifferential of each absolute value term is invariant under this transformation. Pursuing this line of reasoning gives the following theorem. In order to facilitate its statement, we briefly establish some notation.

The total variation of a FDE on , which has been parametrized into vectors and , can be expressed as a sum of pairwise distances between values in and . That is, there are sets and of index pairs such that

This formulation depends on the underlying graph structure and the locations of the observations. The right-hand side of this expression can be written as the norm of a vector , where and are matrices with elements in , each having rows. We will use the matrices and , which satisfy and is zero in its first rows. Let for . Let

and let and be the matrices and , respectively. We denote by the locations of observation on , and we further partition these into ordered observations along each edge, so that denotes the th observation along edge . Recall the definition of , and let be a diagonal matrix with on its diagonal. Lastly, define the vector such that

Theorem 2.4.

Let . Then the fused density estimator exists almost surely. It can be computed as follows. Let be a vector with indices enumerating the constant portions of the fused density estimator , such that denotes the value of the fused density estimator on the open interval between and , or between an observation and the end of the edge if or . Let be a vector with indices enumerating the nodes in , such that denotes the value of at node . Then the fused density estimator for this sample is the minimizer of

(4)
Proof.

The proof follows directly from the line of reasoning before the theorem’s statement. Details can be found in Appendix A. ∎

Remark. The condition on is an important one. As discussed previously, when the total variation penalty is not strong enough to balance the likelihood term and minimizers of (1) are degenerate. The almost surely condition is simply a requirement that no two observations occur at the same location, and no observations occur at nodes of the geometric network. With a slight modification of the assumption on , Theorem 2.4 can be extended to the setting where multiple observations are allowed at a single location. This extension also allows observations to occur at nodes of the geometric network. In practice, this extension may be useful when dealing with imperfect data, though we will not focus on it here because it is a measure zero event in the density estimation paradigm. For completeness, we include the extension as Theorem A.1 of Appendix A.

Methods for computing solutions to the problem in Theorem 2.4–a total-variation regularized quadratic program–are well established. As in [Kim], we rely on solving the dual quadratic program. For convenience, we write the dual problem as a minimization instead of its typical maximum formulation

Proposition 2.5.

The dual problem to (4) is

(5)

The primal solution can be recovered from the dual through the expression

A more general statement of Proposition 2.5 which suits the more general statement of Theorem 2.4, can be found in Appendix A. It is worth noting that strong duality between the primal and dual problems in (4) and (2.5) follows immediately. Indeed, both are extended linear-quadratic programs in the sense of [VaAn]. By Theorem 11.42 in [VaAn], strong duality holds, and in addition both the primal and dual problem attain their minimum values, respectively, if and only if (4) is bounded. This is guaranteed by the assumption on in Theorem 2.4. Furthermore, the fact that the minimum of (4) is attained gives the existence of FDEs as asserted in Theorem 2.4.

2.2 Additional Properties of FDEs

In this section, we state a result on the local structure of an FDE and provide additional comments on its implementation details. The result is intuitive: along an edge, the value of piecewise constant segments is inversely related to the length of the segment, relative to adjacent segments. Since smaller segments suggest higher probability in the corresponding region, this property demonstrates local structure of the estimator which aligns with essential global behavior.

Proposition 2.6 (Ordering Property).

Let and be the lengths of two segments interior to an edge , in the sense that . Assume further that no two observations occur at the same location. Then implies that . Similarly, implies .

Up to this point in our analysis, we have discussed the computation of the FDE without consideration for preprocessing the data or postprocessing our resulting FDE. Since the computation and rates of convergence of the FDE represent the bulk of our contribution, we will maintain this perspective in the remainder of the paper. It is worth mentioning, however, that FDE is amenable to pre and postprocessing. Handling multiple observations at a single location in Theorem 2.4 makes initial binning or minor discretizations of data (such as projecting observations onto a geometric network) straightforward. Moreover, the FDE can be viewed exclusively as a method for generating adaptive bin widths, where the resulting bins can then be fit to the data as in a regular histogram. This approach performs model selection (via FDE) and model fit (via a post-selection MLE) of the histogram separately, and is common practice in model selection using lasso and related methods [PostSelectLasso, GeneralPostSelect]. When FDE is used exclusively to find bins, it becomes a change point localization method, instead of a nonparametric density estimator as in its original formulation. Though FDE is amenable to these examples of pre and postprocessing, we will examine the FDE as a density estimator in the remaining sections.

We also make some suggestions into the selection of . The choice of leads to a fixed number of piecewise constant portions of the fused density estimator. In this sense, the choice of the parameter is analogous to choosing the number of bins in histogram estimation. One can tune this selection with information-criteria (IC) such as AIC or BIC by selecting the FDE over a grid of

values that minimizes the IC. Each of these ICs requires the specification of the degrees of freedom, which can be set to the number of selected piecewise constant regions in the graph, as is done in the Gaussian case

[Sharpnack, tibshirani2012degrees]. Alternatively, one could use cross-validation as a selection criterion. Implementing cross-validation is often practical for large problems because the sparse QP in (2.5) can be solved very quickly, as we will see in the next section.

3 Experiments

We have established a tractable formulation of the fused density estimator in (2.5). Quadratic programming is a mature technology, so computing FDEs via quadratic programming dramatically improves its computation. Quadratic program solvers designed to leverage sparsity in the and matrices allow the optimization portion of fused density estimation to scale to large networks and many observations.

In this section we compute FDEs on a number of synthetic and real-world examples.111These examples can be found at github.com/rbassett3/FDE-Tools, which also includes a Python package for fused density estimation on geometric networks. We evaluate the performance of different optimization methods and provide recommendations for solvers which implement those methods. To facilitate accessibility and customization of these tools, each of the solvers we consider is open source and compare favorably with commercial alternatives.

3.1 Univariate Examples

We first evaluate fused density estimators in the context of univariate density estimation–where the geometric network is simply a single edge connecting two nodes. The operator is especially simple in this setting, corresponding to an oriented edge-incidence matrix of a chain graph. Figure 5 contains fused density estimators of the standard normal, exponential, and uniform densities, each derived from sample points. The parameter in these experiments was selected by fold cross-validation.

Figure 5: Univariate densities and fused density estimators

3.2 Geometric Network Examples

We next evaluate FDEs on geometric networks. For each of these examples, the underlying geometric network is extracted from OpenStreetMap (OSM) database [OSM]. Figure 6 is a fused density estimator with domain taken to be the road network in a region of the city of Baghdad. Observations are the locations of terrorist incidents which occurred in this region from 2013 to 2016, according to the Global Terrorism Database [GTD]. The density we attempt to infer is the distribution for the location of terrorist attacks in this region of the city.

Figure 6: An FDE for the location of terrorist attacks in a neighborhood of Baghdad. The detected hotspot contains the streets and alleys near a hospital.
Figure 7: A fused density estimator for artificial observations on Monterey’s road network

Figure 7

is an FDE on the road network in Monterey, California. The observations were generated according to a multivariate normal distribution, and projected onto the nearest waypoint in the OpenStreetMap dataset. These examples of FDEs on geometric networks illustrate some important properties of the estimator. The FDEs clearly respect the network topology. This is most obviously demonstrated in the Monterey example, where the red and light green regions, which correspond to elevated portions of the density, are chosen to be sparsely connected regions of the network. This is intuitive because the sparsely connected regions impact the fusion penalty less severely than a highly connected region, but it is one way that FDEs reflect the underlying network structure. The Baghdad and Davis examples demonstrate that FDEs can also be used for hot spot localization, and especially in low-data circumstances. Lastly, we note that FDEs partition the geometric network into level sets, thereby forming various regions of the network into clusters. This clustering is an interesting aspect of FDEs, and suggests they could be used to classify regions into areas of high and low priority.

3.3 Algorithmic Concerns

The two most prevalent methods for solving sparse quadratic programs are interior point algorithms and the alternating direction method of multipliers. Interior point methods to solve problems of the form (4) were introduced by [Kim]. Interior point approaches have the benefit of requiring few iterations for convergence. The cost per iteration, however, depends crucially on the structure of and when performing a Newton step on the relaxed KKT system. In the case of univariate fused density estimators, the Newton step requires inversion of a banded matrix, one which has its nonzero elements concentrated along the diagonal. Leveraging the banded structure allows inversion to be performed in linear time, which is crucial to the performance of the algorithm. For further details of interior point methods, we refer the reader to [WrightIP, ConvOpt, Nesterov].

The alternating direction method of multipliers proceeds by forming an augmented lagrangian function and updating the primal and dual variables sequentially. More details can be found in [ADMM, Bertsekas]. Compared to interior point methods, convergence of ADMM usually requires more iterations of a less-expensive update, whereas interior point methods converge in fewer iterations but require a more expensive update. In this section, we compare the performance of these algorithms on fused density estimation problems. A comparison between the methods on the related problem of trend-filtering can be found in [Sharpnack], where the algorithmic preferences pertained only to the x grid graph setting. Their results favor the ADMM approach, though the regularity of this graph structure makes generalizing to general graphs difficult.

For software, we use the Operator Splitting Quadratic Program (OSQP) solver and CVXOPT. These are mature sparse QP solvers that use ADMM and interior point algorithms, respectively. They are both open source, and compare favorably to commercial solvers [OSQP, QPExper]. Our choice to use these solvers instead of custom implementations reflects that (i) these tools are representative of what is available in practice (ii) outsourcing this portion to other solvers reduces the ability for subtle differences in implementation to favor one method over the other (iii) these projects are production-quality, so their implementations are likely to be of higher quality than custom implementations. We first compare ADMM and interior point methods on univariate fused density estimator problems. We perform 200 simulations, sampling 100 data points from each distribution. We let range from to . These choices correspond to the lower bound on the parameter in Theorem 2.4 and an upper bound which selects a constant or near-constant density. We report in-solver time, in seconds, and do not include the time required to convert to the sparse formats required for each solver.

Density
Exponential
Normal
Uniform
Table 1:

Mean and standard deviation of run time (s) for univariate OSQP experiments

Density
Exponential
Normal
Uniform
Table 2: Mean and standard deviation of run time (s) for univariate CVXOPT experiments

In these experiments, interior point terminated in around iterations. The number of iterations in ADMM were less consistent, ranging from a few hundred to a few thousand.

For the geometric network case, we performed experiments using four examples: the San Diego and Baghdad datasets from figures 2 and 6, in addition to similar datasets in Davis, California. One of these is a fused density estimator with domain as the road network in downtown Davis, and the other is on the entire city of Davis–our largest example in this paper–which has 19000 variables and 25000 constraints in the dual formulation (2.5). We choose in a range that progresses from overfitting to underfitting the data. By overfit, we mean that we choose as small as possible to make the fused density estimator problem still feasible. By underfit, we mean that the fused density estimator is a constant function. We record ‘-’ when a solver does not run to successful completion. All experiments were run on a computer with 8 GB of memory, an intel processor with four cores at 2.50 GHz, and a 64-bit linux operating system.

parameter
Example Overfit Middle Underfit
Baghdad
San Diego
Downtown Davis
Davis
Table 3: OSQP run times (s) for geometric network examples
parameter
Example Overfit Middle Underfit
Baghdad
San Diego
Downtown Davis
Davis -
Table 4: CVXOPT run times (s) for geometric network examples

From these experiments we see that the augmented lagrangian method outperforms interior point on the geometric network examples. The lack of regularity in the matrices and , and the large-scale matrix factorizations associated with Newton limits this method in comparison to ADMM. On smaller, well-structured problems, like in the univariate examples, interior point methods are often faster. On these well-structured problems, however, the gain in performance is negligible (on the order of a tenth of a second). On the other hand, the speed and versatility of OSQP, especially in the context of large, irregular network structure, leads us to recommend ADMM as the method to solve the fused density estimator problem in (2.5). This supports the suggestion of using ADMM for trend-filtering in [Sharpnack], and extends their recommendation beyond the grid graph.

4 Statistical Rates

In this section we prove a squared Hellinger rate of convergence for fused density estimation when the true log-density is of bounded variation. Hellinger distance is defined as

where is the base measure over the edges in the geometric network ; in the univariate setting, this is just the Lebesgue measure. The factor of is a convention that ensures that the Hellinger distance is bounded above by . The Hellinger distance is a natural choice for quantifying rates of convergence for density estimators because it is tractable for product measures and provides bounds for rates in other metrics [LeCam73, LeCamBook, Gibbs]. The squared Hellinger risk of an estimator for is . The minimax squared Hellinger risk over a set of densities , for a sample size is

The minimum is over all estimators which are measurable maps from the sample space of to .

We find fused density estimation achieves a rate of convergence in squared Hellinger risk which matches the minimax rate over all univariate densities in –densities of log-bounded variation where the underlying geometric network is simply a compact interval. In this sense, univariate FDE has the best possible squared Hellinger rate of convergence over this function class. The rate we attain is , and the equivalence of rates is asymptotic. On an arbitrary connected geometric network, minimax rates for density estimation can depend on the network, but our results demonstrate that FDE on a geometric network has squared Hellinger rate at most the univariate minimax rate.

We begin by establishing the minimax rate over the class , which gives a lower bound on the squared Hellinger rate for fused density estimation. To establish the lower bound, it is sufficient to examine the minimax rate of convergence over a set of densities contained in . Fixing a constant and compact interval , we consider the set of functions

Recall that, for a given radius , the packing entropy of a set with respect to a metric is the logarithm of its packing number, the size of the largest collection of points in which are at least -separated with respect to the metric . Because is bounded below, the packing entropy of and are of the same order. From Example 6.4 in [YangBarron], we have that has packing entropy of order . Applying Theorem 5 from [YangBarron] gives the minimax squared Hellinger rate over densities as . In Theorem 4.2, we show that the FDE attains the rate of over the larger class . Therefore, the minimax squared Hellinger rate over must also equal , so we have proven the following theorem. For sequences and , we write if and .

Theorem 4.1.

The minimax squared Hellinger rate over , the set of densities with of bounded variation, is . That is,

To prove the FDE rate of convergence for univariate density estimation, we extend techniques developed for the theory of M-estimators, [vandeGeer], and locally-adaptive regression splines in Gaussian models, [MammenvandeGeer]. A detailed proof of our main result can be found in Appendix A. This rate bound for FDE is based on novel empirical process bounds for log-densities of bounded variation, and these are used in conjunction with peeling arguments to provide a uniform bound on the Hellinger error. The empirical process bounds in A.3 rely on new Bernstein difference metric covering number bounds for functions of bounded variation, which can be found in Appendix B. We extend the FDE rates for the univariate setting to arbitrary geometric networks in section 4.2; this requires embedding the geometric network onto the real line. This embedding is constructed from the depth-first search algorithm, a technique used in [padilla] for regression over graphs, and is described in Appendix A.

The subsections in this section follow this outline: In subsection 4.1, we provide a proof sketch of the squared Hellinger rate of convergence for the univariate FDE. In subsection A.3, we detail the lemmas used to prove the main result. In subsection 4.2, we extend these rate results from the univariate setting to arbitrary geometric networks.

4.1 Upper Bounds for Rate of Univariate FDE

In this subsection we prove a squared Hellinger rate of for univariate fused density estimation. Let the geometric network be a closed interval (a single edge connecting nodes and ). Recall the definition of as the set of densities with of bounded variation. Let be a fixed density on , so that the total variation is constant as increases.

Theorem 4.2.

Let be the fused density estimator of an iid sample of points drawn from a univariate density . There is an -dependent sequence such that , the FDE is well defined, and

Combined with the lower bound in Theorem 4.1, this gives that univariate fused density estimation attains the minimax rate over densities in .

Proof Sketch (Detailed proof in Appendix A).

In order to control the Hellinger error for FDE, we rely on the fact that the FDE is the minimizer of (1). We derive an inequality involving the squared Hellinger distance, an empirical process, and fusion-penalty terms. This inequality (and in general inequalities serving this purpose; see [vandeGeer]) is referred to as a basic inequality. To reduce notation, we introduce the shorthand , , , , and .

We arrive at the following basic inequality by manipulating the optimality condition, . In fact, from the definition of the FDE we have the stronger condition for all , but the weaker condition will suffice.

Lemma 4.3 (Basic Inequality).

Squared Hellinger rates now follow from controlling the right hand side. We do so by considering two cases. When is small, we show that

From the basic inequality, this gives

(6)

Excluding details, when is chosen to dominate , the first term in (4.1) is negative, so we conclude that .

The condition “when is small”, and the corresponding control on can be formalized in the following theorem.

Theorem 4.4.

where the supremum is taken over all .

When is large, on the other hand, we show that . From the basic inequality, this gives

Whence we conclude that . This follows from the analogue to (4.4) when is large.

Theorem 4.5.

where the supremum is taken over all .

To summarize our conclusions so far: the squared Hellinger rate is when balances the competing terms in (4.1). By choosing

we have a minimal which dominates in (4.1). Furthermore, this choice of satisfies by Theorem 4.4. We have established a squared Hellinger rate of for both the cases of considered. Furthermore, this choice of satisfies the condition on in Theorem 2.4, so the FDE is well-defined.

Theorems 4.4 and 4.5 are essential components of the proof outlined above. Both of these results are new and of independent interest. Their derivation requires the following lemma.

Lemma 4.6.

Let and . There is a constant and choice of such that for all and

Lemma 4.6 can be used to prove Theorems 4.4 and 4.5 by applying the peeling device twice, once each for the parameters and .

The proof of Lemma 4.6 requires three basic ingredients: control of the bracketing entropy of , a uniform bound on , and a relationship dictating how scales with control of the Hellinger distance. These ingredients have the same motivation as in [MammenvandeGeer], where the authors use a total variation penalty to construct adaptive estimators in the context of regression. In that work, the authors assume subgaussian errors and prove bounds on metric entropy for functions of bounded variation. The subgaussian assumption provides local error bounds, and the metric entropy condition bounds the number of sets on which we must control that error. Though similarly motivated, our context is more complicated. In order to control the in with the Hellinger distance, we consider coverings in the Bernstein difference metric instead of the metric. Using the Bernstein difference allows us to achieve the results in Lemma 4.6, but its use requires control of generalized bracketing entropy–bracketing with the Bernstein difference–instead of the usual bracketing entropy with the metric. In addition, the uniform bound we require is now on the Bernstein difference over .

In Appendix B, we show that the bracketing entropy of , with bracketing radius , is of order . This bracketing entropy results implies generalized bracketing entropy bounds, and can be proved similarly to results in monotonic shape-constrained estimation [vandervaart]. In order to achieve the finite sample bounds necessary to achieve these rates, Bernstein’s inequality is used to provide concentration inequalities that are critical to bounding the basic inequality. With this combination of local error bounds and bracketing rates, we can apply results in the spirit of generic chaining [talagrand] to obtain Lemma 4.6.

Lastly, we translate the probabilistic results into bounds on Hellinger risk. In general, one cannot prove expected risk rates from convergence in probability because the tails may not decay quickly enough to give a finite expectation. But out of the proofs of Theorems 4.4 and 4.5, we can derive exponential tail bounds for . This allows us to translate our probabilistic rates into rates on the Hellinger risk; doing so requires some care to simultaneously apply the rates in Theorems 4.4 and 4.5. These details are provided in the expanded proof in Appendix A.

4.2 Guarantees for Connected Geometric Networks

In the previous sections we proved an rate of convergence for univariate fused density estimators. The following theorem extends this result to arbitrary geometric networks.

Theorem 4.7.

Let be the FDE of an iid sample over a connected geometric network with true density . Then there exists a choice of , dependent on , such that and

We prove this theorem using an embedding lemma, Lemma A.11, which states that for any fixed geometric network there is a measure-preserving embedding of into that preserves densities and Hellinger distances. Furthermore, for any function on , the (univariate) total-variation of the embedded function never exceeds twice that of the graph-valued total variation. With this lemma in hand, Theorem 4.7 is proven by strategically bounding terms in our analysis by their univariate counterparts. A detailed proof can be found in the supplementary material.

Theorem 4.7 provides an upper-bound on the minimax Hellinger rate for densities of log-bounded variation on geometric networks. Unlike the unvariate case, however, we do not have a lower bound as in Theorem 4.1, so we cannot conclude that FDE attains the minimax rate for arbitrary geometric networks. This mirrors similar results found in the Gaussian regression setting; see for example [padilla]. The minimax squared-Hellinger rate for density estimation on geometric networks is at least as small as the univariate rate, and it is reasonable to suspect that the minimax rate on some graphs may be strictly better than the univariate rate. Though the univariate case may seem simple, and hence one might expect it to be easier, the sparse connectivity of the underlying graph in univariate estimation negatively affects its minimax rate of convergence. In some sense this is intuitive. For example, adding cycles to a graph increases the total variation compared the same graph with the cycles removed. The increased total variation can be seen as applying more shrinkage in the context of estimation, which makes total variation balls smaller and the problem easier. Similarly, tree graphs (graphs without cycles) have more connectivity than the univariate chain graph and larger total variation for a function defined on it. This intuition is consistent with the formal results from [OptRatesforDenoising] in the regression setting. While there may be networks for which the FDE and minimax squared Hellinger rates may decrease more quickly than the , we leave that study to future work.

Acknowledgements

RB was supported in part by the U.S. Office of Naval Research grant N00014-17-2372. JS was supported in part by NSF grant DMS-1712996. We would like to thank Ryan Tibshirani, Roger Wets, and Matthias Köppe for helpful conversations.

Appendix A Appendix: Proofs

a.1 Proofs from Section 2

Proof of Lemma 2.2.

Let be any function of bounded variation. Set , so that . The value of

(7)

evaluated at is

(8)

Whereas (7) evaluated at is

(9)

Here we have used that , which follows from shift-invariance of total variation. That is, for any function and constant . Subtracting (9) from (8) gives

Recall that for all , with equality attained if and only if . This proves our result, since we have shown that cannot be a minimizer of (7) unless , in which case . ∎

In is worth noting that, in the proof of Lemma 2.2, we only require that total variation is shift invariant. A similar result holds with any shift invariant penalty substituted for .

Proof of Representer Theorem, Lemma 2.3.

Let , and assume that is identified with some interval . Consider any subinterval of which does not intersect . Assume, towards a contradiction, that is a function which is not constant on . We will show that cannot minimize (3).

Define

Let be a function on which is on edge and otherwise. We next consider the effect of this change on the objective in (3). Since no , the term is unaltered by changing to . The interval is contained in , so we have

for every real-valued function on . From the definitions of and ,

This equality is attained if and only if is monotonic on . If follows that .

The integral term in (3) is less at than its evaluation at , since . Equality holds when has measure zero on . Hence,

We cannot have that is monotonic and satisfies has measure zero, unless is constant on . By assumption it is not, so we conclude that cannot minimize (3) since its evaluation at the objective is strictly greater than at . Therefore, any that satisfies (3) must be constant on . ∎

Proof of Theorem 2.4.

We pick up from the discussion preceding the theorem’s statement. Recall that the subdifferentials of the fusion penalty terms are preserved by the exponential transformation. This allows to conclude that there is a satisfying

if and only if and satisfies