1 Introduction
We consider the problem of estimating a realvalued function of bounded variation () from observations in the commonly used white noise regression model (see e.g. Brown and Low (1996), Reiß (2008) and Tsybakov (2009))
(1.1) 
Here, denotes the standard Gaussian white noise process in , and we identify the torus with the set , i.e. to simplify technicalities we assume to be a periodic function. To ease notation we will henceforth drop the symbol , and write for instance instead of , and so on. See Remark 4 in Section 2 for a more detailed explanation and for the arguments needed to treat nonperiodic functions. The function is further assumed to be of bounded variation, written , meaning that and its weak partial derivatives of first order are finite Radon measures on (see e.g. Chapter 5 in Evans and Gariepy (2015)). Note that, for (1.1) to be welldefined, we need to assume additionally that if , since only in we have . In the following we assume that is known, otherwise it can be estimated efficiently (see e.g. Munk et al. (2005) or Spokoiny (2002)), which will not affect our results.
Due to their low smoothness, functions of bounded variation are well suited to model objects with discontinuities. This is a desirable property for instance in medical imaging applications, where sharp transitions between tissues occur, and smoother functions would represent them inadequately. Consequently, functions have been studied extensively in the applied and computational analysis literature, see e.g. Chambolle and Lions (1997), Meyer (2001), Rudin et al. (1992), Scherzer et al. (2009) and references therein. Remarkably, the very reason for the success of functions of bounded variation in applications, namely their low smoothness, has difficulted the development of a rigorous theory for the corresponding estimators in a statistical setting. With the exception of the onedimensional case , where Mammen and van de Geer (1997) showed that the least squares estimator with a total variation (TV) penalty attains the minimax optimal convergence rates, there are to the best of our knowledge no statistical guarantees for estimating functions in dimension . Roughly speaking, the main challenges in higher dimensions are twofold: first, the embedding fails if ; and second, the space does not admit a characterization in terms of the size of wavelet coefficients.
A way around the difficulty of analyzing TVregularization in higher dimensions is to discretize the observational model (1.1), thereby reducing the problem of estimating a function
to that of estimating a vector of function values
. In particular, the risk is measured by the Euclidean norm of , and not by the continuous norm. TVregularized least squares in this discrete setting is nowadays fairly well understood. We mention Dalalyan et al. (2017) and Hütter and Rigollet (2016), who proved convergence rates in any dimension, which were shown to be minimax optimal in that model Sadhanala et al. (2016), and its generalization to trendfiltering is a current research topic Guntuboyina et al. (2017), Wang et al. (2016). However, this discretized model disregards the infinitedimensional nature of the function estimation problem, and our work aims to close this gap.To this end, we consider the continuous model (1.1) and present estimators for that are minimax optimal up to logarithmic factors in any dimension, i.e. they attain the polynomial rate for the risk, . These estimators combine the strengths of TV and multiscale datafidelity constraints, in contrast to the above mentioned works for the discretized model, which typically employ a simple quadratic datafidelity (see Figure 1 for an illustration of the different performance when ). Multiscale datafidelity terms and the associated reconstructions by the corresponding dictionary are widely used since the introduction of wavelets (see e.g. Daubechies (1992) and Donoho (1993)), and specially for imaging tasks overcomplete frames such as curvelets (Candès and Donoho, 2000), shearlets (Guo et al. (2006), Labate et al. (2005)) and other multiresolution systems (see Haltmeier and Munk (2014) for a survey) have been shown to perform well in theory and numerical applications. In contrast, multiscale methods using overcomplete frames in combination with a TVfunctional only have been empirically shown to yield promising results for function estimation (Candès and Guo (2002), Dong et al. (2011), Frick et al. (2012), Frick et al. (2013)). Despite of a lacking theoretical understanding in a statistical setup these methods have been rarely used in routine applications, as they need large scale convex optimization methods for their computation. However, in the meantime such methods have become computationally feasible due to recent progress in optimization, e.g. the development of primaldual methods (Chambolle and Pock, 2011) or semismooth Newton methods (Clason et al., 2010). Still, this practical success is until today overshadowed by lacking theoretical guarantees for these estimation methods. In this paper we aim to fill this gap and will show that such estimators are optimal in a minimax sense up to logarithmic factors.
Multiscale total variation estimator
Let be a dictionary of functions indexed by a countable set and satisfying . Consider the projection of the white noise model (1.1) onto ,
(1.2) 
where denotes the standard inner product in . For each , given the observations , our estimator for is defined as any solution to the constrained minimization problem
(1.3) 
Here, the expression denotes the bounded variation seminorm of a function (see Section 2.1), and the subset , the finite subset and the threshold will be specified later. We show below that, given a family , there are universal choices of and that guarantee optimal behavior of in an asymptotic minimax sense.
Example 1.
In order to explain the estimator , consider the situation where and the dictionary consists of normalized indicator functions of dyadic squares (Nemirovski, 2000),
where denotes the Lebesgue measure of the set . Choose the set in (1.3) to consist of all dyadic squares in of measure at dyadic positions, which gives . Then, the estimator has the form
(1.4) 
where the maximum is taken over squares of size with vertices at dyadic positions. The main peculiarity of is the datafidelity term, which encourages proximity of to the truth simultaneously at all large enough dyadic squares . This results in an estimator that preserves features of the truth in both the large and the small scales, thus giving a spatially adaptive estimator. This is illustrated in Figure 1 (see Frick et al. (2013) for an algorithmic implementation): the estimator is represented in part (c), and it succeeds to reconstruct the image well at both the large (sky and building) and small scales (stairway). In Figure 1(d) we show for comparison the classical TVregularization estimator, a.k.a. RudinOsherFatemi (ROF) estimator (Rudin et al., 1992)
(1.5) 
which employs a global datafidelity term. The parameter is chosen here in an oracle way so as to minimize the distance to the truth, where we measure the ”distance” by the symmetrized Bregman divergence of the seminorm (see Section 3 of Frick et al. (2012)). As seen in Figure 1(d), the estimator successfully denoises the image in the large scales at the cost of details in the small scales. The reason is simple: the use of the norm as a datafidelity, which measures the proximity to the data globally. This means that the optimal parameter is forced to achieve the best tradeoff between regularization and data fidelity in the whole image: in particular, in rich enough images there will be regions where one either overregularizes or underregularizes, e.g. in the stairway in Figure 1(d).
The estimator requires the choice of the parameter . By (1.2), we expect the coefficients of the truth to be close to . This motivates choosing such that the truth satisfies the constraint in the righthand side of (1.4
) with high probability, which is achieved by the
universal threshold(1.6) 
with depending on the dictionary in an explicit way. This choice yields minimax optimality for up to logarithmic factors in any dimension with respect to the risk, (see Theorem 1). We remark that this universal choice of the parameter appears to us as a great conceptual and practical advantage of the estimator (1.3), in contrast to penalized estimators such as (1.5) requiring more complex parameterchoice methods (e.g. Lepskii (1991) or Wahba (1977)). In particular, in (1.6) can be computed using known or simulated quantities only.
Other examples that minimize the seminorm and fall into the framework of (1.3) result from dictionaries consisting of a wavelet basis (Donoho (1993), Härdle et al. (2012)), a curvelet frame (Candès and Donoho, 2000) or a shearlet frame (Labate et al., 2005). Such estimators have been proposed in the literature (Candès and Guo (2002), Frick et al. (2012), Malgouyres (2002)) and have been shown to perform very well in simulations, outperforming wavelet and curvelet thresholding, and TVregularization with global datafidelity, as confirmed in Figure 1.
Multiscale datafidelity
We now discuss conditions on required for our results. One of these conditions takes the form of a compatibility between the dictionary and the norm of the Besov space (see Section 2.1). We will require that for each there is a finite subset such that
(1.7) 
holds for any function . This is a Jacksontype inequality (Cohen, 2003), representing how well a function can be approximated in the Besov norm by its coefficients with respect to . It is wellknown that smooth enough wavelet bases satisfy this condition (Cohen, 2003). In Section 2.4 we will extend (1.7) to more general multiscale systems, e.g. the system in Example 1 and mixed frames of wavelets and curvelets and of wavelets and shearlets.
The appearance of the Besov space in (1.7) may seem arbitrary at first sight, but it is actually quite natural. It results from the use of the maximum of the frame coefficients as a constraint in (1.3), which itself is motivated by the following fact: the space measures the limit Besov regularity of the paths of white noise, in the sense that
where the white noise is seen as a random Schwartz distribution (see Veraar (2011) for more refined statements). This means that the norm is a natural way of measuring the smoothness of white noise, which suggests to use that norm to measure the residuals of our estimator, i.e. the constraint in (1.3). Indeed, we show below that this constraint ensures that most of the noise is removed.
Let us mention that there are modifications of the Besov spaces constructed to measure the smoothness of white noise more accurately (see e.g. Section 5.2.2 in Giné and Nickl (2015)). However, we will not need them in this work, since our theoretical analysis naturally fits in the context of Besov scales. In a different direction, the optimal balancing of the different scales in the constraint of (1.3) is very much related to the multiplicative scaling used by Dümbgen and Spokoiny (2001)
to correctly weight their test statistics. In that setting, one may wonder whether an
additive scaling is necessary in our setting, as it is in theirs. The answer is that such an additive scaling would help us remove some (but not all) of the logarithmic terms in the error bound in Theorem 1 below, but would imply additional difficulties in the theoretical analysis of the estimator, since the constraint would not match the Besov scale exactly.Convergence analysis
The main contribution of this paper is a unified analysis of estimators of the form (1.3) for a class of dictionaries . For fixed , define the ball of radius ,
(1.8) 
Our main theorem (Theorem 1 in Section 2.2) can be informally stated as follows.
Theorem 1 (Informal).
Let the dimension , and let satisfy an inequality of the form (1.7), with smooth enough elements (see Assumption 1 in Section 2.2). Let the threshold in (1.3) be as in (1.6) for depending on and only. Then the estimator attains the minimax optimal rates of convergence over up to a logarithmic factor,
(1.9) 
for large enough, for any , any and a constant independent of , but dependent on , , and . For , (1.9) holds for with an additional factor.
The convergence rate in (1.9) is indeed minimax optimal over the class up to the logarithmic factor, as it is the optimal rate over the smaller class of bounded functions with bounded gradients (Nemirovski, 2000).
The proof of (1.9) relies on an interpolation inequality between and (Cohen et al., 2003),
(1.10) 
for any , . This inequality can be proven by a delicate analysis of the wavelet coefficients of functions of bounded variation (the original proof is in Cohen et al. (2003), and in Section 5.3 we prove an extension of (1.10) to periodic functions that we need here). The inequality (1.10) is the first step towards bounding the risk of : inserting we can bound it in terms of the and the risks. It can be shown that the risk is bounded by a constant with high probability, while the risk can be handled using inequality (1.7) as follows:
(1.11) 
The first term is bounded by by construction, and it represents the error that we allow the minimization procedure to make. The second term behaves as asymptotically almost surely, and it represents the stochastic error of the estimator. The third term arises from the compatibility between and the Besov space stated in (1.7). Inserting the result in (1.10) yields the conclusion that with high probability. See Section 3 for the full proof.
Crucial for our proof is the inequality (1.10), which relates the risk functional on the lefthand side with the datafidelity and the regularization functionals on the righthand side. This inequality is sharp, in the sense that the norms in the righthand side cannot both be replaced by weaker norms. In this sense, it is important that our estimator (1.3) combines a bound on the frame coefficients (related to the norm) with control on the seminorm. Further, the sharpness implies that our proof strategy does not apply to estimators defined with other datafidelity or regularization functionals. In particular, we give in Remark 7
a heuristic argument of why this strategy cannot provide optimality of wavelet thresholding, since it does not control the
seminorm. A similar argument holds for the TVregularized estimator (1.5): it controls the norm of the residuals instead of their norm, and translating the former norm in the latter yields a suboptimal bound on the risk.Let us finally comment on the parameter set (1.8). The requirement that includes a bound on the supremum norm is needed for the derivation of convergence rates. The boundedness, which could be relaxed to boundedness in the Besov norm (see Remark 3), is needed for an error control of the form
(1.12) 
Since the embedding holds for only (see (2.2)), and since we have , we see that a typical function of bounded variation does not belong to if . Hence, the Jacksontype inequality in (1.12) cannot hold for general functions of bounded variation in . This explains why our parameter space is the intersection of a ball with an ball. Finally, we remark that most works in function estimation deal with Hölder or Sobolev functions with , so the assumption is implicit. Alternatively, we refer to Section 3 in Lepski et al. (1997) and to Delyon and Juditsky (1996) for examples of estimation over Besov bodies where uniform boundedness has to be assumed explicitly if .
Related work
This paper is related to a number of results at the cutting edge of statistics and applied harmonic analysis. Starting with the seminal paper of Rudin et al. (1992) that proposed the TVregularized least squares estimator (1.5) for image denoising, the subsequent development of TVbased estimators depends greatly on the spatial dimension. In dimension , Mammen and van de Geer (1997) showed that the ROFestimator attains the optimal rates of convergence in the discretized nonparametric regression model, and Donoho and Johnstone (1998) proved the optimality of wavelet thresholding for estimation over . We also refer to Davies and Kovac (2001) and Dümbgen and Kovac (2009) for a combination of TVregularization with related multiscale datafidelity terms in , and to Li et al. (2017) for a different approach.
In higher dimensions, the situation becomes more involved due to the low regularity of functions of bounded variation. An approach for dealing with this low regularity is to employ a finer datafidelity term, a strategy that has been used by several authors. In this sense, we distinguish three different variants of the ROFmodel that are related to our approach. First, Meyer (2001) proposed the replacement of the norm in the ROF functional by a weaker norm designed to match the smoothness of Gaussian noise. Several algorithms and theoretical frameworks using the Besov norm (Garnett et al., 2007), the norm (Haddad and Meyer, 2007) and the Sobolev norm in (Osher et al., 2003) were proposed, but the statistical performance of these estimators was not analyzed. A different approach started with Durand and Froment (2001), Malgouyres (2001) and Malgouyres (2002), who proposed estimators of the form (1.3) with a wavelet basis in dimension one. Following this approach and the development of curvelets (see e.g. Candès and Donoho (2000) for an early reference), Candès and Guo (2002) and Starck et al. (2001) proposed the estimator (1.3) with being a curvelet frame and a mixed curvelet and wavelet family, respectively, which showed good numerical behavior. A third line of development that leads to the estimator (1.3) began with Nemirovski Nemirovski (1985) (see also Nemirovski (2000)). He proposed a variational estimator for nonparametric regression over Hölder and Sobolev spaces that used a datafidelity term based on the combination of local likelihood ratio (LR) tests: the multiresolution norm. In statistical inverse problems, Dong et al. (2011) proposed an estimator using TVregularization constrained by the sum of local averages of residuals, instead of the maximum we employ in (1.3).
Regarding the tools and techniques we use, we mention in particular the concept of an interpolation inequality that relates the risk functional, the regularization functional and the datafidelity term (see Nemirovski (1985) for the first derivation of that interpolation inequality, and Grasmair et al. (2018)). Nevertheless, while the inequality in those papers is essentially the GagliardoNirenberg inequality for Sobolev norms (see Lecture II in Nirenberg (1959)), we extend and make use of interpolation inequalities for the norm, e.g. equation (1.10), see Cohen et al. (2003) and Ledoux (2003). Finally, as opposed to Grasmair et al. (2018), we formulate our results in the white noise model. This eases the incorporation of results from harmonic analysis (e.g. the interpolation inequalities between and
and the characterization of Besov spaces by local means) and from probability theory (the path regularity of white noise) to our statistical analysis. See however Section
4 for a discussion of our results in the discretized nonparametric regression model analogous to (1.1).Organization of the paper
The rest of the paper is organized as follows. In Section 2 we state general assumptions on the family under which the estimator is shown to be nearly minimax optimal over the set . We state a general convergence theorem and give a sketch of the proof. Then we present examples of the estimator (1.3) where is a wavelet basis, a multiresolution system, and a curvelet or shearlet frame combined with wavelets, and show their almost minimax optimality over a range of risks. The proof of the main theorem is given in Section 3, while the rest of the proofs are relegated to the Appendix in Section 5.
Notation
We denote the Euclidean norm of a vector by . For a real number , define and . The cardinality of a finite set is denoted by . We say that two norms and in a normed space are equivalent, and write , if there are constants such that for all . Finally, we denote by a generic positive constant that may change from line to line.
2 Results
2.1 Basic definitions
For , let denote the space of times continuously differentiable periodic functions on , which we identify with the torus . The space of periodic functions of bounded variation consists of functions whose weak distributional gradient is a valued finite Radon measure on . The finiteness implies that the bounded variation seminorm of , defined by
is finite. is a Banach space with the norm , see Evans and Gariepy (2015). For , let be an regular wavelet basis for whose elements are times continuously differentiable with absolutely integrable th derivative, indexed by the set
(2.1)  
In particular, we consider wavelets of the form
where
is a tensor product of periodized onedimensional wavelets, and
denotes either the mother wavelet or the father wavelet of a onedimensional wavelet basis of . The index refers here to (shifts of) the father wavelet . See e.g. Section 4.3.6 in Giné and Nickl (2015) for the construction of such a basis. Then for and with , the Besov norm of a (generalized) function is defined by
(2.2) 
with the usual modifications if or . If and , the Besov space consists of functions with finite Besov norm, while if and , then consists of continuous functions with finite Besov norm. In these cases, denotes the standard inner product in . If , consists of periodic distributions with finite Besov norm. Here, denotes the space of periodic distributions, defined as the topological dual to the space of infinitely differentiable periodic functions (see Section 4.1.1 in Giné and Nickl (2015)). In that case, is interpreted as the action of on the function .
Finally, we define the Fourier transform of a function
by(2.3) 
The Fourier transform of a function is defined as in (2.3) extending the integration over . The formal definition of the Fourier transform is as usual extended to functions in and, by duality, to distributions (see e.g. Section 4.1.1 in Giné and Nickl (2015)).
2.2 Main result
The main ingredient of the estimator (1.3) is the family of functions , on which we impose the following assumptions.
Assumption 1.
is of the form for a countable set and functions satisfying for all . For each , consider a subset of polynomial growth, meaning that for all for a polynomial . The sets satisfy the following:

there is a constant independent of such that
holds for any ;

there are constants such that the following hold:

for any we have
(2.4) for a constant independent of ;

we have for a constant and all .

Examples.

Another example of a family satisfying Assumption 1 is given by translations and rescalings of (the smooth approximation to) the indicator function of a cube. In Section 2.4.2 we verify the assumption for such a system, that has been used previously as a dictionary for function estimation (see Grasmair et al. (2018)).
Definition 1.
In the following we assume that , so that we do not have to worry about the case . The reason for minimizing over this set is that, in the analysis of the estimator , we will need upper bounds on its supremum norm. As it turns out, the upper bound can be chosen to grow logarithmically in without affecting the polynomial rate of convergence of the estimator (but yielding additional logarithmic factors in the risk). Alternatively, if we knew an upper bound for the supremum norm of , we could choose . In that case, the risk bounds in Theorem 1 below would improve in some logarithmic factors (see Remark 2).
Theorem 1.
Remark 1.

Notice that part a) of the theorem implies that (2.7) holds asymptotically almost surely if .

While Assumption 1(a) is enough to guarantee convergence rates with high probability, for convergence rates in expectation we need to impose the quantitative smoothness assumptions in part (b) of Assumption 1. Notice that (2.4) requires certain smoothness on the elements , as well as a form of stability of the dictionary . The latter condition is satisfied by dictionaries of wavelet type.

By the assumption that , we have the tail bound
for any and , where denotes the white noise process in . This bound follows from Chernoff’s inequality and the union bound (see Proposition 9 in the Appendix), and it will play an important role for bounding the stochastic estimation error of the estimator .
Remark 2.
The logarithmic factors in (2.7) and (2.8) are equal to for and to for . They arise in part from the bound (that we get from minimizing over in (2.6)), while part of them arise from the estimation procedure itself. Indeed, if we additionally constrain the estimator to , the factors can be improved to and for and , respectively. See Remark 6 below for an explanation of the different factors in and .
Remark 3.
Recall that our parameter set involves a bound on the supremum norm. This bound can be relaxed to a bound on the Besov norm without changing the convergence rate for . Indeed, assume for simplicity that is an orthonormal wavelet basis of , and for let index the wavelet coefficients up to level . In the proof of Theorem 1 we need a relaxed form of Assumption 1(a), namely an inequality of the form
(2.9) 
for sufficiently smooth . But this inequality for all is equivalent to (see Bersteintype inequalities for Besov spaces, e.g. in Section 3.4 in Cohen (2003)). Consequently, Theorem 1 can be extended to show that the estimator with an orthonormal wavelet basis attains the optimal polynomial rates of convergence uniformly over the enlarged parameter space .
Remark 4.
In this work we deal with the estimation of periodic functions, i.e. defined on the torus . The reason for that is purely technical: our analysis makes use of Banach spaces of functions, whose definition is simpler for functions defined over (a manifold without boundary) than over the hypercube (which has a boundary). We remark that our work could be extended to function spaces over by the use of boundary corrected wavelet bases (see Section 4.3.5 in Giné and Nickl (2015)), and adapting the definitions of Besov and spaces and their corresponding norms.
We can now state the main result of this paper, which is a direct consequence of Theorem 1.
Theorem 2.
2.3 Sketch of the proof of Theorem 1
The proof of Theorem 1, which we relegate to Section 3, relies heavily on results from the theory of function spaces. In particular, the following interpolation inequalities play a central role in our proof.
Proposition 1 (Interpolation inequalities).

Let , . Then there is a constant depending on such that
holds for any .

Let and . Then there is a constant such that
holds for any .
Remark 5.
The limitation in part a) is sharp in the sense that the constant grows unboundedly as .
Proposition 2.
Let and . Let be such that or , where is the Hölder conjugate of . Then for any and parameters such that
we have
(2.10) 
for any .
The proof of Proposition 2 is given in Section 5.3, and it is a generalization to periodic functions of a result by Cohen et al. (2003). Notice that Proposition 2 expresses an interpolation inequality in terms of Besov norms. In order to derive part b) of Proposition 1, we choose , , and . Then for , the norm in the lefthand side of (2.10) can be readily reformulated in terms of an norm. For the situation is more involved, since the embedding does not hold. A more refined argument is needed to prove convergence in for , for which we use a variation of part a) of Proposition 1. This difference is responsible for the different logarithmic factors in (2.7).
With Proposition 1, the proof of part a) of Theorem 1 for will be done along the following steps.

We bound the risk of using part b) of Proposition 1 as
(2.12) The rest of the proof consists in showing that the righthand side behaves as conditionally on .

By part (a) of Assumption 1, the Besov norm in (2.12) can be bounded as
(2.13) The first term satisfies
where the second inequality follows by construction of and the third one holds conditionally on . The second term in (2.13) is bounded by , since by construction and by . Using the expression (1.6) for we have the bound
conditionally on .

The bounded variation norm in (2.12) satisfies
The supremum norm of the difference can be bounded as in step 3. In order to bound , notice that conditionally on we have
and hence the function is feasible for the minimization problem (2.5) that defines . Therefore we conclude that conditionally on , and we have
where the second inequality follows from .
Remark 6.
While in this sketch we only considered the case , the proof for is analogous, but somewhat more involved. It uses part a) of Proposition 1 to bound the risk, and chooses a sequence for a certain constant and . An important step of the proof is to control the constant in the embedding as a function of , which allows us to bound the risk by the risk. This approach yields the additional logarithm in (
Comments
There are no comments yet.