# Oracle inequalities for image denoising with total variation regularization

We derive oracle results for discrete image denoising with a total variation penalty. We consider the least squares estimator with a penalty on the ℓ^1-norm of the total discrete derivative of the image. This estimator falls into the class of analysis estimators. A bound on the effective sparsity by means of an interpolating matrix allows us to obtain oracle inequalities with fast rates. The bound is an extension of the bound by Ortelli and van de Geer [2019c] to the two-dimensional case. We also present an oracle inequality with slow rates, which matches, up to a log-term, the rate obtained for the same estimator by Mammen and van de Geer [1997]. The key ingredient for our results are the projection arguments to bound the empirical process due to Dalalyan et al. [2017].

• 8 publications
• 20 publications
02/28/2019

### Oracle inequalities for square root analysis estimators with application to total variation penalties

We study the analysis estimator directly, without any step through a syn...
03/05/2020

### Logistic regression with total variation regularization

We study logistic regression with total variation penalty on the canonic...
04/24/2019

### Prediction bounds for (higher order) total variationregularized least squares

We establish oracle inequalities for the least squares estimator f̂ with...
07/22/2019

### Fast rates for empirical risk minimization over càdlàg functions with bounded sectional variation norm

Empirical risk minimization over classes functions that are bounded for ...
06/11/2018

### Adaptive Denoising of Signals with Shift-Invariant Structure

We study the problem of discrete-time signal denoising, following the li...
07/22/2019

### Fast rates for empirical risk minimization with cadlag losses with bounded sectional variation norm

Empirical risk minimization over sieves of the class F of cadlag functio...
06/04/2018

### On the total variation regularized estimator over a class of tree graphs

We generalize to tree graphs obtained by connecting path graphs an oracl...

## 1 Introduction

### 1.1 Model

We consider the problem of image denoising. We want to estimate the image based on its noisy observation

 Y=f0+ϵ,

where

is a noise matrix with i.i.d. Gaussian entries with known variance

. Without loss of generality we assume throughout the paper that . Note that are matrices of dimension with entries the pixel values. Let . To estimate the image we use total variation regularized least squares.

For two integers , we use the notation . If , we write .

For , we denote the entries of a matrix by (using subscripts) or alternatively by (using arguments). Indeed, we can also regard a matrix as a function with domain mapping to or a subset thereof.

In this paper, the total variation of an image is defined as

 TV(f):=n1∑j=2n2∑k=2|(Δf)j,k|,

where, for ,

 (Δf)j,k:=fj,k−fj,k−1−fj−1,k+fj−1,k−1

and is the two dimensional discrete derivative operator. Note that , where and are of the form

 ⎛⎜ ⎜ ⎜ ⎜ ⎜⎝−11−11⋱⋱−11⎞⎟ ⎟ ⎟ ⎟ ⎟⎠.

Thus we see that is a linear operator.

### 1.2 ANOVA decomposition

We now present the ANOVA decomposition of an image. This decomposition separates an image into four mutually orthogonal components: the global mean, the two matrices of the main effects - one for each of the two dimensions - and the matrix of interaction terms. We will need the insights brought by the ANOVA decomposition of an image to define the estimator for the interaction terms, which is the main object studied in this paper.

Let be the matrix having all the entries equal to one.

We decompose an image as

 f=~f+f(⋅,∘)+f(∘,⋅)+f(∘,∘)ψ1,1,

where is the global mean, and are the matrices of main effects and is the matrix of interaction terms.

The global mean is defined as

 f(∘,∘):=1n1n2n1∑j=1n2∑k=1f(j,k).

The main effects are defined as and , where

 f(j,∘):=1n2n2∑k=1f(j,k)−f(∘,∘), j∈[n1]

and

 f(∘,k):=1n1n1∑j=1f(j,k)−f(∘,∘), k∈[n2].

Note that has identical columns and has identical rows.

The interaction terms are defined as

 ~f(j,k)=f(j,k)−f(j,∘)−f(∘,k)−f(∘,∘), (j,k)∈[n1]×[n2].

We define

 TV1(f(⋅,∘)):=n1∑j=2|f(j,∘)−f(j−1,∘)|.

and similarly

 TV2(f(∘,⋅)):=n2∑k=2|f(∘,k)−f(∘,k−1)|.

Note that and are mutually orthogonal.

Let be the Frobenius norm

 ∥f∥2:=(n1∑j=1n2∑k=1f2j,k)1/2

and let be the sum of the absolute values of the entries of , i.e.

 ∥f∥1:=∑(j,k)∈[n1]×[n2]|fj,k|.

We endow with the trace-inner product. By orthogonality

 ∥f∥22=∥~f∥22+∥f(⋅,∘)∥22+∥f(∘,⋅)∥22+n1n2f2(∘,∘).

### 1.3 Estimator

We consider the estimator

where are positive tuning parameters. We call the two dimensional total variation regularized least squares estimator. This estimator has the form of a so-called analysis estimator: it approximates the observations under a regularization penalty on the -norm of a linear operator of the signal .

Since , , and are mutually orthogonal, we may decompose the estimator as

 ^f=^f(∘,∘)+^f(⋅,∘)+^f(∘,⋅)+^~f,

where

 ^f(∘,∘) =Y(∘,∘), ^f(⋅,∘) =argminf∈Rn1×n2{∥Y(⋅,∘)−f∥22/n+2λ1TV1(f)}, ^f(∘,⋅) =argminf∈Rn1×n2{∥Y(∘,⋅)−f∥22/n+2λ2TV2(f)}, ^~f =argminf∈Rn1×n2{∥~Y−f∥22/n+2λTV(f)}.

We are going to focus on the estimator of the interaction terms , which is an analysis estimator as well.

### 1.4 Contributions

We apply to image denoising an extension of the approach developed in Ortelli and van de Geer (2019c) for bounding the the effective sparsity for (higher order) total variation regularized estimators in one dimension. We obtain oracle inequalities with fast and slow rates for image denoising with the total variation penalty.

###### Theorem 1.1 (Main result for fast rates - simplified version of Theorem 2.1)

Let be an arbitrary subset of size of defining a regular grid parallel to the coordinate axes. Choose

Then,

, it holds that, with high probability,

 ∥^~f−~f0∥22/n≤∥f−~f0∥22/n+4λ∑(j,k)∈[2:n1]×[2:n2]∖S|(Δf)j,k|+O(s3/2lognn).

We also show that the projection arguments by Dalalyan et al. (2017) allow us to recover the same rate obtained by entropy calculations for the estimator of the interactions in a simple and constant-friendly way at the price of an additional log term.

###### Theorem 1.2 (Main result for slow rates - simplified version of Theorem 3.1)

Consider a square image (i.e. ). Choose a subset of of size defining a regular grid parallel to the coordinate axes with

 s≍n2/5TV(f0)4/5(logn)2/5.

Choose

Then

 ∥^~f−~f0∥2n=OP(n−3/5TV(f0)4/5(logn)2/5).

### 1.5 Related literature

Total variation regularization as a method of proximal denoising of images dates back to Rudin et al. (1992). An early contribution treating total variation regularization on a two dimensional grid of points is the work by Mammen and van de Geer (1997). Two dimensional total variation regularization has been also in the focus of some recent studies as Sadhanala et al. (2016); Wang et al. (2016); Hütter and Rigollet (2016); Chatterjee and Goswami (2019).

Total variation regularized estimators in two dimensions usually take the form of analysis estimators, for which general oracle inequalities with fast and slow rates are derived in Ortelli and van de Geer (2019b). Moreover Ortelli and van de Geer (2019a) explain how to transform general analysis estimators into their synthesis form, based on the work by Elad et al. (2007).

A criterion to divide the previous theoretical contributions in the field of image denoising with total variation regularization is the way in which total variation in two dimensions is defined.

Let us in a first moment think of an image as of a (differentiable) function

, . In the continuous case, we can define the total variation of in two ways.

• Either we use partial derivatives and define the total variation as

• Or we use the total derivative and define the total variation as

 ∫n10∫n20∣∣∣∂2f(x,y)∂x∂y∣∣∣∂x∂y.

In this paper we are going to consider the discrete case, where an image is a matrix of pixels . For the discrete case, by replacing derivatives by discrete differences, we find the following counterparts to the definitions of total variation provided above for the continuous case.

• The first approach translates to the discrete case as

 1n1n1∑j=1n2∑k=2|fj,k−fj,k−1|+1n2n1∑k=1n2∑j=2|fj,k−fj−1,k|.

It corresponds, up to normalization, to summing up the edge differences of across the two dimensional grid graph. This approach is used in Sadhanala et al. (2016); Wang et al. (2016); Hütter and Rigollet (2016); Chatterjee and Goswami (2019). We denote the sum of the edge differences of across the two dimensional grid graph by

 TVG(f):=n1∑j=1n2∑k=2|fj,k−fj,k−1|+n1∑k=1n2∑j=2|fj,k−fj−1,k|.
• The parallel to the second approach is given by

 TV(f):=n1∑j=2n2∑k=2|fj,k−fj,k−1−fj−1,k+fj−1,k−1|

and corresponds to the sum of the interaction terms among groups of 4 pixels. This approach is used in Mammen and van de Geer (1997) and will be used in this paper as well.

For the first approach, much work has been done. Sadhanala et al. (2016) derive minimax rates, which, for large and under the canonical scaling , are of order  . Hütter and Rigollet (2016) derive sharp oracle inequalities, that result in the rate . The work by Wang et al. (2016) renders a rate of order (neglecting log terms), which under the canonical scaling (s. Sadhanala et al. (2016)), would result in an upper bound of order . Lastly, the very recent work by Chatterjee and Goswami (2019) focuses on the constrained optimization problem and a tuning-free version thereof. For a certain they obtain a rate faster than the minimax rate. Their approach, as the one of the work for higher order total variation regularization (Guntuboyina et al. (2017)), is based on bounding Gaussian widths of tangent cones.

In our latest work (Ortelli and van de Geer (2019c)), we derived oracle inequalities with fast rates by bounding the weighted weak compatibility constant introduced by Dalalyan et al. (2017). We want to apply the same approach here to obtain an oracle inequality with fast rates for the two dimensional total variation regularized estimator as defined by Mammen and van de Geer (1997).

Using oracle inequalities with slow rates, we also match (up to log terms) the rate obtained in Mammen and van de Geer (1997) by entropy calculations, which is . This is the last missing piece in our attempt to derive nonasymptotic counterparts of the theoretical guarantees on the mean squared error of the total variation regularized estimators handled in Mammen and van de Geer (1997).

We will assume throughout the paper that we have i.i.d. Gaussian errors with unit variance. For the case of unknown variance, Ortelli and van de Geer (2019b) show how to simultaneously estimate the signal and the variance of the noise by extending the idea of the square root lasso (Belloni et al. (2011)) to analysis estimators.

### 1.6 Synthesis formulation

The synthesis formulation of the two dimensional total variation regularized estimator expresses a matrix as linear combination of dictionary matrices, such that the penalty is the -norm of (a subset of) the coefficients of the dictionary matrices. Here the notation with arguments instead of subscripts comes in handy. Consider some . We may write for and ,

 f(j,k) = n1∑j′=1n2∑k′=1βj′,k′ψj′,k′(j,k)

where for the dictionary matrices are with

 ψj′,k′(j,k)=1{j≥j′,k≥k′}, (j,k)∈[n1]×[n2].

We call the collection of matrices the dictionary. Moreover,

 β1,1 =f(1,1), βj′,1 =f(j′,1)−f(j′−1,1), j′∈[2:n1], β1,k′ =f(1,k′)−f(1,k′−1), k′∈[2:n2], βj′,k′ =(Δf)j′,k′, (j′,k′)∈[2:n1]×[2:n2].

Define

 ~ψ1,1 :=ψ1,1 ~ψj,1 :=ψj,1−ψj,1(∘,∘), j∈[2:n1], ~ψ1,k :=ψ1,k−ψ1,k(∘,∘), k∈[2:n2], ~ψj,k :=ψj,k−ψj,k(⋅,∘)−ψj,k(∘,⋅)−ψj,k(∘,∘), (j,k)∈[2:n1]×[2:n2].
###### Remark 1.1

For a linear space, let let denote the projection operator on and be the anti-projection operator. Let be anoter linear space. Then .

We thus note that

 ~ψj,1 :=Aspan(ψ1,1)ψj,1, j∈[2:n1], ~ψ1,k :=Aspan(ψ1,1)ψ1,k, k∈[2:n2], ~ψj,k :=Aspan({ψj,1}j∈[n1]∩{ψ1,k}k∈[n2])ψj,k, (j,k)∈[2:n1]×[2:n2].

Thus, the four linear spaces , , , are mutually orthogonal.

The following lemma shows how a matrix can be expressed as a linear combination of the elements of the dictionary and how in that case the meaning of the coefficients has to be adapted accordingly. In particular, the new dictionary can be partitioned into four mutually orthogonal parts corresponding to the sets and .

###### Lemma 1.3

It holds that

 f=n1∑j=1n2∑k=1~βj,k~ψj,k,

where

 ~β1,1 =f(∘,∘), ~βj,1 =f(j,∘)−f(j−1,∘), j∈[2:n1], ~β1,k =f(∘,k)−f(∘,k−1), k∈[2:n1], ~βj,k =βj,k=(Δf)j,k, (j,k)∈[2:n1]×[2,n2].
###### Proof of Lemma 1.3.

See Appendix A. ∎

The next lemma, which is based on Lemma 1.3, gives us a synthesis form of the estimator for the interaction terms .

###### Lemma 1.4

We have

 ^~f=n1∑j=2n2∑k=2^~βj,k~ψj,k

where

 ^~βj,k=argmin{bj,k}(j,k)∈[2:n1]×[2:n2]{∥Y−n1∑j=2n2∑k=2bj,k~ψj,k∥22/n+2λn1∑j=2n2∑k=2|bj,k|}.
###### Proof of Lemma 1.4.

See Appendix A. ∎

### 1.7 Notation

Fix some set

 S⊆[3:n1−1]×[3:n2−1]:={t1,…,ts}.

One may think of as being the active set of the interactions of some “oracle" approximation of . We refer to the elements of as (interaction) jump locations. The cardinality of is denoted by and the coordinates of a jump location are denoted by , . Note that we require that and . This assumption ensures that we have no boundary effects when doing partial integration, see Corollary 2.7.

Given , we can partition into subsets, subset consisting of the points closest to , . This corresponds to a Voronoi tessellation. The natural metric here would be the city block metric. However, a Voronoi tessellation typically has subsets of relatively irregular shape. We will require that the partition consists of rectangles. Indeed, a partition consisting of rectangles will ease the construction of an interpolating matrix, which is our tool to bound the effective sparsity (Theorem 2.1) and to obtain an oracle inequality with fast rates. The concepts interpolating matrix and effective sparsity are presented in Section 2.2.

###### Definition 1.5

We call a rectangular tessellation of if it satisfies the following conditions:
each is a rectangle (),
,
for all and , the rectangles and () possibly share boundary points, but not interior points,
for all the jump location is an interior point of .

For a rectangular tessellation we let be the area of the rectangle North-West of , the area to the South-West, the area to the South-East and the area to the North-East. In other words, if , , , are the four corners of the rectangle , starting with the top-left corner and going clockwise along the boundary, then

 d−−m =d−1,md−2,m, d−+m =d−1,md+2,m, d++m =d+1,md+2,m, d+−m =d+1,md−2,m,

where

 d−1,m =(t1,m−t−1,m), d−2,m =(t2,m−t−2,m), d+1,m =(t+1,m−t1,m), d+2,m =(t+2,m−t2,m).

For two matrices and we use the symbol for entry-wise multiplication:

 (a⊙b)j,k:=aj,kbj,k, (j,k)∈[2:n1]×[2:n2].

Moreover

 aS:={aj,k, (j,k)∈S}, a−S:={aj,k, (j,k)∉S}.

We will use the same notation for the matrix which shares its entries with for and has all its other entries equal to zero. Similarly, shares its entries with for and has its other entries equal to zero.

## 2 Fast rates

Our objective is now to establish that can adapt to the number of jumps in the main effects and the interactions. The main effects can be dealt with by using the results for the one-dimensional total variation regularized estimator (see Ortelli and van de Geer (2018) and Ortelli and van de Geer (2019b)). Thus, our main result will be the adaptation for the estimator of the interactions. It will be derived invoking a general result from Ortelli and van de Geer (2019b) for analysis problems. For that purpose, we need to establish a bound for the so-called effective sparsity (Definition 2.3), which in turn can be done by using interpolating matrices (see Lemma 2.5

). This way of bounding the effective sparsity is an extension to the two-dimensional case of the bound on the effective sparsity for one-dimensional total variation regularized estimators based on interpolating vectors exposed in

Ortelli and van de Geer (2019c).

### 2.1 Main result

In this section we present the main result: an oracle inequality for the estimator of the interactions .

Fix a set . The set could be the jump locations of some oracle approximation of but we do not insist on this. Let

 d1,max(S):=maxm∈[1:s]max{d−1,m,d+1,m}

and

 d2,max(S):=maxm∈[1:s]max{d−2,m,d+2,m}.

For all we define

###### Theorem 2.1

Let . Choose

 λ≥2λ0(t)√d1,max(S)/n1+d2,max(S)/n2\vrule height 6.99989% 3pt depth -5.599915pt width 1px

For all , we have with probability at least

 ∥^~f−~f0∥22/n ≤ ∥f−~f0∥22/n+4λ∥(Δf)−S∥1 + (√sn\vrule height 13.499786pt depth -10.799829pt+√2xn\vrule h% eight 13.499786pt depth -10.799829pt+λΓ(S,v−S))2,

where

 Γ2(S,v−S) (1)

Theorem 2.1 is an application of Theorem 4.4 in Ortelli and van de Geer (2019c). The work goes mainly into establishing a bound for quantity , the so-called “effective sparsity" which we define in Section 2.2. It depends on , on the weights and on a sign pattern . The weights are given in (2).

###### Corollary 2.2

Let us take and S the jump locations of . Suppose these jump locations lie on a regular grid. We have , and for all . Then

 Γ2(S,v−S)≤8(log(en1)+log(en2))n1n2(n1−1)(n2−1)s2

and we may take

So then

 λ2Γ2(S,vS,qS)≤16λ20(t)s(s1+s2)nlog(en1)+log(en2)(n1−1)(n2−1).

If say, we see that . In other words, the oracle bound is then up to log-terms of order .

### 2.2 Interpolating matrix and partial integration

The definition for effective sparsity is as in Ortelli and van de Geer (2019c), but formulated in matrix (instead of vector) form.

###### Definition 2.3

Let be a matrix of weights. Let be a matrix s.t.

 qj,k{∈{−1,+1},(j,k)∈S,∈[−1,1],(j,k)∉S..

We call the matrix a sign configuration. The effective sparsity is

 Γ(S,v−S,qS)=
 max{trace(qTS(D1fDT2)S)−∥(1−v)−S⊙(D1fDT2)−S∥1:∥f∥22/n=1}.

Moreover we write

 Γ(S,v−S):=maxqSΓ(S,v−S,qS).
###### Definition 2.4

Consider some sign configuration and a matrix . We call an interpolating matrix for the weights if it has the following properties:

• , ,

• .

For completeness, we give the matrix version of Lemma 4.2 in Ortelli and van de Geer (2019c).

###### Lemma 2.5

We have

 Γ2(S,v−S,qS)≤nminw(qS)∥DT1w(qs)D2∥22

where the minimum is over all interpolating matrices for the sign configuration .

###### Proof of Lemma 2.5.

Let be arbitrary and let be a sign configuration. Then

 trace(qTS⊙(D1fDT2)S)−∥(1−v)−S⊙(D1fDT2)−S∥1 ≤ trace(qTS⊙(D1fDT2)S)−∥w−S(qS)⊙(D1fDT2)−S∥1 ≤ trace(qTS⊙(D1fDT2)S)+trace(w−S(qS)T⊙(D1fDT2)−S) = trace(w(qS)TD1fDT2) = trace(DT2w(qS)TD1f) ≤ √n% \vrule height 6.999893pt depth -5.599915pt∥DT1w(q