1 Introduction
1.1 Model
We consider the problem of image denoising. We want to estimate the image based on its noisy observation
where
is a noise matrix with i.i.d. Gaussian entries with known variance
. Without loss of generality we assume throughout the paper that . Note that are matrices of dimension with entries the pixel values. Let . To estimate the image we use total variation regularized least squares.For two integers , we use the notation . If , we write .
For , we denote the entries of a matrix by (using subscripts) or alternatively by (using arguments). Indeed, we can also regard a matrix as a function with domain mapping to or a subset thereof.
In this paper, the total variation of an image is defined as
where, for ,
and is the two dimensional discrete derivative operator. Note that , where and are of the form
Thus we see that is a linear operator.
1.2 ANOVA decomposition
We now present the ANOVA decomposition of an image. This decomposition separates an image into four mutually orthogonal components: the global mean, the two matrices of the main effects  one for each of the two dimensions  and the matrix of interaction terms. We will need the insights brought by the ANOVA decomposition of an image to define the estimator for the interaction terms, which is the main object studied in this paper.
Let be the matrix having all the entries equal to one.
We decompose an image as
where is the global mean, and are the matrices of main effects and is the matrix of interaction terms.
The global mean is defined as
The main effects are defined as and , where
and
Note that has identical columns and has identical rows.
The interaction terms are defined as
We define
and similarly
Note that and are mutually orthogonal.
Let be the Frobenius norm
and let be the sum of the absolute values of the entries of , i.e.
We endow with the traceinner product. By orthogonality
1.3 Estimator
We consider the estimator
where are positive tuning parameters. We call the two dimensional total variation regularized least squares estimator. This estimator has the form of a socalled analysis estimator: it approximates the observations under a regularization penalty on the norm of a linear operator of the signal .
Since , , and are mutually orthogonal, we may decompose the estimator as
where
We are going to focus on the estimator of the interaction terms , which is an analysis estimator as well.
1.4 Contributions
We apply to image denoising an extension of the approach developed in Ortelli and van de Geer (2019c) for bounding the the effective sparsity for (higher order) total variation regularized estimators in one dimension. We obtain oracle inequalities with fast and slow rates for image denoising with the total variation penalty.
Theorem 1.1 (Main result for fast rates  simplified version of Theorem 2.1)
Let be an arbitrary subset of size of defining a regular grid parallel to the coordinate axes. Choose
Then,
, it holds that, with high probability,
We also show that the projection arguments by Dalalyan et al. (2017) allow us to recover the same rate obtained by entropy calculations for the estimator of the interactions in a simple and constantfriendly way at the price of an additional log term.
Theorem 1.2 (Main result for slow rates  simplified version of Theorem 3.1)
Consider a square image (i.e. ). Choose a subset of of size defining a regular grid parallel to the coordinate axes with
Choose
Then
1.5 Related literature
Total variation regularization as a method of proximal denoising of images dates back to Rudin et al. (1992). An early contribution treating total variation regularization on a two dimensional grid of points is the work by Mammen and van de Geer (1997). Two dimensional total variation regularization has been also in the focus of some recent studies as Sadhanala et al. (2016); Wang et al. (2016); Hütter and Rigollet (2016); Chatterjee and Goswami (2019).
Total variation regularized estimators in two dimensions usually take the form of analysis estimators, for which general oracle inequalities with fast and slow rates are derived in Ortelli and van de Geer (2019b). Moreover Ortelli and van de Geer (2019a) explain how to transform general analysis estimators into their synthesis form, based on the work by Elad et al. (2007).
A criterion to divide the previous theoretical contributions in the field of image denoising with total variation regularization is the way in which total variation in two dimensions is defined.
Let us in a first moment think of an image as of a (differentiable) function
, . In the continuous case, we can define the total variation of in two ways.
Either we use partial derivatives and define the total variation as

Or we use the total derivative and define the total variation as
In this paper we are going to consider the discrete case, where an image is a matrix of pixels . For the discrete case, by replacing derivatives by discrete differences, we find the following counterparts to the definitions of total variation provided above for the continuous case.

The first approach translates to the discrete case as
It corresponds, up to normalization, to summing up the edge differences of across the two dimensional grid graph. This approach is used in Sadhanala et al. (2016); Wang et al. (2016); Hütter and Rigollet (2016); Chatterjee and Goswami (2019). We denote the sum of the edge differences of across the two dimensional grid graph by

The parallel to the second approach is given by
and corresponds to the sum of the interaction terms among groups of 4 pixels. This approach is used in Mammen and van de Geer (1997) and will be used in this paper as well.
For the first approach, much work has been done. Sadhanala et al. (2016) derive minimax rates, which, for large and under the canonical scaling , are of order . Hütter and Rigollet (2016) derive sharp oracle inequalities, that result in the rate . The work by Wang et al. (2016) renders a rate of order (neglecting log terms), which under the canonical scaling (s. Sadhanala et al. (2016)), would result in an upper bound of order . Lastly, the very recent work by Chatterjee and Goswami (2019) focuses on the constrained optimization problem and a tuningfree version thereof. For a certain they obtain a rate faster than the minimax rate. Their approach, as the one of the work for higher order total variation regularization (Guntuboyina et al. (2017)), is based on bounding Gaussian widths of tangent cones.
In our latest work (Ortelli and van de Geer (2019c)), we derived oracle inequalities with fast rates by bounding the weighted weak compatibility constant introduced by Dalalyan et al. (2017). We want to apply the same approach here to obtain an oracle inequality with fast rates for the two dimensional total variation regularized estimator as defined by Mammen and van de Geer (1997).
Using oracle inequalities with slow rates, we also match (up to log terms) the rate obtained in Mammen and van de Geer (1997) by entropy calculations, which is . This is the last missing piece in our attempt to derive nonasymptotic counterparts of the theoretical guarantees on the mean squared error of the total variation regularized estimators handled in Mammen and van de Geer (1997).
We will assume throughout the paper that we have i.i.d. Gaussian errors with unit variance. For the case of unknown variance, Ortelli and van de Geer (2019b) show how to simultaneously estimate the signal and the variance of the noise by extending the idea of the square root lasso (Belloni et al. (2011)) to analysis estimators.
1.6 Synthesis formulation
The synthesis formulation of the two dimensional total variation regularized estimator expresses a matrix as linear combination of dictionary matrices, such that the penalty is the norm of (a subset of) the coefficients of the dictionary matrices. Here the notation with arguments instead of subscripts comes in handy. Consider some . We may write for and ,
where for the dictionary matrices are with
We call the collection of matrices the dictionary. Moreover,
Define
Remark 1.1
For a linear space, let let denote the projection operator on and be the antiprojection operator. Let be anoter linear space. Then .
We thus note that
Thus, the four linear spaces , , , are mutually orthogonal.
The following lemma shows how a matrix can be expressed as a linear combination of the elements of the dictionary and how in that case the meaning of the coefficients has to be adapted accordingly. In particular, the new dictionary can be partitioned into four mutually orthogonal parts corresponding to the sets and .
Lemma 1.3
It holds that
where
The next lemma, which is based on Lemma 1.3, gives us a synthesis form of the estimator for the interaction terms .
Lemma 1.4
We have
where
1.7 Notation
Fix some set
One may think of as being the active set of the interactions of some “oracle" approximation of . We refer to the elements of as (interaction) jump locations. The cardinality of is denoted by and the coordinates of a jump location are denoted by , . Note that we require that and . This assumption ensures that we have no boundary effects when doing partial integration, see Corollary 2.7.
Given , we can partition into subsets, subset consisting of the points closest to , . This corresponds to a Voronoi tessellation. The natural metric here would be the city block metric. However, a Voronoi tessellation typically has subsets of relatively irregular shape. We will require that the partition consists of rectangles. Indeed, a partition consisting of rectangles will ease the construction of an interpolating matrix, which is our tool to bound the effective sparsity (Theorem 2.1) and to obtain an oracle inequality with fast rates. The concepts interpolating matrix and effective sparsity are presented in Section 2.2.
Definition 1.5
We call a rectangular tessellation of
if it satisfies the following conditions:
each is a rectangle (),
,
for all and , the rectangles and () possibly share
boundary points, but not interior points,
for all the jump location is an interior point of .
For a rectangular tessellation we let be the area of the rectangle NorthWest of , the area to the SouthWest, the area to the SouthEast and the area to the NorthEast. In other words, if , , , are the four corners of the rectangle , starting with the topleft corner and going clockwise along the boundary, then
where
For two matrices and we use the symbol for entrywise multiplication:
Moreover
We will use the same notation for the matrix which shares its entries with for and has all its other entries equal to zero. Similarly, shares its entries with for and has its other entries equal to zero.
2 Fast rates
Our objective is now to establish that can adapt to the number of jumps in the main effects and the interactions. The main effects can be dealt with by using the results for the onedimensional total variation regularized estimator (see Ortelli and van de Geer (2018) and Ortelli and van de Geer (2019b)). Thus, our main result will be the adaptation for the estimator of the interactions. It will be derived invoking a general result from Ortelli and van de Geer (2019b) for analysis problems. For that purpose, we need to establish a bound for the socalled effective sparsity (Definition 2.3), which in turn can be done by using interpolating matrices (see Lemma 2.5
). This way of bounding the effective sparsity is an extension to the twodimensional case of the bound on the effective sparsity for onedimensional total variation regularized estimators based on interpolating vectors exposed in
Ortelli and van de Geer (2019c).2.1 Main result
In this section we present the main result: an oracle inequality for the estimator of the interactions .
Fix a set . The set could be the jump locations of some oracle approximation of but we do not insist on this. Let
and
For all we define
Theorem 2.1
Let . Choose
For all , we have with probability at least
where
(1) 
Theorem 2.1 is an application of Theorem 4.4 in Ortelli and van de Geer (2019c). The work goes mainly into establishing a bound for quantity , the socalled “effective sparsity" which we define in Section 2.2. It depends on , on the weights and on a sign pattern . The weights are given in (2).
Corollary 2.2
Let us take and S the jump locations of . Suppose these jump locations lie on a regular grid. We have , and for all . Then
and we may take
So then
If say, we see that . In other words, the oracle bound is then up to logterms of order .
2.2 Interpolating matrix and partial integration
The definition for effective sparsity is as in Ortelli and van de Geer (2019c), but formulated in matrix (instead of vector) form.
Definition 2.3
Let be a matrix of weights. Let be a matrix s.t.
We call the matrix a sign configuration. The effective sparsity is
Moreover we write
Definition 2.4
Consider some sign configuration and a matrix . We call an interpolating matrix for the weights if it has the following properties:

, ,

.
For completeness, we give the matrix version of Lemma 4.2 in Ortelli and van de Geer (2019c).
Lemma 2.5
We have
where the minimum is over all interpolating matrices for the sign configuration .
Proof of Lemma 2.5.
Let be arbitrary and let be a sign configuration. Then