Unifying local and non-local signal processing with graph CNNs

02/24/2017 ∙ by Gilles Puy, et al. ∙ 0

This paper deals with the unification of local and non-local signal processing on graphs within a single convolutional neural network (CNN) framework. Building upon recent works on graph CNNs, we propose to use convolutional layers that take as inputs two variables, a signal and a graph, allowing the network to adapt to changes in the graph structure. In this article, we explain how this framework allows us to design a novel method to perform style transfer.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 10

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Convolutional neural networks (CNNs) have achieved unprecedented performance in a wide variety of applications, in particular for image analysis, enhancement and editing – e.g., classification [krizhevsky12]

, super-resolution

[dong16], and colorisation [zhang16]. Yet standard CNNs can only handle signals that live on a regular grid, and each layer of a CNN only performs a local processing. Locality has already been identified as a limitation for classical signal processing tasks where powerful non-local methods have been proposed, such as patch-based methods for inpainting [criminisi04] or denoising [buades11, talmon2011transient]. Regular CNNs do not allow such a non-local processing. Furthermore, the growing amount of signals collected on irregular grids, such as social, transportation or biological networks, requires extending signal processing from regular to irregular graphs [shuman_SPMAG2013].

Any CNN consists of a composition of convolutional and pooling layers. One should thus redefine both convolution and pooling to handle “graph signals”. In this work, we use convolutional layers only and hence just concentrate on the generalisation of the convolution. One major challenge in this generalisation is to take into account the possible changes of the graph structure from one signal instance to another: nodes and edges can appear, disappear, and the edge weights can vary. For instance, the connections between the users (graph’s vertices) in a social network change over time. It would be cumbersome to retrain a CNN each time a connection changes. In non-local signal processing methods, the situation is even more extreme as the graph is a construct whose edges typically capture similarities between different parts of the signal itself. In this case, the CNN must not be just robust to few variations in the graph structure but fully adapt to these variations. We propose here a solution to this challenge but passing two variables to the CNN: the signal itself, as usual, and the graph structure.

Contributions We propose a graph CNN framework that takes as inputs two variables: a signal and a graph structure. This permits the adaptation of the CNN to changes in the structure of the graph on which the signal lives, even in the extreme case where this structure changes with the input signal itself. We also propose a unique way of defining convolutions on arbitrary graphs, in particular non-local convolutions, with application to a wide range of many different signal processing applications. Due to space constraint, we only present the use of graph CNNs for image style transfer in this article. We use a local CNN to capture and transfer local style properties of the painting to the photograph. We also use a non-local graph CNN to capture and transfer global style properties of the painting, as well as to preserve the content of the photograph. In addition, we show that this task can be done using only two random shallow networks, instead of a trained regular deep CNN [gatys16].

Let us mention that additional experiments in the Appendix demonstrate the effectiveness and versatility of our framework on other kinds of signals (greyscale images, color palettes, and speech signals) and tasks (color transfer and denoising). In particular, the experiments show that it is possible to identify the optimal mixing of local and non-local signal processing techniques by learning.

2. Graph CNN

2.1. State-of-the-art methods

In this section, we review different existing solutions to generalise CNNs to signal living on graphs. The reader can refer to [bronstein16] for a detailed overview. We restrict our attention here to the solutions the most closely connected to ours.

A first approach to redefine convolution for graph signals is to work in the spectral domain, for which we need to define the graph Fourier transform. To introduce this transform, we consider an undirected weighted graph

111A graph is a set of vertices, a set of edges and a weighted adjacency matrix , with iff . In this paper, we consider directed graphs unless explicitly stated. The matrix is thus not symmetric in general. with graph Laplacian denoted by . For example, can be the combinatorial graph Laplacian , or the normalised one , where

is the identity matrix and

is the diagonal degree matrix with entries  [chung_book1997]. The matrix

is real symmetric and positive semi-definite. Thus, there exists a set of orthonormal eigenvectors

and real eigenvalues

such that , where . The matrix is viewed as the graph Fourier basis [shuman_SPMAG2013].

For any signal defined on the vertices of , is its graph Fourier transform. One way to define convolution on with a filter is by filtering in the graph Fourier domain:

(1)

where denotes the entry-wise multiplication and . In the context of graph CNN, it is the approach chosen in [bruna14]. This approach however has several drawbacks: Computing

is often intractable for real-size graphs; Matrix-vector multiplication with

is usually slow (there is no fast graph Fourier transform); This definition does not allow variations in the graph structure as the matrix is impacted by any such change; The number of filter coefficients to learn is as large as the size of the input signal. The subsequent work [henaff15] solves this last issue by imposing that the filter lives in the span of a kernel matrix with .

To overcome the computational issues of the spectral approach, a known trick in the field of graph signal processing is to define a filter as a polynomial of the graph eigenvalues [hammond11]. Let be a polynomial of degree : , with and consider the filter . One can easily prove that spectral filtering with satisfies

(2)

This expression involves only computations in the vertex domain through matrix-vector multiplications with . As the Laplacian is usually a sparse matrix, filtering a signal with a polynomial filter is fast. This is the approach adopted by [defferrard16] and [kipf16] in their construction of graph CNNs. Beyond the computational improvements, the number of coefficients to learn is also reduced: instead of . Furthermore, the localisation of the filter in the vertex domain is exactly controlled by the degree of the polynomial [hammond11, defferrard16]. Yet these polynomial filters are not entirely satisfying. Indeed, for, e.g., a graph modelling a regular lattice, polynomial filters are isotropic unlike those in regular CNNs for images – where the underlying graph is a regular lattice. There is no equivalence between regular CNNs and graph CNNs with polynomial filters. Let us also mention the work of [atwood16] where the convolution is defined using a diffusion process on the graph. Due to lack of space, we do not report the exact definition but this one shares similarities with (2) where the normalised transition matrix is substituted for the Laplacian.

In our work, we built upon the work of [niepert16] and [monti16] to get rid of these shortcomings. The convolutions are directly defined in the vertex domain in a way which allows ones to directly control the computational complexity and the localisation of the filters. Furthermore, these filters do not suffer from the isotropy issue of polynomial filters.

2.2. Our method

Each layer of our graph CNN implements a function

(3)

where is the input signal, , and is a -vertex graph on which the columns of live and which defines how the convolution is done in this layer. The input signal has size in the “spatial” dimension – e.g., pixels for images – and has channels or feature maps – e.g., for color images. The output signal has same spatial size – we do not use any pooling layers in this work – and feature maps.

2.2.1. Convolution

The convolution we use follows principles also used in, e.g., [scarselli09, li16, niepert16, monti16], where the computation done at one vertex is a function of (at least) the values of the signal at this vertex and neighbouring vertices as well as of labels attributed to each edge. We choose here to use the formalism of [monti16] for our description.

Convolutions in [monti16] are done in two steps: the extraction of a signal patch around each vertex and a scalar product. We assume here that all vertices have the same number of connections: for all . If this is not the case, one can always complete the set of edges and associate to these edges, e.g., a null weight. We also assume that for all .

For a given graph satisfying the above assumption and with adjacency matrix , we model patch extraction at vertex with a function

(4)

where each , , extracts one entry of the vector , which represents one column of the input signal . Let be the indices to which is connected. The order in which these entries are extracted by is determined by “pseudo-coordinates” attributed to each connected vertex [monti16]. The nature of these pseudo-coordinates will be given in Section 2.2.2 for local convolution and in Section 2.2.3 for non-local convolution. We define

(5)

where . The vector is the row of , which contains at most non-zero entries, and is a re-weighting function that gives the possibility to account for each edge weight in the convolution. We noticed that the choice of this function is very important in the definition of the non-local convolutions to achieve good results in our signal processing applications (see its definition in Section 2.2.3). Note that depends on and in our work while this function depends solely on the pseudo-coordinates in [monti16]. This is a simple but important modification for our applications.

Convoluting with a filter is then defined as in [monti16]:

(6)

for all . Finally, the function in (3) satisfies

(7)

where is an element-wise non-linearity, e.g., defined as , denotes the column-vector of , , , , are filters, and are biases.

Let us highlight that the size of the input signal in the spatial dimension is not fixed in (7). Hence, can be computed for signals of different sizes using the same filters , exactly as with regular CNNs.

We explain in the next section how one can recover the usual local convolution for images from this definition. We will then continue with the description of the proposed non-local filtering in the Section 2.2.3.

2.2.2. Local convolution

As noticed in [monti16], the above definition of convolution permits us to recover easily the standard convolution for images (or, similarly, signals on regular lattices) by constructing a local graph from the Cartesian -coordinates of each pixel in the image.222We consider a regular grid of equispaced pixels. We denote these coordinates , . For a filter of size , we connect each pixel to all its local neighbours that satisfies for . We then build the local adjacency matrix that satisfies   if , and 0 otherwise. The pseudo-coordinates are determined using the relative position of each pixel to pixel . For any pixel of the image, the connected pixels have relative coordinates in . We thus create a look up table that associates a unique integer to each of these relative coordinates. Then, we define .

With this procedure the pixels are always extracted in the same order, e.g., lexicographically. Finally, (6) is equivalent to the usual convolution when using in (5).

2.2.3. Non-local convolution

We now describe our proposition to perform more general non-local convolutions, i.e., we give the definition of the pseudo-coordinates and of the function in (5). In our applications, these convolutions are based on a graph that captures some structure that we wish to preserve in the signal . The exact construction of thus differs depending on the application. Yet, we define the pseudo-coordinates and the function always in the same way, whatever the application.

The weight of the edge between vertices and is determined based on a distance between feature vectors and extracted at vertex and , respectively. Let denote this distance. Let again be the vertices to which vertex is connected and ordered such that . We propose to define the pseudo-coordinates as , . In other words, the pseudo-coordinates re-order the distances between feature vectors in increasing order. Note that we break any tie arbitrarily.

Finally, we propose to use the following function in (5):

(8)

where is the entry of .

3. Style transfer with graph CNNs

In this section, we substitute for in (7) to simplify notations. However, one should not forget that the convolution at each layer is defined by an underlying graph . This graph will always be defined explicitly in the text.

Style transfer consists in transforming a target image , typically a photograph, to give it the “style” of a source image , typically a painting. Impressive results have recently been obtained using CNNs [gatys16]. The style transfer method of Gatys et al. consists in solving a minimization problem of the form

(9)

where is a matrix with the feature maps at depth of a multi-layer CNN, and are two subsets of depths, and denotes Frobenius norm. Gatys et al. used the very deep VGG-19 network, pre-trained for image classification [simonyan14]. The first term encourages the solution to have the style of the painting by matching the Gram matrices of the feature maps, such statistics capturing texture patterns at different scales. The second term ensures that the main structures (the “content”) of the original photograph , as captured in feature maps, are preserved in . Note that all the spatial information is lost in the first term that encodes the style, while it is still present in the second term. It was proved shortly after that similar results can be obtained using a deep neural network with all the filter coefficients chosen randomly [he16]. Let us also mention that [ustyuzhaninov16] showed that texture synthesis, i.e., when only the first term in (9) is involved, can be done using multiple (8) one-layer CNNs with random filters giving each feature maps.

We show now that our graph-based CNNs allow us to revisit neural style transfer. We use only two one-layer graph CNNs with random filters giving each only feature maps. This is a much “lighter” network than the ones used in the literature. The first network, denoted , uses local convolutions (Section 2.2.2) and the second, denoted , uses non-local convolutions (Section 2.2.3) on a graph that captures the structure of the photograph to be preserved. Both and have the form (7) with for the three Lab channels of color images and . We also choose

and ReLU for the non-linearity in both cases. The

coefficients of the filters and the biases in (7

) are chosen randomly using independent draws from the standard Gaussian distribution.

The graph in the second CNN is constructed as follows. For an image of interest , we construct a feature vector at each pixel of the image by extracting all the pixels’ Lab values in the neighbourhood of size around as well as the absolute 2D coordinates of the pixel. We then search the nearest neighbours to in the set using the Euclidean distance. Let be the set of all distances between each and its nearest neighbours. We have . To avoid that some pixels are too weakly connected to others, which then produces artefacts in the final images, we compute the percentile of the values in and saturates all the distances above this percentile to this value. The weights of the adjacency matrix then satisfy

(10)

with equal to the percentile of .

Figure 1. Examples of style transfer results obtained where the photograph (left) is transformed to have the style of a given painting (top). Using the proposed graph CNN framework, only two one-layer random CNNs are required to extract matched statistics.

We capture the style of the painting by computing the Gram matrices

(11)

where the non-local convolution in is computed using the graph constructed on . The matrix captures local statistics while captures non-local statistics. To give the style of to , we now compute a new graph on . We then compute an image by solving

(12)

where uses, this time, the graph constructed on for the non-local convolutions, is the Total Variation norm, and . Note that unlike in (9), we do not try to match feature maps but only Gram matrices. Yet the final image retains the structure of thanks to the non-local convolution in .

In practice, we minimise (12) using the L-BFGS algorithm starting from a random initialisation of . The parameters are computed so that the gradient coming from the term they respectively influence has a maximum amplitude of at the first iteration of the algorithm. We set . All images used in the experiments have size . However, we do not solve (12) directly at this resolution, but in a coarse-to-fine scheme instead: We start by downsampling all images at pixels; Solve (12) at this resolution; Upscale the solution at pixels; Restart the same process at this new resolution using the up-scaled image as initialisation; Repeat this process until the final resolution is reached.

We present some results obtained with our graph-based method in Fig. 1. One can notice that the main structure of the photograph perfectly appears in thanks to the presence of the structure-preserving non-local convolutions in . The style is also well transferred thanks to the matching of the Gram matrices and .

To highlight the role of the graph CNN with non-local convolutions, we repeat exactly the same experiments but using only the non-local graph CNN, i.e., we do not use the regular CNN with local convolutions – the TV regularisation is still present. Fig. 2 shows results for one photograph and different paintings. First, we notice that the main structures of the photograph are well preserved thanks to the graph CNN. Second, the colors of the painting and the relative arrangement of the colors are well transferred. However, we are not able to transfer finer style details like brush strokes. On the contrary, the local CNN is able to capture these finer details which appear in the results with the complete method.

Let us highlight that this experiment already shows that our graph CNN framework can adapt to many changes in the graph structure. Indeed, the graph used in was built from the painting when computing while it is built from the photograph when computing .

Figure 2. Examples of style transfer results obtained with our graph CNN method using only non-local convolutions where the photograph (left) is transformed to have the style of a given painting (top).

4. Conclusion

We proposed a graph CNN framework that allows us to unify local and non-local processing of signals on graphs, and showed how to use this framework to perform style transfer. The results already suggest that the proposed convolution adapts correctly to changes in the input graph. Additional experiments in the Appendix demonstrate the versatility of our framework on other kinds of signals and tasks. Beyond signal processing, we believe that some of the tools presented here can be useful to other applications involving time-varying graph structures, such as in social networks.

[]

References

Appendix A - Color transfer

In this appendix, we address another application that is color transfer where the goal is to transfer the “color palette” of a source image onto the one of the target image. First, we describe how we perform this task using graph CNNs. Second, we describe a more traditional approach using optimal transport, such as used in [frigo14], with which we will compare our results.

a.1. Color transfer with graph CNNs

All images are converted to Lab and only chrominances a and b are processed – the luminance of target images remains untouched. The first step consists in building a palette representative of the color distribution of each image. This is done by clustering image’s chrominances into clusters using -means. We thus obtain a source palette and a target one . The first and second columns of these palettes represent respectively the a and b channels. The pixel values in the ab channels are normalised and shifted to be in the range .

Similarly to what was done for style transfer, we capture the “statistics” of the reference palette by computing the Gram matrix of the output feature maps of a graph convolutional layer of the form of (7). We have . We choose and . The coefficients of the filters and the biases in (7) are chosen randomly using independent draws from the standard Gaussian distribution. We use ReLU for the non-linearity.

The network implements non-local convolutions (Section 2.2.3) with a graph constructed as follows. For a palette , we construct a feature vector for each palette element that simply contains the ab values for that element. We then search the nearest neighbours to in the set using the Euclidean distance. The weights of the adjacency matrix representing the graph used for convolution satisfy

(13)

with , for connected palette entries and .

We capture the color statistics of the source image by computing the Gram matrix

(14)

where the graph used in was built using .

To find a mapping between the colors in the source and target images, we solve the following minimisation problem

(15)

where the graph used for the convolutions in is now built using , and . We explain the role of each term and give the definition of in the next paragraph. We solve this problem using the L-BFGS algorithm starting from as initialisation. The parameter is computed so that the gradient coming from the term it influences has a maximum amplitude of at the first iteration of the algorithm. We set and . Let be the obtained solution to (15). We transform the color in target image as follows. For each pixel ab value in this image, we find its nearest neighbour in the palette , say the -th color in the palette. The new color is then obtained by replacing the pixel ab value by the entry of the new palette , so that there are only distinct colors in the final images.

The first term in (15) ensures that the colors in are similar to the colors in . The second and third terms in (15) permit us to ensure a consistency in the mapping from the old colors to the new colors . The second term controls the total cost of moving from the old colors to the new colors. Note that this cost is also usually involved in optimal transport methods for color transfer [ferradans13, frigo14]. The role of the third term is to ensure that two similar palette elements in should map to similar palette elements in . The matrix is the combinatorial Laplacian matrix constructed from a symmetric version of the adjacency matrix built from : . The third term in (15) classically promotes smoothness on the weighted symmetrized graph since Similar elements in palette should remain similar in transformed palette .

a.2. Optimal transport

Let be the cost matrix with entries , where and are the and rows of and , respectively. This matrix encodes the cost of moving the palette elements in to the palette elements in , and vice-versa. The optimal transport problem is about finding the transport from to that is the least costly:

(16)

where and the inequalities on the right hand side hold element-wise. This is the optimal transport problem solved in [frigo14] for color transfer, except that the cost matrix also incorporates information about the luminance of each image in their work, which hence yields different results. Let be the solution to the above convex problem. We compute the new palette whose rows read

(17)

for . Remark that the palette is made of colors similar to those in the palette as each palette element of is a convex combination of palette elements in . We finally transform the color in image as follows. For each pixel ab value in image , we find the nearest neighbour in the palette , say the entry of . The new color is then obtained by replacing the pixel ab value by the entry of the new palette .

a.3. Results

We present color transfer results obtained with both methods in Fig. 3. One can notice that our results suffer from fewer artefacts than the ones obtained with optimal transport. One can also refer to the results of [hristova15] on the same images333http://people.irisa.fr/Hristina.Hristova/publications/2015_EXPRESSIVE/indexResults.html for comparison with a method more evolved than a sole optimal transport. We believe that our results achieve similar visual qualities. Let us also mention that the optimal transport results presented here can certainly be improved thanks to some extra graph-regularisation terms such as used in [ferradans13]. Nevertheless, our results show that one can achieve competitive results compared to the state-of-the-art with a completely different approach that uses shallow graph CNNs with random weights.

Beyond the transformed images, it is also interesting to study how each palette is transformed with the different methods. We present in Fig. 4 these different color palettes. We remark that the palette obtained with optimal transport is almost identical to the target palette , at the price of several artefacts in the resulting images. On the contrary, we observe more differences between the palette obtained with our graph CNN method and : a better preservation of the internal structure of is obtained in thanks to the graph regularisation.

Figure 3. Examples of color transfer results where the photograph on the left is modified to take the color of the photograph on the right. The second image from the left shows the result obtained with our graph CNN method while the second from the right shows the result obtained with optimal transport.
Figure 4. Examples of color transfer results where the photograph on the left is modified to take the color of the photograph on the right. The second image from the left shows the result obtained with our graph CNN method while the second from the right shows the result obtained with optimal transport. We present below each image the corresponding color palettes , , , (from left to right).

Appendix B - Training graph CNNs for denoising

Figure 5.

Left: Evolution of the average PSNR on the image validation set during training. The blue curve corresponds to the pre-training of the local network. The green and red curves correspond to the training of the local and non-local networks, respectively, using the pre-trained network for the initialisation. The vertical dashed-dotted black lines indicate the end of an epoch. Right: Evolution of the average SNR on the audio validation set during training. The green and red curves correspond to the training of the local and non-local networks, respectively.

In this appendix, we show that one can train a graph CNN to solve standard signal processing tasks. To demonstrate that the approach can lend itself to different kinds of data, this part is about denoising images and single-channel audio signals.

The ground truth signal is denoted , and its components take values in range (image pixels), and (audio samples). An associated noisy version satisfies

(18)

where

is a random vector drawn from the centered Gaussian distribution of standard deviation

, for images, or , for audio.

In the following experiments, we compare the denoising performance achieved by trained local and non-local graph CNNs.

b.1. Network structures

For both types of signals, the first and the last layer of denoising CNNs are using local convolutions (Section 2.2.2). We denote these layers by and , where only incorporates a non-linearity, while is fully linear (without bias). For ease of notation, we slightly abuse the definition of non-linearity in (7), to accommodate the soft-thresholding function

(19)

where and are parameters learned during training. The first layer thus takes as input a noisy signal , while returns its denoised version.

The graph for the non-local convolutions is built on the noisy input signal and thus changes with each new signal to denoise. It is constructed based on the nearest neighbour search as follows. From the noisy signal, we assemble a feature vector by extracting all the components (i.e. pixels or samples) in the predefined local neighbourhood centered around the entry of . We then search the nearest neighbours to in the set using the Euclidean distance. The weight of the adjacency matrix between connected elements and satisfies

(20)

where are parameters learned during training. Note that (20) is the weighting function used in NL-means [buades11].

We compare the denoising performance reached by a CNN using only local convolutions, i.e., a regular CNN, and a graph CNN where we also use non-local convolutions.

Image denoising – The regular CNN uses only and , and, therefore, implements the function . We choose , hence the local filters have size , and . The non-local network is built by inserting one non-local layer between and , which we denote by . This network implements the function . The neighbourhood used for building the graph through features is of size (). The non-local layer is linear (no bias or soft-thresholding) and we interpret its role as an additional non-local denoising of the sparse feature maps given by before reconstruction of the denoised image by . This layer implements a 2D graph-convolution with a single filter : . For the local layers, we use . We remark that, while one may expect an equivalent local CNN (implementing , with a fully linear 2D local convolution layer of the same size as ) to perform better than , our experiments showed that both yield equivalent results.

Audio denoising – The audio denoising CNNs (local and non-local) share the same architecture, defined as . The middle layer is linear (no bias or soft-thresholding) and implements either local () or non-local () convolutions444With slight abuse, we keep the same notation here as in the image denoising case for simplicity. as defined in (7). Therefore, the number of filter coefficients to train is equal in both local and non-local cases. We choose filters of size and . The non-local graph is generated as described before, with a local neighbourhood defined as a signal segment of length .

b.2. Experimental setup

Figure 6. Noisy image (left, dB) denoised with the trained local network (middle left, dB) or the trained non-local network (middle right, dB) and the original image (right).

In all layers, one filter is initialised to the constant filter , serving to extract and approximately reconstruct the mean of each patch. The remaining filters are initialised using random draws from the centered Gaussian distribution with standard deviation . We noticed that this initialisation accelerates the learning. The parameters and ( of each) are initialised at .

Training is done using ADAM [kingma14], with a batchsize of size , and by minimising the Euclidean distance between the denoised and ground-truth signals. The datasets are divided into three subsets: training, validation and test sets. The validation set is used to monitor the evolution of the PSNR (images) or SNR (audio) during training. The training and test examples are always generated by corrupting the original signal, as defined in (18), with independent draws of .

Image denoising – We train the two image denoising CNNs using the holidays dataset which contains images [jegou08]. The test set is built by choosing images at random among all available images; the validation set is build by choosing instances among the remaining ones; the rest form the training set. All images are cropped to become square and resized to have size . First, we pretrain the local network for epoch and a stepsize of . Second, we build the non-local network using the pre-trained layers and and train this whole network using a stepsize of . Third, starting again from the pre-trained layer and , we continue training the local network using a stepsize of . Note that training the non-local network is computationally more expensive than training the local network as a new non-local graph must be constructed for each new noisy image. We thus limited training of to epochs instead of for . For the non-local layer, the default values proposed in [buades11] are used to initialise the parameters and ( and , respectively).

Audio denoising – For audio, we use the TIMIT [garofolo1993darpa] speech dataset, counting training and test tracks of varying duration, sampled at kHz. The validation set is built by separating tracks from the training set chosen at random. The audio files are first trimmed to have equal length by extracting a second signal from the original audio tracks. The two parameters in (20) are initialised at and , and the training is performed on the whole network at once (for both local and non-local CNNs). We run the training procedure for one epoch, with the stepsize .

b.3. Results

Figure 7. First and third panels: feature maps obtained at the analysis layer of both trained networks computed from the same image (the non-linearity was not applied). Second and fourth panels: feature maps obtained by applying the adjoint of the filters at the synthesis layer of both trained networks.

Figure 5 presents the evolutions of the mean PSNR (for images) and the mean SNR (for audio), on the validation sets during training. One can already notice at this stage that the non-local networks outperform the local ones.

Image denoising – The average PSNR of the image test set is dB before denoising. After denoising, we reach a PSNR of dB and dB with the local and non-local networks, respectively. The non-local network thus improves the denoising performance by dB. For comparison, we also denoise the test set by soft-thresholding in the Haar wavelet basis. We first compute the threshold that maximises the PSNR on the validation set and then use it to denoise the test set. We reach an average PSNR of  dB only on the image test set with this method. We present in Fig. 6 an image before and after denoising with both networks. We notice that the non-local network allows a better recovery of the homogeneous regions in the image.

By analogy with dictionary learning and the terminology used in this field, we call hereafter the first layer of the CNNs used for denoising, the “analysis layer”, and the last layer , the “synthesis layer”. The analysis layer filters the input image with a bank of filters and soft-thresholds the result, yielding sparse feature maps. The synthesis layer resynthesises an image from the feature maps using another bank of filters. We present in Fig. 7 some feature maps obtained at the analysis layer – without application of the non-linearity – for both trained networks. We remark that the learned filters act like wavelet filters in both cases. This is a well-known phenomenon observed in dictionary learning or at the first layer of deep CNNs. Our results are thus in line with these observations. In addition, we remark that the introduction of the non-local convolutions did not affect a lot the type of filters learned at the first layer. We also present the feature maps obtained by applying the adjoint of the filters at the synthesis layer of both trained networks. Note that in dictionary learning, the filters at the analysis are exactly the adjoint of the filters at the synthesis. Similarly, here, we notice that the (adjoint of the) synthesis filters act like wavelet filters, just as the analysis filters do. We also remark that the synthesis filters are quite similar in absence and presence of the non-local filtering.

Audio denoising – The average SNR of the audio test set is dB before denoising. Denoising by the local CNN increases the average SNR to dB, whereas, denoising with the non-local CNN increases the SNR to dB, i.e. a dB improvement compared to the former. As for images, we also denoise the test set by soft-thresholding in the Haar wavelet basis with a threshold optimised on the validation set. We reach an average PSNR of  dB only on the test set with this method.