DeepAI
Log In Sign Up

Deep Adaptive Wavelet Network

Even though convolutional neural networks have become the method of choice in many fields of computer vision, they still lack interpretability and are usually designed manually in a cumbersome trial-and-error process. This paper aims at overcoming those limitations by proposing a deep neural network, which is designed in a systematic fashion and is interpretable, by integrating multiresolution analysis at the core of the deep neural network design. By using the lifting scheme, it is possible to generate a wavelet representation and design a network capable of learning wavelet coefficients in an end-to-end form. Compared to state-of-the-art architectures, the proposed model requires less hyper-parameter tuning and achieves competitive accuracy in image classification tasks

READ FULL TEXT VIEW PDF
07/31/2019

Deep Neural Network Hyperparameter Optimization with Orthogonal Array Tuning

Deep learning algorithms have achieved excellent performance lately in a...
02/19/2021

A Deep Graph Wavelet Convolutional Neural Network for Semi-supervised Node Classification

Graph convolutional neural network provides good solutions for node clas...
09/17/2020

Deep Collective Learning: Learning Optimal Inputs and Weights Jointly in Deep Neural Networks

It is well observed that in deep learning and computer vision literature...
12/15/2016

Design of Image Matched Non-Separable Wavelet using Convolutional Neural Network

Image-matched nonseparable wavelets can find potential use in many appli...
11/12/2019

WaveletKernelNet: An Interpretable Deep Neural Network for Industrial Intelligent Diagnosis

Convolutional neural network (CNN), with ability of feature learning and...
04/19/2020

A Biologically Interpretable Two-stage Deep Neural Network (BIT-DNN) For Hyperspectral Imagery Classification

Spectral-spatial based deep learning models have recently proven to be e...
06/02/2017

Latent Attention Networks

Deep neural networks are able to solve tasks across a variety of domains...

1 Introduction

Convolutional neural networks (CNNs) have become the dominant machine learning approach for image recognition. Numerous deep learning architectures have been developed ever since AlexNet 

[16]

greatly outperformed other models on the ImageNet Challenge 

[7]

in 2012. Based on backpropagation, CNNs can leverage correlation and structure inside datasets by directly tuning the network trainable parameters for a given task.

The trend in CNNs is to increase the number of layers to be able to model more complicated mathematical functions, to the point that recent architectures surpass 100 layers [13, 14]. There is, however, no guarantee that increasing the number of layers is always advantageous. Zagoruyko  [30] indeed showed that decreasing the number of layers and increasing the width of each layer leads to better performance than their commonly used thin and very deep counterpart, while reducing training time. Their results also support our general observation that current CNNs are not necessarily designed systematically, but usually through a manual process based on trial-and-error [9].

A limitation of such networks is the lack of interpretability, which is usually referred to as the Achilles heel of CNNs. Convolutional neural networks are frequently treated as black-box function approximators which map a given input to a classification output [8]. As deep learning becomes more ubiquitous in domains where transparency and reliability are priorities, such as healthcare, autonomous driving and finance, the need for interpretability becomes imperative [3]. Interpretability enables users to understand the strengths and weaknesses of a model and conveys an understanding of how to diagnose and correct potential problems [8]. Interpretable models are also considered less susceptible to adversarial attacks [23].

Theoretical properties of traditional signal processing approaches, such as multiresolution analysis using wavelets, are well studied, which makes such approaches more intepretable than CNNs. There are in fact several prior works that incorporate wavelet representations into CNNs. Oyallon  [22] proposed a hybrid network which replaces the first layers of ResNet by a wavelet scattering network. This modified ResNet resulted in a comparable performance to that of the original ResNet but has a smaller number of trainable parameters. Williams  [27] took the wavelet sub-bands of the input images as a new input and processed them with CNNs. In a different work [28], they showed a wavelet pooling algorithm, which uses a second-level wavelet decomposition to subsample features. Lu  [19] addressed the organ tissue segmentation problem by using a dual-tree wavelet transform on top of a CNN. Cotter and Kingsbury [5] also used a dual-tree wavelet transform to learn filters by taking activation layers into the wavelet space.

Recently, Fujieda  [10] proposed wavelet CNNs (WCNNs), which were built upon the resemblance between multiresolution analysis and the convolutional filtering and pooling operations in CNNs. They proposed a CNN similar to DenseNet, but the Haar wavelets (which are commonly used in multiresolution analysis) were used as convolution and pooling layers. These wavelet layers were concatenated with the feature maps produced by the succeeding convolutional blocks. This model is more interpretable than CNNs since the wavelet layers generate the wavelet transform of the input. The use of a fixed wavelet (Haar), however, is likely suboptimal as it restricts the adaptability and cannot leverage data-driven learning.

Inspired by WCNNs, we propose to perform multiresolution analysis within the network architecture by using the lifting scheme [25] to perform a data-driven wavelet transform. The lifting scheme offers many advantages compared to the first-generation wavelets, such as adaptivity, data-drivenness, non-linearity, faster and easier implementation, fully in-place calculation, and reversible integer-to-integer transform [31].

Unlike previous works which combine CNNs and wavelets, our model learns all the filters from data in an end-to-end framework. Due to the connection with multiresolution analysis, the number of layers in our network is determined mathematically. The combination of end-to-end training and multiresolution analysis via the lifting scheme allows us to efficiently capture the essential information from the input for image classification such as texture and object recognition. The use of multiresolution analysis generates a relevant visual representation at each decomposition level, which contributes to the interpretability of the network.

The evaluation of the proposed network was performed on three competitive benchmarks for texture and object classification tasks, namely, KTH-TIPS-b, CIFAR-10 and CIFAR-100. The proposed model attains comparable results to those presented by the state-of-the-art on texture classification, trained end-to-end from scratch, with a fraction of the number of trainable parameters. Moreover, the proposed model shows better generalization compared to networks especially tailored for texture recognition as it presents good performance for object classification task. This work is the first to propose trainable wavelet filters in the context of CNNs. In summary, we propose a deep neural network for image classification which exhibits the following properties:

The network is interpretable since approximation and detail coefficients, which have a relevant visual representation, are generated by the multiresolution analysis using the lifting scheme at each decomposition level.

The network extracts features using a multiresolution analysis approach and capture essential information for classification task reducing the number of trainable parameters in texture classification. The loss function used to train the network ensures that the captured information is relevant to the classification task.

The architecture offers competitive accuracy in texture and object classification tasks.

2 Background

This section briefly describes multiresolution analysis and the lifting scheme which are the building blocks of our model.

2.1 CNNs as Multiresolution Analysis

Convolutional neural networks proposed by LeCun in 1989 [17] contain filtering and downsampling steps. In order to have a better understanding of CNNs, we propose to interpret convolution and pooling operations in CNNs as operations in multiresolution analysis [20]. In the following, only one-dimensional input signals are considered for simplicity, but the analysis can be easily extended to higher dimensional signals.

Given an input vector

, and a weighting function , referred to as kernel, the convolution layer output (or feature map) can be defined as

(1)

where is the set of kernel indices.

The role of the pooling layers is to output a summary statistic of the input [12]. It is normally used to reduce the complexity and to simplify information. Most common pooling layers consist of convolution and downsampling in signal processing. Using the standard downsampling symbol , the output vector from a pooling layer can be written as

(2)

where is the pooling filter.

We can now interpret convolution and pooling layers as operations in multiresolution analysis. In this analysis, the resolution of a signal (measure of the amount of detail in a signal) is changed by a filtering operation, and the scale of a signal is changed by a downsampling operation [21]. The wavelet transform, for example, repeatedly decomposes a signal into spectrum sub-bands by using low-pass and high-pass filters and applies downsampling by a factor of 2.

Then, to perform a multiresolution analysis, a new signal decomposition is obtained by taking as input the low-pass filtered sub-band . Each of these decompositions are referred to as levels, and generate a hierarchical decomposition of the signal into and each time. Let and denote the low-pass and high-pass filters at step , respectively. Such transformation is thus represented as a sequence of convolution and pooling operations,

(3)

where and denote the approximation and detail coefficients generated at step , respectively, and . Based on this level decomposition-based construction, it is possible to compare CNNs structures with multiresolution analysis, as Eqns. 2 and 3 are quite similar, with the difference that in CNNs the filters are randomly selected and their output does not have a meaningful interpretation.

Figure 1:

(a) The 2D Adaptive Lifting Scheme consists of successively applying horizontal and vertical lifting steps where each of them have their own predictor and updater. (b) The predictors and updaters are based on operations, such as paddings, convolutions, and non-linear activation functions, which can be either trainable (red boxes) or fixed (green boxes).

2.2 Lifting Scheme

The first-generation wavelets are mathematical functions that allow for efficient representations of data using only a small set of coefficients by exploiting space and frequency correlation [21]

. The main idea behind the wavelet transform is to build a sparse approximation of natural signals through the correlation structure present on them. This correlation is normally local in space and frequency, meaning that there is a stronger correlation among the neighboring samples on the signal. The construction of mother wavelets is traditionally performed by using the Fourier transform, however, this can also be constructed in the spatial domain 

[6].

The lifting scheme, which is also known as second-generation wavelets [25], is a simple and powerful approach to define wavelets that has the same properties as the first-generation wavelets [6]. The lifting scheme takes as input a signal and generates as outputs the approximation and the details sub-bands of the wavelet transform. Designing such lifting scheme consists of three stages [4] as follows.

Splitting the signal.

This step consists of splitting the input signal into two non-overlapping partitions. The simplest possible partition is chosen; i.e. the input signal

is divided into even and odd components denoted as

and , respectively, and defined as and .

Updater.

This stage will take care of the separation in the frequency domain, looking that the approximation has the same running average as the input signal 

[6]. To achieve this, the approximation should be a function of the even part of the signal plus an update operator .

Let denote the sequence of neighboring odd polyphase samples of . The even polyphase samples are updated using , and the result forms the approximation , as described in Eqn. 4, where is the update operator.

(4)
Predictor.

The splitting partitions of the signals are, typically, closely correlated. Thus, given one of them, it is possible to build a good predictor for the other set, by tracking the difference (or details) among them [6]. As the even part of the signal corresponds to the approximation (Eqn. 4), then it is possible to define P as a function of .

Let denote a sequence of approximation coefficients. In the prediction step, the odd polyphase samples are predicted from . The resulting prediction residuals, or high sub-band coefficients , are computed by Eqn. 5, where is the prediction operator.

(5)

2.2.1 Lifting Scheme Via Neural Networks

Yi  [29] proposed to replace the updater and the predictor with non-linear functions represented by neural networks to adapt to the input signals. To train them, the authors proposed to use the following loss functions:

(6)

where and are the loss functions for the predictor and updater, respectively. The loss for the predictor network promotes the minimization of the detail coefficients magnitude (Eqn. 5). Yi  [29] argued that is close to by definition, which only makes it necessary for the loss function of the updater network to minimize the distance between and . Note that in Yi  [29], the predictor and the updater were trained sequentially.

3 Deep Adaptive Wavelet Network (DAWN)

We propose a new network architecture, Deep Adaptive Wavelet Network (DAWN), which uses the lifting scheme to capture essential information from the input data for image classification. The adaptive lifting scheme presented by Yi  [29] showed that neural networks trained through backpropagation can be used to implement the lifting scheme for one-dimensional (1D) signals. The DAWN architecture extends this idea to address a classification task, and integrates multiresolution analysis into neural networks. The proposed model performs multiresolution analysis at the core of the classification network by training the parameters of two-dimensional (2D) lifting schemes in an end-to-end fashion. None of the previous wavelet-based CNN approaches have performed this end-to-end training while learning the wavelet parameters.

3.1 2D Adaptive Lifting Scheme

We first explain the proposed 2D Adaptive Lifting Scheme, and then present the integration of the 2D lifting scheme into the proposed classification architecture.

The 2D Adaptive Lifting Scheme consists of a horizontal lifting step followed by two independent vertical lifting steps that generate the four sub-bands of the wavelet transform. These sub-bands are denoted as LL, LH, HL, and HH, where L and H represent low and high frequency information, respectively, and the first and second positions refer to the horizontal and the vertical directions, respectively. Note that the 2D lifting scheme, illustrated in Figure 1 (a), performs spatial pooling, as the spatial size of the outputs are reduced by half with respect to the input.

The Adaptive Horizontal Lifting Scheme performs horizontal analysis by splitting the 2D signal into two non-overlapping partitions. We chose to partition the 2D signal into the even () and odd () horizontal components. Then a horizontal updater () and a horizontal predictor () operators are applied in the same way as described in Section 2.2. The vertical lifting step has a similar structure as the horizontal lifting step, but in this case, the splitting is performed in the vertical component of the 2D signal, followed by the processing, performed by the vertical updater and the vertical predictor operators.

Predictor and Updater.

The internal structure of the updater and the predictor is the same for both the vertical and horizontal directions. Figure 1 (b) shows the structure of the horizontal predictor (or horizontal updater). At the beginning, reflection padding is applied instead of zero padding to prevent harmful border effects caused by the convolution operation. Then, a 2D convolutional layer, where the kernel size, depending on the direction of analysis ( if horizontal while

if vertical), is applied. The output depth of the first convolutional layer is set to be twice the number of channels of the input. Then, a second convolutional layer with kernels of size (1,1) is applied. The output depth of this layer is set the same as the initial input depth of the predictor/updater. The stride for all the convolutions is set to

. The first convolutional layer is followed by a activation function, and we can benefit from its properties of sparsity and a reduced likelihood of vanishing gradient. The last convolutional layer is followed by a activation function as we do not want to discard negative values in this stage.

Design Choices.

We arbitrarily chose to perform the horizontal analysis before the vertical analysis. However, there are no performance variations by computing the vertical analysis first. The number of convolutional layers and the kernel size used in the predictor/updater will be discussed during the hyperparameter study (Section 

4.3). The main concern while choosing the depth was to maintain a relevant visual representation of the approximation and details sub-bands, while not considerably increasing the number of network parameters.

3.2 DAWN Architecture

The DAWN architecture is based on stacking multiple 2D Adaptive Lifting Schemes to perform multiresolution analysis (see Figure 2). The architecture starts with two convolutional layers followed by a multiresolution analysis of levels. Each level consists of a 2D Adaptive Lifting Scheme, which generates as output the four wavelet transform sub-bands LL, LH, HL and HH, and the input correspond to the low level sub-band (LL) from the previous level. The details sub-bands from each level (LH, HL, HH) are concatenated and followed by a global average pooling layer [18], used to reduce overfitting and to perform dimensionality reduction. In the last level, the global average pooling of the outputs at each level are concatenated before the final fully-connected layer and a log-softmax to perform the classification task.

Number of levels.

The minimum size of feature maps at the end of the network for this architecture is set to as it is the minimum possible size that still maintains the 2D signal structure. Assuming that the input images are square, the number of levels , is given by , where is the input image dimension. For example, for input images of size , and . Note that this number of layers is automatically given since our network is based on multiresolution analysis. The effect of choosing different levels, than the ones given by is analyzed during the hyperparameter study (Section 4.3).

Initial convolutional layers.

As in every classification task, the proposed approach needs a discriminative representation of the data before the classification takes place. To obtain a discriminative feature set before the first downsampling of the signal, the architecture starts by extracting descriptors with two sequences of Conv-BN-ReLU, where Conv and BN stand for Convolution and Batch Normalization respectively, with kernel size

and with the same depth. The depth in these initial convolutional layers is one of the few hyper-parameters of DAWN. By fixing the depth and determining the number of decomposition levels, one can automatically obtain the depth of features maps of the last 2D lifting scheme for a given input image size.

Figure 2: The proposed architecture is composed by three modules: i) Initial convolutional layers to increase the input depth, ii) levels of multiresolution analysis, where 2D lifting scheme is applied on the approximation output of the previous level, and iii)

a large concat of details from the different levels and the approximation, followed by a global average pooling and a dense layer. The operations in the architecture can be classified as either trainable (red boxes) or fixed (green boxes).

Loss function and constraints.

End-to-end training is performed using the cross-entropy loss function, in combination with some regularization terms to enforce a wavelet decomposition structure during training. The loss function takes the form of Eqn. 7, where denotes the number of classes, and

are the binary ground-truth and the predicted probability for belonging to class

, respectively. The regularization parameters and tune the strength of the regularization terms. Also, and denote the mean of the input signal to the lifting scheme at level and the mean of the approximation sub-band at level , respectively. And, is the concatenation of the vectorized detail sub-bands at level .

(7)

To promote low-magnitude detail coefficients [11], the first regularization term in Eqn. 7 minimizes the sum of the Huber norm of across all the decomposition levels. The choice of a Huber norm compared to is motivated by training stability. The second regularization term minimizes the sum of the norm of the difference between and across all the decomposition levels in order to preserve the mean of the input signal to form a proper wavelet decomposition [11].

4 Experiments and Results

The evaluation of the DAWN model was analyzed on one texture dataset, KTH-TIPS2-b and two benchmarks datasets for object recognition task, namely, CIFAR-10 and CIFAR-100. The obtained results are compared against different models commonly used for classification: ResNet [13]; DenseNet [14] with growing factor of 12; a variant of VGG [24], which adds batch normalization, global average pooling, and dropout. The proposed architecture is also compared with previous networks using some multi-resolution analysis component: wavelet CNN (WCNN) [10], and Scattering network [22]. For this later one, we show the results of the handcrafted representation and the hybrid network that combines scattering transform on top of a Wide-Resnet. For KTH-TIPS2-b, T-CNN [26] results is shown as this architecture specifically tailored to texture analysis. The training was done on multiple NVIDIA V100 Pascal GPUs with 12Gb of memory.

4.1 Implementation

An SGD optimizer with a momentum of is used for training. The initial learning rate is set to for all the databases. The batch size is set to and for the CIFAR databases and KTH-TIPS2-b, respectively. A learning rate decay of

is applied on epochs

and for KTH-TIPS2-b; and on epochs and for CIFAR. The number of epochs is set to and for KTH-TIPS2-b, and the CIFAR databases, respectively. The regularization parameters and are set to for all the experiments. For the Scattering networks [22] on the CIFAR databases, the original training setup has been used, as it achieves higher accuracy than the one obtained with the configuration proposed in this paper for the other architectures.

4.2 Databases and Results

KTH-TIPS2-b

The KTH-TIPS Texture Database was developed by the Computational Vision and Active Perception Laboratory (CVAP) at the KTH Royal Institute of Technology in Stockholm [2]. There are three versions of this dataset: KTH-TIPS, KTH-TIPS2-A and KTH-TIPS2-B. In this study, we work with the third version since it is the most widely used as benchmark in texture analysis. It contains 11 classes with four folders per class called samples, each sample has 108 images. As in other works [10], one of the samples of each class was used for training and the rest sample folders were used for testing. The data augmentation consists in applying random cropping and mirroring operations. Table 1

contains the average and standard deviation across different training sessions.

In this database, WCNN [10] with 4 levels achieves better accuracy compared to T-CNN with a smaller number of trainable parameters. The proposed architecture with a depth of for the initial convolutional layers, achieves the same accuracy as WCNN but with a much smaller number of parameters. Note that the initial convolutional layers are essential for extracting meaningful feature representations, and without them the performance of the model drops significantly.

Scattering network with the handcrafted representation (Scatter+FC) consist of using a scattering transform of spatial scale followed by a global average pooling and ending with a fully connected layer. This network configuration is very similar to the proposed network structure used for this database (Figure 2). This network configuration achieves similar performance to the proposed approach with sightly less trainable parameters as the wavelets are not trainable. This result indicates that our architecture is able to learn representations that are similar to the scattering transform.

The proposed architecture performs better than DenseNet 13 and 22 BC with similar number of parameters. Note that for DenseNet, the number indicates the total number of layers used inside the network and BC meaning the use of the bottleneck compression approach [14]. Scattering network with hybrid configuration (Scatter+WRN) increases significantly the number of trainable parameter compared to the handcrafted representation network. This hybrid configuration perform poorly as it overfit the dataset, and it has a highly dependence on the CNN architecture and the setup of hyperparameters.

Architecture # param. Avg. Std.
T-CNN 19’938’059 63.80 % 1.68
WCNN L4 10’211’811 68.83% 0.73
Scatter+WRN 10’934’283 60.33 % 2.19
Scatter+FC 22’484 68.57 % 2.86
DenseNet 22 BC 74’684 65.71 % 1.35
DenseNet 13 89’711 66.16 % 1.52
DAWN (no init.) 2’894 58.60 % 4.10
DAWN (16 init.) 71’227 68.88 % 2.14
Table 1: Comparison of accuracy results on the KTH-TIPS2-b database where all the network are trained from scratch without pre-trained information.
Cifar

CIFAR-10 [15] contains 60000 colour images of size belonging to 10 classes. The same partition used to train and test DenseNet [14] is used in this paper, i.e. 50000 images for training and 10000 images for testing. CIFAR-100 [15] has 100 classes with 500 images per class. The data augmentation consists in applying random cropping with a padding of 4 pixels and horizontal mirroring operations.

Table 2 shows the best results of each architecture on these two databases. There are different DenseNet configurations available with a default growth value of 12. The configuration chosen for the comparison was the one with the closest number of parameters to that of the proposed model. The 18-layer ResNet architecture, after replacing the initial convolutional layers with a convolutional layer with stride and kernel size , is used for comparison. Those layers were removed because they are normally used to reduce the size of the image at the beginning of the network, which is not required for the small images of the CIFAR datasets. For WCNN, an experiment on varying the number of levels was conducted and the result of the best variant is reported in Table 2

. Scattering transform network configurations are the same used the original paper 

[22] for these datasets.

For the CIFAR databases, the proposed network uses three levels of lifting scheme, as the input image size is . Table 2 shows that increasing the number of initial convolutional filters tends to improve the accuracy performance. Therefore, it is up to the user to balance between a more compact network, in terms of number of parameters, and a network with better classification performance. DAWN architecture outperforms WCNN for both datasets even when the proposed architecture has significantly less number of parameters. The scatter network with handcrafted representation (Scatter+MLP) achieves less accuracy than DAWN architecture as the wavelets are not learned.

It also has a competitive accuracy for CIFAR-10 compared to VGG and ResNet architectures; furthermore, DAWN with a depth of for the initial convolution layers, outperforms the results in both architectures for CIFAR-100 dataset. The scattering hybrid representation (Scatter+WRN) has a considerable higher number of parameters than the other architectures, and its performance is similar to VGG and ResNet for both datasets. In this application, the DenseNet architecture exhibits good performance due to its ability to retain relevant features through the entire network.

Architecture # param. CIFAR-10 CIFAR-100
VGG (variation) 15.0 M 94.00 % 72.61 %
ResNet 18 11.2 M 94.25 % 73.30 %
DenseNet 40 1.10 M 94.73 % 75.25 %
DenseNet 100 7.19 M 95.90 % 79.8 %
WCNN L3 2.28 M 89.85 % 65.17 %
Scatter+WRN 45.5 M 92.31 % 72.26 %
Scatter+MLP 17.0 M 81.90 % 49.84 %
DAWN (16 init.) 59.3 K 86.04 % 56.7 %
DAWN (32 init.) 0.21 M 90.41 % 65.06 %
DAWN (64 init.) 0.73 M 92.69 % 70.57 %
DAWN (128 init.) 2.79 M 93.34 % 72.47 %
DAWN (256 init.) 10.9 M 92.02 % 74.04 %
Table 2: Comparison of accuracy results on the CIFAR-10 and CIFAR-100 databases. The number of trainable parameter are shown for CIFAR-100 database.
Hybrid network

As an additional experiment, the proposed multiresolution analysis can be combined with other network architecture. This hybrid network (DAWNN+WRN) consists in replacing the scattering transform by the 2D lifting schemes (Figure 2) inside the Scatter+WRN architecture. This proposed hybrid architecture has similar number of trainable parameters than Scatter+WRN. On CIFAR databases, this architecture gets and of accuracy for CIFAR-10 and CIFAR-100, respectively, which is slightly higher compared to the one obtained by Scatter+WRN.

4.3 Hyperparameter Tuning

DAWN network uses a few number of hyperparameters inside the architecture. Besides the initial convolution depth analyzed in Section 4.2, the other hyperparameters are the kernel size and the number of convolutional layers inside the updater and predictor of the lifting scheme. This section presents an analysis of the effect of these hyperparameters on the final architecture results. For simplicity, the experiments are performed on CIFAR datasets using the DAWN architecture with 64 initial filters.

Kernel size and number of convolutions

Both of these hyperparameters affect the lifting scheme module, whose role is to generate a mathematical function for the wavelet representation. The update operator needs to represent the frequency structure of the input signal, while the predictor needs to represent the spatial structure of the input signal. These hyperparameters also affect the final number of trainable parameters for the whole architecture. Table 3 shows the effect when changing these hyperparameters: i) the kernel size experiments were obtained with the U/P structure described in Figure 1 ii)

the number of hidden layers inside the module is generated by the repetition of the first convolutional layer of the U/P module. It is noticed that the performance results do not have a high variance for combinations of hyperparameters with similar number of trainable parameters.

Number of multiresolution analysis levels

Table 3 shows how the number of trainable parameters depends on the number of levels of the 2D adaptive lifting scheme. This table illustrates how the performance varies from not using any lifting scheme level (only initial convolutions), which results in poor performance, to using the maximum number of possible levels (according to Section 3). As shown in Table 3, it is usually beneficial to use the maximum number of levels as it leads to higher accuracy values for both datasets. Note that in the CIFAR database, the input size is 3232, which makes makes the maximum number of possible levels equal to 3.

Configuration CIFAR-10 CIFAR-100 # param.
(k=3, h=1, l=3) 92.69 % 70.57 % 734’628
(k=1, h=1, l=3) 88.09 % 64.30 % 439’716
(k=2, h=1, l=3) 92.27 % 68.01 % 587’172
(k=4, h=1, l=3) 92.69 % 70.96 % 882’084
(k=3, h=2, l=3) 92.58 % 70.51 % 918’564
(k=3, h=3, l=3) 92.46 % 68.85 % 1’140’900
(k=3, h=4, l=3) 92.35 % 68.19 % 1’363’236
(k=3, h=1, l=0) 75.49 % 44.12 % 45’348
(k=3, h=1, l=1) 90.53 % 66.71 % 275’108
(k=3, h=1, l=2) 92.17 % 70.42 % 504’868
Table 3: Results of tunning the DAWN architecture with 64 initial convolutions. The first table entry is the network configuration used to generate the results in Table 2. The hyperparameters tested are kernel size (k), the number of hidden convolutional layers (h), and the number of levels (l). The number of trainable parameter are shown for CIFAR-100 database.

4.4 Visual Representation Results

The decomposition generated by the lifting scheme has a relevant visual representation as it is composed of approximation and details sub-bands of an input signal. Figure 3 shows the visualization of the multiresolution analysis for different number of decomposition levels. To generate the visualizations presented in Figure 3, the network was run without the initial convolutional layers on KTH database.

Many decomposition levels are very similar to traditional wavelet decomposition where the approximation sub-band captures the low-frequency information of the image while the detail sub-bands tend to capture high-frequency information. However, some sub-bands are slightly different as the loss function also minimize the cross-entropy loss function to ensure good classification performance (Section 3).

Figure 3: Results of extracting the coefficients for 3 decomposition levels of the 2D Adaptive Lifting Scheme in the DAWN architecture. The loss function applied is the same as in Eqn. 7. For visualization purposes, the LH, HL and HH sub-bands were multiplied by a factor of .

5 Discussion and Future work

Multiresolution analysis as a deep learning architecture

Analogous to DAWN architecture, Bruna and Mallat [1] use a multiresolution analysis based on wavelet transform as a backbone of their architecture. Both, this work and our work, focus on the wavelet extraction as an operation invariant to deformation. In Bruna’s work, the modulus is obtained from each wavelet coefficient at different levels. In DAWN architecture, the details coefficients per level of the wavelet transform are carried out to the end of the network. One biggest difference between DAWN and the Scattering handcrafted representation is the ability of DAWN to learn the wavelet configuration. It is this ability that allows it to adapt to the data and perform equivalently across different datasets, as it was shown in Tables 1 and  2.

Combining Multiresolution analysis with more traditional CNNs architectures

The hybrid network with the proposed 2D lifting scheme shows the potential of improving the accuracy or reducing the number of trainable parameters for other networks. How to combining or incorporating more CNN features in the proposed network and keeping performance across the different datasets is an interesting work avenue.

Initial convolutions

At the moment, the architecture uses initial convolutional layers to increase the number of channels from the input image, which is a simple approach. Research using more advance architecture for this part of the proposed network is left as future work. Moreover, multiresolution analysis is usually apply on an image instead on a CNN output. Changing the order of the initial convolutions and the different lifting scheme might conduct to some exciting new architectures.

6 Conclusions

We presented the DAWN architecture, which combines the lifting scheme and CNNs to learn features using multiresolution analysis. In contrast to the black-box nature of CNNs, the DAWN architecture is designed to extract a wavelet representation of the input at each decomposition level. Unlike traditional wavelets, the proposed model is data-driven so that it adapts to the input images. It is also trainable end-to-end and achieve state-of-the-art performance for texture classification with very limited number of trainable parameter. Interpreting convolution and pooling operations in CNNs as operations in multiresolution analysis helped us to systematically design a novel network architecture. The performance of DAWN is comparable to that of state-of-the-art classification networks when tested on the CIFAR-10 and CIFAR-100 datasets.

References

  • [1] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 35(8), 2013.
  • [2] B. Caputo, E. Hayman, and P. Mallikarjuna. Class-specific material categorisation. IEEE International Conference on Computer Vision, pages 1597–1604, 2005.
  • [3] S. Chakraborty et al. Interpretability of deep learning models: A survey of results. In IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation, pages 1–6, 2017.
  • [4] R. L. Claypoole, G. M. Davis, W. Sweldens, and R. G. Baraniuk. Nonlinear wavelet transforms for image coding via lifting. IEEE Transactions on Image Processing, 12(12):1449–1459, 2003.
  • [5] F. Cotter and N. Kingsbury. Deep learning in the wavelet domain. arXiv preprint arXiv:1811.06115, 2018.
  • [6] I. Daubechies and W. Sweldens. Factoring wavelet transforms into lifting steps. The Journal of Fourier Analysis and Applications, 4(3):247–269, 1998.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.

    IEEE Conference on Computer Vision and Pattern Recognition

    , pages 248–255, 2009.
  • [8] Y. Dong, H. Su, J. Zhu, and B. Zhang. Improving interpretability of deep neural networks with semantic information. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4306–4314, 2017.
  • [9] T. Elsken, J. Metzen, and F. Hutter. Simple and efficient architecture search for CNNs. In Workshop on Meta-Learning at NIPS, 2017.
  • [10] S. Fujieda, K. Takayama, and T. Hachisuka. Wavelet convolutional neural networks. arXiv:1805.08620, 2018.
  • [11] M. Gholipour and H. A. Noubari. Hardware implementation of lifting based wavelet transform. In International Conference on Signal Processing Systems, volume 1, pages V1–215–V1–219, 2010.
  • [12] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press.
  • [13] K. He, X. Zhang, S. Ren, and J. J. S. Sun. Deep residual learning for image recognition. Conference on Computer Vision and Pattern Recognition, 2016.
  • [14] G. Huang, Z. Liu, and L. Van der Maaten. Densely connected convolutional networks. Conference on Computer Vision and Pattern Recognition, 2018.
  • [15] A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
  • [16] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25:1097–1105, 2012.
  • [17] Y. LeCun. Generalization and network design strategies. Connectionism in Perspective, 1989.
  • [18] M. Lin, Q. Chen, and S. Yan. Network in network. International Conference on Learning Representations, 2014.
  • [19] H. Lu, H. Wang, Q. Zhang, D. Won, and S. W. Yoon. A dual-tree complex wavelet transform based convolutional neural network for human thyroid medical image segmentation. IEEE International Conference on Healthcare Informatics, pages 191–198, 2018.
  • [20] S. Mallat. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(7):674–693, 1989.
  • [21] G. Meurant. Wavelets: a tutorial in theory and applications, volume 2. Academic press, 2012.
  • [22] E. Oyallon, E. Belilovsky, and S. Zagoruyko. Scaling the scattering transform: Deep hybrid networks. International Conference on Computer Vision, 2017.
  • [23] A. Ross and F. Doshi-Velez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In

    AAAI Conference on Artificial Intelligence

    , 2018.
  • [24] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015.
  • [25] W. Sweldens. The lifting scheme: A construction of second generation wavelets. Society for Industrial and Applied Mathematics, 29(2):511–546, 1998.
  • [26] P. F. W. Vincent Andrearczyk. Using filter banks in convolutional neural networks for texture classification. arXiv:1601.02919, 2016.
  • [27] T. Williams and R. Li. Advanced image classification using wavelets and convolutional neural networks. IEEE International Conference on Machine Learning and Applications, pages 233–239, 2016.
  • [28] T. Williams and R. Li. Wavelet pooling for convolutional neural networks. International Conference on Learning Representations, 2018.
  • [29] Z. Yi, R. Wang, and J. Li. Nonlinear wavelets and BP neural networks adaptive lifting scheme. International Conference on Apperceiving Computing and Intelligence Analysis Proceeding, pages 316–319, 2010.
  • [30] S. Zagoruyko and N. Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2017.
  • [31] X. Zhang, W. Wang, T. Yoshikawa, and Y. Takei. Design of IIR orthogonal wavelet filter banks using lifting scheme. IEEE Transactions on Signal Processing, 54(7):2616–2624, 2006.