WaveSNet: Wavelet Integrated Deep Networks for Image Segmentation

05/29/2020 ∙ by Qiufu Li, et al. ∙ Shenzhen University 20

In deep networks, the lost data details significantly degrade the performances of image segmentation. In this paper, we propose to apply Discrete Wavelet Transform (DWT) to extract the data details during feature map down-sampling, and adopt Inverse DWT (IDWT) with the extracted details during the up-sampling to recover the details. We firstly transform DWT/IDWT as general network layers, which are applicable to 1D/2D/3D data and various wavelets like Haar, Cohen, and Daubechies, etc. Then, we design wavelet integrated deep networks for image segmentation (WaveSNets) based on various architectures, including U-Net, SegNet, and DeepLabv3+. Due to the effectiveness of the DWT/IDWT in processing data details, experimental results on CamVid, Pascal VOC, and Cityscapes show that our WaveSNets achieve better segmentation performances than their vanilla versions.



There are no comments yet.


page 1

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

Comparison of max-pooling, max-unpooling, and wavelet transforms. Max-pooling is a common down-sampling operation in the deep networks, which easily removes the data details (the window boundary in area A) and breaks the object structure (the pole in area B). Based on the pooling indices, max-unpooling recovers the data resolution, but not the details. DWT keeps the object structure and saves the data details, while IDWT precisely recovers the original data using the DWT output.

Due to the advantage of deep network in extracting high-level features, deep learning has achieved high performances in various tasks, particular in the computer vision. However, the current deep networks are not good at extracting and processing data details. While deep networks with more layers are able to fit more functions, the deeper networks are not always associated with better performances

[11]. An important reason is that the details will be lost as the data flow through the layers. In particular, the lost data details significantly degrade the performacnes of the deep networks for the image segmentation. Various techniques, such as condition random field, àtrous convolution [3, 4], PointRend [12], are introduced into the deep networks to improve the segmentation performance. However, these techniques do not explicitly process the data details.

Wavelets [6, 14], well known as “mathematical microscope”, are effective time-frequency analysis tools, which could be applied to decompose an image X into the low-frequency component containing the image main information and the high-frequency components containing the details (Fig. 1). In this paper, we rewrite Discrete Wavelet Transform (DWT) and Inverse DWT (IDWT) as the general network layers, which are applicable to 1D/2D/3D data and various wavelets. One can flexibly design end-to-end architectures using them, and directly process the data details in the deep networks. We design wavelet integrated deep networks for image segmentation, termed WaveSNets, by replacing the down-sampling with DWT and the up-sampling with IDWT in the U-Net [17], SegNet [1], and DeepLabv3+ [4]. When Haar, Cohen, and Daubechies wavelets are used, we evaluate WaveSNets using dataset CamVid [2], Pascal VOC [8], and Cityscapes [5]. The experimental results show that WaveSNets achieve better performances in semantic image segmentation than their vanilla versions, due to the effective segmentation for the fine and similar objects. In summary:

  • We rewrite DWT and IDWT as general network layers, which are applicable to various wavelets and can be applied to design end-to-end deep networks for processing the data details during the network inference;

  • We design WaveSNets using various network architectures, by replacing the down-sampling with DWT layer and up-sampling with IDWT layer;

  • WaveSNets are evaluated on the dataset of CamVid, Pascal VOC, Cityscapes, and achieve better performance for semantic image segmentation.

2 Related works

2.1 Sampling operation

Down-sampling operations, such as max-pooling, average-pooling, and strided-convolution, are introduced into the deep networks for local connectivity and weight sharing. These down-sampling operations usually ignore the chassic sampling theorem

[15], which result in aliasing among the data components in different frequency intervals. As a result, the data details presented by the high-frequency components are totally lost, and random noises showing up in the same components could be sampled into the low resolution data. In addition, the object basic structures presented by the low-frequency component will be broken. Fig. 1 shows a max-pooling example. In the signal processing, the low-pass filtering before the down-sampling is the standard method for anti-aliasing. Anti-aliased CNNs [21]

integrate the low-pass filtering with the down-sampling in the deep networks, which achieve increased classification acccuracy and better shift-robustness. However, the filters used in anti-aliased CNNs are empirically designed based on the row vectors of Pascal’s triangle, which are ad hoc and no theoretical justifications are given. As no up-sampling operation, i.e., reconstruction, of the low-pass filter is available, the anti-aliased U-Net

[21] has to apply the same filtering after normal up-sampling to achieve the anti-aliasing effect.

Up-sampling operations, such as max-unpooling [1], deconvolution [17]

, and bilinear interpolation

[3, 4]

, are widely used in the deep networks for image-to-image translation tasks. These up-sampling operations are usually applied to gradually recover the data resolution, while the data details can not be recovered from them. Fig.

1 shows a max-unpooling example. The lost data details would significantly degrade the network performance for the image-to-image tasks, such as the image segmentation. Various techniques, including àtrous convolution [3, 4], PointRend [12], etc., are introduced into the design of deep networks to capture the fine details for performance improvement of image segmentation. However, these techniques try to recover the data details from the detail-unrelated information. Their ability in the improvement of segmentation performance is limited.

2.2 Wavelet

Wavelets are powerful time-frequency analysis tools, which have been widely used in signal analysis, image processing, and pattern recognition. A wavelet is usually associated with scaling function and wavelet functions. The shifts and expansions of these functions compose stable basis for the signal space, with which the signal can be decomposed and reconstructed. The functions are closely related to the low-pass and high-pass filters of the wavelet. In practice, these filters are applied for the data decomposition and reconstruction. As Fig.

1 shows, 2D Discrete Wavelet Transform (DWT) decompose the image X into its low-frequency component and three high-frequency components . While is a low resolution version of the image, keeping its main information, save its horizontal, vertical, and diagonal details, respectively. 2D Inverse DWT (IDWT) could reconstruct the image using the DWT output.

Various wavelets, including orthogonal wavelets, biorthogonal wavelets, multiwavelets, ridgelet, curvelet, bandelet and contourlet, etc., have been designed, studied, and applied in signal processing, numerical analysis, pattern recognition, computer vision and quantum mechanics, etc. The àtrous convolution used in DeepLab [3, 4] is originally developed in the wavelet theory. In the deep learning, while wavelets are widely applied as data preprocessing or postprocessing tools, wavelet transforms are also introduced into the design of deep networks by taking them as substitutes of sampling operations.

Multi-level Wavelet CNN (MWCNN) [13]

integrates Wavelet Package Transform (WPT) into the deep network for image restoration. MWCNN concatenates the low-frequency and high-frequency components of the input feature map, and processes them in a unified way. The details stored in the high-frequency components would be largely wiped out via this processing mode, because the data amplitude in the components is significantly weaker than that in the low-frequency component. Convolutional-Wavelet Neural Network (CWNN)

[7] applies the redundant dual-tree complex wavelet transform (DT-CWT) to suppress the noise and keep the object structures for extracting robust features from SAR images. The architecture of CWNN contains only two convolution layers. While DT-CWT discards the high-frequency components output from DT-CWT, CWNN takes as its down-sampling output the average value of the multiple low-frequency components. Wavelet pooling proposed in [19] is designed using a two-level DWT. Its back-propagation performs a one-level DWT with a two-level IDWT, which does not follow the mathematical principle of gradient. Recently, the application of wavelet transform in image style transfer [20] is studied. In these works, the authors evaluate their methods using only one or two wavelets, because of the absence of the general wavelet transform layers; the data details presented by the high-frequency components are abandoned or processed together with the low-frequency component, which limits the detail restoration in the image-to-image translation tasks.

3 Our method

Our method is to replace the sampling operations in the deep networks with wavelet transforms. We firstly rewrite Discrete Wavelet Transform (DWT) and Inverse DWT (IDWT) as the general network layers. Although the following analysis is mainly for orthogonal wavelet and 1D data, it can be generalized to other wavelets and 2D/3D data with slight changes.

3.1 Wavelet transform

For a given 1D data , DWT decomposes it, using two filters, i.e., low-pass filter and high-pass filter of an 1D orthogonal wavelet, into its low-frequency component and high-frequency component , where


and , denote the convolution and naive down-sampling, respectively. In theory, the length of every component is of that of x, i.e.,


Therefore, is usually even number.

IDWT reconstructs the original data x based on the two components,


where denotes the naive up-sampling operation.

For high-dimensional data, high dimensional DWT could decomposes it into one low-frequency component and multiple high-frequency components, while the corresponding IDWT could reconstructs the original data from the DWT output. For example, for a given 2D data

, 2D DWT decomposes it into four components,


where is the low-pass filter and are the high-pass filters of the 2D orthogonal wavelet. is the low-frequency component of the original data which is a low-resolution version containing the data main information; are three high-frequency components which store the vertical, horizontal, and diagonal details of the data. 2D IDWT reconstruct the original data from these components,


Similarly, the size of every component is size of the original 2D data in terms of the two dimensional direction, i.e.,


Therefore, are usually even numbers.

Generally, the filters of high-dimensional wavelet are tensor products of the two filters of 1D wavelet. For 2D wavelet, the four filters could be designed from


where is the tensor product operation. For example, the low-pass and high-pass filters of 1D Haar wavelet are


Then, the filters of the corresponding 2D Haar wavelet are


Eqs. (1), (3), (4) and (5

) present the forward propagations for 1D/2D DWT and IDWT. It is onerous to deduce the gradient for the backward propagations from these equations. Fortunately, the modern deep learning framework PyTorch

[16] could automatically deduce the gradients for the common tensor operations. We have rewrote 1D/2D DWT and IDWT as network layers in PyTorch, which will be publicly available for other researchers. In the layers, we do DWT and IDWT channel by channel for multi-channel data.

3.2 WaveSNet

We design wavelet integrated deep networks for image segmentation (WaveSNets), by replacing the down-sampling operations with 2D DWT and the up-sampling operations with 2D IDWT. In this paper, we take U-Net, SegNet, and DeepLabv3+ as the basic architectures.

(a) PDDS
(b) PUDS
Figure 2: The dual structures used in U-Net and SegNet.
Figure 3: The wavelet based dual structure (WADS).

WSegNets SegNet and U-Net share a similar symmetrical encoder-decoder architecture, but differ in their sampling operations. We name the pair of connected down-sampling and up-sampling operations and the associated convolutional blocks as dual structure, where the convolutional blocks process the feature maps with the same size. Fig. 2(a) and Fig. 2(b) show the dual structures used in U-Net and SegNet, which are named as PDDS (Pooling Deconvolution Dual Structure) and PUDS (Pooling-Unpooling Dual Structure), respectively. U-Net and SegNet consist of multiple nested dual structures. While they apply the max-pooling during their down-sampling, PDDS and PUDS use deconvolution and max-unpooling for the up-sampling, respectively. As Fig. 2(a) shows, PDDS copys and transmits the feature maps from encoder to decoder, concatenating them with the up-sampled features and extracting detailed information for the object boundaries restoration. However, the data tensor injected to the decoder might contain redundant information, which interferes with the segmentation results and introduces more convolutional paramters. PUDS transmits the pooling indices via the branch path for the upgrading of the feature map resolution in the decoder. As Fig. 1 shows, the lost data details can not be restored from the pooling indices.

To overcome the weaknesses of PDDS and PUDS, we adopt DWT for down-sampling and IDWT for up-sampling, and design WADS (WAvelet Dual Structure, Fig. 3). During its down-samping, WADS decomposes the feature map into low-frequency and high-requency components. Then, WADS injects the low-frequency component into the following layers in the deep networks to extract high-level features, and transmits the high-frequency components to the up-sampling layer for the recovering of the feature map resolution using IDWT. IDWT could also restore the data details from the high-frequency components during the up-sampling. We design wavelet integrated encoder-decoder networks using WADS, termed WSegNets, for imag segmentation.

Table 1 illustrates the configuration of WSegNets, together with that of U-Net and SegNet. In this paper, the U-Net consists of eight more convolutional layers than the original one [17].

data size the number of channels networks
encoder decoder SegNet U-Net WSegNet
3, 64 64, 64 PUDS PDDS WADS
64, 128 128, 64 PUDS PDDS WADS
128, 256, 256 256, 256, 128 PUDS PDDS WADS
256, 512, 512 512, 512, 256 PUDS PDDS WADS
512, 512, 512 512, 512, 512 PUDS PDDS WADS
Table 1: Configurations of U-Net, SegNet, and WSegNet.

In Table 1

, the first column shows the input size, though these networks can process images with arbitrary size. Every number in the table corresponds to a convolutional layer followed by a Batch Normalization (BN) and Rectified Linear Unit (ReLU). While the number in the column “encoder” is the number of the input channels of the convolution, the number in the column “decoder” is the number of the output channels. The encoder of U-Net and SegNet consists of 13 convolutional layers corresponding to the first 13 layers in the VGG16bn

[18]. A convolutional layer with kernel size of converts the decoder output into the predicted segmentation result.

WDeepLabv3+ DeepLabv3+ [4] employs an unbalanced encoder-decoder architecture. The encoder applys an àtrous convolutional version of CNN to alleviate the detailed information loss due to the common down-sampling operations, and an Àtrous Spatial Pyramid Pooling (ASPP) to extract image multiscale representations. At the begin of the decoder, the encoder feature map is directly upsampled using a bilinear interpolation with factor of 4, and then concatenated with a low-level feature map transmitted from the encoder. DeepLabv3+ adopts a dual structure connecting its encoder and decoder, which differs with PDDS only on the up-sampling and suffers the same drawbacks.

Figure 4: WDeepLabv3+ structure.

We design WDeepLabv3+, a wavelet integrated version of DeepLabv3+, by applying two wavelet version of dual structures. As Fig. 4 shows, the encoder applys a wavelet integrated àtrous CNN followed by the ASPP, which output encoder feature map and two sets of high-frequency components. The encoder feature map is up-sampled by two IDWT, integrated with the detail information contained in the high-frequency components, while the two IDWT are connected with a convolutional block. The decoder then apply a few convolutions to refine the feature map, followed by a bilinear interpolation to recover the resolution.

4 Experiment

We evaluate the WaveSNets (WSegNet, WDeepLabv3+) on the image dataset of CamVid, Pascal VOC, and Cityscapes, in terms of mean intersection over union (mIoU).

4.1 WSegNet

CamVid CamVid contains 701 road scene images (367, 101 and 233 for the training, validation and test respectively) with size of , which was used in [1] to quantify SegNet performance. Using the training set, we train SegNet, U-Net, and WSegNet when various wavelets are applied. The trainings are supervised by cross-entropy loss. We employ SGD solver, initial learning rate of with “poly” policy and , momentum of and weight decay of . With a mini-batch size of

, we train the networks 12K iterations (about 654 epochs). The input image size is

. For every image, we adopt random resizing between and , random rotation between and degrees, random horizontal flipping and cropping with size of . We apply a pre-trained VGG16bn model for the encoder initialization and initialize the decoder using the technique proposed in [10]. We do not tune above hyper-parameters for a fair comparison.

SegNet U-Net WSegNet
haar ch2.2 ch3.3 ch4.4 ch5.5 db2 db3 db4 db5 db6
sky 90.50 91.52 91.31 91.38 91.29 91.39 91.24 91.35 91.48 90.99 90.89 90.63
building 77.80 80.28 79.90 79.27 78.82 79.37 78.65 79.48 79.60 78.52 78.58 77.84
pole   9.14 27.97 27.99 29.38 27.90 28.96 27.26 27.91 28.38 28.04 26.66 25.96
road 92.69 93.71 93.69 93.47 93.47 93.77 93.78 93.72 93.91 92.57 92.12 92.11
sidewalk 78.05 80.05 80.33 79.44 79.34 79.89 80.08 79.67 80.58 76.95 75.62 76.65
tree 72.40 73.51 73.34 73.27 73.21 73.07 71.60 73.61 73.68 72.90 72.28 71.92
symbol 37.61 43.07 44.44 42.68 40.42 43.57 42.33 44.72 44.01 41.06 41.72 39.69
fence 19.92 27.50 32.59 24.62 25.59 28.40 28.85 25.52 29.60 26.90 24.15 29.00
car 79.31 85.04 83.21 84.43 82.63 84.57 84.14 84.30 83.97 81.92 81.07 78.38
walker 39.93 50.35 49.35 50.98 50.52 50.43 49.09 50.15 49.39 47.69 48.02 43.17
bicyclist 39.48 44.45 50.38 47.94 48.69 47.93 52.64 51.15 47.73 46.08 43.53 38.96
1-13[3pt/1pt] mIoU 57.89 63.40 64.23 63.35 62.90 63.76 63.61 63.78 63.85 62.15 61.33 60.39
gl. acc. 91.04 92.19 92.08 92.04 91.98 92.10 91.73 92.06 92.40 91.60 91.30 91.14
para.(e6) 29.52 37.42 29.52
Table 2: WSegNet results on CamVid test set

Table 2 shows the mIoU and global accuracy on the CamVid test set, together with the parameter numbers of U-Net, SegNet, and WSegNet with various wavelets. In the table, “dbx” represents orthogonal Daubechies wavelet with approximation order , and “chx.y” represents biorthogonal Cohen wavelet with approximation orders . Their low-pass and high-pass filters can be found in [6]. The length of these filters increase as the order increases. While Haar and Cohen wavelets are symmetric, Daubechies are not. The mIoU of WSegNet decreases from to as asymmetric wavelet varies from “db2” to “db6”, while it varies from to as symmetric wavelet varies from “ch2.2” to “ch5.5”. It seems that the performances among different asymmetric wavelets are much diverse than that among various symmetric wavelets. In the wavelet integrated deep networks, we truncate the DWT output to make them to be size of input data. As a result, IDWT with asymmetric wavelet can not fully restore an image in the region near the image boundary, and the region width increases as the wavelet filter length increases. With symmetric wavelet, however, one can fully restore the image based on symmetric extension of the DWT output. Consequently, WSegNet with Cohen wavelets performs better than that with Daubechies ones near the image boundary. Fig. 5

shows an example image, its manual annotation, a region consisting of “building”, “road” and “sidewalk” and the segmentation results achieved using different wavelets. The region is located close to the image boundary and has been enlarged with colored segmentation results for better illustration. One can observe from the results of “db5” and “db6” that a long line of “sidewalk” pixels, located near the image boundary, are classified as “road”. In comparison, the results of “ch4.4” and “ch5.5” match well with the ground truth.

Figure 5: Boundary comparison for images segmented by WSegNet with different wavelets.

In [1], the authors train the SegNet using an extended CamVid training set containing 3433 images, which achieved mIoU on the CamVid test set. We train SegNet, U-Net and WSegNet using only 367 CamVid training images. From Table 2, one can find WSegNet achieves the best mIoU () using Haar wavelet, and the best global accuracy () using “db3” wavelet. WSegNet is significantly superior to SegNet in terms of mIoU and the global accuracy, while they require the same amount of parameters (). The segmentation performance of WSegNet is also better than that of U-Net, while it requires much less parameters than U-Net ().

Table 2 also lists the IoUs for the 11 categories in the CamVid. Compared with U-Net and WSegNet, SegNet performs very poor on the fine objects, such as “pole”, “symbol”, and “fence”, as that the max-pooling indices used in SegNet are not helpful for restoring the image details. While U-Net achieves comparable or even better IoUs on the easily identifiable objects, such as the “sky”, “building”, and “car”, it does not discriminate well similar objects like“walker” and “bicyclist”, or “building” and “fence”. The feature maps of these two pairs of objects might look similar to the decoder of U-Net, as the data details are not separately provided. Table 3 shows the confusion matrices on these four categories. The proportion of “building” in the predicted “fence” decreases from to , as the network varies from SegNet to WPUNet, while that of “bicyclist” in the predicted “walker” decreases from to . These results suggest that WSegNet is more powerful than SegNet and U-Net in distinguishing similar objects.

SegNet U-Net WSegNet(haar)
building fence building fence building fence
building 88.01   1.30 89.91   1.19 90.22   1.08
fence 34.11 29.12 30.64 40.16 30.16 44.65
walker bicyclist walker bicyclist walker bicyclist
walker 53.30   2.54 66.83   1.95 68.74   1.90
bicyclist 13.69 49.66 16.73 51.45 12.52 59.80
Table 3: Confusion matrices for the CamVid test set.
Figure 6: Comparison of SegNet and WSegNet segmentations.

Fig. 6 presents a visual example for various networks, which shows the example image, its manual annotation, a region consisting of “building”, “tree”, “sky” and “pole”, and the segmentation results achieved using SegNet, U-Net and WSegNet. The region is enlarged with colored segmentation reults for better illustration. From the figure, one can find in the segmentation result that WSegNet keeps the basic structure of “tree”, “pole”, and “building” and restores the object details, such as the “tree” branches and the “pole”. The segmentation result of WSegNet matches the image region much better than that of SegNet and U-Net, even corrects the annotation noises about “building” and “tree” in the ground truth.

SegNet U-Net WSegNet
haar ch2.2 ch3.3 ch4.4 ch5.5 db2
Pascal VOC mIoU 61.33 63.64 63.95 63.50 63.46 63.48 63.62 63.48
2-10[1pt/1pt] gl. acc. 90.14 90.82 90.79 90.72 90.72 90.73 90.76 90.75
Cityscapes mIoU 65.75 70.05 70.09 69.86 69.73 70.13 70.63 70.37
2-10[1pt/1pt] gl. acc. 94.65 95.21 95.20 95.18 95.09 95.17 95.15 95.20
parameters (e6) 29.52 37.42 29.52
Table 4: WSegNet results on Pascal VOC and Cityscapes val set.

Pascal VOC and Cityscapes The original Pascal VOC2012 semantic segmentation dataset contains 1464 and 1449 annotated images for training and validation, respectively, and contains 20 object categories and one background class. The images are with various sizes. We augment the training data to 10582 by extra annotations provided in [9]. We train SegNet, U-Net, and WSegNet with various wavelets on the extended training set 50 epochs with batch size of 16. During the training, we adopt random horizontal flipping, random Gaussian blurring, and cropping with size of . The other hyper-parameters are the same with that used in CamVid training. Table 4 presents the results on the validation set for the trained networks.

Cityscapes contains 2975 and 500 high quality annotated images for the training and validation, respectively. The images are with size of . We train the networks on the training set 80 epochs with batch size of 8 and initial learning rate of . During the training, we adopt random horizontal flipping, random resizing between 0.5 and 2, and random cropping with size of . Table 4 presents the results on the validation set.

From Tabel 4, one can find the segmentation performance of SegNet is significant inferior to that of U-Net and WSegNet. While WSegNet achieves better mIoU ( for Pascal VOC and for Cityscapes) and requires less parameters (), the global accuracies of WSegNet on the two dataset are a little lower than that of U-Net. This result suggest that, compared with U-Net, WSegNet could more precisely classify the pixels at the “critical” locations.

4.2 WDeepLabv3+

DeepLabv3+ WDeepLabv3+
haar ch2.2 ch3.3 ch4.4 ch5.5 db2
background 93.83 93.82 93.85 93.94 93.91 93.86 93.87
aeroplane 92.29 93.14 92.50 91.56 93.21 92.73 92.41
bike 41.42 43.08 42.19 42.21 43.42 42.59 42.84
bird 91.47 90.47 91.60 90.34 91.24 90.81 90.73
boat 75.39 75.47 72.04 75.19 74.20 72.40 72.90
bottle 82.05 80.18 81.14 82.12 79.55 82.23 81.89
bus 93.64 93.25 93.25 93.20 93.07 93.52 93.31
car 89.30 90.36 90.00 90.67 88.79 86.31 87.11
cat 93.69 92.80 93.31 92.62 93.56 93.84 93.97
chair 38.28 40.80 40.75 39.32 39.79 43.27 41.60
cow 86.60 89.72 89.04 90.49 88.42 92.04 88.17
table 61.37 63.24 62.67 65.49 64.93 67.31 65.58
dog 91.16 90.24 91.04 89.54 89.97 90.38 90.65
horse 86.60 88.86 88.88 90.23 89.00 91.02 89.19
motorbike 88.47 87.94 87.89 88.61 88.36 87.30 87.58
person 86.71 86.88 86.61 86.35 86.59 86.32 86.96
plant 64.48 64.33 64.69 68.45 65.50 68.01 65.41
sheep 83.04 87.45 86.50 87.43 85.70 88.14 85.95
sofa 49.24 47.43 48.06 49.85 46.74 51.94 50.19
train 85.49 84.18 83.88 85.47 83.88 85.16 86.69
monitor 77.80 76.55 78.31 79.04 78.32 78.78 79.11
1-8[3pt/1pt] mIoU 78.68 79.06 78.96 79.62 78.96 79.90 79.34
gl. acc. 94.64 94.69 94.68 94.75 94.68 94.77 94.75
para. (e6) 59.34 60.22
Table 5: WDeepLabv3+ results on Pascal VOC val set.

Taking ResNet101 as the backbone, we build DeepLabv3+ and WDeepLabv3+ with , and train them on the Pascal VOC using the same training policy with that used in the training of WSegNet. Table 5 shows the segmentation results on the validation set.

We achieve mIoU using DeepLabv3+ on the Pascal VOC validation set, which is comparable to that () obtained by the inventors of this network [4]. With a few increase of parameter (), the segmentation performance of WDeepLabv3+ with various wavelet is always better than that of DeepLabv3+, in terms of mIoU and global accuracy. Using “ch5.5” wavelet, WDeepLabv3+ achieves the best performance, mIoU and global accuracy. From Table 5, one can find that the better performance of WDeepLabv3+ is mainly resulted from its better segmentation for the fine objects (“chair”, “table”, and “plant”) and similar objects (“cow”, “horse”, and “sheep”). The above results justify the high efficiency of DWT/IDWT layers in processing data details.

Figure 7: Comparison of DeepLabv3+ and WDeepLabv3+ results.

Fig. 7 shows four visual examples of segmentation results for DeepLabv3+ and WDeepLabv3+. The first and second columns present the original images and the segmentation ground truth, while the third and fourth columns show the segmentation results of DeepLabv3+ and WDeepLabv3+ with “ch5.5” wavelet, respectively. We show the segmentation results with colored pictures for better illustration. For the example image in Fig. 7(a), DeepLabv3+ falsely classifies the pixels in some detail regions on the cow and her calfs as “background” or “horse”, i.e., the regions for the cow’s left horn, the hind leg of the brown calf, and some fine regions on the two calfs. While WDeepLabv3+ correctly classifies the horse and the sheep in the Fig. 7(b) and Fig. 7(c), DeepLabv3+ classifies them as “cow” because of the similar object structures. In Fig. 7(d), the “table” segmented by WDeepLabv3+ is more complete than that segmented by DeepLabv3+. These results illustrate that WDeepLabv3+ performs better on the detail regions and the similar objects.

5 Conclusion

Our proposed general DWT and IDWT layers are applicable to various wavelets, which can be used to extract and process the data details during the network inference. We design WaveSNets (WSegNet and WDeepLabv3+) by replacing the down-sampling with DWT and replacing the up-sampling with IDWT, in U-Net, SegNet, and DeepLabv3+. Experimental results on the CamVid, Pascal VOC, and Cityscapes show that WaveSNets could well recover the image details and perform better for segmenting similar objects than their vanilla versions.


  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on PAMI 39 (12), pp. 2481–2495. Cited by: §1, §2.1, §4.1, §4.1.
  • [2] G. J. Brostow, J. Fauqueur, and R. Cipolla (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: §1.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062. Cited by: §1, §2.1, §2.2.
  • [4] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the ECCV, pp. 801–818. Cited by: §1, §1, §2.1, §2.2, §3.2, §4.2.
  • [5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1.
  • [6] I. Daubechies (1992) Ten lectures on wavelets. Vol. 61, Siam. Cited by: §1, §4.1.
  • [7] Y. Duan, F. Liu, L. Jiao, P. Zhao, and L. Zhang (2017) SAR image segmentation based on convolutional-wavelet neural network and markov random field. Pattern Recognition 64, pp. 255–267. Cited by: §2.2.
  • [8] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: a retrospective. International journal of computer vision 111 (1), pp. 98–136. Cited by: §1.
  • [9] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In 2011 International Conference on Computer Vision, pp. 991–998. Cited by: §4.1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    In Proceedings of the IEEE ICCV, pp. 1026–1034. Cited by: §4.1.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on CVPR, pp. 770–778. Cited by: §1.
  • [12] A. Kirillov, Y. Wu, K. He, and R. Girshick (2019) PointRend: image segmentation as rendering. arXiv preprint arXiv:1912.08193. Cited by: §1, §2.1.
  • [13] P. Liu, H. Zhang, W. Lian, and W. Zuo (2019)

    Multi-level wavelet convolutional neural networks

    IEEE Access 7, pp. 74973–74985. Cited by: §2.2.
  • [14] S. G. Mallat (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE transactions on PAMI 11 (7), pp. 674–693. Cited by: §1.
  • [15] H. Nyquist (1928) Certain topics in telegraph transmission theory. Transactions of the American Institute of Electrical Engineers 47 (2), pp. 617–644. Cited by: §2.1.
  • [16] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §3.1.
  • [17] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on MICCAI, pp. 234–241. Cited by: §1, §2.1, §3.2.
  • [18] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
  • [19] T. Williams and R. Li (2018) Wavelet pooling for convolutional neural networks. In International Conference on Learning Representations, Cited by: §2.2.
  • [20] J. Yoo, Y. Uh, S. Chun, B. Kang, and J. Ha (2019) Photorealistic style transfer via wavelet transforms. arXiv preprint arXiv:1903.09760. Cited by: §2.2.
  • [21] R. Zhang (2019) Making convolutional networks shift-invariant again. arXiv preprint arXiv:1904.11486. Cited by: §2.1.