Due to the advantage of deep network in extracting high-level features, deep learning has achieved high performances in various tasks, particular in the computer vision. However, the current deep networks are not good at extracting and processing data details. While deep networks with more layers are able to fit more functions, the deeper networks are not always associated with better performances. An important reason is that the details will be lost as the data flow through the layers. In particular, the lost data details significantly degrade the performacnes of the deep networks for the image segmentation. Various techniques, such as condition random field, àtrous convolution [3, 4], PointRend , are introduced into the deep networks to improve the segmentation performance. However, these techniques do not explicitly process the data details.
Wavelets [6, 14], well known as “mathematical microscope”, are effective time-frequency analysis tools, which could be applied to decompose an image X into the low-frequency component containing the image main information and the high-frequency components containing the details (Fig. 1). In this paper, we rewrite Discrete Wavelet Transform (DWT) and Inverse DWT (IDWT) as the general network layers, which are applicable to 1D/2D/3D data and various wavelets. One can flexibly design end-to-end architectures using them, and directly process the data details in the deep networks. We design wavelet integrated deep networks for image segmentation, termed WaveSNets, by replacing the down-sampling with DWT and the up-sampling with IDWT in the U-Net , SegNet , and DeepLabv3+ . When Haar, Cohen, and Daubechies wavelets are used, we evaluate WaveSNets using dataset CamVid , Pascal VOC , and Cityscapes . The experimental results show that WaveSNets achieve better performances in semantic image segmentation than their vanilla versions, due to the effective segmentation for the fine and similar objects. In summary:
We rewrite DWT and IDWT as general network layers, which are applicable to various wavelets and can be applied to design end-to-end deep networks for processing the data details during the network inference;
We design WaveSNets using various network architectures, by replacing the down-sampling with DWT layer and up-sampling with IDWT layer;
WaveSNets are evaluated on the dataset of CamVid, Pascal VOC, Cityscapes, and achieve better performance for semantic image segmentation.
2 Related works
2.1 Sampling operation
Down-sampling operations, such as max-pooling, average-pooling, and strided-convolution, are introduced into the deep networks for local connectivity and weight sharing. These down-sampling operations usually ignore the chassic sampling theorem, which result in aliasing among the data components in different frequency intervals. As a result, the data details presented by the high-frequency components are totally lost, and random noises showing up in the same components could be sampled into the low resolution data. In addition, the object basic structures presented by the low-frequency component will be broken. Fig. 1 shows a max-pooling example. In the signal processing, the low-pass filtering before the down-sampling is the standard method for anti-aliasing. Anti-aliased CNNs 
integrate the low-pass filtering with the down-sampling in the deep networks, which achieve increased classification acccuracy and better shift-robustness. However, the filters used in anti-aliased CNNs are empirically designed based on the row vectors of Pascal’s triangle, which are ad hoc and no theoretical justifications are given. As no up-sampling operation, i.e., reconstruction, of the low-pass filter is available, the anti-aliased U-Net has to apply the same filtering after normal up-sampling to achieve the anti-aliasing effect.
, and bilinear interpolation[3, 4]
, are widely used in the deep networks for image-to-image translation tasks. These up-sampling operations are usually applied to gradually recover the data resolution, while the data details can not be recovered from them. Fig.1 shows a max-unpooling example. The lost data details would significantly degrade the network performance for the image-to-image tasks, such as the image segmentation. Various techniques, including àtrous convolution [3, 4], PointRend , etc., are introduced into the design of deep networks to capture the fine details for performance improvement of image segmentation. However, these techniques try to recover the data details from the detail-unrelated information. Their ability in the improvement of segmentation performance is limited.
Wavelets are powerful time-frequency analysis tools, which have been widely used in signal analysis, image processing, and pattern recognition. A wavelet is usually associated with scaling function and wavelet functions. The shifts and expansions of these functions compose stable basis for the signal space, with which the signal can be decomposed and reconstructed. The functions are closely related to the low-pass and high-pass filters of the wavelet. In practice, these filters are applied for the data decomposition and reconstruction. As Fig.1 shows, 2D Discrete Wavelet Transform (DWT) decompose the image X into its low-frequency component and three high-frequency components . While is a low resolution version of the image, keeping its main information, save its horizontal, vertical, and diagonal details, respectively. 2D Inverse DWT (IDWT) could reconstruct the image using the DWT output.
Various wavelets, including orthogonal wavelets, biorthogonal wavelets, multiwavelets, ridgelet, curvelet, bandelet and contourlet, etc., have been designed, studied, and applied in signal processing, numerical analysis, pattern recognition, computer vision and quantum mechanics, etc. The àtrous convolution used in DeepLab [3, 4] is originally developed in the wavelet theory. In the deep learning, while wavelets are widely applied as data preprocessing or postprocessing tools, wavelet transforms are also introduced into the design of deep networks by taking them as substitutes of sampling operations.
Multi-level Wavelet CNN (MWCNN) 
integrates Wavelet Package Transform (WPT) into the deep network for image restoration. MWCNN concatenates the low-frequency and high-frequency components of the input feature map, and processes them in a unified way. The details stored in the high-frequency components would be largely wiped out via this processing mode, because the data amplitude in the components is significantly weaker than that in the low-frequency component. Convolutional-Wavelet Neural Network (CWNN) applies the redundant dual-tree complex wavelet transform (DT-CWT) to suppress the noise and keep the object structures for extracting robust features from SAR images. The architecture of CWNN contains only two convolution layers. While DT-CWT discards the high-frequency components output from DT-CWT, CWNN takes as its down-sampling output the average value of the multiple low-frequency components. Wavelet pooling proposed in  is designed using a two-level DWT. Its back-propagation performs a one-level DWT with a two-level IDWT, which does not follow the mathematical principle of gradient. Recently, the application of wavelet transform in image style transfer  is studied. In these works, the authors evaluate their methods using only one or two wavelets, because of the absence of the general wavelet transform layers; the data details presented by the high-frequency components are abandoned or processed together with the low-frequency component, which limits the detail restoration in the image-to-image translation tasks.
3 Our method
Our method is to replace the sampling operations in the deep networks with wavelet transforms. We firstly rewrite Discrete Wavelet Transform (DWT) and Inverse DWT (IDWT) as the general network layers. Although the following analysis is mainly for orthogonal wavelet and 1D data, it can be generalized to other wavelets and 2D/3D data with slight changes.
3.1 Wavelet transform
For a given 1D data , DWT decomposes it, using two filters, i.e., low-pass filter and high-pass filter of an 1D orthogonal wavelet, into its low-frequency component and high-frequency component , where
and , denote the convolution and naive down-sampling, respectively. In theory, the length of every component is of that of x, i.e.,
Therefore, is usually even number.
IDWT reconstructs the original data x based on the two components,
where denotes the naive up-sampling operation.
For high-dimensional data, high dimensional DWT could decomposes it into one low-frequency component and multiple high-frequency components, while the corresponding IDWT could reconstructs the original data from the DWT output. For example, for a given 2D data, 2D DWT decomposes it into four components,
where is the low-pass filter and are the high-pass filters of the 2D orthogonal wavelet. is the low-frequency component of the original data which is a low-resolution version containing the data main information; are three high-frequency components which store the vertical, horizontal, and diagonal details of the data. 2D IDWT reconstruct the original data from these components,
Similarly, the size of every component is size of the original 2D data in terms of the two dimensional direction, i.e.,
Therefore, are usually even numbers.
Generally, the filters of high-dimensional wavelet are tensor products of the two filters of 1D wavelet. For 2D wavelet, the four filters could be designed from
where is the tensor product operation. For example, the low-pass and high-pass filters of 1D Haar wavelet are
Then, the filters of the corresponding 2D Haar wavelet are
) present the forward propagations for 1D/2D DWT and IDWT. It is onerous to deduce the gradient for the backward propagations from these equations. Fortunately, the modern deep learning framework PyTorch could automatically deduce the gradients for the common tensor operations. We have rewrote 1D/2D DWT and IDWT as network layers in PyTorch, which will be publicly available for other researchers. In the layers, we do DWT and IDWT channel by channel for multi-channel data.
We design wavelet integrated deep networks for image segmentation (WaveSNets), by replacing the down-sampling operations with 2D DWT and the up-sampling operations with 2D IDWT. In this paper, we take U-Net, SegNet, and DeepLabv3+ as the basic architectures.
WSegNets SegNet and U-Net share a similar symmetrical encoder-decoder architecture, but differ in their sampling operations. We name the pair of connected down-sampling and up-sampling operations and the associated convolutional blocks as dual structure, where the convolutional blocks process the feature maps with the same size. Fig. 2(a) and Fig. 2(b) show the dual structures used in U-Net and SegNet, which are named as PDDS (Pooling Deconvolution Dual Structure) and PUDS (Pooling-Unpooling Dual Structure), respectively. U-Net and SegNet consist of multiple nested dual structures. While they apply the max-pooling during their down-sampling, PDDS and PUDS use deconvolution and max-unpooling for the up-sampling, respectively. As Fig. 2(a) shows, PDDS copys and transmits the feature maps from encoder to decoder, concatenating them with the up-sampled features and extracting detailed information for the object boundaries restoration. However, the data tensor injected to the decoder might contain redundant information, which interferes with the segmentation results and introduces more convolutional paramters. PUDS transmits the pooling indices via the branch path for the upgrading of the feature map resolution in the decoder. As Fig. 1 shows, the lost data details can not be restored from the pooling indices.
To overcome the weaknesses of PDDS and PUDS, we adopt DWT for down-sampling and IDWT for up-sampling, and design WADS (WAvelet Dual Structure, Fig. 3). During its down-samping, WADS decomposes the feature map into low-frequency and high-requency components. Then, WADS injects the low-frequency component into the following layers in the deep networks to extract high-level features, and transmits the high-frequency components to the up-sampling layer for the recovering of the feature map resolution using IDWT. IDWT could also restore the data details from the high-frequency components during the up-sampling. We design wavelet integrated encoder-decoder networks using WADS, termed WSegNets, for imag segmentation.
|data size||the number of channels||networks|
|3, 64||64, 64||PUDS||PDDS||WADS|
|64, 128||128, 64||PUDS||PDDS||WADS|
|128, 256, 256||256, 256, 128||PUDS||PDDS||WADS|
|256, 512, 512||512, 512, 256||PUDS||PDDS||WADS|
|512, 512, 512||512, 512, 512||PUDS||PDDS||WADS|
In Table 1
, the first column shows the input size, though these networks can process images with arbitrary size. Every number in the table corresponds to a convolutional layer followed by a Batch Normalization (BN) and Rectified Linear Unit (ReLU). While the number in the column “encoder” is the number of the input channels of the convolution, the number in the column “decoder” is the number of the output channels. The encoder of U-Net and SegNet consists of 13 convolutional layers corresponding to the first 13 layers in the VGG16bn. A convolutional layer with kernel size of converts the decoder output into the predicted segmentation result.
WDeepLabv3+ DeepLabv3+  employs an unbalanced encoder-decoder architecture. The encoder applys an àtrous convolutional version of CNN to alleviate the detailed information loss due to the common down-sampling operations, and an Àtrous Spatial Pyramid Pooling (ASPP) to extract image multiscale representations. At the begin of the decoder, the encoder feature map is directly upsampled using a bilinear interpolation with factor of 4, and then concatenated with a low-level feature map transmitted from the encoder. DeepLabv3+ adopts a dual structure connecting its encoder and decoder, which differs with PDDS only on the up-sampling and suffers the same drawbacks.
We design WDeepLabv3+, a wavelet integrated version of DeepLabv3+, by applying two wavelet version of dual structures. As Fig. 4 shows, the encoder applys a wavelet integrated àtrous CNN followed by the ASPP, which output encoder feature map and two sets of high-frequency components. The encoder feature map is up-sampled by two IDWT, integrated with the detail information contained in the high-frequency components, while the two IDWT are connected with a convolutional block. The decoder then apply a few convolutions to refine the feature map, followed by a bilinear interpolation to recover the resolution.
We evaluate the WaveSNets (WSegNet, WDeepLabv3+) on the image dataset of CamVid, Pascal VOC, and Cityscapes, in terms of mean intersection over union (mIoU).
CamVid CamVid contains 701 road scene images (367, 101 and 233 for the training, validation and test respectively) with size of , which was used in  to quantify SegNet performance. Using the training set, we train SegNet, U-Net, and WSegNet when various wavelets are applied. The trainings are supervised by cross-entropy loss. We employ SGD solver, initial learning rate of with “poly” policy and , momentum of and weight decay of . With a mini-batch size of
, we train the networks 12K iterations (about 654 epochs). The input image size is. For every image, we adopt random resizing between and , random rotation between and degrees, random horizontal flipping and cropping with size of . We apply a pre-trained VGG16bn model for the encoder initialization and initialize the decoder using the technique proposed in . We do not tune above hyper-parameters for a fair comparison.
Table 2 shows the mIoU and global accuracy on the CamVid test set, together with the parameter numbers of U-Net, SegNet, and WSegNet with various wavelets. In the table, “dbx” represents orthogonal Daubechies wavelet with approximation order , and “chx.y” represents biorthogonal Cohen wavelet with approximation orders . Their low-pass and high-pass filters can be found in . The length of these filters increase as the order increases. While Haar and Cohen wavelets are symmetric, Daubechies are not. The mIoU of WSegNet decreases from to as asymmetric wavelet varies from “db2” to “db6”, while it varies from to as symmetric wavelet varies from “ch2.2” to “ch5.5”. It seems that the performances among different asymmetric wavelets are much diverse than that among various symmetric wavelets. In the wavelet integrated deep networks, we truncate the DWT output to make them to be size of input data. As a result, IDWT with asymmetric wavelet can not fully restore an image in the region near the image boundary, and the region width increases as the wavelet filter length increases. With symmetric wavelet, however, one can fully restore the image based on symmetric extension of the DWT output. Consequently, WSegNet with Cohen wavelets performs better than that with Daubechies ones near the image boundary. Fig. 5
shows an example image, its manual annotation, a region consisting of “building”, “road” and “sidewalk” and the segmentation results achieved using different wavelets. The region is located close to the image boundary and has been enlarged with colored segmentation results for better illustration. One can observe from the results of “db5” and “db6” that a long line of “sidewalk” pixels, located near the image boundary, are classified as “road”. In comparison, the results of “ch4.4” and “ch5.5” match well with the ground truth.
In , the authors train the SegNet using an extended CamVid training set containing 3433 images, which achieved mIoU on the CamVid test set. We train SegNet, U-Net and WSegNet using only 367 CamVid training images. From Table 2, one can find WSegNet achieves the best mIoU () using Haar wavelet, and the best global accuracy () using “db3” wavelet. WSegNet is significantly superior to SegNet in terms of mIoU and the global accuracy, while they require the same amount of parameters (). The segmentation performance of WSegNet is also better than that of U-Net, while it requires much less parameters than U-Net ().
Table 2 also lists the IoUs for the 11 categories in the CamVid. Compared with U-Net and WSegNet, SegNet performs very poor on the fine objects, such as “pole”, “symbol”, and “fence”, as that the max-pooling indices used in SegNet are not helpful for restoring the image details. While U-Net achieves comparable or even better IoUs on the easily identifiable objects, such as the “sky”, “building”, and “car”, it does not discriminate well similar objects like“walker” and “bicyclist”, or “building” and “fence”. The feature maps of these two pairs of objects might look similar to the decoder of U-Net, as the data details are not separately provided. Table 3 shows the confusion matrices on these four categories. The proportion of “building” in the predicted “fence” decreases from to , as the network varies from SegNet to WPUNet, while that of “bicyclist” in the predicted “walker” decreases from to . These results suggest that WSegNet is more powerful than SegNet and U-Net in distinguishing similar objects.
Fig. 6 presents a visual example for various networks, which shows the example image, its manual annotation, a region consisting of “building”, “tree”, “sky” and “pole”, and the segmentation results achieved using SegNet, U-Net and WSegNet. The region is enlarged with colored segmentation reults for better illustration. From the figure, one can find in the segmentation result that WSegNet keeps the basic structure of “tree”, “pole”, and “building” and restores the object details, such as the “tree” branches and the “pole”. The segmentation result of WSegNet matches the image region much better than that of SegNet and U-Net, even corrects the annotation noises about “building” and “tree” in the ground truth.
Pascal VOC and Cityscapes The original Pascal VOC2012 semantic segmentation dataset contains 1464 and 1449 annotated images for training and validation, respectively, and contains 20 object categories and one background class. The images are with various sizes. We augment the training data to 10582 by extra annotations provided in . We train SegNet, U-Net, and WSegNet with various wavelets on the extended training set 50 epochs with batch size of 16. During the training, we adopt random horizontal flipping, random Gaussian blurring, and cropping with size of . The other hyper-parameters are the same with that used in CamVid training. Table 4 presents the results on the validation set for the trained networks.
Cityscapes contains 2975 and 500 high quality annotated images for the training and validation, respectively. The images are with size of . We train the networks on the training set 80 epochs with batch size of 8 and initial learning rate of . During the training, we adopt random horizontal flipping, random resizing between 0.5 and 2, and random cropping with size of . Table 4 presents the results on the validation set.
From Tabel 4, one can find the segmentation performance of SegNet is significant inferior to that of U-Net and WSegNet. While WSegNet achieves better mIoU ( for Pascal VOC and for Cityscapes) and requires less parameters (), the global accuracies of WSegNet on the two dataset are a little lower than that of U-Net. This result suggest that, compared with U-Net, WSegNet could more precisely classify the pixels at the “critical” locations.
Taking ResNet101 as the backbone, we build DeepLabv3+ and WDeepLabv3+ with , and train them on the Pascal VOC using the same training policy with that used in the training of WSegNet. Table 5 shows the segmentation results on the validation set.
We achieve mIoU using DeepLabv3+ on the Pascal VOC validation set, which is comparable to that () obtained by the inventors of this network . With a few increase of parameter (), the segmentation performance of WDeepLabv3+ with various wavelet is always better than that of DeepLabv3+, in terms of mIoU and global accuracy. Using “ch5.5” wavelet, WDeepLabv3+ achieves the best performance, mIoU and global accuracy. From Table 5, one can find that the better performance of WDeepLabv3+ is mainly resulted from its better segmentation for the fine objects (“chair”, “table”, and “plant”) and similar objects (“cow”, “horse”, and “sheep”). The above results justify the high efficiency of DWT/IDWT layers in processing data details.
Fig. 7 shows four visual examples of segmentation results for DeepLabv3+ and WDeepLabv3+. The first and second columns present the original images and the segmentation ground truth, while the third and fourth columns show the segmentation results of DeepLabv3+ and WDeepLabv3+ with “ch5.5” wavelet, respectively. We show the segmentation results with colored pictures for better illustration. For the example image in Fig. 7(a), DeepLabv3+ falsely classifies the pixels in some detail regions on the cow and her calfs as “background” or “horse”, i.e., the regions for the cow’s left horn, the hind leg of the brown calf, and some fine regions on the two calfs. While WDeepLabv3+ correctly classifies the horse and the sheep in the Fig. 7(b) and Fig. 7(c), DeepLabv3+ classifies them as “cow” because of the similar object structures. In Fig. 7(d), the “table” segmented by WDeepLabv3+ is more complete than that segmented by DeepLabv3+. These results illustrate that WDeepLabv3+ performs better on the detail regions and the similar objects.
Our proposed general DWT and IDWT layers are applicable to various wavelets, which can be used to extract and process the data details during the network inference. We design WaveSNets (WSegNet and WDeepLabv3+) by replacing the down-sampling with DWT and replacing the up-sampling with IDWT, in U-Net, SegNet, and DeepLabv3+. Experimental results on the CamVid, Pascal VOC, and Cityscapes show that WaveSNets could well recover the image details and perform better for segmenting similar objects than their vanilla versions.
-  (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on PAMI 39 (12), pp. 2481–2495. Cited by: §1, §2.1, §4.1, §4.1.
-  (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: §1.
-  (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062. Cited by: §1, §2.1, §2.2.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the ECCV, pp. 801–818. Cited by: §1, §1, §2.1, §2.2, §3.2, §4.2.
The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1.
-  (1992) Ten lectures on wavelets. Vol. 61, Siam. Cited by: §1, §4.1.
-  (2017) SAR image segmentation based on convolutional-wavelet neural network and markov random field. Pattern Recognition 64, pp. 255–267. Cited by: §2.2.
-  (2015) The pascal visual object classes challenge: a retrospective. International journal of computer vision 111 (1), pp. 98–136. Cited by: §1.
-  (2011) Semantic contours from inverse detectors. In 2011 International Conference on Computer Vision, pp. 991–998. Cited by: §4.1.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE ICCV, pp. 1026–1034. Cited by: §4.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on CVPR, pp. 770–778. Cited by: §1.
-  (2019) PointRend: image segmentation as rendering. arXiv preprint arXiv:1912.08193. Cited by: §1, §2.1.
Multi-level wavelet convolutional neural networks. IEEE Access 7, pp. 74973–74985. Cited by: §2.2.
-  (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE transactions on PAMI 11 (7), pp. 674–693. Cited by: §1.
-  (1928) Certain topics in telegraph transmission theory. Transactions of the American Institute of Electrical Engineers 47 (2), pp. 617–644. Cited by: §2.1.
-  (2017) Automatic differentiation in pytorch. Cited by: §3.1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on MICCAI, pp. 234–241. Cited by: §1, §2.1, §3.2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
-  (2018) Wavelet pooling for convolutional neural networks. In International Conference on Learning Representations, Cited by: §2.2.
-  (2019) Photorealistic style transfer via wavelet transforms. arXiv preprint arXiv:1903.09760. Cited by: §2.2.
-  (2019) Making convolutional networks shift-invariant again. arXiv preprint arXiv:1904.11486. Cited by: §2.1.