1 Introduction
Due to the advantage of deep network in extracting highlevel features, deep learning has achieved high performances in various tasks, particular in the computer vision. However, the current deep networks are not good at extracting and processing data details. While deep networks with more layers are able to fit more functions, the deeper networks are not always associated with better performances
[11]. An important reason is that the details will be lost as the data flow through the layers. In particular, the lost data details significantly degrade the performacnes of the deep networks for the image segmentation. Various techniques, such as condition random field, àtrous convolution [3, 4], PointRend [12], are introduced into the deep networks to improve the segmentation performance. However, these techniques do not explicitly process the data details.Wavelets [6, 14], well known as “mathematical microscope”, are effective timefrequency analysis tools, which could be applied to decompose an image X into the lowfrequency component containing the image main information and the highfrequency components containing the details (Fig. 1). In this paper, we rewrite Discrete Wavelet Transform (DWT) and Inverse DWT (IDWT) as the general network layers, which are applicable to 1D/2D/3D data and various wavelets. One can flexibly design endtoend architectures using them, and directly process the data details in the deep networks. We design wavelet integrated deep networks for image segmentation, termed WaveSNets, by replacing the downsampling with DWT and the upsampling with IDWT in the UNet [17], SegNet [1], and DeepLabv3+ [4]. When Haar, Cohen, and Daubechies wavelets are used, we evaluate WaveSNets using dataset CamVid [2], Pascal VOC [8], and Cityscapes [5]. The experimental results show that WaveSNets achieve better performances in semantic image segmentation than their vanilla versions, due to the effective segmentation for the fine and similar objects. In summary:

We rewrite DWT and IDWT as general network layers, which are applicable to various wavelets and can be applied to design endtoend deep networks for processing the data details during the network inference;

We design WaveSNets using various network architectures, by replacing the downsampling with DWT layer and upsampling with IDWT layer;

WaveSNets are evaluated on the dataset of CamVid, Pascal VOC, Cityscapes, and achieve better performance for semantic image segmentation.
2 Related works
2.1 Sampling operation
Downsampling operations, such as maxpooling, averagepooling, and stridedconvolution, are introduced into the deep networks for local connectivity and weight sharing. These downsampling operations usually ignore the chassic sampling theorem
[15], which result in aliasing among the data components in different frequency intervals. As a result, the data details presented by the highfrequency components are totally lost, and random noises showing up in the same components could be sampled into the low resolution data. In addition, the object basic structures presented by the lowfrequency component will be broken. Fig. 1 shows a maxpooling example. In the signal processing, the lowpass filtering before the downsampling is the standard method for antialiasing. Antialiased CNNs [21]integrate the lowpass filtering with the downsampling in the deep networks, which achieve increased classification acccuracy and better shiftrobustness. However, the filters used in antialiased CNNs are empirically designed based on the row vectors of Pascal’s triangle, which are ad hoc and no theoretical justifications are given. As no upsampling operation, i.e., reconstruction, of the lowpass filter is available, the antialiased UNet
[21] has to apply the same filtering after normal upsampling to achieve the antialiasing effect.Upsampling operations, such as maxunpooling [1], deconvolution [17]
, and bilinear interpolation
[3, 4], are widely used in the deep networks for imagetoimage translation tasks. These upsampling operations are usually applied to gradually recover the data resolution, while the data details can not be recovered from them. Fig.
1 shows a maxunpooling example. The lost data details would significantly degrade the network performance for the imagetoimage tasks, such as the image segmentation. Various techniques, including àtrous convolution [3, 4], PointRend [12], etc., are introduced into the design of deep networks to capture the fine details for performance improvement of image segmentation. However, these techniques try to recover the data details from the detailunrelated information. Their ability in the improvement of segmentation performance is limited.2.2 Wavelet
Wavelets are powerful timefrequency analysis tools, which have been widely used in signal analysis, image processing, and pattern recognition. A wavelet is usually associated with scaling function and wavelet functions. The shifts and expansions of these functions compose stable basis for the signal space, with which the signal can be decomposed and reconstructed. The functions are closely related to the lowpass and highpass filters of the wavelet. In practice, these filters are applied for the data decomposition and reconstruction. As Fig.
1 shows, 2D Discrete Wavelet Transform (DWT) decompose the image X into its lowfrequency component and three highfrequency components . While is a low resolution version of the image, keeping its main information, save its horizontal, vertical, and diagonal details, respectively. 2D Inverse DWT (IDWT) could reconstruct the image using the DWT output.Various wavelets, including orthogonal wavelets, biorthogonal wavelets, multiwavelets, ridgelet, curvelet, bandelet and contourlet, etc., have been designed, studied, and applied in signal processing, numerical analysis, pattern recognition, computer vision and quantum mechanics, etc. The àtrous convolution used in DeepLab [3, 4] is originally developed in the wavelet theory. In the deep learning, while wavelets are widely applied as data preprocessing or postprocessing tools, wavelet transforms are also introduced into the design of deep networks by taking them as substitutes of sampling operations.
Multilevel Wavelet CNN (MWCNN) [13]
integrates Wavelet Package Transform (WPT) into the deep network for image restoration. MWCNN concatenates the lowfrequency and highfrequency components of the input feature map, and processes them in a unified way. The details stored in the highfrequency components would be largely wiped out via this processing mode, because the data amplitude in the components is significantly weaker than that in the lowfrequency component. ConvolutionalWavelet Neural Network (CWNN)
[7] applies the redundant dualtree complex wavelet transform (DTCWT) to suppress the noise and keep the object structures for extracting robust features from SAR images. The architecture of CWNN contains only two convolution layers. While DTCWT discards the highfrequency components output from DTCWT, CWNN takes as its downsampling output the average value of the multiple lowfrequency components. Wavelet pooling proposed in [19] is designed using a twolevel DWT. Its backpropagation performs a onelevel DWT with a twolevel IDWT, which does not follow the mathematical principle of gradient. Recently, the application of wavelet transform in image style transfer [20] is studied. In these works, the authors evaluate their methods using only one or two wavelets, because of the absence of the general wavelet transform layers; the data details presented by the highfrequency components are abandoned or processed together with the lowfrequency component, which limits the detail restoration in the imagetoimage translation tasks.3 Our method
Our method is to replace the sampling operations in the deep networks with wavelet transforms. We firstly rewrite Discrete Wavelet Transform (DWT) and Inverse DWT (IDWT) as the general network layers. Although the following analysis is mainly for orthogonal wavelet and 1D data, it can be generalized to other wavelets and 2D/3D data with slight changes.
3.1 Wavelet transform
For a given 1D data , DWT decomposes it, using two filters, i.e., lowpass filter and highpass filter of an 1D orthogonal wavelet, into its lowfrequency component and highfrequency component , where
(1) 
and , denote the convolution and naive downsampling, respectively. In theory, the length of every component is of that of x, i.e.,
(2) 
Therefore, is usually even number.
IDWT reconstructs the original data x based on the two components,
(3) 
where denotes the naive upsampling operation.
For highdimensional data, high dimensional DWT could decomposes it into one lowfrequency component and multiple highfrequency components, while the corresponding IDWT could reconstructs the original data from the DWT output. For example, for a given 2D data
, 2D DWT decomposes it into four components,(4) 
where is the lowpass filter and are the highpass filters of the 2D orthogonal wavelet. is the lowfrequency component of the original data which is a lowresolution version containing the data main information; are three highfrequency components which store the vertical, horizontal, and diagonal details of the data. 2D IDWT reconstruct the original data from these components,
(5) 
Similarly, the size of every component is size of the original 2D data in terms of the two dimensional direction, i.e.,
(6) 
Therefore, are usually even numbers.
Generally, the filters of highdimensional wavelet are tensor products of the two filters of 1D wavelet. For 2D wavelet, the four filters could be designed from
(7) 
where is the tensor product operation. For example, the lowpass and highpass filters of 1D Haar wavelet are
(8) 
Then, the filters of the corresponding 2D Haar wavelet are
(9)  
(10)  
(11)  
(12) 
) present the forward propagations for 1D/2D DWT and IDWT. It is onerous to deduce the gradient for the backward propagations from these equations. Fortunately, the modern deep learning framework PyTorch
[16] could automatically deduce the gradients for the common tensor operations. We have rewrote 1D/2D DWT and IDWT as network layers in PyTorch, which will be publicly available for other researchers. In the layers, we do DWT and IDWT channel by channel for multichannel data.3.2 WaveSNet
We design wavelet integrated deep networks for image segmentation (WaveSNets), by replacing the downsampling operations with 2D DWT and the upsampling operations with 2D IDWT. In this paper, we take UNet, SegNet, and DeepLabv3+ as the basic architectures.
WSegNets SegNet and UNet share a similar symmetrical encoderdecoder architecture, but differ in their sampling operations. We name the pair of connected downsampling and upsampling operations and the associated convolutional blocks as dual structure, where the convolutional blocks process the feature maps with the same size. Fig. 2(a) and Fig. 2(b) show the dual structures used in UNet and SegNet, which are named as PDDS (Pooling Deconvolution Dual Structure) and PUDS (PoolingUnpooling Dual Structure), respectively. UNet and SegNet consist of multiple nested dual structures. While they apply the maxpooling during their downsampling, PDDS and PUDS use deconvolution and maxunpooling for the upsampling, respectively. As Fig. 2(a) shows, PDDS copys and transmits the feature maps from encoder to decoder, concatenating them with the upsampled features and extracting detailed information for the object boundaries restoration. However, the data tensor injected to the decoder might contain redundant information, which interferes with the segmentation results and introduces more convolutional paramters. PUDS transmits the pooling indices via the branch path for the upgrading of the feature map resolution in the decoder. As Fig. 1 shows, the lost data details can not be restored from the pooling indices.
To overcome the weaknesses of PDDS and PUDS, we adopt DWT for downsampling and IDWT for upsampling, and design WADS (WAvelet Dual Structure, Fig. 3). During its downsamping, WADS decomposes the feature map into lowfrequency and highrequency components. Then, WADS injects the lowfrequency component into the following layers in the deep networks to extract highlevel features, and transmits the highfrequency components to the upsampling layer for the recovering of the feature map resolution using IDWT. IDWT could also restore the data details from the highfrequency components during the upsampling. We design wavelet integrated encoderdecoder networks using WADS, termed WSegNets, for imag segmentation.
Table 1 illustrates the configuration of WSegNets, together with that of UNet and SegNet. In this paper, the UNet consists of eight more convolutional layers than the original one [17].
data size  the number of channels  networks  

encoder  decoder  SegNet  UNet  WSegNet  
3, 64  64, 64  PUDS  PDDS  WADS  
64, 128  128, 64  PUDS  PDDS  WADS  
128, 256, 256  256, 256, 128  PUDS  PDDS  WADS  
256, 512, 512  512, 512, 256  PUDS  PDDS  WADS  
512, 512, 512  512, 512, 512  PUDS  PDDS  WADS 
In Table 1
, the first column shows the input size, though these networks can process images with arbitrary size. Every number in the table corresponds to a convolutional layer followed by a Batch Normalization (BN) and Rectified Linear Unit (ReLU). While the number in the column “encoder” is the number of the input channels of the convolution, the number in the column “decoder” is the number of the output channels. The encoder of UNet and SegNet consists of 13 convolutional layers corresponding to the first 13 layers in the VGG16bn
[18]. A convolutional layer with kernel size of converts the decoder output into the predicted segmentation result.WDeepLabv3+ DeepLabv3+ [4] employs an unbalanced encoderdecoder architecture. The encoder applys an àtrous convolutional version of CNN to alleviate the detailed information loss due to the common downsampling operations, and an Àtrous Spatial Pyramid Pooling (ASPP) to extract image multiscale representations. At the begin of the decoder, the encoder feature map is directly upsampled using a bilinear interpolation with factor of 4, and then concatenated with a lowlevel feature map transmitted from the encoder. DeepLabv3+ adopts a dual structure connecting its encoder and decoder, which differs with PDDS only on the upsampling and suffers the same drawbacks.
We design WDeepLabv3+, a wavelet integrated version of DeepLabv3+, by applying two wavelet version of dual structures. As Fig. 4 shows, the encoder applys a wavelet integrated àtrous CNN followed by the ASPP, which output encoder feature map and two sets of highfrequency components. The encoder feature map is upsampled by two IDWT, integrated with the detail information contained in the highfrequency components, while the two IDWT are connected with a convolutional block. The decoder then apply a few convolutions to refine the feature map, followed by a bilinear interpolation to recover the resolution.
4 Experiment
We evaluate the WaveSNets (WSegNet, WDeepLabv3+) on the image dataset of CamVid, Pascal VOC, and Cityscapes, in terms of mean intersection over union (mIoU).
4.1 WSegNet
CamVid CamVid contains 701 road scene images (367, 101 and 233 for the training, validation and test respectively) with size of , which was used in [1] to quantify SegNet performance. Using the training set, we train SegNet, UNet, and WSegNet when various wavelets are applied. The trainings are supervised by crossentropy loss. We employ SGD solver, initial learning rate of with “poly” policy and , momentum of and weight decay of . With a minibatch size of
, we train the networks 12K iterations (about 654 epochs). The input image size is
. For every image, we adopt random resizing between and , random rotation between and degrees, random horizontal flipping and cropping with size of . We apply a pretrained VGG16bn model for the encoder initialization and initialize the decoder using the technique proposed in [10]. We do not tune above hyperparameters for a fair comparison.SegNet  UNet  WSegNet  

haar  ch2.2  ch3.3  ch4.4  ch5.5  db2  db3  db4  db5  db6  
sky  90.50  91.52  91.31  91.38  91.29  91.39  91.24  91.35  91.48  90.99  90.89  90.63 
building  77.80  80.28  79.90  79.27  78.82  79.37  78.65  79.48  79.60  78.52  78.58  77.84 
pole  9.14  27.97  27.99  29.38  27.90  28.96  27.26  27.91  28.38  28.04  26.66  25.96 
road  92.69  93.71  93.69  93.47  93.47  93.77  93.78  93.72  93.91  92.57  92.12  92.11 
sidewalk  78.05  80.05  80.33  79.44  79.34  79.89  80.08  79.67  80.58  76.95  75.62  76.65 
tree  72.40  73.51  73.34  73.27  73.21  73.07  71.60  73.61  73.68  72.90  72.28  71.92 
symbol  37.61  43.07  44.44  42.68  40.42  43.57  42.33  44.72  44.01  41.06  41.72  39.69 
fence  19.92  27.50  32.59  24.62  25.59  28.40  28.85  25.52  29.60  26.90  24.15  29.00 
car  79.31  85.04  83.21  84.43  82.63  84.57  84.14  84.30  83.97  81.92  81.07  78.38 
walker  39.93  50.35  49.35  50.98  50.52  50.43  49.09  50.15  49.39  47.69  48.02  43.17 
bicyclist  39.48  44.45  50.38  47.94  48.69  47.93  52.64  51.15  47.73  46.08  43.53  38.96 
113[3pt/1pt] mIoU  57.89  63.40  64.23  63.35  62.90  63.76  63.61  63.78  63.85  62.15  61.33  60.39 
gl. acc.  91.04  92.19  92.08  92.04  91.98  92.10  91.73  92.06  92.40  91.60  91.30  91.14 
para.(e6)  29.52  37.42  29.52 
Table 2 shows the mIoU and global accuracy on the CamVid test set, together with the parameter numbers of UNet, SegNet, and WSegNet with various wavelets. In the table, “dbx” represents orthogonal Daubechies wavelet with approximation order , and “chx.y” represents biorthogonal Cohen wavelet with approximation orders . Their lowpass and highpass filters can be found in [6]. The length of these filters increase as the order increases. While Haar and Cohen wavelets are symmetric, Daubechies are not. The mIoU of WSegNet decreases from to as asymmetric wavelet varies from “db2” to “db6”, while it varies from to as symmetric wavelet varies from “ch2.2” to “ch5.5”. It seems that the performances among different asymmetric wavelets are much diverse than that among various symmetric wavelets. In the wavelet integrated deep networks, we truncate the DWT output to make them to be size of input data. As a result, IDWT with asymmetric wavelet can not fully restore an image in the region near the image boundary, and the region width increases as the wavelet filter length increases. With symmetric wavelet, however, one can fully restore the image based on symmetric extension of the DWT output. Consequently, WSegNet with Cohen wavelets performs better than that with Daubechies ones near the image boundary. Fig. 5
shows an example image, its manual annotation, a region consisting of “building”, “road” and “sidewalk” and the segmentation results achieved using different wavelets. The region is located close to the image boundary and has been enlarged with colored segmentation results for better illustration. One can observe from the results of “db5” and “db6” that a long line of “sidewalk” pixels, located near the image boundary, are classified as “road”. In comparison, the results of “ch4.4” and “ch5.5” match well with the ground truth.
In [1], the authors train the SegNet using an extended CamVid training set containing 3433 images, which achieved mIoU on the CamVid test set. We train SegNet, UNet and WSegNet using only 367 CamVid training images. From Table 2, one can find WSegNet achieves the best mIoU () using Haar wavelet, and the best global accuracy () using “db3” wavelet. WSegNet is significantly superior to SegNet in terms of mIoU and the global accuracy, while they require the same amount of parameters (). The segmentation performance of WSegNet is also better than that of UNet, while it requires much less parameters than UNet ().
Table 2 also lists the IoUs for the 11 categories in the CamVid. Compared with UNet and WSegNet, SegNet performs very poor on the fine objects, such as “pole”, “symbol”, and “fence”, as that the maxpooling indices used in SegNet are not helpful for restoring the image details. While UNet achieves comparable or even better IoUs on the easily identifiable objects, such as the “sky”, “building”, and “car”, it does not discriminate well similar objects like“walker” and “bicyclist”, or “building” and “fence”. The feature maps of these two pairs of objects might look similar to the decoder of UNet, as the data details are not separately provided. Table 3 shows the confusion matrices on these four categories. The proportion of “building” in the predicted “fence” decreases from to , as the network varies from SegNet to WPUNet, while that of “bicyclist” in the predicted “walker” decreases from to . These results suggest that WSegNet is more powerful than SegNet and UNet in distinguishing similar objects.
SegNet  UNet  WSegNet(haar)  
building  fence  building  fence  building  fence  
building  88.01  1.30  89.91  1.19  90.22  1.08 
fence  34.11  29.12  30.64  40.16  30.16  44.65 
walker  bicyclist  walker  bicyclist  walker  bicyclist  
walker  53.30  2.54  66.83  1.95  68.74  1.90 
bicyclist  13.69  49.66  16.73  51.45  12.52  59.80 
Fig. 6 presents a visual example for various networks, which shows the example image, its manual annotation, a region consisting of “building”, “tree”, “sky” and “pole”, and the segmentation results achieved using SegNet, UNet and WSegNet. The region is enlarged with colored segmentation reults for better illustration. From the figure, one can find in the segmentation result that WSegNet keeps the basic structure of “tree”, “pole”, and “building” and restores the object details, such as the “tree” branches and the “pole”. The segmentation result of WSegNet matches the image region much better than that of SegNet and UNet, even corrects the annotation noises about “building” and “tree” in the ground truth.
SegNet  UNet  WSegNet  
haar  ch2.2  ch3.3  ch4.4  ch5.5  db2  
Pascal VOC  mIoU  61.33  63.64  63.95  63.50  63.46  63.48  63.62  63.48 
210[1pt/1pt]  gl. acc.  90.14  90.82  90.79  90.72  90.72  90.73  90.76  90.75 
Cityscapes  mIoU  65.75  70.05  70.09  69.86  69.73  70.13  70.63  70.37 
210[1pt/1pt]  gl. acc.  94.65  95.21  95.20  95.18  95.09  95.17  95.15  95.20 
parameters (e6)  29.52  37.42  29.52 
Pascal VOC and Cityscapes The original Pascal VOC2012 semantic segmentation dataset contains 1464 and 1449 annotated images for training and validation, respectively, and contains 20 object categories and one background class. The images are with various sizes. We augment the training data to 10582 by extra annotations provided in [9]. We train SegNet, UNet, and WSegNet with various wavelets on the extended training set 50 epochs with batch size of 16. During the training, we adopt random horizontal flipping, random Gaussian blurring, and cropping with size of . The other hyperparameters are the same with that used in CamVid training. Table 4 presents the results on the validation set for the trained networks.
Cityscapes contains 2975 and 500 high quality annotated images for the training and validation, respectively. The images are with size of . We train the networks on the training set 80 epochs with batch size of 8 and initial learning rate of . During the training, we adopt random horizontal flipping, random resizing between 0.5 and 2, and random cropping with size of . Table 4 presents the results on the validation set.
From Tabel 4, one can find the segmentation performance of SegNet is significant inferior to that of UNet and WSegNet. While WSegNet achieves better mIoU ( for Pascal VOC and for Cityscapes) and requires less parameters (), the global accuracies of WSegNet on the two dataset are a little lower than that of UNet. This result suggest that, compared with UNet, WSegNet could more precisely classify the pixels at the “critical” locations.
4.2 WDeepLabv3+
DeepLabv3+  WDeepLabv3+  

haar  ch2.2  ch3.3  ch4.4  ch5.5  db2  
background  93.83  93.82  93.85  93.94  93.91  93.86  93.87 
aeroplane  92.29  93.14  92.50  91.56  93.21  92.73  92.41 
bike  41.42  43.08  42.19  42.21  43.42  42.59  42.84 
bird  91.47  90.47  91.60  90.34  91.24  90.81  90.73 
boat  75.39  75.47  72.04  75.19  74.20  72.40  72.90 
bottle  82.05  80.18  81.14  82.12  79.55  82.23  81.89 
bus  93.64  93.25  93.25  93.20  93.07  93.52  93.31 
car  89.30  90.36  90.00  90.67  88.79  86.31  87.11 
cat  93.69  92.80  93.31  92.62  93.56  93.84  93.97 
chair  38.28  40.80  40.75  39.32  39.79  43.27  41.60 
cow  86.60  89.72  89.04  90.49  88.42  92.04  88.17 
table  61.37  63.24  62.67  65.49  64.93  67.31  65.58 
dog  91.16  90.24  91.04  89.54  89.97  90.38  90.65 
horse  86.60  88.86  88.88  90.23  89.00  91.02  89.19 
motorbike  88.47  87.94  87.89  88.61  88.36  87.30  87.58 
person  86.71  86.88  86.61  86.35  86.59  86.32  86.96 
plant  64.48  64.33  64.69  68.45  65.50  68.01  65.41 
sheep  83.04  87.45  86.50  87.43  85.70  88.14  85.95 
sofa  49.24  47.43  48.06  49.85  46.74  51.94  50.19 
train  85.49  84.18  83.88  85.47  83.88  85.16  86.69 
monitor  77.80  76.55  78.31  79.04  78.32  78.78  79.11 
18[3pt/1pt] mIoU  78.68  79.06  78.96  79.62  78.96  79.90  79.34 
gl. acc.  94.64  94.69  94.68  94.75  94.68  94.77  94.75 
para. (e6)  59.34  60.22 
Taking ResNet101 as the backbone, we build DeepLabv3+ and WDeepLabv3+ with , and train them on the Pascal VOC using the same training policy with that used in the training of WSegNet. Table 5 shows the segmentation results on the validation set.
We achieve mIoU using DeepLabv3+ on the Pascal VOC validation set, which is comparable to that () obtained by the inventors of this network [4]. With a few increase of parameter (), the segmentation performance of WDeepLabv3+ with various wavelet is always better than that of DeepLabv3+, in terms of mIoU and global accuracy. Using “ch5.5” wavelet, WDeepLabv3+ achieves the best performance, mIoU and global accuracy. From Table 5, one can find that the better performance of WDeepLabv3+ is mainly resulted from its better segmentation for the fine objects (“chair”, “table”, and “plant”) and similar objects (“cow”, “horse”, and “sheep”). The above results justify the high efficiency of DWT/IDWT layers in processing data details.
Fig. 7 shows four visual examples of segmentation results for DeepLabv3+ and WDeepLabv3+. The first and second columns present the original images and the segmentation ground truth, while the third and fourth columns show the segmentation results of DeepLabv3+ and WDeepLabv3+ with “ch5.5” wavelet, respectively. We show the segmentation results with colored pictures for better illustration. For the example image in Fig. 7(a), DeepLabv3+ falsely classifies the pixels in some detail regions on the cow and her calfs as “background” or “horse”, i.e., the regions for the cow’s left horn, the hind leg of the brown calf, and some fine regions on the two calfs. While WDeepLabv3+ correctly classifies the horse and the sheep in the Fig. 7(b) and Fig. 7(c), DeepLabv3+ classifies them as “cow” because of the similar object structures. In Fig. 7(d), the “table” segmented by WDeepLabv3+ is more complete than that segmented by DeepLabv3+. These results illustrate that WDeepLabv3+ performs better on the detail regions and the similar objects.
5 Conclusion
Our proposed general DWT and IDWT layers are applicable to various wavelets, which can be used to extract and process the data details during the network inference. We design WaveSNets (WSegNet and WDeepLabv3+) by replacing the downsampling with DWT and replacing the upsampling with IDWT, in UNet, SegNet, and DeepLabv3+. Experimental results on the CamVid, Pascal VOC, and Cityscapes show that WaveSNets could well recover the image details and perform better for segmenting similar objects than their vanilla versions.
References
 [1] (2017) Segnet: a deep convolutional encoderdecoder architecture for image segmentation. IEEE transactions on PAMI 39 (12), pp. 2481–2495. Cited by: §1, §2.1, §4.1, §4.1.
 [2] (2009) Semantic object classes in video: a highdefinition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: §1.
 [3] (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062. Cited by: §1, §2.1, §2.2.
 [4] (2018) Encoderdecoder with atrous separable convolution for semantic image segmentation. In Proceedings of the ECCV, pp. 801–818. Cited by: §1, §1, §2.1, §2.2, §3.2, §4.2.

[5]
(2016)
The cityscapes dataset for semantic urban scene understanding
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1.  [6] (1992) Ten lectures on wavelets. Vol. 61, Siam. Cited by: §1, §4.1.
 [7] (2017) SAR image segmentation based on convolutionalwavelet neural network and markov random field. Pattern Recognition 64, pp. 255–267. Cited by: §2.2.
 [8] (2015) The pascal visual object classes challenge: a retrospective. International journal of computer vision 111 (1), pp. 98–136. Cited by: §1.
 [9] (2011) Semantic contours from inverse detectors. In 2011 International Conference on Computer Vision, pp. 991–998. Cited by: §4.1.

[10]
(2015)
Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification
. In Proceedings of the IEEE ICCV, pp. 1026–1034. Cited by: §4.1.  [11] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on CVPR, pp. 770–778. Cited by: §1.
 [12] (2019) PointRend: image segmentation as rendering. arXiv preprint arXiv:1912.08193. Cited by: §1, §2.1.

[13]
(2019)
Multilevel wavelet convolutional neural networks
. IEEE Access 7, pp. 74973–74985. Cited by: §2.2.  [14] (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE transactions on PAMI 11 (7), pp. 674–693. Cited by: §1.
 [15] (1928) Certain topics in telegraph transmission theory. Transactions of the American Institute of Electrical Engineers 47 (2), pp. 617–644. Cited by: §2.1.
 [16] (2017) Automatic differentiation in pytorch. Cited by: §3.1.
 [17] (2015) Unet: convolutional networks for biomedical image segmentation. In International Conference on MICCAI, pp. 234–241. Cited by: §1, §2.1, §3.2.
 [18] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
 [19] (2018) Wavelet pooling for convolutional neural networks. In International Conference on Learning Representations, Cited by: §2.2.
 [20] (2019) Photorealistic style transfer via wavelet transforms. arXiv preprint arXiv:1903.09760. Cited by: §2.2.
 [21] (2019) Making convolutional networks shiftinvariant again. arXiv preprint arXiv:1904.11486. Cited by: §2.1.
Comments
There are no comments yet.