I Introduction
Dense pixelwise prediction tasks such as semantic segmentation, optical flow or depth estimation remain uptodate challenges in computer vision. They find rapidly rising interests for applications such as autonomous driving, robotic vision and image scene understanding. Succeeded by its remarkable success in image recognition
[1], deep convolutional neural networks (CNNs) have achieved stateoftheart performances in dense prediction tasks such as semantic segmentation [2, 3, 4] or singleimage depth estimation [5].Many dense prediction tasks consist of two concurrent goals: classification and localization. Classification is well tackled by an endtoend trainable CNN architecture, e.g. VGGNet [6] or ResNet [7]
, which typically stacks multiple layers of successive convolution, nonlinear activation, and pooling. A typical pooling step, which performs either a subsampling or some strided averaging on an input volume, is favorable for the invariance of prediction results to small spatial translations in the input data as well as for the boost of computational efficiency via dimension reduction. Its downside, however, is the loss of resolution in output feature maps, which renders highquality pixelwise prediction challenging.
Several remedies for such a dilemma have been proposed in the literatures. As suggested in [8, 9], one may mirror the encoder network by a decoder network. Each upsampling (or unpooling) layer in the decoder network is introduced in symmetry to a corresponding pooling layer in the encoder network, and then followed by trainable convolutional layers. Alternatively, one may use dilated (also known as atrous) convolutions in a CNN encoder as proposed in [10, 11, 12]
. This enables the CNN to expand the receptive fields of pixels as convolutional layers stack up without losing resolution in the feature maps, however, at the cost of significant computational time and memory. Another alternative is to combine a CNN lowresolution classifier with a conditional random field (CRF)
[13, 14], either as a standalone postprocessing step [11, 12] or combined with a CNN in an endtoend trainable architecture [15, 16]. The latter also comes with an increased demand in runtime for training and inference.Motivated by close analogy between pooling (resp. unpooling) in an encoderdecoder CNN and decomposition (resp. reconstruction) in multiresolution wavelet analysis, this paper proposes a new class of CNNs with wavelet unpooling and wavelet pyramid. We name the network WCNN. The first contribution with WCNN is to achieve unpooling with the inverse discrete wavelet transform (iDWT). To this end, DWT is applied at the encoder to decompose feature maps into frequency bands. The high frequency components are skipconnected to the decoder to perform iDWT jointly with the coarseresolution feature maps. The wavelet unpooling does not require any additional parameters over baseline CNNs, where the overhead only comes from the memory to cache frequency coefficients from encoder. The second contribution of WCNN are two waveletbased pyramid variants to bridge the standard encoder and decoder. The wavelet pyramids obtain global context from a receptive field of the entire image by exploiting multiresolution DWT/iDWT. The experiments over the dataset Cityscape show that the proposed WCNN yields systematically improvements in dense prediction accuracy.
Ii Related Work
Many challenging tasks in computer vision such as single image depth prediction or semantic image segmentation require models for dense prediction, since they either involve regressing quantities pixelwise or classifying the pixels. Most current stateoftheart methods for dense prediction tasks are based on endtoend trainable deep learning architectures. Early methods segment the image into regions such as superpixels in a bottomup fashion. Predictions for the regions are determined based on deep neural network features
[17, 18, 19]. The use of imagebased bottomup regions supports adherence of the dense predictions to the boundaries in the image.Aim at endtoend CNNs, Long et al. [20] propose a fully connected convolutional (FCN) architecture for semantic image segmentation which successively convolves and pools feature maps of an increasing number of feature channels. FCNs employ the transposed convolution to learn the upsampling of coarse feature maps. To obtain segmentation, feature maps of the intermediate resolutions are concatenated and further processed by transposed convolutions. Since the introduction of FCNs, many variants for dense prediction are proposed. Hariharan et al. [21]
classify pixels based on feature vectors that are extracted at corresponding locations across all feature maps in a CNN. This way, the method combines features across all layers available in the network, capturing highresolution detail as well as context in large receptive fields. However, this approach becomes inefficient in deep architectures with many wide layers. Noh et al.
[8] and Dosovitsky et al. [22] propose encoderdecoder CNN architectures which successively unpool and convolve the lowest resolution feature map of the encoder back to a high output resolution. Since the feature maps in the encoder lose spatial information through pooling, Noh et al. [8] exploint the memorized unpooling [27] to upscale coarse feature maps at the decoder stage, where the pooling locations are used to unpool accordingly. The FCN of Laina et al. [5] uses the deep residual network [7] as an encoder, where most pooling layers are replaced by stridetwo convolution. For upscaling, the upprojection block is developed as an efficient implementation of upconvolution. The principle of upconvolution is developed by [28], which first unpools a feature map by putting activations to one entry of a block and then filter the sparse feature map with convolution. Details in the predictions of such encoderdecoder FCNs can be improved by feeding the feature maps in each scale of the encoder to the corresponding scale of the decoder (skip connections, e.g. [22]). In RefineNet [3], the decoder feature maps are successively refined using multiresolution fusion with their higher resolution counterparts in the encoder. In this paper, we also reincorporate the highfrequency information that is discarded during pooling to successively refine feature maps in the decoder.Some FCN architectures use dilated convolutions in order to increase receptive field without pooling and maintain highresolution of the feature maps [11, 12, 10]. These dilated CNNs trade highresolution output with the high memory consumption, which quickly become a bottleneck for training with large batch size for encoderdecoder CNNs. The fullresolution residual network (FRRN) by [4] is an alternative model which keeps features in a highresolution lane and at the same time, extracts lowresolution higherorder features in an encoderdecoder architecture. The highresolution features are successively refined from residuals computed through the encoderdecoder lane. While the model is highly demanding in memory and training time, it achieves highresolution predictions that well adhere to segment boundaries. [23] take inspiration from Laplace image decompositions for their network design. They successively refine the highfrequency parts of the score maps in order to improve predictions at segment boundaries. Structured prediction approaches integrate inference in CRFs with deep neural networks in endtoend trainable models [15, 24, 25, 16]. While the models are capable of recovering highresolution predictions, inference and learning typically requires tedious iterative procedures. In contrast to those approaches, we aim to provide detailed predictions in a swift and direct forward pass. Recently, the pyramid scene parsing network (PSPNet) from [2]
extracts global context features using a pyramid pooling module, which shows the benefit of aggregation global information for dense predictions. The pyramid design in PSPNet relies multiple average pooling layers with heuristic window size. In this work, we also propose a more efficient pyramid pooling stage based on multiresolution DWT.
Iii WCNN EncoderDecoder Architectures
Recently, CNNs have demonstrated impressive performance on many dense pixelwise prediction tasks, including image semantic segmentation, optical flow estimation, and depth regression. CNNs extract image features through successive layers of convolution and nonlinear activation. In encoder architectures, as the stack of layers gets deeper, the dimension of the feature vectors increases while the spatial resolution is reduced. For dense prediction tasks, CNNs with encoderdecoder architecture are widely applied in which the feature maps of the encoder are successively unpooled and deconvolved. Research on architectures for the encoder part is relatively mature, e.g., the stateoftheart CNNs such as VGGNet [6] and ResNet [7] are commonly used in various applications. In contrast, the design of the decoder has not yet converged to a universally accepted solution. While it is easy to reduce spatial dimension by either pooling or strided convolution, recovering a detailed prediction from a coarse and highdimensional feature space is less straightforward. In this paper, we make an analogy between CNN encoderdecoders to the multiresolution wavelet transform (see Figure 1). We match the pooling operations of the CNN encoder with the multilevel forward transformation of a signal by a wavelet. The decoder performs the corresponding inverse wavelet transform for unpooling. The analogy is straightforward: the wavelet transform successively filters the signal into frequency subbands while reducing the spatial resolution. The inverse wavelet transform successively composes the frequency subband back to full resolution. While the encoder and the decoder transform between different domains (e.g. imagetosemantic segmentation vs. imagetoimage in wavelet transforms), we find that wavelet unpooling provides an elegant mechanism to transmit highfrequency information from the image domain to the semantic segmentation. It also imposes a strong architectural regularization, as the feature dimensions between the encoder and the decoder need to match through the wavelet coefficients.
Iiia Discrete Wavelet Transform
We briefly introduce main concepts of DWT (see [26] for a comprehensive introduction). The multiresolution wavelet transform provides localized timefrequency analysis of signals and images. Consider a 2D input data , and as 1D lowpass and highpass filters, respectively. Denote the indexed array element by , the singlelevel DWT is defined as follows,
(1)  
All the convolutions above are performed with stride 2, yielding a subsampling of factor 2 along each spatial dimension. Let the lowlow frequency component , the lowhigh frequency component , the highlow frequency component , and the highhigh frequency component . The DWT results in . Conversely, supplied with the wavelet coefficients, and provided that and are biorthogonal wavelet filters, the original input can be reconstructed by the inverse DWT as
(2) 
A cascaded wavelet decomposition successively performs Equation 1 on lowlow frequency coefficients from fine to coarse resolution, while the reconstruction works reversely from coarse to fine resolution. In this sense, decompositionreconstruction in multiresolution wavelet analysis is in analogy to the poolingunpooling steps in an encoderdecoder CNN (e.g., [8]). Moreover, it is worth noting that, while the lowfrequency coefficients store local averages of the input data, its highfrequency counterparts, namely , , and encode local textures which are vital in recovering sharp boundaries. This motivates us to make use of the highfrequency wavelet coefficients to improve the quality of unpooling during the decoder stage and, hence, improve the accuracy of CNN in pixelwise prediction.
Throughout this paper, we extensively use the Haar wavelet for its simplicity and effectiveness to boost the performances of the underlying CNN. In this scenario, the Haar filters used for decomposition, see Equation 1, are given by
(3) 
The corresponding reconstruction filters in Equation 2 are given by , , and hence the inverse transform reduces to a sum of Kronecker products (denoted with )
(4) 
With CNNs, data at every layer are structured into 4D tensors, i.e., along the dimensions of the batch size, the channel number, the width and the height. To perform the wavelet transform for CNNs, we apply DWT/iDWT channelwise. Without confusion, the remaining text adopts the shorthand notations
for the Haar DWT and for the corresponding iDWT.IiiB Wavelet CNN EncoderDecoder Architecture
We propose a CNN encoderdecoder that resembles multiresolution wavelet decomposition and reconstruction by its pooling and unpooling operations. In addition, we introduce two pyramid variants to capture global contextual features based on the wavelet transformation.
Figure 1 gives an overview of the proposed WCNN architecture. WCNN employs ResNet [7]
for the encoder. In ResNet, the input resolution is successively reduced by a factor of 32 via one maxpooling layer and four stridetwo convolutional layers,
i.e.,conv1, conv3_1, conv4_1 and conv5_1. In order to restore the input resolution with the decoder, WCNN inserts three DWT layer after conv2, conv3 and conv4 to decompose the corresponding feature maps into four frequency bands. The high frequencies are skipconnected to the decoder to perform unpooling via the iDWT layers, which we will discuss in details with Section IIIB1. We add three convolutional residual block [7] to filter the unpooled feature maps further before the next unpooling stage. As illustrated in Figure 1, the three iDWT layers upsample the output to input resolution. The fullresolution output is obtained with two upconvolutional blocks by transposed convolution. To bridge the encoder and decoder, the contextual pyramid with wavelet transformation is added. Section IIIB2 will detail the pyramid design.IiiB1 Wavelet Unpooling
WCNN achieves the unpooling through iDWT layers. To this end, the DWT layers are added consistently into the encoder to obtain highfrequency components. The idea is straightforward. At encoder, the DWT layers decompose the feature map into four frequency bands channelwise, where each frequency band is halfresolution of the input. The highfrequency components are skipconnected to the decoder where the spatial resolution needs to be upscaled by a factor of two. Taking the layer idwt_4 in Figure 1 as an example, the input to this layer are four components of spatial resolution to perform iDWT. The pyramid output serve the lowlow frequency , while the output of the dwt4 layer operating on the conv4 provide the three highfrequency components , , and . With iDWT, the spatial resolution is upscaled to . The output of layer idwt4 is finalized by adding the resolution direct output of conv4, which is a standard skip connection commonly used by many stateoftheart encoderdecoder CNNs to improve the upsampling performance. The iDWT layer can thus be described by
(5) 
We denote this appproach of upscaling the decoder feature map with the wavelet coefficients from the encoder as wavelet unpooling.
Typically, CNNs extract feature with many layers of convolution and nonlinear operations, which transform and embed the feature space differently layer by layer. The wavelet unpooling aims to maintain the similar frequency structure throughout CNNs. By replacing the lowfrequency of the encoder with the corresponding output of the decoder to perform iDWT with the highfrequency bands from the encoder, the wavelet unpooling aims to enforce learning feature maps of invariant frequency structure under layers of filtering. The skip connections of the signals before DWT also support learning such consistency.
In comparison to the other unpooling methods, for example to upsampling by transposed convolution as proposed in [20], wavelet unpooling does not require any parameters for both DWT and iDWT layers. Compare to the memorized unpooling as proposed in [27], or the method to map the lowresolution feature map to the topleft entry of a block [28], the wavelet unpooling aims to restore every entries according to the frequency structure.
IiiB2 Wavelet Pyramid
With CNNs that are designed for classification task, the last few layers typically reduce the spatial resolution to . Such feature maps have the receptive field of the entire input image and therefore capture the global context. Recent works have demonstrated that capturing global context information is also crucial for accurate dense pixelwise prediction [11, 2]. While it is straightforward to obtain global context with fullyconnected layer or with convolutional layers of large filter size, it is difficult to bridge an encoder with drastically reduced spatial resolution to a proper decoder. Most stateoftheart CNN encoder reduce the spatial resolution by a factor of 32, which produces output given input dimensions. If the global context is captured by a simple fullyconnected layer, learning upsampling kernels is challenging.
One solution is to use the dilated convolutions, which increase the perceptive field with the same amount of parameters [11, 12]. Building on the dilated CNNs, the pyramid spatial pooling network PSPNet [2] introduces a pyramid on the feature map with multiple average pooling of different window sizes. Noticeably, the dilated convolutions demand considerably larger amounts of memory to host the data, which quickly becomes the bottleneck for training with large batch size. In this work, we base our network design on nondilated CNNs and instead construct the pyramids through wavelet transformations. We propose two wavelet pyramids variants, namely the low frequency propagation (LFP) and the full frequency composition (FFC) as shown in Figure 2.
LowFrequency Propagation Wavelet Pyramid
Shown in Figure 2 (a), the LFP pyramid successively performs DWT on the lowlow frequency components . At each pyramid level, the extracted component is further transformed with a convolutional layer, which is then bilinear upsampled to the same spatial resolution as the pyramid input, i.e., conv5. We then concatenate these the upsampled feature maps to aggregate the global context that are captured at different scale. This concatenated feature map is combined with the skipconnected conv5 by an elementwise addition, which sis then filtered with a convolutional layer to match the channel dimension of the decoder.
With LFP, a multiresolution wavelet pyramid is constructed, where only the lowlow frequency bands of each level are used. The LFP pyramid resembles the pyramid proposed by the PSPNet [2]. In particular, with the Haar wavelet, the lowlow frequency is equivalent to the average pooling by a window. However, the difference is the PSPNet design average pooling with a multiple heuristic window size, whereas LFP pyramid is strictly performed accordingly to frequency decomposition. Despite the Haar wavelet is used in this work, the LFP pyramid can be easily generalized with other wavelet base functions.
FullFrequency Composition Wavelet Pyramid
The LFP pyramid only uses the lowlow frequency bands. In order to make full use of the frequency decomposition, the FFC pyramid is developed. Shown in Figure 2 (b), the FFC pyramid amounts to a small encoderdecoder with wavelet unpooling. Start from the input conv5, DWT is performed to obtain the four frequency bands. The lowlow frequency band is filtered by an additional convolutional layer and the high frequency bands are cached for unpooling. The filtered lowlow frequency is then further decomposed by DWT into the finer level and the same operation repeats until the finest feature map is obtained. To upscale from the finest level, we again adopt the wavelet unpooling as described by Equation 5. To this end, the iDWT is first performed using the cached high frequency bands, and then the output is further fused with the skip connection. The wavelet unpooling successively restore the spatial resolution to the same as the input to the pyramid. Finally, we skip connect conv5 with the wavelet output by an elementwise addition, and project the global context with a convolution to bridge the following decoder. It can be seen that, the FFC pyramid mimic the encoderdecoder design, which naturally reduces the spatial resolution and restore it in the consistent manner with the remaining network.
Iv Evaluations
In this section, we evaluate the proposed WCNN method for the task of semantic image segmentation. To this end, we use the Cityscape benchmark dataset [29] which contains 2,975 training, 500 validation and 1,525 test images that are captured in 50 different cities from a driving car. All the images are densely annotated into 30 commonly observed objects classes occurring in urban street scenes from which 19 classes are used for evaluation. The Cityscape benchmark provides all the images with the same high resolution of . The ground truth for the test images is not publicly available and evaluations on the test set are submitted online^{1}^{1}1http://www.cityscapesdataset.com for fair comparison.
layer  operation  input  dimension 
conv1  , s2  RGB  1/2, 64 
maxpool  , s2  conv1  1/4, 64 
conv2_x  resblock  maxpool  1/4, 256 
dwt2  conv2_x  1/8, 256  
conv3_1  resblock , s2  conv2_x  1/8, 512 
conv3_x  resblock  conv3_1  1/8, 512 
dwt3  conv3_x  1/16, 512  
conv4_1  resblock , s2  conv3_x  1/16, 1024 
conv4_x  resblock  conv4_1  1/16, 1024 
dwt4  conv4_x  1/32, 1024  
conv5_1  resblock , s2  conv4_x  1/32, 2048 
conv5_x  resblock  conv5_1  1/32, 2048 
pyramid  conv5x  1/32, 1024  
idwt4  1/16, 1024  
dconv4_x  resblock  idwt4 conv4_x  1/16, 512 
idwt3  1/8, 512  
dconv3_x  resblock  idwt3 conv3_x  1/8, 256 
idwt2  1/4, 256  
dconv2_x  resblock  idwt2 conv2_x  1/4, 128 
upconv2_x  upconv  dconv2_x  1/2, 64 
upconv1_x  upconv  upconv2_x  1/1, 64 
Iva WCNN Configurations
Table I presents the network configurations of the proposed WCNN. We take the stateoftheart ResNet101 [7] for the encoder. The ResNet101 uses stridetwo convolution to reduce spatial resolution. To implement WCNN, we preserve the stridetwo convolution layers and insert three DWT layers dwt2, dwt3, dwt4 into the decoder conv2_x, conv3_x, conv4_x, respectively to obtain the frequency bands. At each upscaling stage at the decoder, the corresponding frequency bands are used, then followed by several residual blocks before the next upscaling stage. The last two upscaling stages are implemented as upconvolution, where transposed convolution is first applied to scale up the resolution by a factor of two, then residual blocks are used to further filter the intermediate output. In WCNN, we reply heavily on the residual blocks proposed in ResNet [7], where each block is a stack of three convolutional layers with the second layer of
for feature extraction and the first and third layers as
convolutions for feature projection.In this work, we develop CNNs for highresolution predictions. An input image of yields conv5_x to have the spatial resolution of . Therefore, we design both LFP and FFC pyamids to have four levels of DWT, which produce the four levels of frequency components of , , and , respectively. The finest pyramid level thus has the receptive field of the entire input. The details of the LFP and FFC pyramids are given in Table II.
To evaluate the proposed network, the baseline CNN is designed to have minimum difference with WCNN. Taking the WCNN configuration in Table I, the baseline model 1) removes all DWT layers at encoder 2) replaces the pyramid by one convolutional layer, and 3) replaces the iDWT layers by transposed convolution to upscale the feature map by a factor of 2. The rest layers are the same with WCNN. In the following experiment, we compare the baseline model, the baseline model with LFP and FFC pyramid, the WCNN with LFP and FFC pyramids.
LFPpyramid  
layer  operation  input  dimension 
dwt_p1  conv5  , 2048  
conv_p1  , 512  
dwt_p2  , 512  
conv_p2  , 512  
concat  concatenation  ,2048  
conv_pyr  concat conv5  ,1024  
FFPpyramid  
layer  operation  input  dimension 
dwt_p1  conv5  , 2048  
conv_p1  , 2048  
dwt_p2  conv_p1  , 2048  
conv_p2  , 2048  
idwt_p2  , 2048  
idwt_p1  , 2048  
conv_pyr  idwt_p1 conv5  , 1024 
IvB Implementation Details
We have implemented all our methods based on the TensorFlow
[30]machine learning framework. For network training, we initialize the parameters of the encoder layers from pretrained ResNet model on ImageNet and initialize the convolutional kernels on the decoder with He [31]initialization. We run the training with batch size of four on the Nvidia Titan X GPU. For both training, we minimize the crossentropy loss using the Stochastic Gradient Descent (SGD) solver with Momentum of 0.9. The initial learning rate is set to 0.001 and decrease with a factor of 0.9 every 10 epoch. We train the network until convergences. For cityscapes, all the variants of our experiments converges around 60K iterations. Following
[4], we apply bootstrapping loss minimization for Cityscapes benchmark in order to speed up the training and boost the segmentation accuracy. For all Cityscapes experiments, we fix the threshold of bootstrapping to the top 8192 most difficult pixels per images.method 
road

sidewalk

building

wall

fence

pole

traffic

traffic light

vegetarian

terrain

sky

person

rider

car

truck

bus

train

motorcycle

bicycle

avg 
frequency  37.7  5.4  21.9  0.7  0.8  1.5  0.2  0.7  17.2  0.8  3.4  1.3  0.2  6.6  0.3  0.4  0.1  0.1  0.7  
baseline  98.8  88.8  96.0  51.5  61.6  62.0  66.6  76.5  96.0  70.1  97.1  85.8  66.4  97.0  81.4  85.4  59.0  53.8  84.6  69.2 
baselineLFP  98.6  90.1  95.5  62.6  62.6  61.3  65.7  76.0  95.9  69.3  97.4  85.4  63.6  97.1  80.1  88.4  73.8  61.2  85.1  71.2 
baselineFFC  98.6  89.6  95.3  63.4  62.0  61.3  67.8  74.4  96.1  64.6  97.3  85.9  63.0  96.9  85.5  89.4  73.6  58.5  84.5  70.7 
WCNNLFP  98.6  89.8  95.7  63.0  65.8  61.5  67.8  76.2  96.3  69.4  97.4  85.8  67.4  97.2  82.0  88.9  69.9  59.9  84.9  71.6 
WCNNFFC  98.7  90.5  95.6  64.8  64.6  63.2  67.8  77.3  96.1  71.0  97.3  86.1  65.3  97.0  82.7  88.7  77.6  57.7  85.1  71.9 
baselineMS  99.0  90.6  96.7  48.0  61.2  68.2  72.9  80.2  96.3  72.5  97.7  89.1  70.3  97.6  76.6  82.2  48.9  60.7  84.9  71.4 
baselineLFPMS  98.7  92.2  96.5  54.0  65.5  68.9  71.2  79.0  96.1  64.7  97.6  88.1  64.3  97.8  71.2  87.3  71.8  68.5  85.7  73.3 
baselineFFCMS  98.7  91.7  96.4  64.6  65.0  67.4  74.3  79.7  96.7  68.9  98.0  88.8  68.9  97.5  88.3  90.6  79.3  60.9  85.8  74.7 
WCNNLFPMS  98.8  92.4  96.2  61.2  68.0  68.5  71.2  79.8  96.3  64.8  97.5  88.4  70.1  97.8  77.8  89.3  61.6  74.1  87.1  73.9 
WCNNFFCMS  98.8  92.2  96.6  68.6  64.8  69.1  73.9  81.6  96.7  72.4  97.8  89.3  68.9  97.5  87.3  90.5  73.3  58.0  85.3  75.2 
method  class mIoU  category mIoU 
FRRN [4]  71.8  88.9 
WCNNFFC  70.9  86.1 
WCNNFFCMS  73.7  88.3 
To train all the variants of the baseline and our model, we fix the input to the network to quarter resolution of the original dataset, i.e.,
. For evaluation on the validation dataset, we upsample the output logits bilinear to half of the resolution (to match the network input resolution) and compute the intersectionoverunion (IoU) score for each class and on average. We also experiment with test time data augmentation, where we randomly scale the input images and feed them through the network before fuse the score.
IvC Cityscapes
We evaluate segmentation accuracy using the commonly used evaluation metric of IoU.
Table III gives the classwise IoU and the mean IoU over the 19 classes. It can be seen that adding LFP and FFC pyramids to the baseline network already significantly improves the segmentation performance over the baseline. The FFC pyramid consistently outperforms the LFP pyramid. With WCNN we gain another increase in mean IoU of up to 1.2 over the corresponding baseline. With multiscale test time augmentation, the accuracy of each model is increased, but the similar rank is observed among the different methods. Our variants strongly benefit, while the combination of wavelet unpooling and FFC wavelet pyramid achieves best increase in performance towards the baseline (6.0 mIoU). These results demonstrate that wavelet unpooling as well as the FFC wavelet pyramid improve the dense prediction of the baseline model. The qualitative comparisons are shown in Figure 3. It can be seen that the WCNN approach recovers finedetailed structures such as fences, poles or traffic signs with higher accuracy than the baselines.Table IV compares our method with the current stateoftheart method FRRN [4] on the same input resolution (2x subsampling) on the Cityscapes benchmark. It can be seen that our method WCNNFFCMS outperforms FRRN by 1.9 mean IoU over the 19classes while it is worse (0.6 mIoU) on the category level. Notably, WCNN is much less memory demanding than FRRN.
V Conclusion
This paper introduce WCNN, a novel encoderdecoder CNN architecture for dense pixelwise prediction. The key innovation is to exploits the discrete wavelet transform (DWT) and inverse DWT to design the unpooling operation. In the proposed network, the highfrequency coefficients extracted by DWT at the encoder stage are cached and later combined with coarseresolution feature maps at the decoder to perform accurate upsampling and hence, ultimate pixelwise prediction. Further, two wavelet pyramid variants are introduced, i.e., the low frequency propagation (LFP) pyramid and the full frequency composition (FFC) pyramid. Both pyramid extract the global context from the encoder output with multiresolution wavelet decomposition. Shown in experiment, WCNN outperforms the variant baseline CNNs and achieve the stateoftheart semantic segmentation performance on the Cityscapes dataset.
In the future work, we will evaluate WCNNs for different dense pixelwise prediction tasks, e.g., depth estimation and optical flow estimation. We will also perform ablation study of the wavelet pyramid to evaluate different pyramid configuration. It is also interesting to extend the WCNN for different wavelet base functions or ultimately learn the optimal base functions with CNNs.
References
 [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105, 2012.

[2]
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,”
in
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2017.  [3] G. Lin, A. Milan, C. Shen, and I. Reid, “RefineNet: Multipath refinement networks for highresolution semantic segmentation,” in CVPR, July 2017.
 [4] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Fullresolution residual networks for semantic segmentation in street scenes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [5] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in International Conference on 3D Vision (3DV), pp. 239–248, 2016.
 [6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
 [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
 [8] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in IEEE International Conference on Computer Vision (ICCV), pp. 1520–1528, 2015.
 [9] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep convolutional encoderdecoder architecture for image segmentation,” arXiv:1511.00561, 2016.
 [10] F. Yu and V. Koltun, “Multiscale context aggregation by dilated convolutions,” in International Conference on Learning Representations (ICLR), 2016.
 [11] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” arXiv:1606.00915, 2016.
 [12] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, pp. 834–848, 2018.
 [13] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected CRFs with Gaussian edge potentials,” in Advances in Neural Information Processing Systems (NIPS), 2011.
 [14] P. Krähenbühl and V. Koltun, “Parameter learning and convergent inference for dense random fields,” in International Conference on Machine Learning (ICML), pp. 513–521, 2013.

[15]
S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr, “Conditional random fields as recurrent neural networks,” in
International Conference on Computer Vision (ICCV), pp. 1529–1537, 2015.  [16] G. Lin, C. Shen, A. van den Hengel, and I. Reid, “Efficient piecewise training of deep structured models for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3194–3203, 2016.
 [17] J. Yan, Y. Yu, X. Zhu, Z. Lei, and S. Z. Li, “Object detection by labeling superpixels,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [18] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 1915–1929, Aug 2013.
 [19] F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth estimation from a single image,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5162–5170, 2015.
 [20] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nov. 2015.
 [21] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and finegrained localization,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [22] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in IEEE International Conference on Computer Vision (ICCV), 2015.
 [23] G. Ghiasi and C. C. Fowlkes, “Laplacian pyramid reconstruction and refinement for semantic segmentation,” in European Conference on Computer Vision (ECCV), pp. 519–534, 2016.
 [24] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang, “Semantic image segmentation via deep parsing network,” in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1377–1385, 2015.
 [25] S. Chandra and I. Kokkinos, “Fast, exact and multiscale inference for semantic image segmentation with deep gaussian crfs,” in European Conference on Computer Vision (ECCV), pp. 402–418, 2016.
 [26] S. Mallat, A Wavelet Tour of Signal Processing: The Sparse Way. Academic Press, 3rd ed., 2009.
 [27] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional networks for mid and high level feature learning,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), ICCV ’11, (Washington, DC, USA), pp. 2018–2025, IEEE Computer Society, 2011.
 [28] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox, “Learning to generate chairs with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1538–1546, 2015.
 [29] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223, 2016.
 [30] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Largescale machine learning on heterogeneous systems,” 2015. Software available from tensorflow.org.
 [31] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
Comments
There are no comments yet.