Waterfall Atrous Spatial Pooling Architecture for Efficient Semantic Segmentation

12/06/2019 ∙ by Bruno Artacho, et al. ∙ Rochester Institute of Technology 22

We propose a new efficient architecture for semantic segmentation, based on a "Waterfall" Atrous Spatial Pooling architecture, that achieves a considerable accuracy increase while decreasing the number of network parameters and memory footprint. The proposed Waterfall architecture leverages the efficiency of progressive filtering in the cascade architecture while maintaining multiscale fields-of-view comparable to spatial pyramid configurations. Additionally, our method does not rely on a postprocessing stage with Conditional Random Fields, which further reduces complexity and required training time. We demonstrate that the Waterfall approach with a ResNet backbone is a robust and efficient architecture for semantic segmentation obtaining state-of-the-art results with significant reduction in the number of parameters for the Pascal VOC dataset and the Cityscapes dataset.



There are no comments yet.


page 3

page 4

page 5

page 7

page 9

page 10

page 13

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation is an important computer vision task Garcia-Garcia et al. (2017); Zhu et al. (2016); Thoma (2016) with applications in autonomous driving Ess et al. (2009), human–machine interaction Oberweger and Vincent Lepetit (2015), computational photography Fan et al. (2017), and image search engines Tzelepi and Tefas (2018). The significance of semantic segmentation, in both the development of novel architectures and its practical use, has motivated the development of several approaches that aim to improve the encouraging initial results of Fully Convolutional Networks (FCN) Long et al. (2015). One important challenge to address is the decrease of the feature map size due to pooling, which requires unpooling to perform pixel-wise labeling of the image for segmentation.

DeepLab Chen et al. (2018), for instance, used dilated or Atrous Convolutions to tackle the limitations posed by the loss of resolution inherited from unpooling operations. The advantage of Atrous Convolution is that it maintains the Field-of-View (FOV) at each layer of the network. DeepLab implemented Atrous Spatial Pyramid Pooling (ASPP) blocks in the segmentation network, allowing the utilization of several Atrous Convolutions at different dilation rates for a larger FOV.

A limitation of the ASPP architecture is that the network experiences a significant increase in size and memory required. This limitation was addressed in Chen et al. (2017), by replacing ASPP modules with the application of Atrous Convolutions in series, or cascade, with progressive rates of dilation. However, although this approach successfully decreased the size of the network, it presented the setback of decreasing the size of the FOV.

Motivated by the success achieved by a network architecture with parallel branches introduced by the Res2Net module Gao et al. (2020), we incorporate Res2Net blocks in a semantic segmentation network. Then, we propose a novel architecture named the Waterfall Atrous Spatial Pooling (WASP) and use it in a semantic segmentation network we refer to as WASPnet (see segmentation examples in Figure 1). Our WASP module combines the cascaded approach used in Chen et al. (2017) for Atrous Convolutions with the larger FOV obtained from traditional ASPP in DeepLab for the deconvolutional stages of semantic segmentation.

Figure 1: Semantic segmentation examples using WASPnet.

The WASP approach leverages the progressive extraction of larger FOV from cascade methods, and is able to achieve parallelism of branches with different FOV rates while maintaining reduced parameter size. The resulting architecture has a flow that resembles a waterfall, which is how it gets its name.

The main contributions of this paper are as follows.

  • [leftmargin=*,labelsep=5.8mm]

  • We propose the Waterfall method for Atrous Spatial Pooling that achieves significant reduction in the number of parameters in our semantic segmentation network compared to current methods based on the spatial pyramid architecture.

  • Our approach increases the receptive field of the network by combining the benefits of cascade Atrous Convolutions with multiple fields-of-view in a parallel architecture inspired by the spatial pyramid approach.

  • Our results show that the Waterfall approach achieves state-of-the-art accuracy with a significant reduction in the number of network parameters.

  • Due to the superior performance of the WASP architecture, our network does not require postprocessing of the semantic segmentation result with a CRF module, making it even more efficient in terms of computational complexity.

2 Related Work

The innovations in Convolutional Neural Networks (CNNs) by the authors of 

Krizhevsky et al. (2012); Simonyan and Zisserman (2015); Szegedy et al. (2014); He et al. (2015) form the core of image classification and serve as the structural backbone for state-of-the-art methods in semantic segmentation. However, an important challenge with incorporating CNN layers in segmentation is the significant reduction of resolution caused by pooling.

The breakthrough work of Long et al. Long et al. (2015) introduced Fully Convolutional Networks (FCN) by replacing the final fully connected layers with deconvolutional stages. FCN Long et al. (2015) addressed the resolution reduction problem by deploying upsampling strategies across deconvolution layers. These deconvolution stages attempt to reverse the convolution operation and increase the feature map size back to the dimensions of the original image. The contributions of FCN Long et al. (2015) triggered research in semantic segmentation that led to a variety of different approaches that are visually illustrated in Figure 2.

Figure 2: Semantic segmentation research overview.

2.1 Atrous Convolution

The most popular technique shared among semantic segmentation architectures is the use of dilated or Atrous Convolutions. An early work by Yu et al. Yu and Koltun (2016) highlighted the uses of dilation. Atrous convolutions were further explored by the authors of Chen et al. (2018, 2017, 2018); Paszke et al. (2016). The main objectives of Atrous Convolutions are to increase the size of the receptive fields in the network, avoid downsampling, and generate a multiscale framework for segmentation.

The name Atrous is derived from the French expression “algorithm à trous”, or translated to English “Algorithm with holes”. As alluded by its name, Atrous Convolutions alter the convolutional filters by the insertion of “holes”, or zero values in the filter, resulting in the increased size of the receptive field, resembling a hybrid of convolution and pooling layers. The use of Atrous Convolutions in the network is shown in Figure 3.

In the simpler case of a one-dimensional convolution, the output of the signal is defined as follows Chen et al. (2018),


where is the rate at which the Atrous Convolution is dilated, is the filter of length K, is the input, and is the output of a pixel. As pointed out in Chen et al. (2018), a rate value of the unit results in a regular convolution operation.

Figure 3: Input pixels using a 3 3 Atrous Convolutios with different dilation rates of 1, 2, and 3, respectively.

Leveraging the success of the Spatial Pyramid Pooling (SPP) structure by He et al. He et al. (2014), the ASPP architecture was introduced in DeepLab Chen et al. (2018)

. The special configuration of ASPP assembles dilated convolutions in four parallel branches with different rates. The resulting feature maps are combined by fast bilinear interpolation with an additional factor of eight to recover the feature maps in the original resolution.

2.2 DeepLabv3

The application of Atrous Convolution followed the ASPP approach in Chen et al. (2018) was later extended in Chen et al. (2017) to the cascade approach, that is, the use of several Atrous Convolutions in sequence with rates increasing through its flux. This approach, named Deeplabv3 Chen et al. (2017), allows the architecture to perform deeper analysis and increment its performance using approaches similar to those in Dai et al. (2017).

Contributions in Chen et al. (2017)

included module realization in a cascade fashion, investigation of different multi-grid configurations for dilation in the cascade of convolutions, training with different output stride scales for the Atrous Convolutions, and techniques to improve the results when testing and fine-tuning for segmentation challenges. Another addition presented by 

Chen et al. (2017)

is the inclusion of a ResNet101 model, pretrained on both ImageNet 

Deng et al. (2009) and JFT-300M Sun et al. (2017) datasets.

More recently, DeepLabv3+ Chen et al. (2018) proposed the incorporation of ASPP modules with the encoder–decoder structure adopted by Badrinarayanan et al. (2015), reporting a better refinement in the border of the objects being segmented. This novel approach represented a significant improvement in accuracy from previous methods. In a separate development, Auto-DeepLab Liu et al. (2019) uses an Auto-ML approach to learn a semantic segmentation architecture by searching both the network level and the cell level of the structure. It achieves results comparable to current methods without requiring ImageNet Deng et al. (2009) pre-training or hierarchical architecture search.

2.3 Crf

A complication resulting of the lack of pooling layers is a reduction of spatial invariance. Thus, additional techniques are used to recover spatial definition, namely, Conditional Random Fields (CRF) and Atrous Convolutions. One popular method relying on CRF is CRFasRNN Zheng et al. (2015). Aiming to better delineate objects in the image, CRFasRNN combines CNN and CRF in a single network to incorporate the probabilistic method of the Gaussian pairwise potentials during inference. That enables end-to-end training, avoiding the need of postprocessing with a separate CRF module, as done in Chen et al. (2018). A limitation of architectures using CRF is that CRF has difficulty capturing delicate boundaries, as they have low confidence in the unary term of the CRF energy function.

The postprocessing module of CRF performs refining of the prediction by Gaussian filters and iterative comparisons of pixels in the output image. The iteration process aims to minimize the “energy” below.


The energy consists of the summations of the unary potentials , where

is the probability (softmax) that pixel

is correctly computed by the CNN, and the pairwise potential energy , which is determined by the relationship between two pixels. Following the authors of Krähenühl and Koltun (2011), is defined as


where the function is defined to be equal to 1 in the case of and zero otherwise, that is, the CRF only accounts for energy that needs to be minimized when the labels differ. The pairwise potential function utilizes two Gaussian kernels: the first depends on pixel positions and the RGB color

; the second depends only on pixel positions. The Gaussian kernels are controlled by the hyperparameters

, , and , which are determined through the iterations of the CRF, as well as the weights and .

2.4 Other Methods

In contrast to the large scale of segmentation networks using Atrous Convolutions, the Efficient Neural Network (ENet) Paszke et al. (2016) produces a real-time segmentation by trading-off some of its accuracy for a significant reduction in processing time, ENet is up to 18 faster than other architectures.

During learning, CNN architectures have the tendency to learn information that is specific to the scale of the input image dataset. In an attempt to deal with this issue, a multiscale approach is used. For instance, the authors of Raj et al. (2015) proposed a network with two paths containing the original resolution image and another with double the resolution. The former is processed through a short CNN and the latter through a fully convolutional VGG-16. The first path is then combined with the upsampled version of the second resulting in a network that can deal with larger variations in scale. A similar approach is applied in Eigen and Fergus (2014); Roy and Todorovic (2016); Bian et al. (2016), expanding the structure to include a larger amount of networks and scales.

Other architectures achieved good results in semantic segmentation by using an encoder–decoder variant. For instance, SegNet Badrinarayanan et al. (2015)

utilizes both an encoder and decoder phase, while relying on pooling indices from the encoder phase to aid the decoder phase. The Softmax classifier generates the final segmentation prediction map. The architecture presented by SegNet was further developed to include Bayesian techniques to model uncertainty in the network 

Kendall et al. (2015).

Contrasting with the work in Long et al. (2015), ParseNet Liu et al. (2015) completes an early fusion in the network, by performing an early merge of the global features from previous layers with the current map of the posterior layer. In ParseNet, the previous layer is unpooled and concatenated to the following layers to generate the final classifier prediction with both having the same size. This approach differs from FCN where the skip connection concatenates maps of different sizes.

Recurrent Neural Networks (RNN) have been used to successfully combine pixel-level information with local region information, enabling the RNN to include global context in the construction of the segmented image. A limitation of RNN, when used for Semantic Segmentation, is that it has difficulty constructing a sequence based on the structure of natural images. ReSeg Visin et al. (2015a) is a network based on previous work by ReNet Visin et al. (2015b)

. ReSeg presents an approach where RNN blocks from ReNet are applied after a few layers of a VGG structure, generating the final segmentation map by the use of upsampling by transposed convolutions. However, RNN-based architectures suffer from the vanishing gradient problem.

Networks using Long Short-Term Memory (LSTM) aim to tackle the issue of vanishing gradients. For instance, LSTM Context Fusion (LSTM-CF) 

Li et al. (2016) utilizes the concatenation of an architecture similar to DeepLab to process RGB and depth information. It uses three different scales for the RGB feature response and depth, similar to the work in Li and Yu (2016). Likewise, the authors of Byeon et al. (2015) used four different LSTM cells, each receiving distinct parts of the image. Recurrent Convolutional Neural Networks (rCNN) Pinheiro and Collobert (2013) recurrently train the network using different input window sizes fed into the RNN. This approach achieves better segmentation and avoids the loss of resolution encountered with fixed window fitting in RNN methods.

3 Methodology

We propose an efficient architecture for Semantic Segmentation making use of the large FOV generated by Atrous Convolutions combined with cascade of convolutions in a “Waterfall” configuration. Our WASP architecture provides benefits due to its multiscale representations as well as efficiency in the reduced size of the network.

The processing pipeline is shown in Figure 4

. The input image is initially fed into a deep CNN (namely a ResNet-101 architecture) with the final layers replaced by a WASP module. The resultant score map with the probability distributions obtained from Softmax is processed by a decoder network that performs bilinear interpolation and generates a more efficient segmentation without the use of postprocessing with CRF. We provide a comparison of our WASP architecture with DeepLab’s original ASPP architecture and with a modified architecture based on the Res2Net module.

Figure 4: WASPnet architecture for semantic segmentation.

3.1 Res2Net-Seg Module

Res2Net Gao et al. (2020) is a recently developed architecture designed to improve upon ResNet He et al. (2015). Res2Net incorporates multiscale features with a Squeeze-and-Excitation (SE) block Hu et al. (2017) to obtain better representations and achieves promising results. The Res2Net module divides the original bottleneck block into four parallel streams, each containing 25% of the layers that are fed to 4 different 3 3 convolutions. Simultaneously, it incorporates the output of the parallel convolution. The SE block is an adaptable architecture that can recalibrate the responses in the feature map channel by modeling the interdependencies between channels. This allows improvements in performance by exploiting the dependencies between feature maps without increase in the network size.

Inspired by the work in Gao et al. (2020), we present a modified version of the Res2Net module that is suitable for segmentation, named Res2Net-Seg. The Res2Net-Seg module, shown in Figure 5, includes the main structure of Res2Net and, additionally, utilizes Atrous Convolutions for each scale for increased FOV and a fifth parallel branch that performs average pooling of all features, which incorporates the original scale in the feature map. The Res2Net-Seg module is utilized in the WASPnet architecture of Figure 4 in place of the WASP module. We next propose the WASP module, inspired by multiscale representations, which an improvement over both the Res2Net-Seg and the ASPP configuration.

Figure 5: Res2Net-Seg block.

3.2 WASP Module

We propose the “Waterfall Atrous Spatial Pyramid” module, shown in Figure 6. WASP is a novel architecture with Atrous Convolutions that is able to leverage both the larger FOV of the ASPP configuration and the reduced size of the cascade approach.

An important drawback of Atrous Convolution, applied in either the cascade fashion or the ASPP (parallel design), is that it requires a larger number of parameters and more memory for its implementation, compared to standard convolution. In Chen et al. (2018), there was experimentation to replace convolutional layers of the network backbone architecture, namely, VGG-16 or ResNet-101, with Atrous Convolution modules, but it was too costly in terms of memory requirements. A compromise solution is to apply the cascade of Atrous Convolutions and ASPP modules starting after block 5 when ResNet-101 was utilized.

We overcome these limitations with our Waterfall architecture for improved performance and efficiency. The Waterfall approach is inspired by multiscale approaches Eigen and Fergus (2014); Roy and Todorovic (2016), the parallel structures of ASPP Chen et al. (2018), and Res2Net modules Gao et al. (2020), as well as the cascade configuration Chen et al. (2017). It is designed with the goal of reducing the number of parameters and memory required, which are the main limitation of Atrous Convolutions. The WASP module is utilized in the WASPnet architecture shown in Figure 4.

A comparison between the ASPP module, cascade configuration, and the proposed WASP module is visually highlighted in Figures 6 and 7, for the ASPP and cascade modules. The WASP configuration consists of four branches of a Large-FOV being fed forward in a waterfall-like fashion. In contrast, the ASPP module uses parallel branches that use more parameter and are less efficient, while the cascade architecture uses sequential filtering operations lacking the larger FOV.

Figure 6: Proposed Waterfall Atrous Spatial Pooling (WASP) module.
Figure 7: Comparison for Atrous Spatial Pyramid Pooling (ASPP) Chen et al. (2018) and Cascade configuration Chen et al. (2017).

3.3 Decoder

To process the score maps resulting from the WASP module, a short decoder stage was implemented containing the concatenation with low level features from the first block of the ResNet backbone, convolutional layers, dropout layers, and bilinear interpolations to generate output maps in the same resolution of the input image.

Figure 8 shows the decoder and the respective stage dimensions and number of layers. The representation considers an input image with dimensions of 1920 1080 3 for width, height, and RGB color, respectively. In this case, the decoder receives 256 maps of dimensions 240 135 and 256 low level features of dimension 480 270. After matching the dimensions for inputs of the decoder, the layers are concatenated and processed through convolutional layers, dropout, and a final bilinear interpolation to reach the original input size.

Figure 8: Decoder used in the WASPnet method.

4 Experiments

4.1 Datasets

We performed experiments on three datasets used for pre-training, training, validation, and testing. Microsoft Common Objects in Context (COCO) dataset Lin et al. (2014) was used by Chen et al. (2018) as pre-training as it includes a large amount of data, allowing a good balance of starting weights when training with different datasets, and consequently allowing the increase in precision of the segmentation.

Pascal Visual Object Class (VOC) 2012 Everingham et al. (2010) is a dataset containing objects in different scenarios including people, animals, vehicles, and indoor objects. It contains three different types of challenges: classification, detection, and segmentation; the latter was utilized in this paper. For the segmentation benchmark, the dataset contains 1464 images for training, 1449 images for validation, and 1456 images for testing annotated for 21 classes. Data augmentation was used to increase the training set size to 10,582.

Cityscapes Cordts et al. (2016) is a larger dataset containing urban scene images recorded in street scenes of 50 different cities with pixel annotations of 25,000 frames. In the Cityscapes dataset, 5000 images are finely annotated at pixel level divided into 2975 images for training, 500 for validation, and 1525 for testing. Cityscapes is annotated in 19 semantic classes divided into 7 categories (construction, ground, human, nature, object, sky, and vehicle).

4.2 Evaluation Metrics

We based our comparison of performance to other methods using Mean Intersection over Union (mIOU), considered the most important and more widely used metric for semantic segmentation. A pixel-level analysis of detection is conducted, reporting the intersection of true positive (TP) pixels detection as a percentage of the union of TP with false negative (FN) and false positive (FP) pixels.

4.3 Simulation Parameters

We calculate the learning rate based on the polynomial method (“poly”) Liu et al. (2015), also adopted in Chen et al. (2018). The poly learning rate results in more effective updating of the weights when compared to the traditional “step” learning rate, given as


where was employed. We utilized a batch size of eight due to physical memory constraints in the hardware available, lower than the batch size of ten used by DeepLab. A subtle improvement in training with a larger batch size is expected for the architectures proposed.

We experimented with different rates of dilation on WASP. We found that larger rates result in better mIOU. A set rate of {6, 12, 18, 24} was selected for the WASP module. In addition, we performed pre-training using the MS-COCO dataset Lin et al. (2014), and data augmentation in randomly selected images scaled between (0.5,1.5).

5 Results

Following training, validation, and testing procedures, the WASPnet architecture was implemented utilizing WASP module, Res2Net-Seg module, or ASPP module. The validation mIOU results are presented in Table 1 for the Pascal VOC dataset. When following similar guidelines as in Chen et al. (2018) for training and hyperparameters, and using the WASP module, an mIOU of 80.22% is achieved without the need for CRF postprocessing. Our WASPnet resulted in a gain of 5.07% on the validation set and reduced the number of parameters by 20.69%.

Architecture Number of Parameters Parameter Reduction mIOU
WASPnet-CRF (ours) 47.482 M 20.69% 80.41%
WASPnet (ours) 47.482 M 20.69% 80.22%
Res2Net-Seg-CRF 50.896 M 14.99% 80.12%
Res2Net-Seg 50.896 M 14.99% 78.53%
Deeplab-CRF Chen et al. (2018) 59.869 M - 77.69%
Deeplab Chen et al. (2018) 59.869 M - 76.35%
Table 1: Pascal Pascal Visual Object Class (VOC) validation set results.

The Res2Net-Seg approach results in an mIOU of 78.53% without CRF, achieves mIOU of 80.12% with CRF, and reduces the number of parameters by 14.99%. The Res2Net-Seg approach still shows benefits with the incorporation of CRF as postprocessing, similar to the cascade and ASPP methods.

Overall, the WASP architecture provides the best result and the highest reduction in parameters. Sample results for the WASPnet architecture are shown in Figure 9 for validation images from the Pascal VOC dataset Everingham et al. (2010). Note, from the generated segmentation, that our method presents a better definition in the detection shape, being closer to the ground-truth when compared to previous methods utilizing ASPP (DeepLab).

We tested the effects of different dilation rates (in our WASP module) on the final segmentation. In our tests, all kernel sizes were set to 3 following procedures as in Chen et al. (2018). Table 2 reports the accuracy, in mIOU, for the Pascal VOC dataset for different dilation rates in the WASP module. The configuration with dilation rates of {6, 12, 18, 24} resulted in the best accuracy for the Pascal VOC dataset, therefore, the following tests were conducted using this dilation rate.

WASP Dilation Rates mIOU
{2, 4, 6, 8} 79.61%
{4, 8, 12, 16} 79.72%
{6, 12, 18, 24} 80.22%
{8, 16, 24, 32} 79.92%
Table 2: Pascal VOC validation set results for different sets of dilation in the WASP module.

We also experimented with postprocessing using CRF. The application of CRF has the benefit of better defining the shapes of the segmented areas. Similarly to the procedures followed in Chen et al. (2018), we performed parameter tuning, for the parameters of Equation (3), by varying between 3 and 6, from 30 to 100, and from 3 to 6, while fixing both and to 3.

Figure 9: Results sample for Pascal VOC dataset Everingham et al. (2010).

The addition of CRF postprocessing to our WASPnet method resulted in a modest increase of 0.2% in the mIOU for both the validation and test sets of the Pascal VOC dataset. The gains from using CRF are less significant than those in Chen et al. (2018), due to more efficient use of FOV by WASPnet. The effects of CRF on accuracy were not consistent across different classes. Classes with objects that do not have extremities, such as bottle, car, bus, and train, benefited most, whereas there was a decrease in accuracy for classes with more delicate boundaries such as bicycle, plant, and motorcycle.

Results on the testing Pascal VOC dataset are shown in Table 3. The additional training dataset column refers to DeepLabv3 types of models where a ResNet-101 model was pretrained on both ImageNet Deng et al. (2009) and JFT-300M Sun et al. (2017) when performing the test challenge for Pascal VOC. JFT-300M consists of Google’s internal dataset of 300 million images labeled in 18,291 categories, and therefore these results cannot be compared directly to other external architectures including this work. The addition of the JFT dataset for training allows the architecture to achieve performance improvements that are not possible without the such a large number of training samples. Note that training of the WASPnet network was performed only on the training dataset provided by the challenge, consisting of 1464 images. Based on these results, WASPnet outperforms all of the other methods that are trained on the same dataset.

Architecture Additional Training Dataset Used mIOU
DeepLabv3+ Chen et al. (2018) JFT-300M Sun et al. (2017) 87.8%
Deeplabv3 Chen et al. (2017) JFT-300M Sun et al. (2017) 85.7%
Auto-DeepLab-L Liu et al. (2019) JFT-300M Sun et al. (2017) 85.6%
Deeplab Chen et al. (2018) JFT-300M Sun et al. (2017) 79.7%
WASPnet-CRF (ours) - 79.6%
WASPnet (ours) - 79.4%
Dilation Yu and Koltun (2016) - 75.3%
CRFasRNN Zheng et al. (2015) - 74.7%
ParseNet Liu et al. (2015) - 69.8%
FCN 8s Long et al. (2015) - 67.2%
Bayesian SegNet Kendall et al. (2015) - 60.5%
Table 3: Pascal VOC test set results.

WASPnet was also used with the Cityscapes dataset Cordts et al. (2016) following similar procedures. Table 4 shows the results obtained for Cityscapes, resulting in an mIOU of 74.0%, a gain of 4.2% from Chen et al. (2018). The Res2Net-Seg version of the network achieved 72.1% mIOU.

Architecture Number of Parameters Parameter Reduction mIOU
WASPnet (ours) 47.482 M 20.69% 74.0%
WASPnet-CRF (ours) 47.482 M 20.69% 73.2%
Res2Net-Seg (ours) 50.896 M 14.99% 72.1%
Deeplab-CRF Chen et al. (2018) 59.869 M - 71.4%
Deeplab Chen et al. (2018) 59.869 M - 71.0%
Table 4: Cityscapes validation set results.

For both WASP and Res2Net-Seg architectures tested on the Cityscapes dataset, the CRF postprocessing did not have much benefit. A similar result was found with DeepLab where CRF resulted in a small improvement of the mIOU. The higher resolution and shape of detected instances in the Cityscapes dataset likely affected the effectiveness of the CRF. With Cityscapes, we used a batch size of 4 due to hardware constraints during training; other architectures have used batch sizes of up to ten.

Table 5 shows the results of WASPnet on the Cityscapes testing dataset. WASPnet achieved mIOU of 70.5% and outperformed other architectures trained on the dame dataset. We only performed training on the fine annotation images from the Cityscapes dataset, containing 2975 images, whereas the DeepLabv3 style architectures used larger datasets for training, such as JFT-300M containing 300 million images for pre-training and and coarser dataset from Cityscapes containing 20,000 images.

Architecture Additional Training Dataset Used mIOU
Auto-DeepLab-L Liu et al. (2019) Coarse Cityscapes Cordts et al. (2016) 82.1%
DeepLabv3+ Chen et al. (2018) Coarse Cityscapes Cordts et al. (2016) 82.1%
WASPnet (ours) - 70.5%
Deeplab Chen et al. (2018) - 70.4%
Dilation Yu and Koltun (2016) - 67.1%
FCN 8s Long et al. (2015) - 65.3%
CRFasRNN Zheng et al. (2015) - 62.5%
ENet Paszke et al. (2016) - 58.3%
SegNet Badrinarayanan et al. (2015) - 55.6%
Mask-RCNN He et al. (2017) - 49.9%
Table 5: Pascal Cityscapes test set results.

Figure 10 shows examples of Cityscapes image segmentations with the WASPnet method. Like our observations from the Pascal VOC dataset, our method produces better defined shapes for the segmentation compared to DeepLab. Our results are closer to the ground-truth data, and show better segmentation of smaller objects that are further away from the camera.

Figure 10: Results sample for Cityscapes dataset Cordts et al. (2016).

Our results in Table 4 illustrate that postprocessing with CRF slightly decreased the mIOU by 0.8% in the Cityscapes dataset: CRF has difficulty dealing with delicate boundaries, which are common in the Cityscapes dataset. With WASPnet, the presence of larger FOV due to the WASP module is able to offset the potential gains of the CRF module from previous networks. An additional limitation is that CRF requires substantial extra time for processing. For these reasons, we conclude that WASPnet can be used without CRF postprocessing.

Fail Cases

Classes that contain more delicate, and consequently harder to accurately detect, shapes contribute the most to segmentation errors. Particularly, tables, chairs, leaves, and bicycles present a bigger challenge to segmentation networks. These classes also resulted in a lower accuracy when applying CRF. Representative examples of fail cases are shown in Figure 11 for classes chair and bicycle, which are the most difficult to segment. Even in these cases, WASPnet (without CRF) is able to better detect the general shape compared to DeepLab.

Figure 11: Occurrence of fail cases to detect more delicate boundaries

6 Conclusions

We propose a “Waterfall” architecture based on the WASP module for efficient semantic segmentation that achieves high mIOU scores on the Pascal VOC and Cityscapes datasets. The smaller size of this efficient architecture improves its functionality and reduces the risk of overfitting without the need for postprocessing with the time consuming CRF. The results of WASPnet segmentation demonstrated superior performance compared to Res2Net-Seg and Deeplab. This work provides the foundation for further application of WASP in a broader range of applications for more efficient multiscale analysis.

Conceptualization, B.A. and A.S.; methodology, B.A.; algorithm and experiments, B.A. and A.S.; original draft preparation, B.A. and A.S.; review and editing, B.A. and A.S.; supervision, A.S.; project administration, A.S.; funding acquisition, A.S.

This research was funded in part by National Science Foundation grant number 1749376.

The authors declare no conflict of interest.

The following abbreviations are used in this manuscript:

ASPP Atrous Spatial Pyramid Pooling
COCO Common Objects in Context
CNN Convolutional Neural Networks
CRF Conditional Random Fields
ENet Efficient Neural Network
FCN Fully Convolutional Networks
FN False Negative
FOV Field-of-View
FP False Positive
LSTM Long Short-Term Memory
LSTM-CF Long Short-Term Memory Context Fusion
rCNN Recurrent Convolutional Neural Networks
mIOU Mean Intersection over Union
RGB Red, Green, and Blue
RNN Recurrent Neural Networks
SE Squeeze-and-Excitation
TP True Positive
VOC Visual Object Class
WASP Waterfall Atrous Spatial Pooling