Spectral Analysis for Semantic Segmentation with Applications on Feature Truncation and Weak Annotation

12/28/2020
by   Li-Wei Chen, et al.
National Chiao Tung University
7

The current neural networks for semantic segmentation usually predict the pixel-wise semantics on the down-sampled grid of images to alleviate the computational cost for dense maps. However, the accuracy of resultant segmentation maps may also be down graded particularly in the regions near object boundaries. In this paper, we advance to have a deeper investigation on the sampling efficiency of the down-sampled grid. By applying the spectral analysis that analyze on the network back propagation process in frequency domain, we discover that cross-entropy is mainly contributed by the low-frequency components of segmentation maps, as well as that of the feature in CNNs. The network performance maintains as long as the resolution of the down sampled grid meets the cut-off frequency. Such finding leads us to propose a simple yet effective feature truncation method that limits the feature size in CNNs and removes the associated high-frequency components. This method can not only reduce the computational cost but also maintain the performance of semantic segmentation networks. Moreover, one can seamlessly integrate this method with the typical network pruning approaches for further model reduction. On the other hand, we propose to employee a block-wise weak annotation for semantic segmentation that captures the low-frequency information of the segmentation map and is easy to collect. Using the proposed analysis scheme, one can easily estimate the efficacy of the block-wise annotation and the feature truncation method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 10

page 12

05/06/2019

SEMEDA: Enhancing Segmentation Precision with Semantic Edge Aware Loss

While nowadays deep neural networks achieve impressive performances on s...
08/21/2020

Beyond Fixed Grid: Learning Geometric Image Representation with a Deformable Grid

In modern computer vision, images are typically represented as a fixed u...
10/19/2019

Correlation Maximized Structural Similarity Loss for Semantic Segmentation

Most semantic segmentation models treat semantic segmentation as a pixel...
10/05/2020

MetaBox+: A new Region Based Active Learning Method for Semantic Segmentation using Priority Maps

We present a novel region based active learning method for semantic imag...
07/16/2019

Efficient Segmentation: Learning Downsampling Near Semantic Boundaries

Many automated processes such as auto-piloting rely on a good semantic s...
12/04/2018

Improving Semantic Segmentation via Video Propagation and Label Relaxation

Semantic segmentation requires large amounts of pixel-wise annotations t...
04/17/2019

Material Segmentation of Multi-View Satellite Imagery

Material recognition methods use image context and local cues for pixel-...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation, which densely assigns semantic labels to each image pixel, is one of the important topics in computer vision. Recently we have witnessed that the CNNs based on the encoder-decoder architecture 

[16, 4, 28, 2] achieve striking performance on several segmentation benchmarks [19, 7, 5, 29, 1]

. Generally, an encoder-decoder architecture consists of an encoder module that extracts context information and gradually reduces the spatial resolution of features to save the computational cost; and a decoder module that aggregates the information from the encoder and gradually recovers the spatial resolution of the dense segmentation map. Among these works, Long et al. first propose Fully Convolutional Neural Network (FCN) 

[16] that predicts the dense segmentation map by utilizing the skip-architecture, where the features of different granularities in the encoder are up-sampled and integrated in the decoder, yet still faces the challenge of acquiring accurate object boundaries in the segmentation map. The similar idea can be also observed in the U-Net [23], which further adds dense skip-connections between the corresponding down-sampling and up-sampling modules of the same feature dimensions, in results the boundary localization is improved but not fully resolved yet. Other than skip connections, Chen et al. propose the DeepLab models  [2, 3, 4] that integrate the atrous spatial pyramid pooling module (ASPP) and the dense conditional random field (dense CRF) [13], where the former utilizes the dilated convolutional layer composed of the filters at multiple sampling rates thus having the contextual information at the various spatial resolution, in order to boost the edge-response at object boundaries. Moreover, Kirillov et al.  [12] propose the PointRend module which outputs an initial segmentation map then adaptively refines it by extra point-wise predictions upon the object boundaries.

Clearly, improving the edge-response at the object boundaries becomes a challenge of the existing segmentation networks. In the frequency domain, the edge-response actually corresponds to the accuracy at high-frequency region of segmentation maps. However, most of the existing methods predict the dense segmentation map on a low-resolution image grid, which is then up-sampled to the original image resolution to save the computational cost [4, 16]. Such low-resolution grid of segmentation map can indispensably limit the edge-response of the segmentation networks. It is unclear whether these low-resolution grids are sufficient to capture the information of segmentation maps. On the other hand, it is also critical that how these networks learn the semantic from the ground truth annotation in the frequency domain. In fact, these networks can learn sufficient semantic contents from the weak annotation, which keep only the location or the coarse contours of objects [20, 11], and achieve comparable accuracy with that learn from the ground truth annotation. These results actually indicates that
(1) The most of semantic contents in segmentation maps can be learned in the low-frequency region. This further rationalize the low-resolution grid of the predicted segmentation maps in the existing segmentation networks.
(2) Despite the inaccurate boundary of object of the weak annotation, the learned networks can still tolerate the annotation noise induced by the weak annotation and learn the semantics.
In other words, the networks are insensitive to the noise that associated with the semantic edges. Such noise robustness helps the network to adapt to the real world datasets such as satellite and aerial images, in which the accurate annotation becomes almost impossible in these large scale datasets. For the aerial images, annotation noise can comes from various sources thus leads to different scales of noise [18]. Many works tries to increase the noise robustness by proposing the noise robust training objective [27, 25, 26]. Yet, they do not discuss the scales dependency of annotation noise as well as the response in the frequency domain. Lastly, the general training framework for segmentation network trains the networks by the training objective, such as cross-entropy (CE), and evaluate it upon the intersection-over-union (IoU) score. The correlation among the sampling frequency, CE, IoU score, and the edge-response remains unclear. Thus, further investigations on sampling efficiency and the frequency response of a semantic segmentation network are indispensable for clear evaluation on network performance.

For the semantic segmentation, the response in the frequency domain turned out to be critical yet unclear, as summarized above. For the simpler task such as 1D regression, it is found that the network tends to learn the low-frequency component of target signal in the early training stage. [21, 22, 17] such tendency is also known as spectral bias [21]. This spectral bias indicates that it is easier for network to learn the low-frequency target, which is also consistent to the trend in semantic segmentation. However, these works only study the regression of the data with various frequencies without considering the prior distribution of segmentation maps. For the semantic segmentation, the frequency distribution of signal in the segmentation maps should be considered. This further motivates us to study the frequency distribution of signal as well as the contribution to the training objective function and the evaluation metric.

In this work, we propose a spectral analysis to provide a theoretical point of view on the above mentioned issues. We analyze the common objective function and the evaluation metric for semantic segmentation (

i.e. CE and the IoU score respectively) as well as the gradient propagation within CNNs, in the frequency domain. Our analysis demonstrates the following novel discoveries:

  • The cross-entropy (CE) can be explicitly decomposed into the summation of frequency components, we thus can evaluate their contributions to CE and further estimate the sampling efficiency of the low-resolution grid for segmentation maps, including the segmentation output and the groundtruth annotation. We find that CE is mainly contributed by the low-frequency component of the segmentation maps.

  • The decomposition of CE inspires us to further deduce the correlation between the segmentation logits and the features within CNNs, in the frequency domain. We discover that the segmentation logits of a specific frequency are mainly affected by the features with the same frequency.

  • Based on the two findings above, the high frequency components of smooth features (e.g. the ones in the decoder of CNN) are found to be negligible, where our experiments show that truncating these high frequency components does not interfere the performance of semantic segmentation networks.

  • Frequency analysis of the IoU score reveals its close correlation to the CE. This justifies the use of IoU metric for evaluating the segmentation networks optimized by CE objective.

Our findings above contribute to the semantic segmentation networks in the following two objectives:
(1) Feature truncation for segmentation networks. The features in the decoder, which generally are assumed to be smooth comparing to the ones in the encoder, can be truncated in order to reduce the computational cost. This truncation method can also be easily integrated with the commonly-used pruning approaches [15, 10] for further cost reduction where they instead remove the redundant filters or weights, which are independent to the feature size. Moreover, one can determine the efficient sizes of these features by validating the sampling efficiency of the corresponding band limits;
(2) Block-wise annotation. As a novel weak annotation for semantic segmentation, it is cheaper to collect than the full pixel-wise groundtruth and keeps the low-resolution information of the groundtruth annotation. The intuition behind the proposed block-wise annotation is similar to the existing weak annotations that utilize only the coarse contours of the instances in the segmentation map [20, 11] thus contain only the low-frequency information. The block-wise annotation can be directly associated with the spectral information and its efficiency is also well explained by our analysis scheme.

2 Proposed Spectral Analysis

As motivated above, we propose to investigate the sampling efficiency for semantic segmentation networks by having the spectral analysis upon the cross-entropy (CE) objective function and the intersection-over-union (IoU) evaluation metric. In section 2.1 and 2.2, CE and IoU score are decomposed into the components of each frequency, respectively. In section 2.3, we further deduce the gradient propagation for convolutional layers in order to demonstrate the correlation between the segmentation output and the features in CNNs.

Notation

The notations in this section are defined as follows. In general, the upper case letter, e.g, denote the functional in the spatial domain while the lower case letter e.g, denote the corresponding spectrum in the frequency domain

, in which the spectrum is obtained by the Fourier transform of functional in the spatial domain. For example, the spectrum

and is the Fourier transform ; where . The rest of notations will be defined whenever they appear.

2.1 Spectral Decomposition of Cross-Entropy

Let denote the segmentation logits produced by a semantic segmentation network and denote the groundtruth annotation, in which and are indexes for the object class and image pixel respectively. The commonly-used objective function for learning semantic segmentation, cross-entropy (CE), can be written as

(1)

where . For all and , the integral can be transformed to the frequency domain as follows. (See lemma 1 of supplementary)

(2)

where and are the spectrum of the segmentation logits and that of the groundtruth annotations, respectively. For simplicity, we hereafter refer as segmentation spectrum. The overall CE as written in Eq. 1 is hence given as

(3)

where . The discrete form of Eq. 3 is

(4)

By Eq. 4, we can thus decompose the as the summation of over frequency domain , where

(5)

we hereafter name as the frequency components of CE. Moreover, based on such spectral analysis, the contribution from each frequency components to CE can be evaluated. Later in section 3.1, we will further demonstrate that the CE is mainly contributed by the low-frequency components in section 3.

2.2 Spectral Analysis of Intersection-over-Union Score

Given the segmentation logits and the ground truth annotation , the intersection-over-union (IoU) score is typically defined as , where is the segmentation output . In order to analyze the IoU score in the frequency domain, we extend the above definition to the continuous space as follows:

(6)

where denotes pixel indexes. Please note Eq. 6 holds for each object class where we skip

for simplicity. Besides, this definition is equivalent to the origin definition of IoU score for the binarized segmentation maps. The components in Eq. 

6 can be written as follows (see lemma 1 and lemma 2 of supplementary),

(7)

where . IoU score can be hence written as

(8)

and it is composed of two terms: and . The latter term is positively correlated to the component of CE in Eq. 1 since function is monotonically increasing. As a result, the minimization of maximizes as well as IoU score. This derivation actually justifies the rationality of the common learning procedure for semantic segmentation models: the networks are trained by the CE objective function while being validated by the IoU scores. In addition, although the IoU score in Eq. 8 can not be explicitly decomposed into frequency components as Eq. 3 due to its non-linearity, the positive correlation between IoU and CE can still helps to clarify its connection to the edge-response (i.e. high-frequency response) which most segmentation models are targeting to improve on. Therefore, later in our experiments and analysis, we adopt the decomposition of CE to study the frequency response (as well as the edge-response) and take IoU as a reasonable metric for evaluation.

2.3 Spectral Gradient of Convolutional Layers

In section 2.1, we have demonstrated that the CE can be decomposed into the summation of frequency components . Here we further deduce the gradient propagation of within CNNs in order to reveal how updates the network. We hereafter refer the gradient propagation in frequency domain as the spectral gradient

. For simplicity, we deduce the spectral gradient for a convolutional layer, including a convolution and an activation function

(9)

where is the kernel, is the activation function, is input feature, and is the output of convolutional layer. Here we consider the soft-plus activation since its everywhere differentiable thus make it easier for analysis. Here we derive the spectral gradient for Eq. 9. That is, the gradient of with respect to , i.e. Assuming the is small and , the gradient of under frequency with respect to the under frequency is written as follows (see lemma 3 of supplementary).

(10)

where . These assumptions actually rely on the fact that the numeric scale of feature and kernel are usually limited to a small range of value for the numeric stability of networks. Under these assumptions, the spectral gradient in Eq. 10 can be approximated as the delta function, which reveals how the variation of affect . In addition, due to the property of the delta function, is affected only by with the same frequency.

Following the spectral gradient in Eq. 10, we now derive the spectral gradient of with respect to the features in a convolutional layer. For each semantic class in segmentation maps, let and denote the spectrum of kernel and that of the segmentation output for a convolutional layer, respectively. By Eq. 5 and Eq. 10, the spectral gradient of can be written as follows. (see lemma 4 of supplementary)

(11)

This results in the complicated spectra gradient consists of and . Here recalls that the segmentation map is usually predicted upon the low-resolution grid of the original image as mentioned in section 1. As a result, should be smooth such that is small when is large. Hence,

(12)

Furthermore, should also be small at high-frequency since it is the product of the spectrum of segmentation maps, i.e and , as shown in Eq. 5. We would therefore neglect the high-frequency region of and focus on the case when is small. In such case, is negligible when , as well as , is large. Namely, modifying the features at high-frequency does not effect at low-frequency . Regarding to the spectral gradient for various cases of and , we provide the numerical validation of this observation in section 3.1.

2.4 Discussion for Spectral Analysis

In this section, we summarize the discussion for the spectral analysis based on the theoretical analysis in the above sections. As mentioned in section 1, the segmentation map is usually predicted upon the low-resolution grid of the original image, yet it is unclear whether this grid is sufficient to capture the most of the information of CE. Following the decomposition of CE in Eq. 4, we define the truncated CE as the frequency components of CE filtered by a low-pass filter with the band limit as follows,

(13)

The losses of due to the band limit can thus be estimated from the discrepancy between and . As a result, an efficient grid can be defined when the discrepancy between and is negligible. In addition, these efficient grids can be applied to the segmentation output and the groundtruth annotation since is related to the product of and as shown in Eq. 5. Further more, based on the discussion of Eq. 12, the feature can also be truncated when is large, i.e. , . Based on the above observations, we can apply these efficient grids not only to the ground truth annotations but also the feature pruning in the decoder of the CNN.
In the following section, we apply the efficient grids on the features and the groundtruth annotation and validate the positive correlation between the the loss of CE and the loss of IoU score caused by the efficient grids. The truncated CE as well as the efficient grids can therefore serve as an informative reference for removing the high-frequency components of the segmentation output and groundtruth annotation in our applications.

3 Validation and Applications

In section 2

, we propose the spectral analysis to analyze the deep learning framework in aspect of frequency domain. The frequency components of CE and the spectral gradient is therefore proposed in Eq. 

11 and Eq. 5, respectively. In section 3.1, we validate our proposed spectral analysis, including the spectral decomposition of CE and the spectral gradients, upon on various segmentation datasets and segmentation models described below. Based on these numeric validations, we identify the efficient grids and apply the grids onto the features in CNNs and the ground truth annotation. This leads us to further propose two novel applications: (1) Feature truncation and (2) Block-wise annotation, which are detailed in section 3.2 and section 3.3, respectively.

Datasets.

We examine the experiments upon the following three semantic segmentation datasets: PASCAL semantic segmentation benchmark [7], DeepGlobe land-cover classification challenge [6]

and Cityscapes pixel-level semantic labeling task

[5] (denoted as PASCAL, DeepGlobe and Cityscapes respectively). The PASCAL dataset contains 21 categories, 1464 training images, and 1449 validation images; the dataset further augmented by the extra annotations from [8]. The DeepGlobe dataset contains 7 categories, 803 training images, which are split into 701 and 102 images for training and validation, respectively. The Cityscapes dataset contains 19 categories, 2975 training images, and 500 validation images.

Segmentation models and implementation details.

In our experiment, we utilize the standard segmentation models including DeepLab v3+ [4] and Deep Aggregation Net (DAN) [14]. We adopt the ResNet-101  [9]

pre-trained on ImageNet-1k 

[24] as the backbone of these models. These models are trained by the following training policies: For all datasets, the images are randomly cropped to 513

513 pixels and random-flipped in the training stage; the training batch size are 8. For PASCAL dataset, the model is trained with initial learning rate 0.0007 and 100 epochs; for DeepGlobe dataset, the model is trained with initial learning rate 0.007 and 600 epochs; for Cityscapes dataset, the model is trained with initial learning rate 0.001 and 200 epochs.

3.1 Validation of Spectral Analysis

Spectral Decomposition of CE.

Recall that in Eq. 4, , where , as well as the fact that the segmentation map leads to small in the high-frequency region since it is up-sampled from a low-resolution grid. The should also be small in the high-frequency region. To elaborate the limitation of such low-resolution grid, we the truncated CE at Eq. 13. Here, we further define the relative discrepancy between truncated CE and CE,

(14)

In our experiments, is maximal at 256, which is half size of the segmentation map, i.e. 513. Hence .

We evaluate , , and based on the various segmentation models (DeepLab v3+ and DAN) and datasets (PASCAL, DeepGlobe and Cityscapes); and are the averaged value of and over all semantic classes respectively. In addition, the evaluation are performed at both the initial and final stages of model training, in order to monitor the training progress. The results are shown in Fig. 1 and Fig. 2. The results shown in Fig. 1 indicate that is indeed small in the high-frequency region thus leads to small , as discussed above. These results also support the fact that CE is mainly contributed by the low-frequency components. On the other hands, the results shown in Fig. 2 reveals that the low-frequency components of apparently decreases as training progresses, suggesting that the model learns to capture low resolution-grid more effectively.

Here we further investigate the limitation of the low-resolution grid and the efficient resolution of the features based on examined in Fig. 1. For comparison, we examine the numeric value of in Table 1. Apparently, as the resolution of the grid goes higher, becomes smaller since more information on high-frequency is captured. More specifically, dramatically decrease to small values when in all experiments of the trained models. On the other hand, the segmentation maps predicted by these models are evaluated on the 129 129 low-resolution grid which corresponds to . These empirical evidences suggest that the grid with could still serve for efficient sampling of the segmentation map. These efficient grids are further validated on our proposed applications, i.e. feature truncation and block-wise annotation, in sections 3.2 and section 3.3, respectively. Besides, it is apparent from Table 1 that on the DeepGlobe dataset is significantly smaller than that on the PASCAL dataset as well as that on the Cityscapes dataset. A better efficacy in the feature truncation and the block-wise annotation is expected and will be shown in the following sections. In summary, our analysis can serve as a quantitative criterion for sampling efficiency over different datasets and models.

Figure 1: The spectral decomposition of CE (), the relative discrepancy of CE (), and the absolute value of spectra ( and ). For , , and , the profiles for DeepLab v3+ and DAN are denoted by deeplab and dan, respectively. Note here the profiles of and are normalized with respect to their corresponding maximal values for better comparison over datasets.
Figure 2: Spectral Decomposition of CE for DeepLab v3+ and DAN at both initial and final training stages. The notation follows Fig. 1. The profiles of DeepLab v3+ and DAN at final training stages are denoted by and , respectively. For these two models, the profiles at initial stage are denoted by and , respectively.
Dataset 8 16 32 64 256
DeepLab
PASCAL 0.275 0.040 0.018 0.010 0
DeepGlobe 0.102 0.028 0.010 0.005 0
Cityscapes 2.152 0.668 0.146 0.017 0
DAN
PASCAL 0.248 0.035 0.019 0.009 0
DeepGlobe 0.081 0.021 0.008 0.004 0
Cityscapes 1.307 0.255 0.029 0.012 0
Table 1: Truncated CE and its relative discrepancy with respect to CE under various band limit , based on PASCAL, DeepGlobe and Cityscapes datasets.
Figure 3: Architecture of DeepLab v3+[4]

Spectral Gradient.

We now turn to demonstrate that the analytic form of spectral gradient can be approximated as a delta function in section 2.3. Here, we validate such approximation numerically on the three datasets (PASCAL, DeepGlobe and Cityscapes datasets) and take the spectra of ASPP features in DeepLab v3+ as our example . We also provide the validations on

for various operations within the decoder module of DeepLab v3+, including the convolution, ReLU, bilinear up-sampling, in which

and denote the input and output spectra for these operations, respectively. These spectral gradients are illustrated in Fig. 4, where each column represents different or while are used respectively for each row, in which stands for the size of input features. Please note that each figure of spectral gradient has included all possible . Clearly, for the convolution, ReLU, and bilinear up-sampling operations demonstrate delta responses, which is consistent with our discussion for Eq. 10 where is approximated by . Regarding to on PASCAL, DeepGlobe and Cityscapes datasets, it still behaves close to delta function despite the complex operations along the gradient propagation from the segmentation output back to the ASPP module. Such result helps to verify our assumptions used for the derivation in section 2.3 as well as the corresponding approximation i.e. Eq. 12. Noting that the segmentation output is related to not only the ASPP features but also the low-level features from the encoder, as shown in Fig. 3. As the paths of gradient propagation from the segmentation output to these two features are only deviated by a bilinear up-sampling operator, whose spectral gradient acts as a delta function, we hence validate only on the former one. Besides of validation on the DeepLab v3+ model mentioned above, we provide also the validation on DAN in Fig. 8 of supplementary. These spectral gradients can be well approximated by the delta function. These results verify our assumptions in section 2.3 as well as the corresponding approximation i.e. Eq. 12. Such approximation enable us to further consider the feature truncation in section 3.2, which removes the redundant high-frequency components of features in CNNs without significant degradation of segmentation performance.

Figure 4: The evaluation of spectral gradient. Starting from the leftmost column, for the convolution, ReLU, and bilinear up-sampling operations, and for the PASCAL, DeepGlobe and Cityscapes datasets are denoted as Conv, ReLU, Upsample, PASCAL, DeepGlobe and Cityscapes, respectively.

3.2 Application on Feature Truncation

Recall that in section 3.1, truncating the components of the features in the decoder at frequency affects only the , which is negligible when is large. Here, we perform the experiments on the DeepLab v3+ and DAN to demonstrate that our feature truncation can not only reduce the computational cost but also maintain its the performance. Notably, the encoder of DeepLab v3+ has 95.6 billion FLOPs (floating-point operations) and 60.1 million parameters. In contrast, its decoder has 43.4 billion FLOPs while only has 1.3 million parameters. Similarly, the encoder of DAN has 95.6 billion FLOPs with 60.1 million parameters while the decoder has 11.1 billion FLOPs with only 0.4 million parameters. Although the decoder has small number of parameters, it has a comparable cost with that of the encoder since it up-samples the features for the dense segmentation map and results in large features for computation. The truncation of the features in decoder is thus expected to effectively reduce the computational cost. Moreover, one can combine feature truncation with the typical network pruning method to reduce the computation cost in two different aspects, i.e. the feature size and the redundant parameters.

Here, we adopt the Soft Filter Punning (SFP) method [10] and combine it with our feature truncation as an example. The experiments are done based on the following conditions: for SFP, we use pruning rate 20%-60% for the encoder while use 20% for the decoder, since parameters of the encoder is much larger than that of the decoder and is potentially over-parameterized; for the feature truncation applied on the decoder, we down-sample the decoder features, including the high-level features of the ASPP module and the low-level features of the backbone network, from the original size of 129129 to the efficient low-resolution grids suggested by the analysis in section 3.1. i.ethe grids with . For simplicity, we validate only the grid 6565 and 3333 with and in our experiments. The results for DeepLab v3+ are summarized in Table 2. Similar results for DAN are summarized in Table 4 of supplementary. Clearly, the IoU score and FLOPs decreases as the pruning rate increases or as the feature size decreases. In the baseline model of DeepLab v3+, the feature truncation reduces FLOPs by 23.3% with the feature truncated from 129 to 65 and leads to 0.5% (from 78.8% to 78.1%), 0.3% (from 53.9% to 53.6%) and 5.1% (from 67.8% to 62.7%) mIoU drop for PASCAL, DeepGlobe and Cityscapes datasets, respectively. Here recalls the for PASCAL, DeepGlobe and Cityscapes datasets are 0.018, 0.010 and 0.146, respectively. For these datasets, the mIoU drop is positively correlated to that of which is the relative information loss of CE. Similar correlation is also found when the feature truncated from 129 to 33. As the feature truncated from 129 to 33, the feature truncation leads to 3.0% (from 78.8% to 78.1%), 0.8% (from 53.9% to 53.1%) and 14.5% (from 67.8% to 53.3%) mIoU drop for PASCAL, DeepGlobe and Cityscapes datasets, respectively. Such trend also consists with the trend of , which are 0.040, 0.028 and 0.668 for these datasets, respectively. Clearly, the larger the , the larger the mIoU-drop. Similar trends are also observed for DAN and the pruned models integrating the SFP. The correlation between the loss of CE and the drop of IoU score is consistent to the discussion in section 2.2. We can therefore utilize to estimate the efficacy of feature truncation. Regarding the PASCAL and DeepGlobe datasets that have small , feature truncation effectively reduce around 20% FLOPs with small mIoU drop. Via further integrating the SFP with 20% pruning rate, we reduce the FLOPs by 45.3% with feature size 65 and get 2.6% mIoU drop (from 78.6% to 76.0%) on the PASCAL dataset. Combining SFP with 40% pruning rate, we reduce the FLOPs by 66.1% with feature truncated to 33 and get only 1.9% mIoU drop (from 53.9% to 52.0%) on the DeepGlobe dataset. These results demonstrate that the feature truncation can effectively integrate with the typically network pruning approach and reduce the computational cost.

To further analyze the efficiency of these models, we define the FLOPs per IoU score (FPI) by FLOPs/mIoU. The lower the FPI, the better the efficiency of the model. The FPI for DeepLab v3+ and DAN are illustrated in Fig. 4(a) and Fig. 4(b), respectively; the numeric value of FPI is summarized in Table 2 and Table 4, respectively. The FPI apparently decreases as feature reduced from 129 to 65, suggesting the efficiency cost reduction by the feature truncation. As the feature size reduced from 129 to 33, the similar decrements of FPI is achieved in the experiments upon the PASCAL and DeepGlobe datasets. Yet, the FPI increases in the experiments on the Cityscapes dataset. Such increments of FPI are also compatible with the large for both DeepLab v3+ and DAN. (c.f. Table 1), which indicates the the large as well as the large mIoU drop. Such large mIoU drop leads to an inefficient model reduction.

In summary, the performance of the DeepLab v3+ and DAN with various setups of feature truncation and SFP are evaluated based on the three datasets. These results demonstrate the effective integration of the proposed feature truncation and the typically network pruning approach to reduce the computational cost. Further more, the trend of the segmentation performance consists with our theoretical analysis. To our understanding, this is the first work that provide an theoretical framework to analyze the performance in the aspect of the frequency domain. As mentioned in section 1, the existing segmentation models predict the segmentation maps upon the low-resolution grid to save the computational cost. Our framework serves as an effective analysis tool to estimate the efficient grid size of segmentation maps as well as the features in CNNs to save the computational cost.

Feature PASCAL DeepGlobe Cityscapes FLOPs pruned
size mIoU mIoU-drop FPI () mIoU mIoU-drop FPI () mIoU mIoU-drop FPI () () FLOPs
Baseline
129 78.6% 0.0% 177 53.9% 0.0% 258 67.8% 0.0% 205 139 0.0%
65 78.1% 0.5% 136 53.6% 0.3% 199 62.7% 5.1% 170 107 23.3%
33 75.6% 3.0% 130 53.1% 0.8% 185 53.3% 14.5% 185 98 29.2%
20% Pruning rate for encoder
129 76.6% 2.0% 130 53.7% 0.2% 186 67.2% 0.6% 148 100 28.2%
65 76.0% 2.6% 100 53.4% 0.5% 143 62.4% 5.3% 122 76 45.6%
33 73.2% 5.4% 96 52.7% 1.2% 133 53.4% 14.4% 131 70 49.6%
40% Pruning rate for encoder
129 74.4% 4.2% 103 53.0% 0.9% 145 66.1% 1.7% 116 77 44.7%
65 73.9% 4.7% 72 52.7% 1.2% 101 61.5% 6.3% 86 53 61.8%
33 72.1% 6.5% 65 52.0% 1.9% 91 52.8% 14.9% 89 47 66.1%
60% Pruning rate for encoder
129 65.1% 13.5% 90 50.1% 3.8% 117 58.6% 9.1% 100 59 57.7%
65 64.7% 13.9% 54 49.9% 4.0% 70 54.8% 13.0% 64 35 74.8%
33 63.3% 15.3% 46 49.4% 4.5% 59 47.4% 20.4% 61 29 79.1%
Table 2: Results for feature truncation and network pruning on DeepLab v3+. For the setup of network pruning, we denote the experiment without SFP as the ”baseline” model while denote that with SFP as the ”X pruning rate for encoder” where X = (20%, 40%, and 60%) are the pruning rates. For each setup of network pruning, we further evaluate the results with various feature sizes for feature truncation, i.e. (129,65,33). For each experiment we calculate mIoU score, mIoU drop, and FLOPs; where mIoU is the IoU score averaged over all semantic classes; FPI is the FLOPs per IoU score. Particularly, mIoU-drop is the deviation of mIoU with respect to the one of the baseline model with feature size 129129.
(a) DeepLab v3+
(b) DAN
Figure 5: The FPI for DeepLab v3+ and DAN with various prune rates of SFP and feature size (i.e. 33, 65, 129) for the feature truncation.

3.3 Application on Block-wise annotation

In section 3.1, we determine the efficient low-resolution grid for the segmentation maps by analyzing , which estimates the discrepancy between CE and the truncated CE. In this section, we apply these low-resolution grids to the groundtruth annotation, resulting the proposed block-wise annotation, as a novel weak annotation. Moreover, we demonstrate that the performance of the semantic segmentation network trained with block-wise annotation can be estimated by . We perform the experiment that trains the DeepLab v3+ and DAN with the block-wise annotation at various band limit (from 1 to 256) and evaluates it based on the original pixel-wise annotation. Examples of the block-wise annotation and the prediction of the two models upon the PASCAL, DeepGlobe and Cityscapes datasets are illustrated in Fig. 7. Please note that the block-wise annotation at is actually equivalent to the original pixel-wise groundtruth. The experimental results are summarized in Table 3. For each , we evaluate IoU score and relative IoU score. Particularly, relative IoU score is the ratio of IoU score over that at , i.e. the IoU score of the network trained with pixel-wise groundtruth annotation. As the band limit goes lower, IoU score as well as the relative IoU score goes smaller. For the PASCAL and DeepGlobe datasets, IoU score as well as the relative IoU score apparently drop when the band limit decreases from 16 to 8, where the relative IoU score drops from 94.9% to 86.8% on the PASCAL dataset and from 99.2% to 95.2% on the DeepGlobe dataset. Such observation is also compatible with the large increments of in Table 1. As the decreases from 16 to 8, increase from 0.040 to 0.275 on the PASCAL dataset and from 0.028 to 0.102 on the DeepGlobe dataset. For the Cityscapes dataset, the relative IoU score dramatically drop from 100% to 88.2% when the band limit decreases from 256 to 32. Such large decrements is consistent with the large . These large increments of indicates the large information loss of CE as well as the loss of IoU score. Besides, the relative IoU score on the DeepGlobe dataset is larger than that on the PASCAL dataset. This result reflects the fact that the DeepGlobe dataset contains mainly the low-frequency component, which agrees with the observation of in section 3.1. The similar observation holds for the experiments upon DAN. To summarize all experiments, we illustrate the correlation between relative mIoU and (c.f. Table 1) in Fig. 6. Clearly, the larger the , the lower the relative IoU score. As a result, one can estimate the performance of the semantic segmentation network trained with the block-wise annotation by simply evaluating without thoroughly performing the experiments over all band limits. In summary, the proposed spectral analysis enables the advanced analysis of weak annotation in the frequency domain. Our analysis reveals the correlation between the segmentation performance and the low resolution grid of segmentation maps. Based on the analysis, we propose the block-wise annotation, as a novel approach of weak annotation that apply the low-resolution grid to the segmentation maps. Notably, these low resolution grids actually corresponds to the coarse contour of instances in the segmentation maps, which are greatly utilized in the existing weak annotation. In this work, we aim to provide the theoretical background of the spectral analysis and the block-wise annotation. Further research should be undertaken to investigate the spectral analysis upon the existing weak annotations [20, 11] in the future.

Figure 6: The correlation between relative IoU score and the based on PASCAL, DeepGlobe and Cityscapes datasets.
8 16 32 256
DeepLab
PASCAL 0.682 0.746 0.775 0.786
DeepGlobe 0.514 0.535 0.536 0.539
Cityscapes 0.385 0.500 0.593 0.678
DAN
PASCAL 0.687 0.746 0.777 0.787
DeepGlobe 0.500 0.522 0.533 0.536
Cityscapes 0.384 0.501 0.585 0.664
(a) IoU score.
8 16 32 256
DeepLab
PASCAL 86.8% 94.9% 98.6% 100.0%
DeepGlobe 95.4% 99.3% 99.4% 100.0%
Cityscapes 56.8% 73.7% 87.5% 100.0%
DAN
PASCAL 87.3% 94.8% 98.7% 100.0%
DeepGlobe 93.3% 97.4% 99.3% 100.0%
Cityscapes 57.8% 75.5% 88.2% 100.0%
(b) Relative IoU score.
Table 3: Experimental results of using our proposed block-wise annotation for learning semantic segmentation based on PASCAL, DeepGlobe and Cityscapes datasets. (a) and (b) summarize the IoU score and the relative IoU score, respectively.
256 32 16 8
PASCAL
DeepLab v3+
DAN
DeepGlobe
DeepLab v3+
DAN
Cityscapes
DeepLab v3+
DAN
Figure 7: Examples of the block-wise annotation and the prediction of DeepLab v3+ and DAN. For each dataset, the block-wise groundtruth annotations and the predictions of DeepLab v3+ and that of DAN are shown. The corresponding band limits are denoted at the top of annotation.

4 Conclusion

Our proposed spectral analysis for semantic segmentation network correlate CE, IoU score and gradient back-propagation in the spectrum point of view. We discover that CE is mainly contributed by the low-frequency component of the segmentation maps, which associates with the features in CNNs at the same frequency. As a result, the quantitative estimation of the sampling efficiency for the segmentation map becomes possible. We test our theory on two applications: feature truncation and block annotation. Our results show that combination of the feature truncation and the network pruning can save computational cost significantly with small accuracy lost. In addition, the block annotation can potentially save more in labeling cost, since the network trained using the block-wise annotation in an efficient grid performs close to the original network.

References

  • [1] H. Caesar, J. Uijlings, and V. Ferrari (2018) Coco-stuff: thing and stuff classes in context. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §1.
  • [2] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. ArXiv:1412.7062. Cited by: §1.
  • [3] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. ArXiv:1706.05587. Cited by: §1.
  • [4] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV), Cited by: §1, §1, Figure 3, §3.
  • [5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §3.
  • [6] I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu, F. Hughes, D. Tuia, and R. Raska (2018) Deepglobe 2018: a challenge to parse the earth through satellite images. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), Cited by: §3.
  • [7] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: a retrospective. International Journal of Computer Vision (IJCV). Cited by: §1, §3.
  • [8] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In IEEE International Conference on Computer Vision (ICCV), Cited by: §3.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.
  • [10] Y. He, X. Dong, G. Kang, Y. Fu, C. Yan, and Y. Yang (2019) Asymptotic soft filter pruning for deep convolutional neural networks. IEEE Transactions on Cybernetics. Cited by: §1, §3.2.
  • [11] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele (2017) Simple does it: weakly supervised instance and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §3.3.
  • [12] A. Kirillov, Y. Wu, K. He, and R. Girshick (2019) PointRend: image segmentation as rendering. ArXiv:1912.08193. Cited by: §1.
  • [13] P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems (NIPS), Cited by: §1.
  • [14] T. Kuo, K. Tseng, J. Yan, Y. Liu, and Y. F. Wang (2018) Deep aggregation net for land cover classification.. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), Cited by: §3.
  • [15] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2019) Rethinking the value of network pruning. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • [16] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1.
  • [17] T. Luo, Z. Ma, Z. J. Xu, and Y. Zhang (2019) Theory of the frequency principle for general deep neural networks. arXiv preprint arXiv:1906.09235. Cited by: §1.
  • [18] V. Mnih (2013) Machine learning for aerial image labeling. Ph.D. thesis, University of Toronto. Cited by: §1.
  • [19] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun, and A. Yuille (2014) The role of context for object detection and semantic segmentation in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [20] G. Papandreou, L. Chen, K. P. Murphy, and A. L. Yuille (2015)

    Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation

    .
    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §3.3.
  • [21] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. A. Hamprecht, Y. Bengio, and A. Courville (2018) On the spectral bias of neural networks. arXiv preprint arXiv:1806.08734. Cited by: §1.
  • [22] B. Ronen, D. Jacobs, Y. Kasten, and S. Kritchman (2019) The convergence rate of neural networks for learned functions of different frequencies. In Advances in Neural Information Processing Systems, pp. 4763–4772. Cited by: §1.
  • [23] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (MICCAI), Cited by: §1.
  • [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV). Cited by: §3.
  • [25] Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, and J. Bailey (2019) Symmetric cross entropy for robust learning with noisy labels. In IEEE International Conference on Computer Vision (ICCV), pp. 322–330. Cited by: §1.
  • [26] Y. Xu, P. Cao, Y. Kong, and Y. Wang (2019)

    L_DMI: a novel information-theoretic loss function for training deep nets robust to label noise

    .
    In Advances in Neural Information Processing Systems (NIPS), pp. 6222–6233. Cited by: §1.
  • [27] Z. Zhang and M. Sabuncu (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in Neural Information Processing Systems (NIPS), pp. 8778–8788. Cited by: §1.
  • [28] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [29] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017)

    Scene parsing through ade20k dataset

    .
    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.

5 Supplementary

5.1 Fourier transform of spatial integral

Lemma 1.

Given two functional and in spatial domain , the overlapping integral can be transformed into the frequency domain as

(15)

where and .

Proof.

By the convolution lemma, integral can be written as

(16)

; where denotes the convolution operation as ; is the inverse Fourier transform operator. Eq. 16 can now be written as

(17)

By the orthogonality of Fourier basis, we have , where is the Dirac delta function:

(18)

and its integral property is . Hence, Eq. 17 is given as

(19)

Lemma 2.

Given functional in spatial domain , the integral can be transformed into the frequency domain as

(20)

where .

Proof.

The proof follows similar process as for lemma 1, as follows.

(21)

5.2 Gradient propagation for a convolution layer

Consider a convolution layer consists of the convolutional kernel and the soft-plus activation function ; is the spatial location. Let denote the input, the output of convolution layer is written as

(22)
Lemma 3.

Assuming is small and , the spectral gradient can be approximated as

(23)

where , and are , and , respectively.

Proof.

The spectral gradient of a convolution layer consists of the spectral gradient for the convolution operator and that for the activation function. We will show two gradient and combine it in the end of derivation.

For the convolution operator, it can be written as in the frequency domain , where , and are , and , respectively. Without loss of generality, in the discrete frequency domain, the gradient of under a specific frequency with respect to the under frequency is defined as

(24)

where is delta function in which it equals to if and otherwise.

For the soft-plus function, it can be first expressed as Taylor series

(25)

in which is small since the kernel is small and by the assumption. Hence, becomes negligible. The Fourier transform of is thus given as

(26)

and its spectral gradient is

(27)

where

is a dummy variable for the convolution and

is the spectrum size of features. By Eq. 24 and Eq. 27, the spectral gradient of a convolutional layer in Eq. 9 is then written as

(28)

where are the frequency indices. Since is small as argued above, the corresponding spectrum should also be small. We can therefore neglect the second term of Eq. 28,i.e, and approximate Eq. 28 as

(29)

. ∎

5.3 Gradient propagation for the frequency component of CE

Lemma 4.

Given a convolutional layer that satisfies the assumption of lemma 3. Let denote the spectrum of input feature. For each semantic class in segmentation maps, let and denote the spectrum of kernel and that of the segmentation output, respectively. The spectral gradient for the frequency component of CE, is

(30)
Proof.

By lemma 3 and Eq. 5, the spectral gradient is given as

(31)

in which

(32)

where is the segmentation output after performing softmax on logits and is the spectrum of the segmentation output. Substituting Eq. 32 into Eq.